· 7 years ago · Nov 20, 2018, 04:22 AM
1Searching & Sorting
2====================
3
4* Processing data is a fundamental operation in Computer Science
5* Two fundamental operations in processing data are *searching* and *sorting*
6* Form the basis or preprocessing step of many algorithms
7* Large variety of algorithms have been developed
8
9## Searching
10
11* *What* are you searching? strings? numbers? files? objects?
12* *How* are you searching? (Not algorithmically, but by what *criteria or criterion*)
13 * You could be searching for the first or the last or all instances of a particular criterion
14 * You could search for extremal elements: minimum, maximum, median, i-th order statistic
15 * There are many basic and advanced criteria that you can search on
16* We want to develop a general, abstract search algorithm
17* *Pseudocode*: a general, generic "code" language that allows you to specify a general abstract algorithm without getting to the specifics of any particular programming language
18* Linear search pseudocode: see book
19* Let's translate the linear search algorithm into actual C code
20
21```c
22/**
23 * This function searches the given array for
24 * the given key value and returns the first
25 * index at which it finds a matching element.
26 *
27 * This function returns -1 if no matching element
28 * is found
29 */
30int linearSearch(const int *arr, int n, int key) {
31 for(int i=0; i<n; i++) {
32 if(a[i] == key) {
33 return i;
34 }
35 }
36 return -1;
37}
38
39//alternatively:
40double * linearSearch(const double *arr, int n, double key) {
41 for(int i=0; i<n; i++) {
42 if(a[i] == key) {
43 return &a[i];
44 }
45 }
46 return NULL;
47}
48
49```
50
51* Problem: the above functions are only good for: arrays of integers.
52* If we wanted a version for arrays of `double` or `char *` (strings) or `Student` structures, etc.
53* Goal (eventually): design/implement ONE search/sort function to search/sort *any* type of variable
54
55### Binary Search
56
57* Suppose that the input array (collection) were *sorted*: how can you exploit this structure?
58* You can exploit this structure by checking the middle element $m$ (searching for $k$):
59 * if equal $m = k$, you stop because you've found your needle
60 * if $k < m$: you "recursively" search on the left hand side
61 * if $m < k$: you recursively search on the right hand side
62 * You stop when the size of the array is zero
63
64
65```c
66int binarySearch(const int *a, int i, int j, int key) {
67
68 if(i > j) {
69 //no such element exists, the array is of "size" zero
70 return -1;
71 } else {
72 int m = (i + j) / 2; //m is the middle index
73 if(a[m] == key) {
74 return m;
75 } else if(key < a[m]) {
76 //recurse on the left hand side
77 binarySearch(a, i, m-1, key);
78 } else if(a[m] < key) {
79 //recurse on the right hand side
80 binarySearch(a, m+1, j, key);
81 }
82 }
83
84}
85```
86
87### Analysis
88
89* Suppose you have an array of $n$ elements
90* How many comparisons does linear search perform to search the given array?
91 * Best case scenario: you find the key at the first element: one comparison!
92 * Worst case scenario: you find the key at the end of the array (or you don't find it at all): $n$ comparisons
93 * Average case scenario: naive analysis:
94 $$\frac{n + 1}{2} \approx \frac{n}{2}$$
95 * Both the average and worst cases are *linear*
96* Binary Search?
97 * Derivation: whitepaper and in the book
98 * Only takes about $\log_2{(n)}$ comparisons
99
100#### Perspective
101
102* Suppose you have a database of $10^{12}$ (1 trillion) elements
103 * If unsorted you would have to use linear search to find a particular element:
104 $$500 \times 10^{11}$$
105 500 billion comparisons
106 * If sorted (indexed) then you can exploit the sort using binary search:
107 $$\log_2{(10^{12})} < 40$$
108
109* Another perspective: suppose you "double" the input size of an array
110 $$n \rightarrow 2n$$
111 * linear search: the number of comparisons goes from
112 $$\frac{n}{2} \rightarrow n$$
113 The number of comparisons doubles
114 * binary search:
115 $$\log_2{(n)} \rightarrow \log_2{(2n)} = \log_2{(n)} + \log_2{(2)} = \log_2{(n)} + 1$$
116
117## Sorting
118
119## Searching & Sorting in Practice
120
121* In practice: you don't generally roll your own searching and sorting algorithms unless you have a Very Good Reason to do so
122* In practice you use built-in search/sort algorithms/functions in a standard library
123* To avoid multiple solutions you use *generic* searching/sorting functions
124* Problem: a sorting algorithm/function may know how to "sort" but if it is generic, it may not know how to "order"
125 
126* You can use a comparator function that specifies how to "order" two elements, $a, b$
127* Given two elements $a, b$ which goes first? Are $a, b$ in order or out of order? Should they be swapped?
128* A comparator takes two elements $a, b$ as input and returns
129 * something negative if $a < b$
130 * zero if $a = b$
131 * something positive if $a > b$
132
133### Comparator functions in C
134
135* In C, a *comparator function* has the following signature:
136
137`int cmp(const void *a, const void *b)`
138
139* `void *` is a "generic void pointer": it can point to anything!
140* the compartor returns an integer according to the "contract" outlined above
141* Standard pattern:
142 * cast the pointers to the specific type of data you want to compare
143 * Use the state of the "object" or structure to determine the proper ordering
144 * Return a valid integer value that expresses the order of the two elements
145* Best practices:
146 * Use descriptive names for your functions
147 * Be explicit in your comparisons
148 * Avoid "tricks" (difference tricks)
149 * Reuse comaprator functionality whenever possible
150* Examples:
151 * Write a comparator to order integers in non-decreasing order
152 * Write a comparator to order integers in non-increasing order
153 * Write a comparator to order `Student` structures by NUID
154 * Write a comparator to order `Student` structures by GPA
155 * Write a comparator to order `Student` structures by last name/first name
156
157#### Searching & Sorting
158
159* The standard C library provides a function:
160
161```c
162void qsort(void *base,
163 size_t nel,
164 size_t size,
165 int (*compar)(const void *, const void *));
166```
167
168* `base` is the array of elements to be sorted (also note: it is not `const`)
169* `nel` is the number of elements in the array
170* `size` is the number of bytes each element in the array takes, usually you use: `sizeof` for this parameter
171* `compar` - a function pointer to a comparator function that defines the order that you want to sort in
172
173#### Function Pointers
174
175* How can we "pass" a function to another function as an argument?
176* Generally these are referred to as "callbacks"
177* Ex: GUIs: you can define a button but how do you specify the code that runs when someone presses the button: you provide a callback
178* Ex: `qsort` needs a way to order elements, so it needs a comparator function
179* In C pointers can point to data stored in memory
180* But, a pointer is just a memory location
181* What else is stored in memory? Your code!
182* It makes sense that you can create a pointer that "points" to your code; points to a function in your code
183
184```c
185//create a function pointer called ptrToFunc that can
186//point to any function that returns
187//an integer and takes three arguments:
188//(int, double, char)
189int (*ptrToFunc)(int, double, char) = NULL;
190
191//declare a pointer that can point to math's sqrt function:
192double (*ptrToSqrt)(double) = NULL;
193
194//make our new pointer point to math's sqrt function:
195ptrToSqrt = sqrt;
196
197//cool: you can also invoke the function via its pointer:
198double y = ptrToSqrt(2.0);
199
200//a whole world of hurt:
201sqrt = sin;
202
203ptrToSqrt = sin;
204
205```
206
207## Searching in practice
208
209* Searching: `search.h` contains linear search functions, `lfind`, `lsearch`
210* Standard library:
211
212```c
213void * bsearch(const void *key,
214 const void *base,
215 size_t nel,
216 size_t size,
217 int (*compar) (const void *, const void *));
218```
219
220#### Sample Code
221
222```c
223#include <stdlib.h>
224#include <stdio.h>
225#include <math.h>
226#include <string.h>
227
228typedef struct {
229 int nuid;
230 char *firstName;
231 char *lastName;
232 double gpa;
233 //forget the date for now
234} Student;
235
236
237int cmpInt(const void *a, const void *b) {
238 const int *x = (const int *)a;
239 const int *y = (const int *)b;
240 if(*x < *y) {
241 //avoid the difference trick: return *x - *y
242 //suceptible to overflow
243 return -1;
244 } else if(*x > *y) {
245 return 1;
246 } else {
247 return 0;
248 }
249 return 0;
250}
251
252
253
254int cmpIntDesc(const void *a, const void *b) {
255 return cmpInt(b, a);
256}
257
258int cmpDouble(const void *a, const void *b) {
259 const double *x = (const double *)a;
260 const double *y = (const double *)b;
261 if(*x < *y) {
262 //avoid the difference trick: return *x - *y
263 //.5 - .75 = -.25 -> 0
264 return -1;
265 } else if(*x > *y) {
266 return 1;
267 } else {
268 return 0;
269 }
270 return 0;
271}
272
273int cmpStudentByNuid(const void *a, const void *b) {
274 const Student *x = (const Student *) a;
275 const Student *y = (const Student *) b;
276 return cmpInt( &(x->nuid), &(y->nuid) );
277}
278
279int cmpStudentByGpa(const void *a, const void *b) {
280 const Student *x = (const Student *) a;
281 const Student *y = (const Student *) b;
282 return cmpDouble( &(y->gpa), &(x->gpa) );
283}
284
285int cmpStudentByName(const void *a, const void *b) {
286 const Student *x = (const Student *) a;
287 const Student *y = (const Student *) b;
288
289 int result = strcmp(x->lastName, y->lastName);
290 if(result != 0) {
291 return result;
292 } else if(result == 0) {
293 //tie: same last name!
294 return strcmp(x->firstName, y->firstName);
295 //there may still be a tie, but that's okay
296 }
297
298}
299
300int main(void) {
301
302 int n = 10;
303 int arr[] = {6, 3, 9, 4, 1, 8, 3, 12, 2, 21};
304 for(int i=0; i<n; i++) {
305 printf("%d ", arr[i]);
306 }
307 printf("\n");
308 qsort(arr, n, sizeof(int), cmpIntDesc);
309 for(int i=0; i<n; i++) {
310 printf("%d ", arr[i]);
311 }
312 printf("\n");
313
314 int k = 8;
315 int *ptr = (int *) bsearch(&k, arr, n, sizeof(int), cmpDouble);
316 if(ptr != NULL) {
317 printf("%d found at memory location %p, which contains %d\n", k, ptr, *ptr);
318 } else {
319 printf("not found!\n");
320 }
321 return 0;
322}
323```
324
325## Generic Programming Demonstration
326
327* How can we use generic programming to avoid duplicate work
328* Demonstration: write a `getMax` function
329
330```c
331#include <stdlib.h>
332#include <stdio.h>
333
334int cmpInt(const void *a, const void *b) {
335 const int *x = (const int *) a;
336 const int *y = (const int *) b;
337 if(*x < *y) {
338 return -1;
339 } else if(*x > *y) {
340 return 1;
341 } else {
342 return 0;
343 }
344}
345
346int cmpDouble(const void *a, const void *b) {
347 const double *x = (const double *) a;
348 const double *y = (const double *) b;
349 if(*x < *y) {
350 return -1;
351 } else if(*x > *y) {
352 return 1;
353 } else {
354 return 0;
355 }
356}
357
358/**
359 * This function finds the maximum integer in the
360 * given array and returns its index
361 */
362int getMaxInt(const int *arr, int n) {
363 int maxIndex = 0;
364 for(int i=1; i<n; i++) {
365 if(arr[i] > arr[maxIndex]) {
366 maxIndex = i;
367 }
368 }
369 return maxIndex;
370}
371
372/**
373 * This function finds the maximum element in the
374 * given array and returns its index
375 */
376int getMax(const void *arr, int n, int size, int (*compar)(const void *, const void *)) {
377 int maxIndex = 0;
378 for(int i=1; i<n; i++) {
379 //want a *pointer* to the i-th element
380 const void *a = arr + i * size;
381 //want a *poitner* to the maxIndex-th element
382 const void *b = arr + maxIndex * size;
383 if( compar(a, b) > 0 ) {
384 maxIndex = i;
385 }
386 }
387 return maxIndex;
388}
389
390int main(void) {
391 int n = 8;
392 double arr[] = {1, 42.5, 3.14, 104.2, 9, 87.3, 7, 62.89};
393 int index = getMax(arr, n, sizeof(double), cmpDouble);
394 printf("Max found at index %d with value %f\n", index, arr[index]);
395 return 0;
396}
397```
398
399## Sorting (Revisited)
400
401### Selection Sort
402
403* Basic Idea: search through the array and find the minimal element, swap it with the first element, `a[0]`
404 * Repeat: find the second smallest among the remaining $n-1$ elements, swap it with `a[1]`
405 * In general: on the $i$-th iteration, the first $i-1$ elements have been sorted, search the remaining $n-i$ elements and find the minimum, swap it with `a[i]`
406
407* Analysis
408 * How many comparisons does selection sort make on an array of size $n$?
409 * Selection sort requires about $n^2$ operations, it is a "quadratic" sorting algorithm
410 * IN practice, quadratic sorting algorithms are NOT feasible,
411 * For any even moderately large data set, quadratic algorithms are not feasible
412
413 ### Quick Sort
414
415 * Basic Idea:
416 * Choose a *pivot* element in the array
417 * Partition all other elements around the pivot: place element smaller to the left, larger to the right
418 * Repeat by recursing on the left partition and the right partition
419 * Until your partition becomes of size 0 or 1
420 * Analysis:
421 * Quick Sort make an average number of comparisons proportional to
422$$\approx n\log{(n)}$$
423
424### Comparison
425
426* Selection sort: $n^2$ vs Quick Sort $n\log{(n)}$
427* Mathematical perspective: $n\log{(n)}$ is a smaller and smaller growing function (quasilinear function)
428* Consider doubling the input size, $n$:
429 * Selection sort would have the number of comparisons go from:
430$$n^2 \rightarrow (2n)^2 = 4n^2$$
431 * Doubling the input size, quadruples the number of operations that selection sort performs
432 * Quick sort would have the number of comparisons go from:
433$$n\log{(n)} \rightarrow 2n\log{(2n)} = 2n\log{(2)} + 2n\log{(n)} = 2n\log{n} + 2n$$
434 * Doubling the input size only doubles the number of operations (roughly)
435
436* Practical perspective: Consider sorting a "large" array of 1 trillion elements: $10^{12}$:
437 * Selection Sort:
438 $$(10^{12})^2 = 10^{24}$$
439 (1 septillion operations), at 7TFlops -> 4 million years
440 * Quick Sort:
441 $$10^{12}\log{(10^{12})} = 12 \cdot 10^{12}\log{10}$$
442 Only about 1.5 hours with a 7TFLOP computer
443
444## Misc & Summary
445
446### Natural vs Artificial Ordering
447
448* It is natural to sort numbers in ascending order
449* It is "natural" to sort strings in lexicographic (ASCII text table) order
450* What is natural about "Freshmen", "Sophomores", "Juniors", "Seniors" (This is actually an "artificial" ordering), a computer would want to sort these lexicographically
451* How do we impose an artificial ordering to change a natural ordering?
452* In C you could define an enumerated type:
453
454```c
455typedef enum {
456 FRESHMAN,
457 SOPHOMORE,
458 JUNIOR,
459 SENIOR
460} Year;
461
462char *yearToString[] = {"Freshman", "Sophomore", "Juniors", "Seniors"};
463
464//what do you get when you use:
465printf("%s", yearToString[FRESHMAN]); //print "Freshman"
466```
467
468* Now our "natural" ordering (wrt numbers) works!
469
470### Sorting Stability
471
472* If two elements, $e_a$ and $e_b$ appear in the original array and are equal $e_a = e_b$ and they are never put out of the original order, then the sorting algorithm is said to be *stable*
473* In general you should prefer stable sorting algorithms and if an algorithm is unstable it can usually be made stable
474
475### Sorting Strings in C
476
477* You cannot use `strcmp` as it is not technically a comparator
478* Solution: cast them as pointers to pointers to chars, then dereference
479
480```c
481#include <stdio.h>
482#include <stdlib.h>
483#include <string.h>
484
485int cmpString(const void *a, const void *b) {
486 const char *x = *((const char**)a);
487 const char *y = *((const char**)b);
488 return strcmp(x, y);
489}
490
491int main(void) {
492
493 char *names[] = {
494 "Chris", "Margaret", "Alan", "Grace", "Zeppo"
495 };
496 qsort(names, 5, sizeof(char *), cmpString);
497
498 for(int i=0; i<5; i++) {
499 printf("%s\n", names[i]);
500 }
501
502 return 0;
503}
504```
505
506### Handling NULLs
507
508* Often you need complex logic to order `NULL` values
509
510```c
511int cmp(const void *a, const void *b) {
512 if(a == NULL && b == NULL) {
513 return 0;
514 } else if(a == NULL && b != NULL) {
515 return 1;
516 } else if(a != NULL && b == NULL) {
517 return -1;
518 } else {
519 //neither a nor b is null, so fall back to
520 //your regular logic
521 }
522}
523```
524
525### Solving a Problem: Demonstration
526
527
528
529### Rest of Semester
530
531* Next week: GUIs (1 lecture)
532* Next week + dead week: Databases (2 lectures)
533* last lecture: review
534* Dead week: live streaming (M-R) Advent of Code