Sort Race

Implement Quick-sort, Merge-sort, and Heap-sort. Optimise your implementations for speed. The point is to find out by experiment which is the fastest.

The data to be sorted will be randomly generated strings. It is your choice whether they should be modern C++ strings, or traditional C strings (i.e. char *s), and it is your choice whether they will be stored in an array or in a standard C++ vector.

Due dates:
  1. 6th Oct.: Preliminary quicksort. It does not need to be perfect yet, but it does need to be clear and readable. This is mainly so that I can be sure you are doing it, and give you some fairly quick feed-back.
  2. 18th Oct.: Quicksort working properly.
  3. 30th Oct.: Mergesort working properly.
  4. Extra credit: Heapsort working properly.
  5. T.B.A.: Last chance to turn in your specially optimised versions of all three prior to the race.


You only have to write the sorting functions. I have provided the functions that generate the random strings, verify that your function does sort correctly, and accurately measure its execution time. These functions, described below, are provided as a C++ source file (testlib.cpp) and the corresponding header file (testlib.h). You should not modify these files at all. You don't even need to look inside them, although you are welcome to: they contain no secrets. They are fully tested on rabbit only.

These are the files: testlib.cpp, testlib.h bsort.cpp.
But rather than download them, it will probably be more convenient for you to just copy them directly into your rabbit accounts. The command to do that is:
            cp /home/www/testlib.* .
(note that the dot at the end is essential). Then, just once, compile the .cpp file with this command:
            CC -c testlib.cpp
and you will never have to worry about them ever again.

The use of this software is very simple, and best illustrated by example. As bubble-sort is not an option for this assignment, I will use it for the example. First I write my bubble-sort function in any way I like, and put it in a C++ source file with #include "testlib.h" at the top, and the special main() as shown.

This is the contents of the file bsort.cpp:
               #include "testlib.h"
               #include <string>
               
               using namespace std;
               
               void swap(string & a, string & b)
               { string temp = a;
                 a = b;
                 b = temp; }
               
               void bubblesort(string A[], int N)         
               { for (int i = N; i > 1; i -= 1)
                 { for (int j = 0; j < i - 1; j += 1)
                   { if (A[j] > A[j+1])
                     { swap(A[j], A[j+1]); } } } }           
               
               int main(int argc, char * argv[])
               { string * data;
                 tester T(argc, argv, data);
                 int length = T.get_number();
               
                 bubblesort(data, length);
               
                 T.finish(); }
As you can see, I chose to sort an array of modern C++ strings. The first line in main: "string * data;" creates a variable called array that can point to an array of strings. If I had wanted to use an array of old-fashioned C strings, I would have typed "char * * data;" instead. If I had wanted a vector of strings, it would have been "vector <string> data;", and if I had wanted a vector of old C strings it would be "vector <char *> data;".
No matter what I choose, the rest of main stays exactly the same. Of course, it is up to you exactly what parameters your sorting function should take.

To compile this program along with the testing functions, I would use this command:
          CC bsort.cpp testlib.o -o bsort
Here is a transcript of a short session in which I test it twice:
               $ CC bsort.cpp testlib.o -o bsort
               $ bsort 1000
                     1000 items sorted in    0.0760 seconds, nlogn=9.9658e+03, ratio=1.311e+05
               $ bsort 10000
                    10000 items sorted in    7.7802 seconds, nlogn=1.3288e+05, ratio=1.708e+04
It is probably obvious what is going on. The number on the command line says how many random strings should be produced. If your sorting function is correct, the system simply reports how long it took. You can see the quadraticness of bubblsort perfectly illustrated.

If the sorting function doesn't produce correct results, you will see something like this:
               $ CC bsort.cpp testlib.o -o bsort
               $ bsort 1000
               Not correctly sorted: entry 0 is Wduanxmhad, should be Aabhzhrzqc
               Sorting failed
That is all you need to know. If you don't want to read the rest of the explanation, that's fine. I strongly suggest you copy and paste my bubble-sort sample from above and go through the steps of compiling and running it, so that you know you have got the basic framework working properly before working on your sorting algorithms, but apart from that, you can just get on with it.


Detailed explanation.

The prototype for main: void main(int argc, char * argv[])
     is a standard unix thing for getting information from the command line. If I ran the program by typing "bsort 100 5", then argc would be equal to three (the number of things), and argv, which is an array of old C strings would contain these three strings: "bsort", "100", and "5".

The second line of main, tester T(argc, argv, data);
     is obviously creating a tester object called T. The constructor for a tester receives everything that appeared on the command line, and generates the indicated number of strings, according to whatever other options are provided, stores them in your array or vector, and starts the timer.
     There are four different constructors for tester, each taking a different type of third parameter. You could use any of these four combinations:
              string * data;
              tester T(argc, argv, data);
     or
              vector <string> data;
              tester T(argc, argv, data);
     or
              char * * data;
              tester T(argc, argv, data);
     or
              vector <char *> data;
              tester T(argc, argv, data);
If you are using an array, the tester constructor creates the array for you. You should only provide it with an appropriate pointer variable as in the examples.

Because the amount of data isn't known until the tester constructor processes argc and argv, your program needs to ask: int length = T.get_number();

Once your sorting function returns control to main, T.finish();
     stops the timer. The tester then checks your results. It kept a copy of the original random strings, it now sorts them itself (using the standard qsort function), and checks that the results are an exact match. For a very large data set, this verification procedure adds a little bit to the overall run time from the program, but that is excluded from the time calculations. The time reported is only the CPU time actually used between the tester constructor ending and its finish method being called.


Options on the Command Line, by example.

bsort
          The default, if no information is provided on the command line, is to generate 1,000,000 random strings, each 10 characters long, in completely random order.
bsort 1000
          The first number on the command line changes the data size. This will only generate 1,000 strings, each 10 characters long.
bsort 1000 7
          The second number on the command line changes the string length. This will generate 1,000 strings, each 7 characters long.
bsort 1000 7 16
          The third number on the command line randomises the string length. This will generate 1,000 strings, each somewhere between 7 and 10 characters long.
bsort v 100
          The letter v (for view) should only be used when debugging with small data sets. In this case, the finish method will print all of the strings, both in the order your sort function produced, and in the correct sorted order, side by side.
          All options that begin with a letter may appear anywhere on the command line, and in any order.
bsort q
          The letter q (for quick) should only be used when you are sure your algorithm is correct. It makes tests with large data sets less annoying by abandoning the complete verification procedure. Instead of keeping its own copy of the data set and independently sorting it, the tester only checks that the data appears in ascending order after your sort. It will not notice if some of the original data is missing.
bsort 2000 s
          The letter s (for seed) when it appears alone, makes the system print out the seed for the random number generator when it first starts up. It will print a line that looks like this seed = 627432563, then continue exactly as normal. The point is that you will be able to reproduce exactly the same results at any time in the future be re-using this same seed value, as in:
bsort 2000 s627432563
          The letter s when immediately followed by a number (no space) uses that number to seed the random number generator. The same seed will always result in exactly the same sequence of "random" strings being produced.
          I will be using this feature when testing your programs, to ensure that everybody's preograms are tested on exactly the same inputs.
bsort 2000 p0.01
          The letter p (for permutations) is used to produce strings that are not in completely random order. The floating point number immediately following the p gives the "proportion of randomness". p1.0 means completely random (i.e. normal behaviour), p0.0 meand no randomness, the strings will be produced completely in order. p0.1 means that 90% of the list of strings will be in order, just 10% of them will be randomly jiggled about.