Sort Race
Implement Quick-sort, Merge-sort, and Heap-sort.
Optimise your implementations for speed. The point is to find out by experiment which is the fastest.
The data to be sorted will be randomly generated strings. It is your choice whether they should be
modern C++ strings, or traditional C strings (i.e. char *s), and it is your
choice whether they will be stored in an array or in a standard C++ vector.
Due dates:
- Sat. 14th Sept: Preliminary quicksort. It does not need to be perfect yet, but
it does need to be clear and readable. This is mainly so that I can be sure you are
doing it, and give you some fairly quick feed-back.
- T.B.A.: Quicksort working properly.
- Mon. 21st Oct: Mergesort working properly.
- Mon. 28th Oct: Heapsort working properly.
- T.B.A.: Last chance to turn in your specially optimised versions of all three
prior to the race.
You only have to write the sorting functions. I have provided the functions that generate the
random strings, verify that your function does sort correctly, and accurately measure its
execution time. These functions, described below, are provided as a C++ source file (testlib.cpp)
and the corresponding header file (testlib.h). You should not modify these files at all. You
don't even need to look inside them, although you are welcome to: they contain no secrets.
They are fully tested on rabbit only.
These are the files: testlib.cpp, testlib.h.
But rather than download them, it will probably be more convenient for you to just copy
them directly into your rabbit accounts. The command to do that is:
cp /home/www/testlib.* .
(note that the dot at the end is essential). Then, just once, compile the .cpp file with
this command:
CC -c testlib.cpp
and you will never have to worry about them ever again.
The use of this software is very simple, and best illustrated by example. As bubble-sort is not
an option for this assignment, I will use it for the example. First I write my bubble-sort function
in any way I like, and put it in a C++ source file with #include "testlib.h" at the
top, and the special main() as shown.
This is the contents of the file bsort.cpp:
#include "testlib.h"
#include <string>
using namespace std;
void swap(string & a, string & b)
{ string temp=a;
a=b;
b=temp; }
void bubblesort(string A[], int N)
{ for (int i=N; i>1; i-=1)
{ for (int j=0; j<i-1; j+=1)
{ if (A[j]>A[j+1])
{ swap(A[j], A[j+1]); } } } }
void main(int argc, char * argv[])
{ string * data;
tester T(argc, argv, data);
int length = T.get_number();
bubblesort(data, length);
T.finish(); }
As you can see, I chose to sort an array of modern C++ strings. The first line in
main: "string * data;" creates a variable called
array that can point to an array of strings. If I had wanted to use an array
of old-fashioned C strings, I would have typed "char * * data;"
instead. If I had wanted a vector of strings, it would have been
"vector <string> data;", and if I had wanted a vector
of old C strings it would be "vector <char *> data;".
No matter what I choose, the rest of main stays exactly the same.
Of course, it is up to you exactly what parameters your sorting function should
take.
To compile this program along with the testing functions, I would use this
command:
CC bsort.cpp testlib.o -o bsort
Here is a transcript of a short session in which I test it twice:
$ CC bsort.cpp testlib.o -o bsort
$ bsort 1000
1000 items sorted in 0.0760 seconds, nlogn=9.9658e+03, ratio=1.311e+05
$ bsort 10000
10000 items sorted in 7.7802 seconds, nlogn=1.3288e+05, ratio=1.708e+04
It is probably obvious what is going on. The number on the command line says how many random
strings should be produced. If your sorting function is correct, the system simply reports
how long it took. You can see the quadraticness of bubblsort perfectly illustrated.
If the sorting function doesn't produce correct results, you will see something like
this:
$ CC bsort.cpp testlib.o -o bsort
$ bsort 1000
Not correctly sorted: entry 0 is Wduanxmhad, should be Aabhzhrzqc
Sorting failed
That is all you need to know. If you don't want to read the rest of the explanation,
that's fine. I strongly suggest you copy and paste my bubble-sort sample from above and
go through the steps of compiling and running it, so that you know you have got the basic
framework working properly before working on your sorting algorithms, but apart from
that, you can just get on with it.
Detailed explanation.
The prototype for main: void main(int argc, char * argv[])
is a standard unix thing for getting information from the
command line. If I ran the program by typing "bsort 100 5", then argc
would be equal to three (the number of things), and argv, which is an
array of old C strings would contain these three strings: "bsort", "100", and "5".
The second line of main, tester T(argc, argv, data);
is obviously creating a tester object called T.
The constructor for a tester receives everything that appeared on the command
line, and generates the indicated number of strings, according to whatever other
options are provided, stores them in your array or vector, and starts the timer.
There are four different constructors for tester, each taking a different
type of third parameter. You could use any of these four combinations:
string * data;
tester T(argc, argv, data);
or
vector <string> data;
tester T(argc, argv, data);
or
char * * data;
tester T(argc, argv, data);
or
vector <char *> data;
tester T(argc, argv, data);
If you are using an array, the tester constructor creates the array for you. You should only
provide it with an appropriate pointer variable as in the examples.
Because the amount of data isn't known until the tester constructor processes argc and argv,
your program needs to ask: int length = T.get_number();
Once your sorting function returns control to main, T.finish();
stops the timer. The tester then checks your results. It kept a copy of
the original random strings, it now sorts them itself (using the standard qsort function), and
checks that the results are an exact match. For a very large data set, this verification procedure
adds a little bit to the overall run time from the program, but that is excluded from the time calculations.
The time reported is only the CPU time actually used between the tester constructor ending
and its finish method being called.
Options on the Command Line, by example.
bsort
The default, if no information is provided
on the command line, is to generate 1,000,000 random strings, each 10 characters long, in completely
random order.
bsort 1000
The first number on the command line
changes the data size. This will only generate 1,000 strings, each 10 characters long.
bsort 1000 7
The second number on the command line
changes the string length. This will generate 1,000 strings, each 7 characters long.
bsort 1000 7 16
The third number on the command line
randomises the string length. This will generate 1,000 strings, each somewhere between 7 and 10 characters long.
bsort v 100
The letter v (for view) should only
be used when debugging with small data sets. In this case, the finish method will print
all of the strings, both in the order your sort function produced, and in the correct
sorted order, side by side.
All options that begin with a letter may
appear anywhere on the command line, and in any order.
bsort q
The letter q (for quick) should only
be used when you are sure your algorithm is correct. It makes tests with large data sets less
annoying by abandoning the complete verification procedure. Instead of keeping its own copy
of the data set and independently sorting it, the tester only checks that the data
appears in ascending order after your sort. It will not notice if some of the original data
is missing.
bsort 2000 s
The letter s (for seed)
when it appears alone, makes the system print out the seed for the random number generator
when it first starts up. It will print a line that looks like this seed = 627432563,
then continue exactly as normal. The point is that you will be able to reproduce exactly the
same results at any time in the future be re-using this same seed value, as in:
bsort 2000 s627432563
The letter s when
immediately followed by a number (no space) uses that number to seed the random number
generator. The same seed will always result in exactly the same sequence of "random" strings
being produced.
I will be using this feature
when testing your programs, to ensure that everybody's preograms are tested on exactly
the same inputs.
bsort 2000 p0.01
The letter p (for
permutations) is used to produce strings that are not in completely random order. The
floating point number immediately following the p gives the "proportion of randomness".
p1.0 means completely random (i.e. normal behaviour), p0.0
meand no randomness, the strings will be produced completely in order. p0.1
means that 90% of the list of strings will be in order, just 10% of them will be
randomly jiggled about.
Judgement
The overall champion sorter will be the one that is fastest for really big data sets
that are completely random, but the fastest sorter of almost-sorted strings will
also be smiled upon.