Assignment 5

Write a program which gives rapid access to a large amount of data. You can find your own data sets if you want, or if you are feeling lazy use the ones linked below.

You do not need to use a really huge data set for testing your program, but your program must be scalable. It should work well and quickly for really big data collections.

The data set you choose should of course be human-readable, and include something that could be used as a key for searches in each data record. For example, if you have a lot of information about people, a person's name could be used as a key to search for all the data associated with that name. On the other hand, if you just use a word list, things will not work very well: once you have found the word, there is no additional information to print.

There is no need for a fancy interface. Just a simple loop in which the user types in a search key (word, name, etc), and the program responds by printing all the information associated with that key (or complaining if the key is not present in the data set).

Options for doing the assignment:
  1. Re-open and re-read the (on-disc) data file for each query, read through the file until you find a matching record, and print it. This option will get you very few points indeed, but not quite zero.
  2. Read the whole data file into memory when the program starts, then perform a simple "brute force" search to answer each query. This option will get you quite a few points.
  3. Read the whole data file into memory and sort it when the program starts, then answer each query with a very fast binary chop search.
    1. Use a simple but slow (e.g. SelectionSort) sorting algorithm for most of the points.
    2. Use a fast but more complex (e.g. MergeSort) sorting algorithm for ALL the points.
    I recommend you first get a good working solution using a simple slow sort, then if you still have time replace it with a fast sort.
  4. Extra credit if you allow the user to add new data interactively. You must not destroy the efficiency of the searching operation, and must save the new data when the program ends.



For convenience if you cant find a reasonable data set, I have provided a big English dictionary. The dictionary is REALLY big (total of 13 MB), so it is split into 26 files, one for each possible first letter. I suggest you pick on one letter, download its file, and use that in all your testing. There is no need to use all 26, and rabbit doesn't really have enough memory for all 41 of you to be using >13MB at the same time.

Here are the files: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

They are really just plain text files, but in standard (unix) format. If you look at them on a PC, they will seem to be one giant line (unless you use a cleverer than average viewing program). Don't worry about that. Download them straight to disc, then ftp to your rabbit account as a binary file.

The data format is very simple. Each word in each file occupies exactly four separate lines always.
  1. The first line is always just a single + sign.
  2. The second line is the word being defined. It always consists of nothing but lower case letters, and (in the case of phrases only) spaces.
  3. The third line shows the parts of speech, "n." for noun, "adv." for adverb, etc. If more than one p.o.s. appears, they are separated by vertical lines, e.g. n.|adv.
  4. The fourth line is the definition of the word. No matter how long the definition is, it is all squeezed onto a single line.
Those four lines are repeated for every word in the dictionary. At the very end, there is a single line that contains a minus sign -. This appears exactly where the + would be if another definition were to follow.
       Every line is exactly as it appears to be, no strange invisible characters. At the end of each line, there is a single \n (ascii code 10). No \r or anything else, which is why PCs will be confused.