Assignment 5
Write a program which gives rapid access to a large amount of data.
You can find your own data sets if you want, or if you are feeling lazy
use the ones linked below.
You do not need to use a really huge data set for testing your program, but
your program must be scalable. It should work well and quickly for really
big data collections.
The data set you choose should of course be human-readable, and include
something that could be used as a key for searches in each data record.
For example, if you have a lot of information about people, a person's name
could be used as a key to search for all the data associated with that
name. On the other hand, if you just use a word list, things will not work
very well: once you have found the word, there is no additional information
to print.
There is no need for a fancy interface. Just a simple loop in which the
user types in a search key (word, name, etc), and the program responds by
printing all the information associated with that key (or complaining if
the key is not present in the data set).
Options for doing the assignment:
- Re-open and re-read the (on-disc) data file for each query, read
through the file until you find a matching record, and print it. This
option will get you very few points indeed, but not quite zero.
- Read the whole data file into memory when the program starts, then
perform a simple "brute force" search to answer each query. This option
will get you quite a few points.
- Read the whole data file into memory and sort it when the program
starts, then answer each query with a very fast binary chop search.
- Use a simple but slow (e.g. SelectionSort) sorting algorithm for
most of the
points.
- Use a fast but more complex (e.g. MergeSort) sorting algorithm for
ALL the points.
I recommend you first get a good working solution using a simple slow sort,
then if you still have time replace it with a fast sort.
- Extra credit if you allow the user to add new data interactively.
You must not destroy the efficiency of the searching operation, and must
save the new data when the program ends.
For convenience if you cant find a reasonable data set, I have provided a
big English dictionary. The dictionary is REALLY big (total of 13 MB),
so it is split into 26 files, one for each possible first letter. I suggest
you pick on one letter, download its file, and use that in all your
testing. There is no need to use all 26, and rabbit doesn't really have
enough memory for all 41 of you to be using >13MB at the same time.
Here are the files:
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
They are really just plain text files, but in standard (unix) format.
If you look at them on a PC, they will seem to be one giant line (unless
you use a cleverer than average viewing program). Don't worry about that.
Download them straight to disc, then ftp to your rabbit account as a binary
file.
The data format is very simple. Each word in each file occupies exactly
four separate lines always.
- The first line is always just a single + sign.
- The second line is the word being defined. It always consists of
nothing but lower case letters, and (in the case of phrases only) spaces.
- The third line shows the parts of speech, "n." for noun, "adv."
for adverb, etc. If more than one p.o.s. appears, they are separated by
vertical lines, e.g. n.|adv.
- The fourth line is the definition of the word. No matter how long the
definition is, it is all squeezed onto a single line.
Those four lines are repeated for every word in the dictionary. At the very
end, there is a single line that contains a minus sign -. This appears
exactly where the + would be if another definition were to follow.
Every line is exactly as
it appears to be, no strange invisible characters. At the end of each line,
there is a single \n (ascii code 10). No \r or anything else, which is why
PCs will be confused.