Goldfish Club
We have been asked by the G.O.P. (Goldfish Operators of Pennsylvania)
to computerise their membership records. At the moment, they just have a
big book, in which they write the details of all their members and their
goldfish. Each entry in the book occupies a complete line, and always
contains the same information in the same order:
- Membership number (e.g. 27364)
- Member's title (Mr, Mrs, Miss, Ms, Uncle, etc.),
- Member's Last name (Bloggs, Smith, Jones, etc.),
- Member's First name (Sally, Hubert, Binkie, etc.)
- Member's Street address (e.g. 123a Ant St),
- Member's City (e.g. Antville),
- Member's State (e.g. AL),
- Member's Zip code (e.g. 10203),
- Member's Telephone number (e.g. 321-455-3838),
- Goldfish's Name (Goldie, Orangey, Fluffy, etc.),
- Goldfish's Species (Orange, Yellow, etc.),
- Goldfish's Birthday (e.g. 23rd June),
- Goldfish's Year of birth (e.g. 1932),
- Goldfish's Status (Dead, Alive),
- Goldfish's Favourite colour (Orange, Yellow, Gold, Blue, etc.).
A quick inspection of the big book reveals that there are about
10,000 entries, and the length of the entries (the number of
characters on a line) varies between 35 and 175.
As a first, quick and easy implementation, we might decide to
store the data as a giant array of 15,000 strings each of 35 to 175
characters (to allow for future expansion). We realise there
would be a lot of waste this way, but it'll let us put together
a working demonstration very quickly, and ensure we win the
contract.
string database[15000];
would create the storage we need. It would occupy a lot of
memory, but not so much that a modern computer would have any trouble.
(Note: Older PC operating systems, and older compilers for PCs, were
incapable of handling any data object more than 65536 bytes long.
We are concerned with correct programming, not conforming to
short-sighted commercial design decisions. Of course, in the
so-called real world you have to be aware of the restrictions
imposed by your customers' hardware. If the G.O.P. are using such
old stuff, perhaps we can make an additional profit by selling them
a new computer too.)
A better
plan would be to define a constant right at the beginning of the
program, and use them everywhere:
const int MAX_NUM_ENTRIES=1000
string database[MAX_NUM_ENTRIES];
Then, any time we need to change the size, all we have to
do is change that one place where the definition is made, and recompile
the program. You won't have to search out all the places where the
number 1000 appears.
The next task would be to define a function that lets us read some
real data into that array. For the purposes of initial testing, we
would write a function that reads entries typed on the keyboard, but
we would be careful to ensure that it will be very easy to convert
the function to read from a file instead.
This is probably
what would be added:
const int MAX_LINE_LENGTH = 1000;
int num_entries = 0;
void read_data()
{ num_entries=0;
while (1)
{ if (num_entries>=MAX_NUM_ENTRIES)
{ cerr << "** File Too Long! Buy The Upgrade!!! **\n";
exit(1); }
getline(cin, database[num_entries]);
if (cin.eof()) break;
num_entries+=1; }
cout << "[" << num_entries << " entries read]\n"; }
Take it one thing at a time:
- cin >> s; (when s is a string variable) is not a good way of reading a line
from a file. strings being read from cin are terminated by spaces, so if any of the data
items (names, addresses, etc) could have spaces in them, this way of reading will not work.
Instead we have to use a rather ungainly function called getline. Getline always reads
a whole line, regardless of any spaces it may contain. The first argument is the input stream
to read a line from, and the second is a string to store the line in.
- It would be very bad to read the 1001st line of a file if we have only made space for 1000
strings. It is common, but not at all nice, to just exit a program when this sort of problem
occurs. To use the exit function, you should #include <stdlib.h> at the
beginning of the file.
- database is an array of strings; database[6] refers to string number 6 in that
array.
- How can you tell when the input is finished? When reading from the terminal it is OK to insist
that the last line contains some special string, such as "The End", then the end can be detected
with a simple test: if (database[num_entries]=="The End"), but that isn't
always satisfactory, especially when reading from a disc file. It is much better to be able
to detect the real end of the file. cin.eof() becomes true after a read operation has failed
because the file has no more data in it. When reading from the terminal, end-of-file is
simulated by typing control-D.
To modify the function so that it reads data from a file instead of the terminal is not difficult.
The modified function would look like this:
const int MAX_LINE_LENGTH = 1000;
int num_entries = 0;
void read_data(string filename)
{ ifstream file(filename.c_str());
if (file.bad())
{ cerr << "** Can't read the file '" << filename << "'\n";
exit(1); }
num_entries=0;
while (1)
{ if (num_entries>=MAX_NUM_ENTRIES)
{ cerr << "** File Too Long! Buy The Upgrade!!! **\n";
exit(1); }
getline(file, database[num_entries]);
if (file.eof()) break;
num_entries+=1; }
cout << "[" << num_entries << " entries read]\n"; }
In this version, the name of the data file is supplied as a string parameter, but there are some very annoying tricks:
- ifstream is a standard C++ type. When the ifstream
constructor us used to open a file, it needs to be given the file's name as a string.
string is a standard C++ type. ifstream does not understand file names
presented to it as C++ strings. This is absolutely absurd, but it is the official standard for
C++. When you tell the ifstream constructor the name of the file, you have to provide it
as an old plain C string, it does not understand C++ strings. Fortnately C++ strings have a special method
called c_str() which produces an equivalent plain-C string.
- If for some reason it is not possible to open the specified file, file.bad() becomes
true, so you can easily test for an error condition.
- getline and eof work on C++ input files in the same was as on cin.
After all that, we would probably provide a function for printing out the whole database, even
if it is only so that we can check that it was read correctly. That is extremely easy:
void print_data()
{ cout << "[" << num_entries << " entries]\n";
for (int i=0; i<num_entries; i+=1)
cout << i << ": '" << database[i] << "'\n"; }
A note:
- I print quotes around the strings just for safety. It is very easy
for stray spaces to creep into strings that are read from files, and if they are at the beginning
or the end of a string, you can't see them. Making quotes appear around a printed string
just makes it possible to see exactly what's there and what isn't. If you do this experiment,
you will see the value of that safety check.
A complete program to test those two functions would be very easy to write. If you want to see one,
here it is.
The main function contains a useful trick:
void main(int argc, char *argv[])
{ string progname=string(argv[0]);
string filename="";
for (int i=1; i<argc; i+=1) /* scan the command line arguments */
{ string thisarg=string(argv[i]);
if (thisarg == "-h") /* -h means want help */
{ cout << "To use this program, enter the command:\n";
cout << " " << progname << " filename\n";
cout << "where filename is the name of the database-containing text file, or:\n";
exit(0); }
else
filename=thisarg; } /* anything else must be the file name. */
/* if no filename provided, the variable filename will still */
/* be empty. read_data is built to detect that and use stdin.*/
read_data(filename);
print_data(); }
It is unreasonable to have to build fixed file names into programs, and sometimes annoying to
have to write an interactive program that asks the user the name of the data file every time it is run.
Often, it is much more convenient to provide simple inputs on the command line, so that if you
normally run the program by typing "a.out" or "fishclub", you could instead run
it by typing "a.out filename" or "fishclub filename", and somehow the
program would be able to see the file name and make use of it.
That is what this special main is doing.
If you declare main with two parameters, the first an int, and the second an array of char*'s
(as above), you automatically have access to everything that appeared on the command line. Unfortunately,
the command line is given to your program as an array of plain-C strings, not C++ strings, but conversion
is easy. That's why the second parameter is an array of "char *": char* is the
most common way of describing an old-fashioned plain-C string.
The
first argument is simply the number of things on the command line (argc = ARGument Count), and the
second argument is an array of those values (argv = ARGument Values). There is always at least one
thing on the command line, and that is the command itself. If you run the program by typing
"a.out filename", then argc will be 2. argv[0] will be "a.out", and argv[1]
will be "filename". Sometimes argv[0] is useful in printing out error messages that tell
the user what he/she should have typed to run the program.
It is traditional in unixy systems to make programs
be able to explain themselves. Typically typing "a.out -h" would ask the program "a.out"
not to run normally, but to simply print out a little bit of help, then exit. Very often, programs
will scan through their arguments to see if the string "-h" appears anywhere before
starting to run properly.
The main function shown above converts argv[0]
to a proper C++ string using the string constructor in the normal way, then runs through all the
rest of the arguments (if there are any) by saying for (i=1; i<argc; ...).
Each argument is converted to a normal C++ string and inspected. "-h" is treated as a
request for help, anything else is assumed to be a filename. If no filename was provided,
the loop will terminate with the variable filename still containing the empty string "".
(This would of course cause an error when read_data tries to open that file).
So...
Our program is now able to read a whole lot of lines of text from a file, and then show us all
those lines afterwards. Not very impressive. A useful database program would be able to answer queries about
the data, and as it stands, that would be quite difficult. We have not even considered the data format,
so have no chance of being able to process it properly.
We
know that each line of data describes one club member, providing 15 independent pieces of information.
- Membership number (e.g. 27364)
- Member's title (Mr, Mrs, Miss, Ms, Uncle, etc.),
- Member's Last name (Bloggs, Smith, Jones, etc.),
- Member's First name (Sally, Hubert, Binkie, etc.)
- Member's Street address (e.g. 123a Ant St),
- Member's City (e.g. Antville),
- Member's State (e.g. AL),
- Member's Zip code (e.g. 10203),
- Member's Telephone number (e.g. 321-455-3838),
- Goldfish's Name (Goldie, Orangey, Fluffy, etc.),
- Goldfish's Species (Orange, Yellow, etc.),
- Goldfish's Birthday (e.g. 23rd June),
- Goldfish's Year of birth (e.g. 1932),
- Goldfish's Status (Dead, Alive),
- Goldfish's Favourite colour (Orange, Yellow, Gold, Blue, etc.).
It
is absolutely necessary that we should be able to separate those 15 items from each other. The usual method is to
pick on a character that can never possibly appear in any of those items, and use it as a separator. A colon
':' or the vertical bar '|' are common choices. With this plan, a sample of a couple of lines from the data file
might look like this:
13523|Mrs|Spuggins|Mary|1234 Ant Street|Abracadabra|GA|27653|123-453-3123|Goldie|Goldfish|26 July|1999|Alive|Green
78123|Mr|Drab|Bub|73739 SW 353 St|Blammo|WA|93485|417-343-7667|Arthur|Haddock|29 February|1804|Dead|Grey
Queries made on the database are likely to be based on the values of these items, (e.g. "list all fish born
in 1999", "Find member number 78123", etc), and although it is not terribly difficult, separating the individual
items out from the big string does take some time. Therefore it would seem sensible to separate the big strings
into their component parts just once when it is first read in, and then store all the components separately. Then
it will be much easier and much faster to search for particular items.
For
this scheme, instead of having one big array of strings called database, we would expect to have 15
arrays of smaller strings:
string membernum[MAX_NUM_ENTRIES];
string title[MAX_NUM_ENTRIES];
string lastname[MAX_NUM_ENTRIES];
string firstname[MAX_NUM_ENTRIES];
string street[MAX_NUM_ENTRIES];
string city[MAX_NUM_ENTRIES];
string state[MAX_NUM_ENTRIES];
string zipcode[MAX_NUM_ENTRIES];
string phone[MAX_NUM_ENTRIES];
string fishname[MAX_NUM_ENTRIES];
string fishspecies[MAX_NUM_ENTRIES];
string fishbirthday[MAX_NUM_ENTRIES];
string fishbirthyear[MAX_NUM_ENTRIES];
string fishstatus[MAX_NUM_ENTRIES];
string fishfavcol[MAX_NUM_ENTRIES];
Then, after reading each line, we need to find where all the vertical bars are, and use string's substring
method to split the string into its 15 parts (substr's two int parameters are the starting position
within the original string, and the number of characters wanted):
void read_data()
{ num_entries=0;
while (1)
{ if (num_entries >= MAX_NUM_ENTRIES)
{ cerr << "** File Too Long! Buy The Upgrade!!! **\n";
exit(1); }
string input;
getline(cin, intput);
if (cin.eof()) break;
int bar_position[20];
int num_bars=0;
int line_length = input.length();
for (int i=0; i<line_length; i+=1)
{ if (input[i] == '|')
{ bar_position[num_bars] = i;
num_bars+=1; } }
if (num_bars != 14)
{ cerr << "Line has incorrect format:\n " << input << "\n";
continue; }
membernum[num_entries] = line.substr( 0, bar_position[0] );
title[num_entries] = line.substr( bar_position[0]+1, bar_position[1]-bar_position[0] );
lastname[num_entries] = line.substr( bar_position[1]+1, bar_position[2]-bar_position[1] );
firstname[num_entries] = line.substr( bar_position[2]+1, bar_position[3]-bar_position[2] );
street[num_entries] = line.substr( bar_position[3]+1, bar_position[4]-bar_position[3] );
city[num_entries] = line.substr( bar_position[4]+1, bar_position[5]-bar_position[4] );
state[num_entries] = line.substr( bar_position[5]+1, bar_position[6]-bar_position[5] );
zipcode[num_entries] = line.substr( bar_position[6]+1, bar_position[7]-bar_position[6] );
phone[num_entries] = line.substr( bar_position[7]+1, bar_position[8]-bar_position[7] );
fishname[num_entries] = line.substr( bar_position[8]+1, bar_position[9]-bar_position[8] );
fishspecies[num_entries] = line.substr( bar_position[9]+1, bar_position[10]-bar_position[9] );
fishbirthday[num_entries] = line.substr( bar_position[10]+1, bar_position[11]-bar_position[10] );
fishbirthyear[num_entries] = line.substr( bar_position[11]+1, bar_position[12]-bar_position[11] );
fishstatus[num_entries] = line.substr( bar_position[12]+1, bar_position[13]-bar_position[12] );
fishfavcol[num_entries] = line.substr( bar_position[13]+1, line_length-bar_position[13] );
num_entries+=1; }
cout << "[" << num_entries << " entries read]\n"; }
Yes, the program has grown quite a lot longer, but there isn't much more real programming. 15 of the lines
are almost identical, and don't require much thought. Even if the program is now bigger, it is much more
useful. We could easily write a function that searches for a particluar record. For example:
int findRecordForMemberNamed(string fname, string lname)
{ for (int i=0; i<num_entries; i+=1)
if (lastname[i]==lname && firstname[i]==fname)
return i;
return -1; }
....
void DoQuery()
{ string fname, lname;
cout << "Enter name of member to be found: ";
cin >> fname >> lname;
int p = findRecordForMemberNamed(fname, lname);
if (p == -1)
cout << "No such member found\n";
else
cout << "Member #" << membernum[p] << " of " << city[p] << ", " << state[p] << "\n"; }