Goldfish Club
We have been asked by the G.O.P. (Goldfish Operators of Pennsylvania)
to computerise their membership records. At the moment, they just have a
big book, in which they write the details of all their members and their
goldfish. Each entry in the book occupies a complete line, and always
contains the same information in the same order:
- Membership number (e.g. 27364)
- Member's title (Mr, Mrs, Miss, Ms, Uncle, etc.),
- Member's Last name (Bloggs, Smith, Jones, etc.),
- Member's First name (Sally, Hubert, Binkie, etc.)
- Member's Street address (e.g. 123a Ant St),
- Member's City (e.g. Antville),
- Member's State (e.g. AL),
- Member's Zip code (e.g. 10203),
- Member's Telephone number (e.g. 321-455-3838),
- Goldfish's Name (Goldie, Orangey, Fluffy, etc.),
- Goldfish's Species (Orange, Yellow, etc.),
- Goldfish's Birthday (e.g. 23rd June),
- Goldfish's Year of birth (e.g. 1932),
- Goldfish's Status (Dead, Alive),
- Goldfish's Favourite colour (Orange, Yellow, Gold, Blue, etc.).
A quick inspection of the big book reveals that there are about
10,000 entries, and the length of the entries (the number of
characters on a line) varies between 35 and 175.
As a first, quick and easy implementation, we might decide to
store the data as a giant array of 15,000 strings each of 200
characters (to allow for future expansion). We realise there
would be a lot of waste this way, but it'll let us put together
a working demonstration very quickly, and ensure we win the
contract.
char database[15000][200];
would create the storage we need. It would occupy 3000000 bytes of
memory, but there aren't many computers these days that haven't got
3MB to spare.
(Note: Older PC operating systems, and older compilers for PCs, were
incapable of handling any data object more than 65536 bytes long.
We are concerned with correct C programming, not conforming to
short-sighted commercial design decisions. Of course, in the
so-called real world you have to be aware of the restrictions
imposed by your customers' hardware. If the G.O.P. are using such
old stuff, perhaps we can make an additional profit by selling them
a new computer too.)
IF You do this experiment, make the array smaller. Rabbit has
a lot of memory, but there are also a lot of you sharing it. A better
plan would be to define two constants right at the beginning of the
program, and use them everywhere:
#define MAX_NUM_ENTRIES 1000
#define MAX_LINE_LENGTH 100
char database[MAX_NUM_ENTRIES][MAX_LINE_LENGTH];
Then, any time we need to change one of the sizes, all we have to
do is change that one place where the definition is made, and recompile
the program. You won't have to search out all the places where the
number 1000 appears.
The next task would be to define a function that lets us read some
real data into that array. For the purposes of initial testing, we
would write a function that reads entries typed on the keyboard, but
we would be careful to ensure that it will be very easy to convert
the function to read from a file instead.
This is probably
what would be added:
int num_entries=0;
void read_data(void)
{ FILE *f;
f=stdin;
while (1)
{ char *t;
int len;
if (num_entries>=MAX_NUM_ENTRIES)
{ printf("** File Too Long! Buy The Upgrade!!! **\n");
exit(1); }
t=fgets(database[num_entries], MAX_LINE_LENGTH, f);
if (t==NULL) break;
len=strlen(database[num_entries]);
database[num_entries][len-1]=0;
num_entries+=1; }
printf("[%d entries read]\n", num_entries); }
Take it one thing at a time:
- stdin and stdout are variables of type FILE * always
available to C programs. They are input and output files initially connected to your
keyboard and screen. If you are writing a program that is supposed to take interactive input
while you are developing it, but later become file based, it is a good idea to use stdin.
That way you use all the normal file operations right from the beginning, and only need a tiny change
to start using real files instead.
- It would be very bad to read the 1001st line of a file if we have only made space for 1000
strings. It is common, but not at all nice, to just exit a program when this sort of problem
occurs. To use the exit function, you should #include <stdlib.h> at the
beginning of the file.
- database is an array of strings; database[6] refers to string number 6 in that
array, it is a pointer to the beginning of that string.
- fgets reads a whole line of text from a file. If the line is longer than the specified
maximum size, the end is left unread until next time.
- As well as reading the characters into the string you provide as a parameter, fgets
also returns a pointer to that string. Normally that is useless, but when you reach the end
of the file (or type control-D on your keyboard), it returns NULL instead. This gives us a very
convenient way of detecting end-of-file.
- The more familiar function gets just reads a line from your keyboard. fgets
reads a line from any file, but it also leaves the newline character '\n' at the end of the string
it gives you. We don't want it, it isn't part of the real data, so we remove it. At least it
is easy to find; it must be the very last character in the string. The library function
strlen returns the length of a string. It counts all the characters even the invisible
ones, so if you type C-A-T-ENTER, the string will be "CAT\n", and its length will be
4. (Remember '\n' represents a single character, not two). Strings are Arrays, and array
indexes start from zero, so string[0] is 'C', string[1] is 'A',
string[2] is 'T', and string[3] is '\n'.
The newline is always at string[len-1], so setting string[len-1]=0 will remove
it by making it look like the end of the string.
After all that, we would probably provide a function for printing out the whole database, even
if its only so that we can check it was read correctly. That is extremely easy:
void print_data(void)
{ int i;
printf("[%d entries]\n", num_entries);
for (i=0; i<num_entries; i+=1)
{ printf("%6d: \"%s\"\n", i, database[i]); } }
A few notes:
- The format %6d is exactly the same as the familiar %d, except that
it says always use six characters for the number even if they aren't all needed. This way
all the output is nicely lined up in columns.
- I print quotes around the strings, by saying \"%s\", just for safety. It is very easy
for stray spaces to creep into strings that are read from files, and if they are at the beginning
or the end of a string, you can't see them. Making quotes appear around a printed string
just makes it possible to see exactly what's there and what isn't.
Once we have done a few tests on this program, we would get really fed up with typing
all the input each time, so converting it to read from a file would be a high priority.
It would also be very easy. In fact, only two little changes are needed.
- Replace f=stdin; by this:
f=fopen("filename","r");
if (f==NULL)
{ printf("Can't read file \"filename\"\n");
exit(1); }
- And at the end, just after the loop, but before printing the number of entries read,
add: fclose(f);.
Then the program will behave exactly as before, except that it will read all its input
from a file called "filename". The second parameter to fopen: "r"
specifies that the file will only be used for reading. This prevents your program from
accidentally overwriting data if something goes wrong. If you say "w" instead, a totally
new file will be created, and opened for writing only. Fopen returns a "File pointer",
an object of the same type as stdin, so you can use it in exactly the same ways.
If fopen fails, it returns NULL instead.
It is very annoying to have filenames built into programs like that. If you ever want to
test it on a different file, you must edit and recompile the whole prorgam. You could make
the program ask the user which file to read, with something like this:
FILE *f;
char filename[100];
printf("File to read: ");
fgets(filename, 99, stdin);
f=fopen(filename, "r");
if (f==NULL)
{ printf("Can't read \"%s\"\n", filename);
exit(1); }
which isn't too bad, or we could do something much better and more professional looking. When
you use a serious program, it doesn't ask you a lot of questions, instead you pre-provide
the answers on the command line. For example, with the compiler you say "cc prog.c";
you don't just type "cc" and then have the compiler ask "what program would
you like to compile?". It is easy to make a normal program behave that way:
Normally, we declare the main function
like this: void main(void), which seems to be the only sensible way. How could
the program itself have parameters? Well, the filenames and options typed on the command line
could be treated as parameters to the whole program, and if you want to receive them, you have to
declare main like this instead:
void main(int argc, char *argv[])
So main has two arguments, which you will immediately recognise as an integer and an
array of strings. The integer is the number of things that appeared on the command line; the array
of strings simply contains all those things. It really does contain everything that appeared
on the command line, including the command itself.
If argc is 1, that means the only
string in the array is the command, and there were no extra parameters supplied.
If argc is 2, that means that
argv[0] will be the command, and argv[1] its one and only parameter.
If argc is 3, that means that
argv[0] will be the command, and argv[1] and argv[2] are its two
parameters. You get the picture.
Using all this, it would be very easy to make main grab the filename off the
command line, and then call our function readdata to do the work; it would probably
also be sensible to modify readdata slightly so that it accepts a filename as a
parameter.
This sort of design allows for a lot
of flexibility in real use and in debugging. We could make it so that the program
read from your keyboard if you don't provide a filename, or read from a file if you do provide one.
We can even add some of the standard command line options that unix programs always seem
to need. Traditionally, if you don't know what parameters a program expects, you can run it
with the "-h" option (by typing progname -h), (H for Help) and it
will tell you.
To see the whole program as it stands after all those changes, follow this link.
You can download the file as it is, compile it and run it if you want to try it out. If you
save the file as "goldfish.c", use this command to compile it
"cc goldfish.c -o gf", then you will have a nice
convenient little command "gf" that you can type to run it.
(A little aside: Many compilers today, our one included, sometimes treat prorgammers like
children, giving warning messages about things that are perfectly correct, but they think
are suspicious. To avoid that, you could add an extra option to the command when compiling:
"cc -w goldfish.c -o gf" explicitly turns off all warning
messages. That is dangerous. Once in a while, a warning is about something that really is
wrong. Instead, the best plan as to put a prototype for every function that your program
defines before its first use. That's why prototypes for read_data and print_data
appear near the top of the copmpleted program.)
Now that we have a nicely working program, it would be nice to make it more efficient
and flexible. Fixing the maximum line length at 100 characters (or whatever it was) could
be a little annoying when The Honourable Mrs. Penelope Flumptonby-Smuggins wants to register
her pedigree siamese goldfish Princess Henrietta Fluffy Snookums III. Fixing all the lines
at 100 characters is extremely wasteful when most will really be much shorter.
To make each line be exactly the size it needs to be, no more and no less, we need to use
string pointers (char *) instead of pre-allocated strings (char [50]).
So, the declaration of the database would be changed to:
char *database[MAX_NUM_ENTRIES];
and we would be able to remove the now-useless definition of MAX_LINE_LENGTH. Also,
reading the string has become a little more complex because besides reading the string we
must also allocated the memory for it to live in. So we would be sensible to write a
special dynamic-string-reading function that can replace the call to fgets in
the read_data function:
char *read_line(FILE *f) /* we tell it to read a line from a particular input file, */
/* it reads a line, and returns a new string containing it. *
{ char temp[500]; /* a temporary place to keep the string */
char *result, *ok;
int len;
ok=fgets(temp, 499, f); /* read a line into temporary place */
if (ok==NULL)
{ return(NULL); } /* we want to return NULL for EOF too. */
len=strlen(temp);
temp[len-1]=0; /* remove the \n from the end */
result=malloc(len); /* create new string just the right size */
strcpy(result,temp); /* copy the line into it */
return (result); }
Inside the function read_data, the line
t=fgets(database[num_entries],MAX_LINE_LENGTH,f); would be replaced by
t=read_line(f);. Because C sees very few differences between arrays and pointers,
no other changes would be required.
A much more serious defect is the fixed number of lines that may be stored in the database.
For large files it just doesn't work, and for small ones it is very inefficient. If we knew
in advance how many lines the database would have, there would be a reasonable solution, but
that information isn't available. By looking at a file, you can find out its total length in bytes,
but not the number of lines it contains. So we'll have to leave this serious problem for
a little later.
Another problem, and the one that we will examine next, is the difficulty involved in extracting
useful information from the data. Suppose we need to find all the club members who live in
Antville. We would have to reprocess every line in the whole database, searching through
each one to extract the city part of the address. All useful database operations involve
the actual pieces of data in the line rather than the whole line itself, so this annoying
task would be carried out very frequently.
We need some way to split a line into its
component parts once and for all, and then keep those components separately, so any
part of a data record can be found instantly at any time. Your Second
Homework Assignment is to write a string splitting function that would do the most
important part of that job. We'll pretend you've already written it, and see how to
build it into the program.
It might be reasonable to keep a separate array
of strings for all the different pieces of information, so instead of having one array of 500
giant strings, we would have 15 arrays of smaller strings:
char *mem_num[MAX_NUM_ENTRIES];
char *title[MAX_NUM_ENTRIES];
char *last_name[MAX_NUM_ENTRIES];
char *first_name[MAX_NUM_ENTRIES];
char *street_ad[MAX_NUM_ENTRIES];
char *city[MAX_NUM_ENTRIES];
char *state[MAX_NUM_ENTRIES];
char *zip_code[MAX_NUM_ENTRIES];
char *tel_num[MAX_NUM_ENTRIES];
char *fish_name[MAX_NUM_ENTRIES];
char *fish_species[MAX_NUM_ENTRIES];
char *fish_birthday[MAX_NUM_ENTRIES];
char *fish_birthyear[MAX_NUM_ENTRIES];
char *fish_status[MAX_NUM_ENTRIES];
char *fish_fav_col[MAX_NUM_ENTRIES];
so that last_name[67] would be the last name of the member described on line 67
of the big book; city[67] would be the city that he, she, or it lives in,
fish_name[67] would be his, her, or its goldfishes name, and so on and so on.
Handling all these strings isn't as hard
as you might imagine, it just takes a bit more typing. Remember how split is
supposed to work: you give it a string and an array of strings; it splits the string into parts
and puts them in the array for you to use as you wish. It returns as its result the number
of parts it made. We could quite easily modify read_data to make use of it.
In fact, it would make a lot of sense to have our special read_line function
do the splitting, so that it never has to allocate memory for the whole line.
So read_line would be given a
FILE to read from, just as before. It would read a line from that file, just as before.
But now, it would split the line into its components (making your split do the
work), putting the components into an array of string pointers that we would pass into it.
It could also return as its result the number of components found (or something like -1 that really
stands out, to indicate end-of-file).
int read_line(FILE *f, char *parts[])
{ char temp[500]; /* a temporary place to keep the string */
char *ok;
int len, numparts;
ok=fgets(temp, 499, f); /* read a line into temporary place */
if (ok==NULL)
{ return(-1); } /* we want to return -1 for EOF indication */
len=strlen(temp);
temp[len-1]=0; /* remove the \n from the end */
numparts=split(temp, parts); /* make split do the real work */
return (numparts); }
Of course, read_data hsa to be modified to take advantage of this. It must supply an
empty array of string pointers for read_line to fill, and move the components from
that array into their corrcet places in the database. It's actually not so difficult to understand,
the following just shows the loop inside read_data, as nothing else changes:
while (1)
{ char *parts[50]; /* extra space in case of mistakes in the input file */
int len, num_parts;
if (num_entries>=MAX_NUM_ENTRIES)
{ printf("** File Too Long! Buy The Upgrade!!! **\n");
exit(1); }
num_parts=read_line(f, parts);
if (num_parts==-1) break; /* -1 parts signals end of file */
if (num_parts!=15)
{ printf("Error in file, wrong number of parts\n");
continue; }
mem_num[num_entries]=parts[0];
title[num_entries]=parts[1];
last_name[num_entries]=parts[2];
first_name[num_entries]=parts[3];
street_ad[num_entries]=parts[4];
city[num_entries]=parts[5];
state[num_entries]=parts[6];
zip_code[num_entries]=parts[7];
tel_num[num_entries]=parts[8];
fish_name[num_entries]=parts[9];
fish_species[num_entries]=parts[10];
fish_birthday[num_entries]=parts[11];
fish_birthyear[num_entries]=parts[12];
fish_status[num_entries]=parts[13];
fish_fav_col[num_entries]=parts[14];
num_entries+=1; }
Copying the parts into the database is done with a simple assignment, not a strcpy.
The 15 database arrays are just arrays of pointers; they have no memory allocated for strings,
so there would be nowhere for the strings to be strcpy'ed to. To be able to do its
job at all, split has to find memory for the components that it splits up to live in,
so we might as well continue to use that same memory. We just make the database entries point to
the strings that split created.
Of course, print_data would also have to be changed, so that it would print out
each of the fifteen little strings on each line it produces, but there's nothing to that.
The end result is that the program is a little more complex (but not much. It looks more
complex because it is bigger, but large parts of it are very repetitious), and a lot more
flexible. For instance, to find out all the members who live in Antville, we could now have
a very simple and efficient little loop:
printf("List of members who live in Antville:\n");
for (i=0; i<num_entries; i+=1)
if (strcmp(city[i],"Antville")==0)
printf("%s %s %s\n", title[i], first_name[i], last_name[i]);
And its hard to imagine anything being much simpler than that.