Goldfish Club

We have been asked by the G.O.P. (Goldfish Operators of Pennsylvania) to computerise their membership records. At the moment, they just have a big book, in which they write the details of all their members and their goldfish. Each entry in the book occupies a complete line, and always contains the same information in the same order:
  1. Membership number (e.g. 27364)
  2. Member's title (Mr, Mrs, Miss, Ms, Uncle, etc.),
  3. Member's Last name (Bloggs, Smith, Jones, etc.),
  4. Member's First name (Sally, Hubert, Binkie, etc.)
  5. Member's Street address (e.g. 123a Ant St),
  6. Member's City (e.g. Antville),
  7. Member's State (e.g. AL),
  8. Member's Zip code (e.g. 10203),
  9. Member's Telephone number (e.g. 321-455-3838),
  10. Goldfish's Name (Goldie, Orangey, Fluffy, etc.),
  11. Goldfish's Species (Orange, Yellow, etc.),
  12. Goldfish's Birthday (e.g. 23rd June),
  13. Goldfish's Year of birth (e.g. 1932),
  14. Goldfish's Status (Dead, Alive),
  15. Goldfish's Favourite colour (Orange, Yellow, Gold, Blue, etc.).
A quick inspection of the big book reveals that there are about 10,000 entries, and the length of the entries (the number of characters on a line) varies between 35 and 175.

As a first, quick and easy implementation, we might decide to store the data as a giant array of 15,000 strings each of 200 characters (to allow for future expansion). We realise there would be a lot of waste this way, but it'll let us put together a working demonstration very quickly, and ensure we win the contract.
          char database[15000][200];
would create the storage we need. It would occupy 3000000 bytes of memory, but there aren't many computers these days that haven't got 3MB to spare.
(Note: Older PC operating systems, and older compilers for PCs, were incapable of handling any data object more than 65536 bytes long. We are concerned with correct C programming, not conforming to short-sighted commercial design decisions. Of course, in the so-called real world you have to be aware of the restrictions imposed by your customers' hardware. If the G.O.P. are using such old stuff, perhaps we can make an additional profit by selling them a new computer too.)
IF You do this experiment, make the array smaller. Rabbit has a lot of memory, but there are also a lot of you sharing it. A better plan would be to define two constants right at the beginning of the program, and use them everywhere:
          #define MAX_NUM_ENTRIES 1000
          #define MAX_LINE_LENGTH 100

          char database[MAX_NUM_ENTRIES][MAX_LINE_LENGTH];
Then, any time we need to change one of the sizes, all we have to do is change that one place where the definition is made, and recompile the program. You won't have to search out all the places where the number 1000 appears.

The next task would be to define a function that lets us read some real data into that array. For the purposes of initial testing, we would write a function that reads entries typed on the keyboard, but we would be careful to ensure that it will be very easy to convert the function to read from a file instead.
        This is probably what would be added:
          int num_entries=0;

          void read_data(void)
          { FILE *f;
            f=stdin;
            while (1)
            { char *t;
              int len;
              if (num_entries>=MAX_NUM_ENTRIES)
              { printf("** File Too Long! Buy The Upgrade!!! **\n");
                exit(1); }
              t=fgets(database[num_entries], MAX_LINE_LENGTH, f);
              if (t==NULL) break;
              len=strlen(database[num_entries]);
              database[num_entries][len-1]=0;
              num_entries+=1; }
            printf("[%d entries read]\n", num_entries); }
Take it one thing at a time:
  1. stdin and stdout are variables of type FILE * always available to C programs. They are input and output files initially connected to your keyboard and screen. If you are writing a program that is supposed to take interactive input while you are developing it, but later become file based, it is a good idea to use stdin. That way you use all the normal file operations right from the beginning, and only need a tiny change to start using real files instead.
  2. It would be very bad to read the 1001st line of a file if we have only made space for 1000 strings. It is common, but not at all nice, to just exit a program when this sort of problem occurs. To use the exit function, you should #include <stdlib.h> at the beginning of the file.
  3. database is an array of strings; database[6] refers to string number 6 in that array, it is a pointer to the beginning of that string.
  4. fgets reads a whole line of text from a file. If the line is longer than the specified maximum size, the end is left unread until next time.
  5. As well as reading the characters into the string you provide as a parameter, fgets also returns a pointer to that string. Normally that is useless, but when you reach the end of the file (or type control-D on your keyboard), it returns NULL instead. This gives us a very convenient way of detecting end-of-file.
  6. The more familiar function gets just reads a line from your keyboard. fgets reads a line from any file, but it also leaves the newline character '\n' at the end of the string it gives you. We don't want it, it isn't part of the real data, so we remove it. At least it is easy to find; it must be the very last character in the string. The library function strlen returns the length of a string. It counts all the characters even the invisible ones, so if you type C-A-T-ENTER, the string will be "CAT\n", and its length will be 4. (Remember '\n' represents a single character, not two). Strings are Arrays, and array indexes start from zero, so string[0] is 'C', string[1] is 'A', string[2] is 'T', and string[3] is '\n'. The newline is always at string[len-1], so setting string[len-1]=0 will remove it by making it look like the end of the string.
After all that, we would probably provide a function for printing out the whole database, even if its only so that we can check it was read correctly. That is extremely easy:
          void print_data(void)
          { int i;
            printf("[%d entries]\n", num_entries);
            for (i=0; i<num_entries; i+=1)
            { printf("%6d: \"%s\"\n", i, database[i]); } }
A few notes:
  1. The format %6d is exactly the same as the familiar %d, except that it says always use six characters for the number even if they aren't all needed. This way all the output is nicely lined up in columns.
  2. I print quotes around the strings, by saying \"%s\", just for safety. It is very easy for stray spaces to creep into strings that are read from files, and if they are at the beginning or the end of a string, you can't see them. Making quotes appear around a printed string just makes it possible to see exactly what's there and what isn't.


Once we have done a few tests on this program, we would get really fed up with typing all the input each time, so converting it to read from a file would be a high priority. It would also be very easy. In fact, only two little changes are needed.
  1. Replace f=stdin; by this:
              f=fopen("filename","r");
              if (f==NULL)
              { printf("Can't read file \"filename\"\n");
                exit(1); }
    
  2. And at the end, just after the loop, but before printing the number of entries read, add: fclose(f);.
Then the program will behave exactly as before, except that it will read all its input from a file called "filename". The second parameter to fopen: "r" specifies that the file will only be used for reading. This prevents your program from accidentally overwriting data if something goes wrong. If you say "w" instead, a totally new file will be created, and opened for writing only. Fopen returns a "File pointer", an object of the same type as stdin, so you can use it in exactly the same ways. If fopen fails, it returns NULL instead.

It is very annoying to have filenames built into programs like that. If you ever want to test it on a different file, you must edit and recompile the whole prorgam. You could make the program ask the user which file to read, with something like this:
          FILE *f;
          char filename[100];
          printf("File to read: ");
          fgets(filename, 99, stdin);
          f=fopen(filename, "r");
          if (f==NULL)
          { printf("Can't read \"%s\"\n", filename);
            exit(1); }
which isn't too bad, or we could do something much better and more professional looking. When you use a serious program, it doesn't ask you a lot of questions, instead you pre-provide the answers on the command line. For example, with the compiler you say "cc prog.c"; you don't just type "cc" and then have the compiler ask "what program would you like to compile?". It is easy to make a normal program behave that way:
        Normally, we declare the main function like this: void main(void), which seems to be the only sensible way. How could the program itself have parameters? Well, the filenames and options typed on the command line could be treated as parameters to the whole program, and if you want to receive them, you have to declare main like this instead:
          void main(int argc, char *argv[])
So main has two arguments, which you will immediately recognise as an integer and an array of strings. The integer is the number of things that appeared on the command line; the array of strings simply contains all those things. It really does contain everything that appeared on the command line, including the command itself.
        If argc is 1, that means the only string in the array is the command, and there were no extra parameters supplied.
        If argc is 2, that means that argv[0] will be the command, and argv[1] its one and only parameter.
        If argc is 3, that means that argv[0] will be the command, and argv[1] and argv[2] are its two parameters. You get the picture. Using all this, it would be very easy to make main grab the filename off the command line, and then call our function readdata to do the work; it would probably also be sensible to modify readdata slightly so that it accepts a filename as a parameter.
        This sort of design allows for a lot of flexibility in real use and in debugging. We could make it so that the program read from your keyboard if you don't provide a filename, or read from a file if you do provide one. We can even add some of the standard command line options that unix programs always seem to need. Traditionally, if you don't know what parameters a program expects, you can run it with the "-h" option (by typing progname -h), (H for Help) and it will tell you.

To see the whole program as it stands after all those changes, follow this link. You can download the file as it is, compile it and run it if you want to try it out. If you save the file as "goldfish.c", use this command to compile it "cc goldfish.c -o gf", then you will have a nice convenient little command "gf" that you can type to run it.

(A little aside: Many compilers today, our one included, sometimes treat prorgammers like children, giving warning messages about things that are perfectly correct, but they think are suspicious. To avoid that, you could add an extra option to the command when compiling: "cc -w goldfish.c -o gf" explicitly turns off all warning messages. That is dangerous. Once in a while, a warning is about something that really is wrong. Instead, the best plan as to put a prototype for every function that your program defines before its first use. That's why prototypes for read_data and print_data appear near the top of the copmpleted program.)


Now that we have a nicely working program, it would be nice to make it more efficient and flexible. Fixing the maximum line length at 100 characters (or whatever it was) could be a little annoying when The Honourable Mrs. Penelope Flumptonby-Smuggins wants to register her pedigree siamese goldfish Princess Henrietta Fluffy Snookums III. Fixing all the lines at 100 characters is extremely wasteful when most will really be much shorter.

To make each line be exactly the size it needs to be, no more and no less, we need to use string pointers (char *) instead of pre-allocated strings (char [50]). So, the declaration of the database would be changed to:
          char *database[MAX_NUM_ENTRIES];
and we would be able to remove the now-useless definition of MAX_LINE_LENGTH. Also, reading the string has become a little more complex because besides reading the string we must also allocated the memory for it to live in. So we would be sensible to write a special dynamic-string-reading function that can replace the call to fgets in the read_data function:
          char *read_line(FILE *f)   /* we tell it to read a line from a particular input file,  */
                                     /* it reads a line, and returns a new string containing it. *
          { char temp[500];                  /* a temporary place to keep the string   */
            char *result, *ok;
            int len;
            ok=fgets(temp, 499, f);          /* read a line into temporary place       */
            if (ok==NULL)
            { return(NULL); }                /* we want to return NULL for EOF too.    */
            len=strlen(temp);
            temp[len-1]=0;                   /* remove the \n from the end             */
            result=malloc(len);              /* create new string just the right size  */
            strcpy(result,temp);             /* copy the line into it                  */
            return (result); }
Inside the function read_data, the line t=fgets(database[num_entries],MAX_LINE_LENGTH,f); would be replaced by t=read_line(f);. Because C sees very few differences between arrays and pointers, no other changes would be required.


A much more serious defect is the fixed number of lines that may be stored in the database. For large files it just doesn't work, and for small ones it is very inefficient. If we knew in advance how many lines the database would have, there would be a reasonable solution, but that information isn't available. By looking at a file, you can find out its total length in bytes, but not the number of lines it contains. So we'll have to leave this serious problem for a little later.


Another problem, and the one that we will examine next, is the difficulty involved in extracting useful information from the data. Suppose we need to find all the club members who live in Antville. We would have to reprocess every line in the whole database, searching through each one to extract the city part of the address. All useful database operations involve the actual pieces of data in the line rather than the whole line itself, so this annoying task would be carried out very frequently.
        We need some way to split a line into its component parts once and for all, and then keep those components separately, so any part of a data record can be found instantly at any time. Your Second Homework Assignment is to write a string splitting function that would do the most important part of that job. We'll pretend you've already written it, and see how to build it into the program.
        It might be reasonable to keep a separate array of strings for all the different pieces of information, so instead of having one array of 500 giant strings, we would have 15 arrays of smaller strings:
        char *mem_num[MAX_NUM_ENTRIES];
        char *title[MAX_NUM_ENTRIES];
        char *last_name[MAX_NUM_ENTRIES];
        char *first_name[MAX_NUM_ENTRIES];
        char *street_ad[MAX_NUM_ENTRIES];
        char *city[MAX_NUM_ENTRIES];
        char *state[MAX_NUM_ENTRIES];
        char *zip_code[MAX_NUM_ENTRIES];
        char *tel_num[MAX_NUM_ENTRIES];
        char *fish_name[MAX_NUM_ENTRIES];
        char *fish_species[MAX_NUM_ENTRIES];
        char *fish_birthday[MAX_NUM_ENTRIES];
        char *fish_birthyear[MAX_NUM_ENTRIES];
        char *fish_status[MAX_NUM_ENTRIES];
        char *fish_fav_col[MAX_NUM_ENTRIES];
so that last_name[67] would be the last name of the member described on line 67 of the big book; city[67] would be the city that he, she, or it lives in, fish_name[67] would be his, her, or its goldfishes name, and so on and so on.
        Handling all these strings isn't as hard as you might imagine, it just takes a bit more typing. Remember how split is supposed to work: you give it a string and an array of strings; it splits the string into parts and puts them in the array for you to use as you wish. It returns as its result the number of parts it made. We could quite easily modify read_data to make use of it. In fact, it would make a lot of sense to have our special read_line function do the splitting, so that it never has to allocate memory for the whole line.
        So read_line would be given a FILE to read from, just as before. It would read a line from that file, just as before. But now, it would split the line into its components (making your split do the work), putting the components into an array of string pointers that we would pass into it. It could also return as its result the number of components found (or something like -1 that really stands out, to indicate end-of-file).
          int read_line(FILE *f, char *parts[]) 
          { char temp[500];                  /* a temporary place to keep the string      */
            char *ok;
            int len, numparts;
            ok=fgets(temp, 499, f);          /* read a line into temporary place          */
            if (ok==NULL)
            { return(-1); }                  /* we want to return -1 for EOF indication   */
            len=strlen(temp);
            temp[len-1]=0;                   /* remove the \n from the end                */
            numparts=split(temp, parts);     /* make split do the real work               */
            return (numparts); }
Of course, read_data hsa to be modified to take advantage of this. It must supply an empty array of string pointers for read_line to fill, and move the components from that array into their corrcet places in the database. It's actually not so difficult to understand, the following just shows the loop inside read_data, as nothing else changes:
  while (1)
  { char *parts[50];                 /* extra space in case of mistakes in the input file */
    int len, num_parts;
    if (num_entries>=MAX_NUM_ENTRIES)
    { printf("** File Too Long! Buy The Upgrade!!! **\n");
      exit(1); }
    num_parts=read_line(f, parts);
    if (num_parts==-1) break;            /*  -1 parts signals end of file  */
    if (num_parts!=15)
    { printf("Error in file, wrong number of parts\n");
      continue; }
    mem_num[num_entries]=parts[0];
    title[num_entries]=parts[1];
    last_name[num_entries]=parts[2];
    first_name[num_entries]=parts[3];
    street_ad[num_entries]=parts[4];
    city[num_entries]=parts[5];
    state[num_entries]=parts[6];
    zip_code[num_entries]=parts[7];
    tel_num[num_entries]=parts[8];
    fish_name[num_entries]=parts[9];
    fish_species[num_entries]=parts[10];
    fish_birthday[num_entries]=parts[11];
    fish_birthyear[num_entries]=parts[12];
    fish_status[num_entries]=parts[13];
    fish_fav_col[num_entries]=parts[14];
    num_entries+=1; }
Copying the parts into the database is done with a simple assignment, not a strcpy. The 15 database arrays are just arrays of pointers; they have no memory allocated for strings, so there would be nowhere for the strings to be strcpy'ed to. To be able to do its job at all, split has to find memory for the components that it splits up to live in, so we might as well continue to use that same memory. We just make the database entries point to the strings that split created.

Of course, print_data would also have to be changed, so that it would print out each of the fifteen little strings on each line it produces, but there's nothing to that.

The end result is that the program is a little more complex (but not much. It looks more complex because it is bigger, but large parts of it are very repetitious), and a lot more flexible. For instance, to find out all the members who live in Antville, we could now have a very simple and efficient little loop:
          printf("List of members who live in Antville:\n");
          for (i=0; i<num_entries; i+=1)
            if (strcmp(city[i],"Antville")==0)
              printf("%s %s %s\n", title[i], first_name[i], last_name[i]);
And its hard to imagine anything being much simpler than that.