Goldfish Club

We have been asked by the G.O.P. (Goldfish Operators of Pennsylvania) to computerise their membership records. At the moment, they just have a big book, in which they write the details of all their members and their goldfish. Each entry in the book occupies a complete line, and always contains the same information in the same order:
  1. Membership number (e.g. 27364)
  2. Member's title (Mr, Mrs, Miss, Ms, Uncle, etc.),
  3. Member's Last name (Bloggs, Smith, Jones, etc.),
  4. Member's First name (Sally, Hubert, Binkie, etc.)
  5. Member's Street address (e.g. 123a Ant St),
  6. Member's City (e.g. Antville),
  7. Member's State (e.g. AL),
  8. Member's Zip code (e.g. 10203),
  9. Member's Telephone number (e.g. 321-455-3838),
  10. Goldfish's Name (Goldie, Orangey, Fluffy, etc.),
  11. Goldfish's Species (Orange, Yellow, etc.),
  12. Goldfish's Birthday (e.g. 23rd June),
  13. Goldfish's Year of birth (e.g. 1932),
  14. Goldfish's Status (Dead, Alive),
  15. Goldfish's Favourite colour (Orange, Yellow, Gold, Blue, etc.).
A quick inspection of the big book reveals that there are about 10,000 entries, and the length of the entries (the number of characters on a line) varies between 35 and 175.

As a first, quick and easy implementation, we might decide to store the data as a giant array of 15,000 strings each of 35 to 175 characters (to allow for future expansion). We realise there would be a lot of waste this way, but it'll let us put together a working demonstration very quickly, and ensure we win the contract.
          string database[15000];
would create the storage we need. It would occupy a lot of memory, but not so much that a modern computer would have any trouble.
(Note: Older PC operating systems, and older compilers for PCs, were incapable of handling any data object more than 65536 bytes long. We are concerned with correct programming, not conforming to short-sighted commercial design decisions. Of course, in the so-called real world you have to be aware of the restrictions imposed by your customers' hardware. If the G.O.P. are using such old stuff, perhaps we can make an additional profit by selling them a new computer too.)
A better plan would be to define a constant right at the beginning of the program, and use them everywhere:
          const int MAX_NUM_ENTRIES=1000

          string database[MAX_NUM_ENTRIES];
Then, any time we need to change the size, all we have to do is change that one place where the definition is made, and recompile the program. You won't have to search out all the places where the number 1000 appears.

The next task would be to define a function that lets us read some real data into that array. For the purposes of initial testing, we would write a function that reads entries typed on the keyboard, but we would be careful to ensure that it will be very easy to convert the function to read from a file instead.
        This is probably what would be added:
          const int MAX_LINE_LENGTH = 1000;
          int num_entries = 0;

          void read_data()
          { num_entries=0;
            while (1)
            { if (num_entries>=MAX_NUM_ENTRIES)
              { cerr << "** File Too Long! Buy The Upgrade!!! **\n";
                exit(1); }
              getline(cin, database[num_entries]);
              if (cin.eof()) break;
              num_entries+=1; }
            cout << "[" << num_entries << " entries read]\n"; }
Take it one thing at a time:
  1. cin >> s; (when s is a string variable) is not a good way of reading a line from a file. strings being read from cin are terminated by spaces, so if any of the data items (names, addresses, etc) could have spaces in them, this way of reading will not work. Instead we have to use a rather ungainly function called getline. Getline always reads a whole line, regardless of any spaces it may contain. The first argument is the input stream to read a line from, and the second is a string to store the line in.
  2. It would be very bad to read the 1001st line of a file if we have only made space for 1000 strings. It is common, but not at all nice, to just exit a program when this sort of problem occurs. To use the exit function, you should #include <stdlib.h> at the beginning of the file.
  3. database is an array of strings; database[6] refers to string number 6 in that array.
  4. How can you tell when the input is finished? When reading from the terminal it is OK to insist that the last line contains some special string, such as "The End", then the end can be detected with a simple test: if (database[num_entries]=="The End"), but that isn't always satisfactory, especially when reading from a disc file. It is much better to be able to detect the real end of the file. cin.eof() becomes true after a read operation has failed because the file has no more data in it. When reading from the terminal, end-of-file is simulated by typing control-D.
To modify the function so that it reads data from a file instead of the terminal is not difficult. The modified function would look like this:
          const int MAX_LINE_LENGTH = 1000;
          int num_entries = 0;

          void read_data(string filename)
          { ifstream file(filename.c_str());
            if (file.bad())
            { cerr << "** Can't read the file '" << filename << "'\n";
              exit(1); }
            num_entries=0;
            while (1)
            { if (num_entries>=MAX_NUM_ENTRIES)
              { cerr << "** File Too Long! Buy The Upgrade!!! **\n";
                exit(1); }
              getline(file, database[num_entries]);
              if (file.eof()) break;
              num_entries+=1; }
            cout << "[" << num_entries << " entries read]\n"; }
In this version, the name of the data file is supplied as a string parameter, but there are some very annoying tricks:
  1. ifstream is a standard C++ type. When the ifstream constructor us used to open a file, it needs to be given the file's name as a string. string is a standard C++ type. ifstream does not understand file names presented to it as C++ strings. This is absolutely absurd, but it is the official standard for C++. When you tell the ifstream constructor the name of the file, you have to provide it as an old plain C string, it does not understand C++ strings. Fortnately C++ strings have a special method called c_str() which produces an equivalent plain-C string.
  2. If for some reason it is not possible to open the specified file, file.bad() becomes true, so you can easily test for an error condition.
  3. getline and eof work on C++ input files in the same was as on cin.
After all that, we would probably provide a function for printing out the whole database, even if it is only so that we can check that it was read correctly. That is extremely easy:
          void print_data()
          { cout << "[" << num_entries << " entries]\n";
            for (int i=0; i<num_entries; i+=1)
              cout << i << ": '" << database[i] << "'\n"; }
A note:
  1. I print quotes around the strings just for safety. It is very easy for stray spaces to creep into strings that are read from files, and if they are at the beginning or the end of a string, you can't see them. Making quotes appear around a printed string just makes it possible to see exactly what's there and what isn't. If you do this experiment, you will see the value of that safety check.
A complete program to test those two functions would be very easy to write. If you want to see one, here it is.

The main function contains a useful trick:
        void main(int argc, char *argv[])
        { string progname=string(argv[0]);
          string filename="";
          for (int i=1; i<argc; i+=1)         /* scan the command line arguments                   */    
          { string thisarg=string(argv[i]);
            if (thisarg == "-h")              /* -h means want help                                */    
            { cout << "To use this program, enter the command:\n";
              cout << "    " << progname << " filename\n";
              cout << "where filename is the name of the database-containing text file, or:\n";
              exit(0); }
            else
              filename=thisarg; }            /* anything else must be the file name.               */
                                      /* if no filename provided, the variable filename will still */
                                      /* be empty. read_data is built to detect that and use stdin.*/
          read_data(filename);
          print_data(); }
It is unreasonable to have to build fixed file names into programs, and sometimes annoying to have to write an interactive program that asks the user the name of the data file every time it is run. Often, it is much more convenient to provide simple inputs on the command line, so that if you normally run the program by typing "a.out" or "fishclub", you could instead run it by typing "a.out filename" or "fishclub filename", and somehow the program would be able to see the file name and make use of it.
        That is what this special main is doing. If you declare main with two parameters, the first an int, and the second an array of char*'s (as above), you automatically have access to everything that appeared on the command line. Unfortunately, the command line is given to your program as an array of plain-C strings, not C++ strings, but conversion is easy. That's why the second parameter is an array of "char *": char* is the most common way of describing an old-fashioned plain-C string.
        The first argument is simply the number of things on the command line (argc = ARGument Count), and the second argument is an array of those values (argv = ARGument Values). There is always at least one thing on the command line, and that is the command itself. If you run the program by typing "a.out filename", then argc will be 2. argv[0] will be "a.out", and argv[1] will be "filename". Sometimes argv[0] is useful in printing out error messages that tell the user what he/she should have typed to run the program.         It is traditional in unixy systems to make programs be able to explain themselves. Typically typing "a.out -h" would ask the program "a.out" not to run normally, but to simply print out a little bit of help, then exit. Very often, programs will scan through their arguments to see if the string "-h" appears anywhere before starting to run properly.         The main function shown above converts argv[0] to a proper C++ string using the string constructor in the normal way, then runs through all the rest of the arguments (if there are any) by saying for (i=1; i<argc; ...). Each argument is converted to a normal C++ string and inspected. "-h" is treated as a request for help, anything else is assumed to be a filename. If no filename was provided, the loop will terminate with the variable filename still containing the empty string "". (This would of course cause an error when read_data tries to open that file).


So...

Our program is now able to read a whole lot of lines of text from a file, and then show us all those lines afterwards. Not very impressive. A useful database program would be able to answer queries about the data, and as it stands, that would be quite difficult. We have not even considered the data format, so have no chance of being able to process it properly.
        We know that each line of data describes one club member, providing 15 independent pieces of information.
  1. Membership number (e.g. 27364)
  2. Member's title (Mr, Mrs, Miss, Ms, Uncle, etc.),
  3. Member's Last name (Bloggs, Smith, Jones, etc.),
  4. Member's First name (Sally, Hubert, Binkie, etc.)
  5. Member's Street address (e.g. 123a Ant St),
  6. Member's City (e.g. Antville),
  7. Member's State (e.g. AL),
  8. Member's Zip code (e.g. 10203),
  9. Member's Telephone number (e.g. 321-455-3838),
  10. Goldfish's Name (Goldie, Orangey, Fluffy, etc.),
  11. Goldfish's Species (Orange, Yellow, etc.),
  12. Goldfish's Birthday (e.g. 23rd June),
  13. Goldfish's Year of birth (e.g. 1932),
  14. Goldfish's Status (Dead, Alive),
  15. Goldfish's Favourite colour (Orange, Yellow, Gold, Blue, etc.).
It is absolutely necessary that we should be able to separate those 15 items from each other. The usual method is to pick on a character that can never possibly appear in any of those items, and use it as a separator. A colon ':' or the vertical bar '|' are common choices. With this plan, a sample of a couple of lines from the data file might look like this:
13523|Mrs|Spuggins|Mary|1234 Ant Street|Abracadabra|GA|27653|123-453-3123|Goldie|Goldfish|26 July|1999|Alive|Green
78123|Mr|Drab|Bub|73739 SW 353 St|Blammo|WA|93485|417-343-7667|Arthur|Haddock|29 February|1804|Dead|Grey
Queries made on the database are likely to be based on the values of these items, (e.g. "list all fish born in 1999", "Find member number 78123", etc), and although it is not terribly difficult, separating the individual items out from the big string does take some time. Therefore it would seem sensible to separate the big strings into their component parts just once when it is first read in, and then store all the components separately. Then it will be much easier and much faster to search for particular items.
        For this scheme, instead of having one big array of strings called database, we would expect to have 15 arrays of smaller strings:
string membernum[MAX_NUM_ENTRIES];
string title[MAX_NUM_ENTRIES];
string lastname[MAX_NUM_ENTRIES];
string firstname[MAX_NUM_ENTRIES];
string street[MAX_NUM_ENTRIES];
string city[MAX_NUM_ENTRIES];
string state[MAX_NUM_ENTRIES];
string zipcode[MAX_NUM_ENTRIES];
string phone[MAX_NUM_ENTRIES];
string fishname[MAX_NUM_ENTRIES];
string fishspecies[MAX_NUM_ENTRIES];
string fishbirthday[MAX_NUM_ENTRIES];
string fishbirthyear[MAX_NUM_ENTRIES];
string fishstatus[MAX_NUM_ENTRIES];
string fishfavcol[MAX_NUM_ENTRIES];
Then, after reading each line, we need to find where all the vertical bars are, and use string's substring method to split the string into its 15 parts (substr's two int parameters are the starting position within the original string, and the number of characters wanted):
          void read_data()
          { num_entries=0;
            while (1)
            { if (num_entries >= MAX_NUM_ENTRIES)
              { cerr << "** File Too Long! Buy The Upgrade!!! **\n";
                exit(1); }
              string input;
              getline(cin, intput);
              if (cin.eof()) break;
              int bar_position[20];
              int num_bars=0;
              int line_length = input.length();
              for (int i=0; i<line_length; i+=1)
              { if (input[i] == '|')
                { bar_position[num_bars] = i;
                  num_bars+=1; } }
              if (num_bars != 14)
              { cerr << "Line has incorrect format:\n  " << input << "\n";
                continue; }
              membernum[num_entries] =     line.substr(                  0,  bar_position[0]                  );
              title[num_entries] =         line.substr(  bar_position[0]+1,  bar_position[1]-bar_position[0]  );
              lastname[num_entries] =      line.substr(  bar_position[1]+1,  bar_position[2]-bar_position[1]  );
              firstname[num_entries] =     line.substr(  bar_position[2]+1,  bar_position[3]-bar_position[2]  );
              street[num_entries] =        line.substr(  bar_position[3]+1,  bar_position[4]-bar_position[3]  );
              city[num_entries] =          line.substr(  bar_position[4]+1,  bar_position[5]-bar_position[4]  );
              state[num_entries] =         line.substr(  bar_position[5]+1,  bar_position[6]-bar_position[5]  );
              zipcode[num_entries] =       line.substr(  bar_position[6]+1,  bar_position[7]-bar_position[6]  );
              phone[num_entries] =         line.substr(  bar_position[7]+1,  bar_position[8]-bar_position[7]  );
              fishname[num_entries] =      line.substr(  bar_position[8]+1,  bar_position[9]-bar_position[8]  );
              fishspecies[num_entries] =   line.substr(  bar_position[9]+1, bar_position[10]-bar_position[9]  );
              fishbirthday[num_entries] =  line.substr( bar_position[10]+1, bar_position[11]-bar_position[10] );
              fishbirthyear[num_entries] = line.substr( bar_position[11]+1, bar_position[12]-bar_position[11] );
              fishstatus[num_entries] =    line.substr( bar_position[12]+1, bar_position[13]-bar_position[12] );
              fishfavcol[num_entries] =    line.substr( bar_position[13]+1,      line_length-bar_position[13] );
              num_entries+=1; }
            cout << "[" << num_entries << " entries read]\n"; }
Yes, the program has grown quite a lot longer, but there isn't much more real programming. 15 of the lines are almost identical, and don't require much thought. Even if the program is now bigger, it is much more useful. We could easily write a function that searches for a particluar record. For example:
        int findRecordForMemberNamed(string fname, string lname)
        { for (int i=0; i<num_entries; i+=1)
            if (lastname[i]==lname && firstname[i]==fname)
              return i;
          return -1; }

        ....

        void DoQuery()
        { string fname, lname;
          cout << "Enter name of member to be found: ";
          cin >> fname >> lname;
          int p = findRecordForMemberNamed(fname, lname);
          if (p == -1)
            cout << "No such member found\n";
          else
            cout << "Member #" << membernum[p] << " of " << city[p] << ", " << state[p] << "\n"; }