Hash Indexes

A Hash Function reduces a data object (typically a string, but it could be anything) to a random-seeming number within a pre-determined range. The function must be deterministic (i.e. if you call it twice with the same argument it must deliver the same result each time), but must also randomly distribute the results (i.e. similar inputs should not produce similar results, and all possible results should be equally likely). A typical hash function might use a strange expression to combine all the characters of a string, perhaps like this:

int hash(char *s)
{ int res=44923, i, n=strlen(s);
  for (i=0; i<n; i+=1)
    res=9*res*res+66177*res+s[i];
  return (abs(res)%hashsize); }

Such a function is guaranteed to produce a result in the range 0 to (hashsize-1). If we had database records that look like this:

struct person
{ char name[20];
  int socsec;
  char address[50];
  char phone[10]; };

create table person
( name: char(20),
  socsec: int,
  address: char(50),
  phone: char(10) )

and we wanted to create an index based on the "name" attribute and we could guarantee that the hash value for each name in the table would be different, we would have a very easy job. I'll illustrate it in C++, just for variety.

class HashIndex
{ protected:
    struct person *index[hashsize];
  public:
    HashIndex()
    { for (int i=0; i<hashsize; i+=1)
      { index[i]=NULL; } }

    struct person *lookup(char *key)
    { int h=hash(key);
      return (index[h]); }

    void insert(struct person *record)
    { int h=hash(record->name);
      index[h]=record; } };

All we have is a big array of pointers to records; its size is the same as the range of possible results from the hash function. Initially every pointer is set to null. When a new record is added to the index, we simply calculate the hsah value of its name attribute, and store a pointer to the record at the indictaed position in the array. Whenever we want to find a record based on a name, we simply calculate the hash value of that name, and return whatever record is found at that position in the array. If it is still null, that means no matching record has been inserted; any proper pointer must refer to the correct record because of the assumption that all records have unique hash values.
        It would be almost as easy to implement this as a real disc-based index. The array would become a file of record numbers, just like an ISAM index. Adding a new record to the index would require two disc accesses (after calculating the hash value, one access to read the block containing that portion of the index file, then another to write that block back to disc after adding the new pointer). Finding a record would also require two disc accesses (after calculating the hash value, one access to read the block containing that portion of the index file, then another to read the actual record from the data file).
        Every operation would take 22mS (under the standard assumptions) regardless of the size of the data file. That is the real strength of a hash index: absolutely minimal access times that do not increase with file size.

        Sadly too good to be true. In any reasonably realistic situation, it is not possible to guarantee that every key value will have a different hash value. Clashes will occur and must be taken into account. For normal programming, when the whole thing is kept in memory, it would be natural to use a linked list: every entry in the index is a pointer not to a data record, but to a linked list of pointers to data records (each of which have that same hash value). If we choose a sensible size for our hash table, the linked lists can be kept very short and hence very efficient.
        The frequencies of has value clashes follows a Poisson Distribution; if we arrange for the hash index to have exactly as many entries as the data file has records, we can calculate the following statistics:

36.7%	of the linked lists will have	0	data record pointers
36.7%	- " -	1	- " -
18.4%	- " -	2	- " -
6.1%	- " -	3	- " -
1.5%	- " -	4	- " -
0.36%	- " -	more than 4	- " -

(The general formula being p(n) = exp(-L) × (L to-the-power-of n) ÷ factorial(n), where L = (number of records in data file) ÷ (number of entries in index file), which is 1 in the above example; n is the number of data records having the same hash value, and hence the length of a linked-list); p(n) is the proportion of linked lists that have length n. P(n) is multiplied by 100 to give a percentage.
        Normal linked lists are not ideal for disc implementations, because every link that is followed requires an extra disc access (you have to read the disc block that contains the next link in the list). The average length of a linked list is L (which = 1 in the above example), suggesting that it doesn't matter: the average index operation will require just three disc accesses. One to read the initial pointer out of the index, then an average of one more access to read the links in the linked list starting there, and finally one more access to read the actual data record.
        However, the average length is irrelevant and misleading. When you search for a record using an index, you normally expect that record to be present; the 36.6% of table entries that are empty will never be looked at, so should not be included in the average. The average length of a non-empty list in our example comes out as 1.58 (general case: L÷(1-p(0))) - That means that with a straight uncomplicated disc-based linked-list implementation of a hash index, the average access would take 46mS. (1 initial access to find the beginning of the linked list from the index, followed by 1.58 times (1 access to read a link from the list, plus 1 access to read the corresponding data record to see if it matches). Actually it is a little faster if you do the calculation completely as we did in class, because you don't always have to search the whole linked list; the entry you're looking for will on average appear half way along the list. However, that doesn't mean you can just halve the factor of 1.58, because there is a minimum of 1 link to be followed before anything can be found. The actual time comes to around 40mS.
        That sounds very good. 40mS to find anything, regardless of the size of the data file. Unfortunately, things can sometimes get out of hand. If you can ensure that the index file is large enough, so that the value of L (records in data file divided by pointers in index file) never gets much above 1, everything will remain perfect. However, if the data file grows beyond predictable limits, the average access time starts to grow alarmingly.

        If you can't ensure that L will never exceed 1 by far, the trick is to use Buckets. A bucket is just like a link in a linked list, except that it contains not one, but multiple pointers to data records. If we chose to use buckets with four data pointers each, and made the hash table be an array of buckets (rather than pointers to buckets), you can easily see that 99.64% of all accesses would be resolved just by reading a single bucket from the index file. After that, we would of course have to read the same number of data records from the data file, so the saving isn't as spectacular as one might have expected, but still an improvement is gained.
        Here are the calculations for our example, all the percentages are divided by (1-.367) to take account of the fact that empty chains (p(0)=0.367) are not being considered.

One record with this hash value: 0.367÷(1-0.367) = 58% of all non-empty buckets will have just one data record pointer; in these cases it takes 2 disc accesses to retrieve the information (read the bucket from the index, then read the data record).
Two records with same hash value: 0.184÷(1-0.367) = 29% of all non-empty buckets have two data pointers. Sometimes it will take 2 disc accesses because the first record you lok at is the right one, sometimes it will take three, giving an average of 2.5 disc accesses.
Three records with same hash value: 0.061÷(1-0.367) = 10% of all non-empty buckets have three data pointers. Sometimes it will take 2 disc accesses because the first record you lok at is the right one, sometimes it will take three, and sometimes it will take four (read the bucket from the index file, then read all 3 data records before finding the right one), giving an average of 3 disc accesses.
Four records with same hash value: 0.015÷(1-0.367) = 2% of all non-empty buckets have four data pointers. Sometimes it will take 2 disc accesses, sometimes 3, sometimes 4, and sometimes 5, giving an average of 3.5 disc accesses.
records with same hash value: 0.003÷(1-0.367) = 0.4% of all non-empty buckets have four data pointers plus a "next bucket" pointer pointing to another bucket with just one data pointer. If the correct record is amongst the first four it will take 2, 3, 4, or 5 disc accesses; if it is the fifth it will take 7 (not six because the next bucket must be read too), giving an average of 4.2 disc accesses.

We can continue the computation until the numbers get too small to matter (which they almost have now) and then calculate the weighted average:
        (58%×2) + (29%×2.5) + (10%×3) + (2%×3.5) + (0.4%×4.2) + ...
comes to an average of 2.27 disc accesses or 25mS per database access. Not bad at all.

Creating a hash index is also very easy; simply read through the whole data file in its natural order. For each record calculate its hash value and add it to the correct bucket. Adding a record is basically the same operation as finding one, except that it is at least one disc access faster because the bucket we are adding the record to will not yet be at its full length.
        Going back to the standard assumptions, if we can only read a single block at a time (giving us 8 records every 11mS) we read a record in an average of 1.4mS, and add it to the index in an average of (less than) 14mS (2.27 accesses minus one access, times 11mS), giving 15mS per record. For the 1,000,000 record file, it takes 14,000 seconds or four hours. That sounds like a long time, but compared to the 60 hours for an ISAM index it is really good.
        There is no great advantage to be gained from reading more than one block at a time; the 1.4mS per record would be reduced, but the 14mS per insertion would not, so the overall change would not be significant.

Hash tables are generally the fastest kinds of indexes we are ever likely to see. They beat all others in all ways (except that they take up more disc space because buckets are much bigger than the usual single data pointers). The one major disadvantage that stops them from being the standard index always used is that the "Find Next" operation is not supported. Hashing randomises the order of records, so it is not possible to move through the file in alphabetical (or any other) order.