All about Hash Tables

Introduction

The data structure upon which hash tables are based is the array. Hash tables are really nothing but arrays, used in a special way. Thus, we define our hash table like so:

char *names[TABLE_SIZE], *addresses[TABLE_SIZE];

We use names and addresses as parallel arrays. This means that the ith name in names corresponds to the ith address in addresses.

The hash function

The algorithm at the core of hash tables is of course the hashing algorithm. Its purpose is to transform some piece of data (the "key" of a record) into an integer in the range 0 ... TABLE_SIZE - 1. The key is very often a string; in our program it is a person's name.

Here is a sample hash function. It's output is seemingly random: you can't easily tell just by looking at a string what its hash value will be, and even similar strings have completely different hash values. But notice that the hash value it produces is determined only by the input string. Thus, whether you call it now or 500 years from now, so long as you give it the same string, it will produce the same hash value.

int hash(char *s)
{ int h = 0, i, len = strlen(s);
  for (i = 0; i < len; i++) 
    h = h * 3943 + s[i];
  if (h < 0)
    h = -h;
  return h % TABLE_SIZE; }

The number 3943 has no particular significance, as we will see below when we test the effectiveness of various hash functions (the effectiveness of a hash function is measured by how well it distributes hash values: an even distribution is desired for keys similar to those expected in actual usage ). Many books recommend making TABLE_SIZE a prime number to increase the effectiveness of a hash function like this one.

Implementation: enter and lookup

A hash table is implemented by providing two functions: one to enter records into the table, and another to lookup records in the table. Each of these functions calls upon the hash function that implements the hashing algorithm.

The way in which these functions work is actually very simple. The enter function, responsible for adding a record to the hash table, simply puts the record in the position in the table equal to the hash of the key. The lookup function, to find a record, simply finds the hash value of the key it is given, and looks in that position in the table.

There is just one complicating factor, but it is easily dealt with. The problem is that two different keys can have the same hash value. This is an obvious consequence of the fact there are many more possible keys that there are positions in our table (no matter how large we make it). Thus, no matter how well we distribute hash values among keys, there will be clashes (a.k.a. collisions).

There are several common ways of dealing with clashes. The one you have been shown is called linear probing. It works by simply putting a new record that clashes with an existing one in the next available space. During lookup, you start at the position equal to the hash value of the key you are looking for, and you look at the key that is in that position in the table, continuing until you find the one you want (or until you see an empty spot, which means what you are looking for isn't in the table).

Here are sample enter and lookup functions that operate just as described above.

int enter(char *name, char *address) 
{ /* returns 1 if record is successfully added, 0 if not */
  int h, pos;
  h = hash(name);
  hashes[h]++; /* for experimental purposes only, see discussion below */
  pos = h;
  while (names[pos] != NULL)
  { pos++;
    if (pos == TABLE_SIZE) /* wrap-around: go back to beginning of table */
      pos = 0;
    if (pos == h) /* we've come back to where we started: the table must be full */ 
      return 0; }
  names[pos] = name;
  addresses[pos] = address; 
  num_records++;
  return 1; } 

char *lookup(char *name)
{ /* returns address associated with name if found, otherwise NULL */
  int h, pos;
  h = hash(name);
  if (names[h] == NULL)
    return NULL;
  pos = h;
  while (strcmp(names[pos], name) != 0)
  { pos++;
    if (pos == TABLE_SIZE) /* wrap-around: go back to beginning of table */
      pos = 0;
    if (names[pos] == NULL || pos == h) /* this is an empty spot, or it is where */
      return NULL; }                    /* we started: name must not be in table */
  return addresses[pos]; }

Note that this linear probing stuff would really slow things down if we had to do too much of it. But, since we expect our hash function to be good and distribute things evenly throughout the table, and we expect our table to be much larger than the number of things we are actually going to put in it, the performance of these enter and lookup functions is actually extremely fast.

Explanation of sample program, things for you to try

Below is a complete program that incorporates the ideas discussed above. It also includes a few other features.

It keeps track of how many records are in the hash table with a global variable, num_records.

It has a function print_all to display all the records in the table. Notice the result of printing all the records is not the records in sorted order. The hash function has the result of scattering the records throughout the table in a seemingly random order, not of sorting them. Thus, while a hash table is ideal for applications that require very quick retrieval of records, some other structure will be in order if the data must be maintained in sorted order.

Finally, notice the array hashes and the function print_hashes. These are not of interest to an end user of this sort of application, but are there to help us see whether the hash function we are using is a good one. Any time a record is entered into the table, the position in hashes corresponding to its hash value is incremented. Thus, by printing hashes we can see the distribution of hash values for the data entered.

Below the program is some sample data. You can run the program with it by typing

hash_table < sample_data.txt

assuming these are the names you have given to the program and the data file. (Reminder: to compile, type cc -Wall -o executable_name source_file(s). The -Wall option turns on all Warnings, which is good because it helps you spot problems with your programs.)

Try modifying the hash function to see what effect it has on the distribution of hash values. Perhaps make print_hashes draw a plot instead of printing numbers; visual data is more easily interpreted. Note: You will need to reduce the size of the table and/or increase the size of the data set to see anything meaningful.

Try adding a remove function to take records out of the hash table.

Try researching and implementing alternative methods of dealing with clashes.

The sample program

Note: HTML does funny things with arrow brackets (the things you put after a #include), so the program below might not look quite right. View the HTML source if you're not sure how it should look.

#include 
#include 
#include 

#define TABLE_SIZE 1999

char *names[TABLE_SIZE], *addresses[TABLE_SIZE];
int num_records, hashes[TABLE_SIZE];

char *read_string(char *prompt)
{ char buffer[1000];
  printf("%s", prompt);
  fgets(buffer, 999, stdin); 
  buffer[strlen(buffer) - 1] = '\0';
  return strdup(buffer); } 

void init_hash_table(void)
{ int i;
  for (i = 0; i < TABLE_SIZE; i++)
  { names[i] = NULL;
    addresses[i] = NULL; 
    hashes[i] = 0; } 
  num_records = 0; }

int hash(char *s)
{ int h = 0, i, len = strlen(s);
  for (i = 0; i < len; i++) 
    h = h * 3943 + s[i];
  if (h < 0)
    h = -h;
  return h % TABLE_SIZE; }
               
int enter(char *name, char *address) 
{ int h, pos;
  h = hash(name);
  hashes[h]++;
  pos = h;
  while (names[pos] != NULL)
  { pos++;
    if (pos == TABLE_SIZE)
      pos = 0;
    if (pos == h)
      return 0; }
  num_records++;
  names[pos] = name;
  addresses[pos] = address; 
  return 1; } 

char *lookup(char *name)
{ int h, pos;
  h = hash(name);
  if (names[h] == NULL)
    return NULL;
  pos = h;
  while (strcmp(names[pos], name) != 0)
  { pos++;
    if (pos == TABLE_SIZE)
      pos = 0;
    if (names[pos] == NULL || pos == h)
      return NULL; }
  return addresses[pos]; }

void print_all(void) 
{ int i;
  for (i = 0; i < TABLE_SIZE; i++)
    if (names[i] != NULL)
      printf("%s   %s\n", names[i], addresses[i]); }

void print_hashes(void)
{ int i;
  for (i = 0; i < TABLE_SIZE; i++)
    printf("%4d %4d\n", i, hashes[i]); }
    
int main(void)
{ int command, result;
  char *name, *address;
  init_hash_table();
  while (1) 
  { printf("There are %d records in the hash table\n", num_records);
    printf("1) Enter    2) Lookup    3) Display all    4) Exit\n");
    printf("Command: ");
    scanf("%d", &command);
    getchar(); /* get newline, but do nothing with it */
    switch (command)
    { case 1: 
        name = read_string("Name: ");
        address = read_string("Address: ");
        result = enter(name, address);
        if (result == 0) 
          printf("The table is full!\n");
        else 
          printf("The record was successfully stored.\n");
        break;
      case 2:
        name = read_string("Name: ");
        address = lookup(name);
        if (address == NULL)
          printf("The record was not found!\n");
        else
          printf("The address of %s is %s\n", name, address);
        free(name);
        break;
      case 3:
        print_all();
        break;
      case 4:
        print_hashes();
        exit(0); 
      default:
        printf("Invalid command!\n"); } } 
  return 0; }

A small sample data set

(Use as directed above.)

1
Bob Gonzalez
1234 Here St.
1
Joseph Kuptz
2344 There Dr.
1
Kelly Green
991 Crest Blvd.
1
Hercules Johnson
1 Microway St.
1
Alice Wassel
3457 Go Pkwy.
1
Lucky Murphy
75711 SW 999 Ave.
4

Miscellaneous notes

Memory management
The declarations of the two arrays that comprise our hash table above do not request memory for the actual data that will be stored in the hash table. They are simply arrays of pointers. When we enter a record into the hash table, we simply make these pointers point to the strings that have been passed to the enter function. Thus, users of the enter function must be sure to call it with strings that are "safe". In this context, safe means stored in permanent (a.k.a. heap) memory. In the sample program, we arrange for this by using the strdup library function in our read_string function. You can read more about strdup and related issues in Dynamic Memory Allocation and You.

Structures
An alternative to using parallel arrays would be to use structures. (If you haven't seen or don't remember structures, you don't need to read the following.) Consider, for example, the following structure definition.

struct hash_table_record
{ char *name, *address; };

You might use it to define the hash table as just one array, an array of pointers to these structures.

struct hash_table_record *hash_table[TABLE_SIZE];

Little changes would then be needed to the syntax of the program, but really everything would be about the same. There is no real advantage to using structures in our simple sample program, and later you will have classes, so don't worry too much about this approach.

AT 20-Oct-2000