ISAM Indexes

An ISAM (Indexed Sequential Access Method) file is simply a list of record numbers. They are arranged so that if you were to read the records of the data file in the order in which their numbers appear in the index file, you would be reading them in their proper order.
        That is a very simple concept, but surprisingly tricky to put exactly into words. An example will sort that out.

Going back to the original example data file:
Original Data File, Unsorted
Record#NameDepartmentPosition
1Jimmy JumbalayaSalessalesperson
2Ferdinand FrogMarketingliar
3Persephone PangAccountsjunior accountant
4Elvis EarwigCatering and Cleaningsoup designer
5Boris BaskervilleMarketingliar
6Mavis Mulligatawny   Research and Development   inventor
7Arbuthnot AntSalestea boy
8Zebidee ZingPersonnelweasel
9Semolina SaltineSalessenior under-manager
10Catherine CatMarketingliar
11Phillip PharmerTrainingbuffoon
12Doris DayCatering and Cleaninghamburger helper
13Quintin QueegHexagonsvice president
14Iolanthe IgResearch and Developmentshouter
15Welliam WedgieMarketingliar
An ISAM index file would also have 15 records, but each would just be a number. They would be (assuming that "name" is the primary key):
ISAM index for it
751012421416311139158
If we read the data file records in that order, first we see record #7 "Arbuthnot Ant", then we see record #5 "Boris Baskerville", then we see record #10 "Catherine Cat", and so on, until finally we see record #8 "Zebidee Zing". All appearing in their proper order. That is all an ISAM file is.

A search on a file with an ISAM index again follows the binary chop scheme. Looking for Iolanthe Ig again, the procedure is:
Preparation
We have no idea where Iolanthe Ig is in the file, so we set MinPos=1, MaxPos=15 to represent this.
One
We look at the index file in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos) = ½(1+15) = 8. At record 8 of the index we find the number 1; at record 1 of the data file we find Jimmy Jumbalaya; we know the file is ordered, and Iolanthe Ig should come before Jimmy Jumbalaya, so we can exclude index records 8 to 15 from the search. Set MaxPos to AvgPos-1 = 7 and continue.
Two
We look at the index file in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos) = ½(1+7) = 4. At record 4 of the index we find the number 12; at record 12 of the data file we find Doris Day; we know the file is ordered, and Iolanthe Ig should come after Doris Day, so we can exclude index records 1 to 4 from the search. Set MinPos to AvgPos+1 = 5 and continue.
Three
We look at the index file in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos) = ½(5+7) = 6. At record 6 of the index we find the number 2; at record 2 of the data file we find Ferdinand Frog; we know the file is ordered, and Iolanthe Ig should come after Ferdinand Frog, so we can exclude index records 1 to 6 from the search. Set MaxPos to AvgPos+1 = 7 and continue.
Four
MinPos and MaxPos are both 7, meaning that if Iolanthe Ig is in the file, he must be referred to by index record 7, as all others have been eliminated. We look at record 7 of the index and find the number 14, so we look at record 14 of the data file and miraculously find Iolanthe Ig there.
Just as with the simple sorted file, the number of iterations of this procedure is logarithm-base-two of number-of-records-in-file. This time, each iteration involves two disc accesses (one to read the index, one to read the data), so the access time (under our normal assumptions) would be 2×log_2(N)×11mS. For a file of 1,000,000 records that comes to 440mS, less than half a second.

It seems that creating an ISAM index is a backwards step. Compared to just sorting the data file, it takes twice as long, and consumes more space (the index has to be stored somewhere, and would occupy 3MB for a 1,000,000 record file - it takes 3 bytes to store a number in the range 1 to 1,000,000).
        The two advantages make this backward step worthwhile: it becomes possible to have multiple indexes on the same file, and we never spend time moving the large data records around.

Creating an ISAM index

Creating an ISAM index is almost exactly the same as sorting a file; we just don't move the records. There are plenty of ways to sort a file, and I'm not going to spend the whole night typing things we spent over a week on in class. Instead I'll just review the more interesting and efficient methods.

Using Quicksort

Quicksort, partly because of its name, is the self-proclaimed expert programmer's favourite. It probably shouldn't be, because in its traditional form it has a fatal flaw that is quite likely to expose itself during database operations. Having said that, it is a clever and valuable method to have in your box of tricks, and easy to modify so that the flaw is removed.
A reminder of how it works:
Preparation
Create an initial index file just containing all the record numbers in order. For our example data, the initial index would be this:
123456789101112131415
This index contains all the right information, just in the wrong order. The quicksort algorithm just jiggles this index about until it is in the right order.
One: Partition the index.
  1. Pick a random record from the index, and swap it with the first one. This simple step removes the "fatal flaw" by disrupting any unfortunate patterns in the data. The new first record is called the "pivot". Read the corresponding data record, and remember it for the rest of the partitioning operation.
  2. Run through the rest of the index record-by-record. For each, read the corresponding record from the data file and compare it with the pivot record. Rearrange things so that all index records corresponding to data records that belong before the pivot record come first in the index file.
After that, the index will have been partitioned into (essentially) two portions. First come references to data records that should come before the pivot, last come references to data records that should come after the pivot.
Two: Recursion.
Each of the two partitions just created are certainly not sorted, but at least they are in the right places. The whole quicksort algorithm (steps One and Two) are applied recursively to each partition in turn.
Imagine that the random record picked in One.1 is record number 6 "Mavis Mulligatawny". The partitioning process would arrange that records 1, 2, 4, 5, 7, 10, and 12 (those that belong before Mavis) would come first in the jiggled index (although not in any particular order beyond that), then record 6 itself would appear, then records 3, 8, 9, 11, 13, 14, 15 would finally appear. So the partitioned index might be:
124571012638911131415
which has neatly produced two equal-sized partitions. The recursive procedure would then be applied to the 1, 2, 4, 5, 7, 10, 12 portion and later to the 3, 8, 9, 11, 13, 14, 15 portion separately.

         If you need more detail than that, look it up in a book. I can't imaging typing the complete quicksort algorithm coherently in HTML.

In the best possible case, each partitioning step divides the index into two exactly equal partitions, so again the number of steps before the partitions are reduced to a size of 1 (and therefore do not need to be sorted) would be log_2(N). Of course, the best possible case will not usually arise. Sometimes the worst possible case will happen, and a partitioning step will produce one partition of size 1, and one partition of size N-1. On average, we will have an averagely-good-to-bad case, with one partition having about a quarter of the index and the other having the rest. The number of steps it should be expected to take to reduce the partitions to 1 member is now the solution of ¾**t=N, or log-base-4/3(N) which is about 2.5×log_2(N).
        Each of these 2.5×log_2(N) levels of the recursive procedure has to run through all N index records, and read all N data records for comparison. Although we do read the index in sequential order, the data records are all jumbled up and can only be read one at a time, so the expected time to create an ISAM index using quicksort is about 2.5×N×log_2(N)×11mS. For the standard example of N=1,000,000 records, this gives 2.5 × 1,000,000 × 20 × 11mS, which is 550,000 seconds, or about 1 week.

All methods of creating an ISAM index involve sorting, and the best general purpose sorting algorithms all require something in the region of N×log(N) operations. All of them rearrange the index but do not rearrange the data file, so all of them suffer from the problem of effectively random-order accesses to the data file which prevents any speed-up from reading multiple records.

Mergesort always cuts the index exactly in half, so does not have the factor of 2.5 in its timing (it is 2.5 times faster than quicksort on average), but has the small disadvantage of needing extra disc space equal to the size of the index being created.

Heapsort is naturally almost as fast as mergesort, and does not require extra space, but does have another disadvantage. Quicksort and mergesort both scan through the indexes they are creating in sequential order (it is the data file that they access randomly). Heapsort accesses both the index file and the data file randomly, this has the effect of doubling the time it takes to execute.

As a general rule, Mergesort can be expected to be the fastest sorting algorithm for creating an ISAM index, and it takes approximately two and a half days to create an index for our 1,000,000 record file. What this teaches us is that it is not normal to create a new ISAM index on an already existing large data file. Create the index when the file is empty, and just maintain it whenever insertions or deletions are performed.
        The usual trick for maintaining an ISAM index is to wait until quite a few new records have been added, then sort them (just the new records). So long as you don't wait too long, just sorting the new records won't take long (You should expect to be able to sort a few thousand records entirely in memory, without haveing to wait for any disc accesses at all). Then, the sorted new records can be merged with the sorted original records in just N operations. So adding 1,000 records to a file of 1,000,000 records would take something in the region of 1,000,000×11mS, which is 11,000 seconds or 3 hours; not wonderful but at least tolerable. This is why serious databases offer other options for indexes besides ISAM.