ISAM Indexes
An ISAM (Indexed Sequential Access Method) file is simply a list of
record numbers. They are arranged so that if you were to read the
records of the data file in the order in which their numbers appear
in the index file, you would be reading them in their proper order.
That is a very simple
concept, but surprisingly tricky to put exactly into words. An example
will sort that out.
Going back to the original example data file:
Original Data File, Unsorted
|
---|
Record# | Name | Department | Position
|
---|
1 | Jimmy Jumbalaya | Sales | salesperson
|
2 | Ferdinand Frog | Marketing | liar
|
3 | Persephone Pang | Accounts | junior accountant
|
4 | Elvis Earwig | Catering and Cleaning | soup designer
|
5 | Boris Baskerville | Marketing | liar
|
6 | Mavis Mulligatawny | Research and Development | inventor
|
7 | Arbuthnot Ant | Sales | tea boy
|
8 | Zebidee Zing | Personnel | weasel
|
9 | Semolina Saltine | Sales | senior under-manager
|
10 | Catherine Cat | Marketing | liar
|
11 | Phillip Pharmer | Training | buffoon
|
12 | Doris Day | Catering and Cleaning | hamburger helper
|
13 | Quintin Queeg | Hexagons | vice president
|
14 | Iolanthe Ig | Research and Development | shouter
|
15 | Welliam Wedgie | Marketing | liar
|
An ISAM index file would also have 15 records, but each would just be a number.
They would be (assuming that "name" is the primary key):
ISAM index for it
|
---|
7 | 5 | 10 | 12 | 4 | 2 | 14 | 1 | 6 | 3 | 11 | 13 | 9 | 15 | 8
|
If we read the data file records in that order, first we see record #7 "Arbuthnot Ant",
then we see record #5 "Boris Baskerville", then we see record #10 "Catherine Cat", and
so on, until finally we see record #8 "Zebidee Zing". All appearing in their proper
order. That is all an ISAM file is.
A search on a file with an ISAM index again follows the binary chop scheme. Looking for
Iolanthe Ig again, the procedure is:
- Preparation
- We have no idea where Iolanthe Ig is in the file, so we set
MinPos=1, MaxPos=15 to represent this.
- One
- We look at the index file in the middle of the possible region,
calculating AvgPos as ½(MinPos+MaxPos)
= ½(1+15) = 8.
At record 8 of the index we find the number 1; at record 1 of the data file we find
Jimmy Jumbalaya; we know the file is ordered, and Iolanthe Ig should
come before Jimmy Jumbalaya, so we can exclude index records 8 to 15 from the search. Set
MaxPos to AvgPos-1 = 7 and continue.
- Two
- We look at the index file in the middle of the possible region,
calculating AvgPos as ½(MinPos+MaxPos)
= ½(1+7) = 4.
At record 4 of the index we find the number 12; at record 12 of the data file we find
Doris Day; we know the file is ordered, and Iolanthe Ig should
come after Doris Day, so we can exclude index records 1 to 4 from the search. Set
MinPos to AvgPos+1 = 5 and continue.
- Three
- We look at the index file in the middle of the possible region,
calculating AvgPos as ½(MinPos+MaxPos)
= ½(5+7) = 6.
At record 6 of the index we find the number 2; at record 2 of the data file we find
Ferdinand Frog; we know the file is ordered, and Iolanthe Ig should
come after Ferdinand Frog, so we can exclude index records 1 to 6 from the search. Set
MaxPos to AvgPos+1 = 7 and continue.
- Four
- MinPos and MaxPos are both 7, meaning that if Iolanthe Ig is in the file, he must be
referred to by index record 7, as all others have been eliminated.
We look at record 7 of the index and find the number 14, so we look at record
14 of the data file and miraculously
find Iolanthe Ig there.
Just as with the simple sorted file, the number of iterations of this procedure is
logarithm-base-two of number-of-records-in-file. This time, each iteration involves
two disc accesses (one to read the index, one to read the data), so the access time
(under our normal assumptions) would be 2×log_2(N)×11mS. For a file
of 1,000,000 records that comes to 440mS, less than half a second.
It seems that creating an ISAM index is a backwards step. Compared to just sorting
the data file, it takes twice as long, and consumes more space (the index has to be
stored somewhere, and would occupy 3MB for a 1,000,000 record file - it takes 3 bytes
to store a number in the range 1 to 1,000,000).
The two advantages make this backward
step worthwhile: it becomes possible to have multiple indexes on the same file, and
we never spend time moving the large data records around.
Creating an ISAM index
Creating an ISAM index is almost exactly the same as sorting a file; we just don't move the
records. There are plenty of ways to sort a file, and I'm not going to spend the whole
night typing things we spent over a week on in class. Instead I'll just review the more
interesting and efficient methods.
Using Quicksort
Quicksort, partly because of its name, is the self-proclaimed expert programmer's
favourite. It probably shouldn't be, because in its traditional form
it has a fatal flaw that is quite likely to
expose itself during database operations. Having said that, it is a clever and
valuable method to have in your box of tricks, and easy to modify so that the flaw
is removed.
A reminder of how it works:
- Preparation
- Create an initial index file just containing all the record numbers in order.
For our example data, the initial index would be this:
This index contains all the right information, just in the wrong order. The quicksort
algorithm just jiggles this index about until it is in the right order.
- One: Partition the index.
-
- Pick a random record from the index, and swap it with the first one. This simple step
removes the "fatal flaw" by disrupting any unfortunate patterns in the data.
The new first record is called the "pivot". Read the corresponding data record, and
remember it for the rest of the partitioning operation.
- Run through the rest of the index record-by-record. For each, read the corresponding record
from the data file and compare it with the pivot record. Rearrange things so that all index records
corresponding to data records that belong before the pivot record come first in the index file.
After that, the index will have been partitioned into (essentially) two portions. First come
references to data records that should come before the pivot, last come references to data
records that should come after the pivot.
- Two: Recursion.
- Each of the two partitions just created are certainly not sorted, but at least they are in
the right places. The whole quicksort algorithm (steps One and Two) are applied recursively
to each partition in turn.
Imagine that the random record picked in One.1 is record number 6 "Mavis Mulligatawny". The
partitioning process would arrange that records 1, 2, 4, 5, 7, 10, and 12 (those that belong
before Mavis) would come first in the jiggled index (although not in any particular order
beyond that), then record 6 itself would appear, then records 3, 8, 9, 11, 13, 14, 15 would finally
appear. So the partitioned index might be:
which has neatly produced two equal-sized partitions. The recursive procedure would then be
applied to the 1, 2, 4, 5, 7, 10, 12 portion and later to the 3, 8, 9, 11, 13, 14, 15
portion separately.
If you need more detail than that, look it up in a book. I can't imaging typing the complete
quicksort algorithm coherently in HTML.
In the best possible case, each partitioning step divides the index into two exactly equal
partitions, so again the number of steps before the partitions are reduced to a size of 1
(and therefore do not need to be sorted) would be log_2(N). Of course, the best possible case
will not usually arise. Sometimes the worst possible case will happen, and a partitioning
step will produce one partition of size 1, and one partition of size N-1. On average, we
will have an averagely-good-to-bad case, with one partition having about a quarter of the
index and the other having the rest. The number of steps it should be expected to take
to reduce the partitions to 1 member is now the solution of ¾**t=N, or log-base-4/3(N)
which is about 2.5×log_2(N).
Each of these 2.5×log_2(N) levels of
the recursive procedure has to run through all N index records, and read all N
data records for comparison. Although we do read the index in sequential order, the data
records are all jumbled up and can only be read one at a time, so the expected time to
create an ISAM index using quicksort is about 2.5×N×log_2(N)×11mS.
For the standard example of N=1,000,000 records, this gives 2.5 × 1,000,000 × 20
× 11mS, which is 550,000 seconds, or about 1 week.
All methods of creating an ISAM index involve sorting, and the best general purpose
sorting algorithms all require something in the region of N×log(N) operations.
All of them rearrange the index but do not rearrange the data file, so all of them
suffer from the problem of effectively random-order accesses to the data file which
prevents any speed-up from reading multiple records.
Mergesort always cuts the index exactly in half, so does not have the factor
of 2.5 in its timing (it is 2.5 times faster than quicksort on average), but has the small
disadvantage of needing extra disc space equal to the size of the index being created.
Heapsort is naturally almost as fast as mergesort, and does not require extra space,
but does have another disadvantage. Quicksort and mergesort both scan through the indexes they
are creating in sequential order (it is the data file that they access randomly). Heapsort
accesses both the index file and the data file randomly, this has the effect of doubling the
time it takes to execute.
As a general rule, Mergesort can be expected to be the fastest sorting algorithm for
creating an ISAM index, and it takes approximately two and a half days to create an index for
our 1,000,000 record file. What this teaches us is that it is not normal to create a new ISAM
index on an already existing large data file. Create the index when the file is empty, and
just maintain it whenever insertions or deletions are performed.
The usual trick for maintaining an ISAM
index is to wait until quite a few new records have been added, then sort them (just the
new records). So long as you don't wait too long, just sorting the new records won't take long
(You should expect to be able to sort a few thousand records entirely in memory, without
haveing to wait for any disc accesses at all). Then, the sorted new records can be merged
with the sorted original records in just N operations. So adding 1,000 records to a file
of 1,000,000 records would take something in the region of 1,000,000×11mS, which is
11,000 seconds or 3 hours; not wonderful but at least tolerable. This is why serious
databases offer other options for indexes besides ISAM.