Indexes
Normally, nobody can effectively control the order in which
data records are added to a table. Data appears when it feels
like appearing, and has to be stored immediately otherwise it
gets lost.
Original Data File, Unsorted
|
---|
Record# | Name | Department | Position
|
---|
1 | Jimmy Jumbalaya | Sales | salesperson
|
2 | Ferdinand Frog | Marketing | liar
|
3 | Persephone Pang | Accounts | junior accountant
|
4 | Elvis Earwig | Catering and Cleaning | soup designer
|
5 | Boris Baskerville | Marketing | liar
|
6 | Mavis Mulligatawny | Research and Development | inventor
|
7 | Arbuthnot Ant | Sales | tea boy
|
8 | Zebidee Zing | Personnel | weasel
|
9 | Semolina Saltine | Sales | senior under-manager
|
10 | Catherine Cat | Marketing | liar
|
11 | Phillip Pharmer | Training | buffoon
|
12 | Doris Day | Catering and Cleaning | hamburger helper
|
13 | Quintin Queeg | Hexagons | vice president
|
14 | Iolanthe Ig | Research and Development | shouter
|
15 | Welliam Wedgie | Marketing | liar
|
When a file gets large enough to be interesting, finding the
information can take a long time. If we want to find out what
Iolanthe Ig does, we have to read the whole file. With only
15 records who cares? But with a realistic size of 1,000,000
records things are different:
Under the
normal reasonable conservative assumption of a 3600 rpm disc
rotation speed and a 3ms track-to-track seek time, giving 11mS
(seek time plus avg. ½ rotation)
per disc access, reading a 1,000,000 record file would take
1,000,000×11mS, which is 11,000 Seconds or 3 hours, which
would not be entirely satisfactory.
However, a
sensible operating system doesn't do exactly as its told. If you
tell it to read a single record, it will at least read the entire
block that the record is stored in, and remember it. Then if you
later ask it to read the next record it may not have to do any
disc accesses at all; the data could already be in memory. If the
operating system "knows" what you are doing, it may be able to
read a whole bunch of blocks in one swoop of the disc's read/write
heads, and speed things up a lot.
How much data can
be read at once? At least a block, and at most a track. (to read
more than one track, the heads would have to move). In the
minimum case of reading just one block, and making the reasonable
assumption that our data file has 60 byte fixed-size records
and our disc has 512 byte blocks, reading a single block would
still take 11mS but would get us 512÷60=8 records, so the
average time for reading a single record becomes 11÷8=1.4mS, and
the time to scan the whole file becomes 1,000,000×1.4mS, which
is 1400 Seconds or 23 minutes.
If we are able
to read a whole track at once, (which relies on the cooperation
of the operating system: some do not allow you to choose how
your data is stored), and assuming that a track consists of the
normal 63 blocks, we would be able to read 63×512÷60=537
blocks in 20mS (seek time plus one rotation), giving an average
of 20÷537=0.037mS per record, which comes to 37 seconds for
the whole file. A huge improvement, but still not good enough. If
every database access took 37 seconds, nothing would ever get done.
One wonders how large corporations survived before computerisation.
Organising the file
in a rational way can help.
The simplest and most direct way to arrange a file to speed up
database operations is to sort its records so that they appear
in ascending order (based on the value of the primary key). Then
any search based on the primary key can be organised as a "binary
chop", and execute in time proportional to log(N), where N is the
number of records. Maintaining this ordering can be quite expensive
(in terms of time) because every time a new record is added, a apace
has to be made for it in just the right place. Putting that aside for
a moment, and just assuming that it is a practical possibility, our
data file would be reordered to become:
Data File, Sorted on Name
|
---|
Record# | Name | Department | Position
|
---|
1 | Arbuthnot Ant | Sales | tea boy
|
2 | Boris Baskerville | Marketing | liar
|
3 | Catherine Cat | Marketing | liar
|
4 | Doris Day | Catering and Cleaning | hamburger helper
|
5 | Elvis Earwig | Catering and Cleaning | soup designer
|
6 | Ferdinand Frog | Marketing | liar
|
7 | Iolanthe Ig | Research and Development | shouter
|
8 | Jimmy Jumbalaya | Sales | salesperson
|
9 | Mavis Mulligatawny | Research and Development | inventor
|
10 | Persephone Pang | Accounts | junior accountant
|
11 | Phillip Pharmer | Training | buffoon
|
12 | Quintin Queeg | Hexagons | vice president
|
13 | Semolina Saltine | Sales | senior under-manager
|
14 | Welliam Wedgie | Marketing | liar
|
15 | Zebidee Zing | Personnel | weasel
|
With this organisation, to find out what Iolanthe Ig does requires
four iterations of a simple procedure:
- Preparation
- We have no idea where Iolanthe Ig is in the file, so we set
MinPos=1, MaxPos=15 to represent this (MinPos and MaxPos represent the minimum
and maximum record numbers at which our target could possibly
be found)
- One
- We look in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos)
= ½(1+15) = 8.
At record 8 we find Jimmy Jumbalaya; we know the file is ordered, and Iolanthe Ig should
come before Jimmy Jumbalaya, so we can exclude records 8 to 15 from the search. Set
MaxPos to AvgPos-1 = 7 and continue.
- Two
- We look in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos)
= ½(1+7) = 4.
At record 4 we find Doris Day; we know the file is ordered, and Iolanthe Ig should
come after Doris Day, so we can exclude records 1 to 4 from the search. Set
MinPos to AvgPos+1 = 5 and continue.
- Three
- We look in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos)
= ½(5+7) = 6.
At record 6 we find Ferdinand Frog; we know the file is ordered, and Iolanthe Ig should
come after Ferdinand Frog, so we can exclude records 1 to 6 from the search. Set
MaxPos to AvgPos+1 = 7 and continue.
- Four
- MinPos and MaxPos are both 7, meaning that if Iolanthe Ig is in the file, he must be
in record 7, as all others have been eliminated. We look at record 7 and miraculously
find him there.
At each step, we halve the size of the possible region, which was originally the whole
file. When this repeated halving reduces the possible region to one record, we have found
our target. So, if we solve the equation ½ to-the-power-of t = number-of-records-in-file,
we will have t = number of iterations required to find any particular record. The
solution is of course t = logarithm-base-two of number-of-records-in-file. People who work
with computers should have no trouble with powers of two, particularly if you remember that
2 to the power of 10 is very close to 1000, so 2**20 is about 1,000,000, and 2**30 is about
1,000,000,000 and therefore log_2(1000) is 10, log_2(1,000,000) is 20, and so on.
Returning to the original
question, we can now say that if our file had 1,000,000 records it would take just 20
iterations to find any record. Each iteration involves reading just one record, so the time
would be 20×11mS = 220mS, about a quarter of a second. Another enormous
improvement. This time, it is enough of an improvement to make database operations viable.
It is very important to notice that
we can not make further significant improvements by reading a lot of records at once. This
is because our disc accesses are not to consecutive records, but jump all over the file
(In the example, the sequence was 8, 4, 6, 7. If we searched for Aaron Aardvark in a file
of 1,000,000 records, the sequence would be 500000, 250000, 125000, 62500, 31250, ...: all
very widely spread). Once we are really close to the target, we will be looking at nearly-neighbouring
records, but the effect is not enough to give a significant difference.
Unfortunately,
files can only be sorted in one order at a time, so we have problems
if we want more than one index on the same file, and if the records
are at all large, moving them around takes a lot of work, so sorting
the file can be rather inefficient. Instead, one creates an index.
The usual scheme for an index
is that the original data file is completely left alone. An index is a separate
file which contains information intended to make searches of the data file
more efficient. Because indexes are separate files that do not modify the
data file in any way, it is possible to have multiple indexes for the same
data.