Indexes

Normally, nobody can effectively control the order in which data records are added to a table. Data appears when it feels like appearing, and has to be stored immediately otherwise it gets lost.
Original Data File, Unsorted
Record#NameDepartmentPosition
1Jimmy JumbalayaSalessalesperson
2Ferdinand FrogMarketingliar
3Persephone PangAccountsjunior accountant
4Elvis EarwigCatering and Cleaningsoup designer
5Boris BaskervilleMarketingliar
6Mavis Mulligatawny   Research and Development   inventor
7Arbuthnot AntSalestea boy
8Zebidee ZingPersonnelweasel
9Semolina SaltineSalessenior under-manager
10Catherine CatMarketingliar
11Phillip PharmerTrainingbuffoon
12Doris DayCatering and Cleaninghamburger helper
13Quintin QueegHexagonsvice president
14Iolanthe IgResearch and Developmentshouter
15Welliam WedgieMarketingliar
When a file gets large enough to be interesting, finding the information can take a long time. If we want to find out what Iolanthe Ig does, we have to read the whole file. With only 15 records who cares? But with a realistic size of 1,000,000 records things are different:
        Under the normal reasonable conservative assumption of a 3600 rpm disc rotation speed and a 3ms track-to-track seek time, giving 11mS (seek time plus avg. ½ rotation) per disc access, reading a 1,000,000 record file would take 1,000,000×11mS, which is 11,000 Seconds or 3 hours, which would not be entirely satisfactory.
        However, a sensible operating system doesn't do exactly as its told. If you tell it to read a single record, it will at least read the entire block that the record is stored in, and remember it. Then if you later ask it to read the next record it may not have to do any disc accesses at all; the data could already be in memory. If the operating system "knows" what you are doing, it may be able to read a whole bunch of blocks in one swoop of the disc's read/write heads, and speed things up a lot.
        How much data can be read at once? At least a block, and at most a track. (to read more than one track, the heads would have to move). In the minimum case of reading just one block, and making the reasonable assumption that our data file has 60 byte fixed-size records and our disc has 512 byte blocks, reading a single block would still take 11mS but would get us 512÷60=8 records, so the average time for reading a single record becomes 11÷8=1.4mS, and the time to scan the whole file becomes 1,000,000×1.4mS, which is 1400 Seconds or 23 minutes.
        If we are able to read a whole track at once, (which relies on the cooperation of the operating system: some do not allow you to choose how your data is stored), and assuming that a track consists of the normal 63 blocks, we would be able to read 63×512÷60=537 blocks in 20mS (seek time plus one rotation), giving an average of 20÷537=0.037mS per record, which comes to 37 seconds for the whole file. A huge improvement, but still not good enough. If every database access took 37 seconds, nothing would ever get done. One wonders how large corporations survived before computerisation.
        Organising the file in a rational way can help.

The simplest and most direct way to arrange a file to speed up database operations is to sort its records so that they appear in ascending order (based on the value of the primary key). Then any search based on the primary key can be organised as a "binary chop", and execute in time proportional to log(N), where N is the number of records. Maintaining this ordering can be quite expensive (in terms of time) because every time a new record is added, a apace has to be made for it in just the right place. Putting that aside for a moment, and just assuming that it is a practical possibility, our data file would be reordered to become:
Data File, Sorted on Name
Record#NameDepartmentPosition
1Arbuthnot AntSalestea boy
2Boris BaskervilleMarketingliar
3Catherine CatMarketingliar
4Doris DayCatering and Cleaninghamburger helper
5Elvis EarwigCatering and Cleaningsoup designer
6Ferdinand FrogMarketingliar
7Iolanthe IgResearch and Developmentshouter
8Jimmy JumbalayaSalessalesperson
9Mavis Mulligatawny   Research and Development   inventor
10Persephone PangAccountsjunior accountant
11Phillip PharmerTrainingbuffoon
12Quintin QueegHexagonsvice president
13Semolina SaltineSalessenior under-manager
14Welliam WedgieMarketingliar
15Zebidee ZingPersonnelweasel
With this organisation, to find out what Iolanthe Ig does requires four iterations of a simple procedure:
Preparation
We have no idea where Iolanthe Ig is in the file, so we set MinPos=1, MaxPos=15 to represent this (MinPos and MaxPos represent the minimum and maximum record numbers at which our target could possibly be found)
One
We look in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos) = ½(1+15) = 8. At record 8 we find Jimmy Jumbalaya; we know the file is ordered, and Iolanthe Ig should come before Jimmy Jumbalaya, so we can exclude records 8 to 15 from the search. Set MaxPos to AvgPos-1 = 7 and continue.
Two
We look in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos) = ½(1+7) = 4. At record 4 we find Doris Day; we know the file is ordered, and Iolanthe Ig should come after Doris Day, so we can exclude records 1 to 4 from the search. Set MinPos to AvgPos+1 = 5 and continue.
Three
We look in the middle of the possible region, calculating AvgPos as ½(MinPos+MaxPos) = ½(5+7) = 6. At record 6 we find Ferdinand Frog; we know the file is ordered, and Iolanthe Ig should come after Ferdinand Frog, so we can exclude records 1 to 6 from the search. Set MaxPos to AvgPos+1 = 7 and continue.
Four
MinPos and MaxPos are both 7, meaning that if Iolanthe Ig is in the file, he must be in record 7, as all others have been eliminated. We look at record 7 and miraculously find him there.
At each step, we halve the size of the possible region, which was originally the whole file. When this repeated halving reduces the possible region to one record, we have found our target. So, if we solve the equation ½ to-the-power-of t = number-of-records-in-file, we will have t = number of iterations required to find any particular record. The solution is of course t = logarithm-base-two of number-of-records-in-file. People who work with computers should have no trouble with powers of two, particularly if you remember that 2 to the power of 10 is very close to 1000, so 2**20 is about 1,000,000, and 2**30 is about 1,000,000,000 and therefore log_2(1000) is 10, log_2(1,000,000) is 20, and so on.
        Returning to the original question, we can now say that if our file had 1,000,000 records it would take just 20 iterations to find any record. Each iteration involves reading just one record, so the time would be 20×11mS = 220mS, about a quarter of a second. Another enormous improvement. This time, it is enough of an improvement to make database operations viable.
        It is very important to notice that we can not make further significant improvements by reading a lot of records at once. This is because our disc accesses are not to consecutive records, but jump all over the file (In the example, the sequence was 8, 4, 6, 7. If we searched for Aaron Aardvark in a file of 1,000,000 records, the sequence would be 500000, 250000, 125000, 62500, 31250, ...: all very widely spread). Once we are really close to the target, we will be looking at nearly-neighbouring records, but the effect is not enough to give a significant difference.


Unfortunately, files can only be sorted in one order at a time, so we have problems if we want more than one index on the same file, and if the records are at all large, moving them around takes a lot of work, so sorting the file can be rather inefficient. Instead, one creates an index.
        The usual scheme for an index is that the original data file is completely left alone. An index is a separate file which contains information intended to make searches of the data file more efficient. Because indexes are separate files that do not modify the data file in any way, it is possible to have multiple indexes for the same data.