523 Second Test, 15 Apr 99

File Formats

A Typical practical format for disc files might be the Three-Level Index, in which a whole file is represented by one block, called the Header. The file's header contains all the vital information about the file (such as name, size, creation date, etc) plus some pointers that give indirect access to the actual data of the file. If we assume/guess that the vital information requires 112 bytes, we have 400 bytes left in a block for the pointers. When we're dealing with disc structures, a pointer always means a pointer to a block on the same disc. Pointers to blocks are block addresses, in the form (surface number, track number, sector number) at a low level, but squeezed into a single integer for general use. A large disc may have many millions of blocks, so a pointer needs to be four bytes long. Thus we have room for 25 pointers in the header block.
        If the file is empty of data, all 25 of these pointers are NULL (i.e. zero). If the file is not empty, the first few of these 25 pointers will point to other blocks; each of these other blocks contains 128 pointers (512÷4) to other blocks; each of these other other blocks conatin 128 pointers to yet other blocks, which contain the actual data.
So we have:
        In the header, up to 25 pointers to "level 2" indexes;
        a level 2 index is just a block of up to 128 pointers, each pointing to a "level 3" index.
        a level 3 index is just a block of up to 128 pointers, each pointing to a data block.
        a data block contains up to 512 bytes of data.
Giving a maximum total of 25x128x128x512=209,715,000 bytes, or about 200MB in a file.
        Describing a one-level index would not get full credit, because the very small maximum file size means that they can not really be considered practical. Two-level indexes and I-Nodes would be acceptable, but linked lists would not, because of the terrible access times.

When the file is first opened, any sensible operating system will read the header into memory and keep it there until the file is closed. A random access will have to read three further blocks (a level 2 index block, then a level 3 index block, then a data block) before it can reach the data it needs, so we would expect the average random access to take 30mS, regardless of the number of bytes it is reading (so long as it doesn't read more than one block of data).
        Sequential accesses are different. To read the first byte of data, you still have to do the three disc accesses, but then you get 512 bytes all for the price of one. For each 512 bytes, you only have to read one further data block, until you have exhausted a whole level 3 index block (after reading 128x512 bytes), when you have to read a new level 3 index block. After each group of 128x128x512 bytes, you also have to read a new level 2 index block. Fortunately, we only want the average time, and we don't need six digits of accuracy in it. The average time per byte is very close to 10mS÷512, which is 20 microseconds per byte. The extra 10mS÷(512x128) for reading an extra level 3 index, and the extra 10mS÷(512x128x128) for reading an extra level 2 index do not have a signinficant effect on the result.

Query Trees

The table definitions are not worth retyping here; I'll just note that if you add up all the data field sizes, you see that a customer record is 103 bytes long, a doctor record is 103 bytes long, a medicine record is 56 bytes long, and a prescription record is 20 bytes long. So we can fit 4 costomers or 4 doctors or 9 medicines or 25 prescriptions in a block. I'll also note that prescriptions has 10,000,000 records, medicines has 1,000 records, doctors has 10,000 records, and customers has 1,000,000 records.

The query under consideration was:

select c.fname, c.lname
from customers c, doctors d, medicines m, prescriptions p
where c.state='South Dakota'
  and m.descr='Poison'
  and p.when>=19720301 and p.when<=19720331
  and d.city!=c.city
  and p.patient_id=c.id
  and p.doctor_id=d.id
  and p.medicine_id=m.id;

The three tables customers, doctors, medicines, can not be joined directly; they are only connected through the prescriptions table. So, the first join must be between prescriptions and something else. The order of joining the remaining tables is not fixed. I'll simply join the tables in the reverse of the order in which they appear in the "from" clause.

The Canonical Query according to some definitions, joins all the tables first then applies all the select conditions. According to other definitions we move conditions to the join operations when they form a natural join, but leave all other conditions until after all joins have been performed. I'll do it both ways. Representing the canonical query tree as a sequence of operations (it is too hard to draw a tree in HTML) we have:

join prescriptions with medicines, natural join condition is p.medicine_id=m.id.
join previous result with doctors, natural join condition is p.doctor_id=d.id.
join previous result with customers, natural join condition is p.patient_id=c.id;
select from previous result only records meeting condition c.state='South Dakota' and m.descr='Poison' and p.when>=19720301 and p.when<=19720331 and d.city!=c.city
project from previous result only c.fname and c.lname.

So now we can calculate the amount of work done at each step. Remember that a join without indexes has to be implemented as nested loops. In programming terms we would have something like:

rewind(prescriptions)
while not(EOF(prescriptions))     // loop 1
{ p=read(prescriptions)
  rewind(medicines)
  while not(EOF(medicines))       // loop 2
  { m=read(medicines)
    rewind(doctors)
    while not(EOF(doctors))       // loop 3
    { d=read(doctors)
      rewind(customers)
      while not(EOF(customers))   // loop 4
      { c=read(customers)
        if (all conditions are true)
          print(c.fname, c.lname) } } } }

Loop 1 will execute 10,000,000 times. Therefore loop 2 will execute 10,000,000 x 1,000 times. Therefore loop 3 will execute 10,000,000 x 1,000 x 10,000 times. Therefore loop 4 will execute 10,000,000 x 1,000 x 10,000 x 1,000,000 times. Giving a total of roughly 100,000,000,000,000,000,000 read operations on the customer table. (There will also be a lot of reads on the other tables, but not nearly so many, so they won't have much effect on the total time). As previously calculated, we can fit 4 customer records in a block, so we only have to really do a disc read once for each 4 records, giving 25,000,000,000,000,000,000 disc reads. At 10mS each, that gives 250,000,000,000,000,000 seconds or just over 8,000,000,000 years.

That was the most naive possible canonical query tree. Normally it is considered OK to move the natural join conditions so that they are tested as their joins are computed. That would result in a different canonical query tree:

rewind(prescriptions)
while not(EOF(prescriptions))         // loop 1
{ p=read(prescriptions)
  rewind(medicines)
  while not(EOF(medicines))           // loop 2
  { m=read(medicines)
    if (p.medicine_id==m.id)            // join condition 2
    { rewind(doctors)
      while not(EOF(doctors))         // loop 3
      { d=read(doctors)
        if (p.doctor_id=d.id)           // join condition 3
        { rewind(customers)
          while not(EOF(customers))   // loop 4
          { c=read(customers)
            if (p.patient_id=c.id)      // join condition 4
            { if (remaining conditions are true)
                print(c.fname, c.lname) } } } } } } }

Loop 1 still executes 10,000,000 times, and loop 2 still executes 10,000,000 x 1,000 times. But only one medicine record can match each prescription, so join condition 2 only succeeds 10,000,000 times. So loop 3 will execute 10,000,000 x 10,000 times, and each prescription only matches one doctor, so join condition 3 only succeeds 10,000,000 times. So loop 4 will execute 10,000,000 x 1,000,000 times, giving a total of 10,000,000,000,000 records read from the customer table. This is one ten-millionth of the number for the naive canonical query tree, so this version should only take about 800 years.

Both versions of the canonical query would be accepted as correct.

For the optimised query tree, we would really have to try all six possible orderings for joining the tables (six, not twenty-four, because prescriptions has to be in the first join). However, that is likely to be too much work under examination conditions, so we don't bother. Let's just hope that the order we've already got is good enough, but keep alert for obvious improvements. The only other optimisation worth performing is moving the remaining select conditions as far down the tree as possible. (We could also move projects, but that has a much lesser effect). The optimised query tree would be:

rewind(prescriptions)
while not(EOF(prescriptions))         // loop 1
{ p=read(prescriptions)
  if (p.when>=19720301 and
      p.when<19720331)               // select condition 1
  rewind(medicines)
  while not(EOF(medicines))           // loop 2
  { m=read(medicines)
    if (p.medicine_id==m.id and         // join condition 2
        m.descr='Poison')               // select condition 2
    { rewind(doctors)
      while not(EOF(doctors))         // loop 3
      { d=read(doctors)
        if (p.doctor_id=d.id)           // join condition 3
        { rewind(customers)
          while not(EOF(customers))   // loop 4
          { c=read(customers)
            if (p.patient_id=c.id and   // join condition 4
                d.city!=c.city and      // select condition 4
                c.state='Soth Dakota')
              print(c.fname, c.lname) } } } } } }

Loop 1 will execute 10,000,000 times, but condition 1 will only be true (assumption: 1% of prescriptions are from March 1972) 1% of the time, so only 100,000 prescriptions survive. Therefore loop 2 will execute 100,000 x 1,000 times; each prescription only matches one medicine, so only 100,000 combinations survive join condition 2, and (assumption: only 1% of prescriptions are for poison) only 1% of those, which is 1,000 combinations survive select condition 2.
        So loop 3 is executed 1,000 x 10,000 times, giving 10,000,000 combinations, but each prescription only matches one doctor, so only 1,000 combinations survive join condition 3.
        So loop 4 is executed 1,000 x 1,000,000 times, giving 1,000,000,000 combinations, but each prescription only matches one customer, so only 1,000 survive join condition 4, and (assumption: 2% of people live in South Dakota, and 5% of people go to out-of-town doctors) 2% of 5% of that 1,000, which is just one record survives select condition 4.
        The busiest peice of the code is loop 4, which executes 1,000,000,000 times, meaning that 1,000,000,000 records are read from the customers file. That is just one ten-thousandth of the number for the non-naive canonical query tree, so the time should be one ten-thousandth of 800 years, which is about one month.

You may have noticed that there was no "select condition 3" in the optimised query tree, and that meant that we couldn't reduce the amount of work done at that stage. If we had joined customers before doctors perhaps the query would run even faster. Let's try it:

rewind(prescriptions)
while not(EOF(prescriptions))         // loop 1
{ p=read(prescriptions)
  if (p.when>=19720301 and
      p.when<19720331)               // select condition 1
  rewind(medicines)
  while not(EOF(medicines))           // loop 2
  { m=read(medicines)
    if (p.medicine_id==m.id and         // join condition 2
        m.descr='Poison')               // select condition 2
    { rewind(customers)
      while not(EOF(customers))       // loop 3
      { c=read(customers)
        if (p.patient_id=c.id and       // join condition 3
            c.state='South Dakota')     // select condition 3
        { rewind(doctors)
          while not(EOF(doctors))     // loop 4
          { d=read(doctors)
            if (p.doctor_id=d.id and    // join condition 4
                d.city!=c.city)         // select condition 4
              print(c.fname, c.lname) } } } } } }

Loop 1 will execute 10,000,000 times, but condition 1 will only be true (assumption: 1% of prescriptions are from March 1972) 1% of the time, so only 100,000 prescriptions survive. Therefore loop 2 will execute 100,000 x 1,000 times; each prescription only matches one medicine, so only 100,000 combinations survive join condition 2, and (assumption: only 1% of prescriptions are for poison) only 1% of those, which is 1,000 combinations survive select condition 2.
        So loop 3 is executed 1,000 x 10,000 times, giving 10,000,000 combinations, but each prescription only matches one customer, so only 1,000 combinations survive join condition 3. Also, only (assumption: 2% of people live in South Dakota) 2% of those survive select condition 3, so we are left with just 20 combinations.
        So loop 4 is executed 20 x 1,000,000 times, giving 20,000,000 combinations, but each prescription only matches one doctor, so only 20 survive join condition 4, and (assumption: 5% of people go to out-of-town doctors) 5% of that 20, which is just one record survives select condition 4.
        All parts of the code are now roughly equally busy. Loop 1 reads 10,000,000 prescriptions, and therefore does 10,000,000÷25 = 400,000 disc reads. Loop 2 reads 100,000,000 medicines, and therefore does 100,000,000÷9 = 11,000,000 disc reads. Loop 3 reads 10,000,000 customers, and therefore does 2,500,000 disc reads. Loop 4 reads 20,000,000 doctors, and therefore does 5,000,000 disc reads. We now have a total of about 19,000,000 disc reads, at 10mS each giving a total time of 190,000 seconds or just over two days.
        A big improvement, but still too slow. Obviously indexes are useful.

Creating and Using Indexes

Using the last version of the plan for query execution, it would obviously help a lot if we could jump straight to the group of prescriptions that meet the condition (when>=19720301 and when<=19720331) and just run through them one-by-one without having to waste time on all the others. For this, an ordering index (B-tree or ISAM) should be created on prescriptions.when.
        As we run through the selected prescriptions one-by-one, we would like to be able to jump directly to the correct medicine without searching the whole medicine table. For this we need an index on medicines.id. Once we have found the right medicine, no extra work is required to check that its description is poison; no additional index will help.
        Next, we would like to be able to jump straight to the customer whose id matches the prescription, so an index on customers.id would help. If that customer does not live in South Dakota, the record is abandoned; no additional index will help.
        Next we want to jump straight to the correct doctor record, so an index on doctors.id would help.
        No other indexes are needed; we can't avoid looking up the correct doctor, and once that is done, no more table accesses are needed to check that the cities are different.
        So, we want indexes on

prescriptions.when
medicines.id
customers.id
doctors.id

The only restriction is that the index on prescriptions.when must be an ordering index, therefore not a hash table.
For simplicity, lets assume they are all ISAM indexes. Creating an ISAM index takes something in the order of 2×Ta×N×log2(N), where Ta is the disc access time (10mS) and N is the table size. The prescriptions table is the biggest, giving time 2×10mS×10,000,000×27 = 5,400,000 seconds, which is about two months. If we are really careful with programming, we should be able to reduce that by a factor of 25 (because we can fit 25 prescription records in a block, and therefore read them at the rate of 25 for the price of 1), giving 2.5 days. Thus we realise that it is a really good idea to create indexes before we fill our tables with data. The other tables are significantly shorter, so the total index creation time would be a little less than three days.

The plan of execution becomes:

seek(prescriptions, when>=19720301)     // index search 1
while not(EOF(prescriptions))             // loop
{ p=read(prescriptions)
  if (p.when>19720331) break              // termination condition 1
  seek(medicines, id==p.medicine_id)         // index search 2
  if (not found) contine
  m=read(medicines)
  if (m.descr!='Poison') continue         // select condition 2
  seek(customers, id=p.patient_id)           // index search 3
  if (not found) continue
  c=read(customers)
  if (c.state!='South Dakota') continue   // select condition 3
  rewind(doctors, id=p.doctor_id)            // index search 4
  if (not found) continue
  d=read(doctors)
  if (d.city==c.city) continue            // select condition 4
  print(c.fname, c.lname) }

(Recall that "continue" means go back to the top of the current loop; we can only get away with just having a single loop because the id fields are all clearly supposed to be unique keys, which means we can guarantee that only one record will match any given value, so there is no need for a nested loop to walk through all the matches).

Finding something in an ISAM index takes time 2×Ta×log2(N), for index search 1, N=10,000,000, so we have 54×Ta. We have already made the assumption that 1% of prescriptions meet the date requirement, so the loop executes 100,000 times, each time taking Ta to read one prescription record.
        Inside that loop, we have an index search on medicines, N=1000 so time=20×Ta executed 100,000 times. Only 1% or 1,000 records survive select condition 2.
        Next we have an index search on customers, N=1,000,000 so time=40×Ta executed 1,000 times. Only 2% or 20 records survive select condition 3.
        Next we have an index search on doctors, N=10,000 so time=28×Ta executed 20 times.
The total time for all these operations is Ta × (54 + 100,000 + 100,000×20 + 1,000×40 + 20×28) which comes to about 2,000,000×10mS or 20,000 seconds or nearly six hours.

If we had used hash tables for each of the indexes (except of course prescriptions.when) The average index search time would be something in the region of 2.5×Ta instead of 2×Ta×log2(N), which would reduce the total time to about one hour.
        The total time to create the indexes and then use them in a query is slightly more than the time to answer the query without indexes (three days instead of two), but once the indexes have been created, they can be used over and over again at no extra cost. If you are ever going to ask two queries of this sort, it would be worth expending the time it takes to create the indexes.

We might like to note that in the real world, we would probably not keep all the records for as far back as 1972 in the current, on-line version of the database. There would probably be a collection much smaller archive files for dates in the distant past, so such searches would be a lot faster. Also, it is very likely that records are added to a table in more-or-less chronological order, so the prescriptions file might already be ordered by date; this would mean that creating that first index could be much faster.