523 Second Test, 15 Apr 99
File Formats
A Typical practical format for disc files might be the Three-Level Index, in which
a whole file is represented by one block, called the Header. The file's header
contains all the vital information about the file (such as name, size,
creation date, etc) plus some pointers that give indirect access to the
actual data of the file. If we assume/guess that the vital information
requires 112 bytes, we have 400 bytes left in a block for the pointers.
When we're dealing with disc structures, a pointer always means
a pointer to a block on the same disc. Pointers to blocks are block addresses,
in the form (surface number, track number, sector number) at a low level,
but squeezed into a single integer for general use.
A large disc may have many millions of blocks, so a pointer needs to be
four bytes long. Thus we have room for 25 pointers in the header block.
If the file is
empty of data, all 25 of these pointers are NULL (i.e. zero). If the file is not empty,
the first few of these 25 pointers will point to other blocks;
each of these other blocks contains 128 pointers (512÷4) to
other blocks; each of these other other blocks conatin 128 pointers to
yet other blocks, which contain the actual data.
So we have:
In the header, up to 25 pointers to "level 2" indexes;
a level 2 index is just a block of up to 128 pointers, each pointing to
a "level 3" index.
a level 3 index is just a block of up to 128 pointers, each pointing to
a data block.
a data block contains up to 512 bytes of data.
Giving a maximum total of 25x128x128x512=209,715,000 bytes, or about 200MB in
a file.
Describing a one-level
index would not get full credit, because the very small maximum file size
means that they can not really be considered practical. Two-level indexes and
I-Nodes would be acceptable, but linked lists would not, because of the terrible
access times.
When the file is first opened, any sensible operating system will read the header
into memory and keep it there until the file is closed. A random access will
have to read three further blocks (a level 2 index block, then a level 3 index
block, then a data block) before it can reach the data it needs, so we would
expect the average random access to take 30mS, regardless of the number of
bytes it is reading (so long as it doesn't read more than one block of data).
Sequential accesses are
different. To read the first byte of data, you still have to do the three disc accesses,
but then you get 512 bytes all for the price of one. For each 512 bytes, you
only have to read one further data block, until you have exhausted a whole
level 3 index block (after reading 128x512 bytes), when you have to read a new
level 3 index block. After each group of 128x128x512 bytes, you also have to read
a new level 2 index block. Fortunately, we only want the average time, and we don't
need six digits of accuracy in it. The average time per byte is very close to
10mS÷512, which is 20 microseconds per byte. The extra 10mS÷(512x128)
for reading an extra level 3 index, and the extra 10mS÷(512x128x128) for
reading an extra level 2 index do not have a signinficant effect on the result.
Query Trees
The table definitions are not worth retyping here; I'll just note that if you
add up all the data field sizes, you see that a customer record is 103 bytes long,
a doctor record is 103 bytes long, a medicine record is 56 bytes long, and a
prescription record is 20 bytes long. So we can fit 4 costomers or 4 doctors
or 9 medicines or 25 prescriptions in a block. I'll also note that prescriptions
has 10,000,000 records, medicines has 1,000 records, doctors has 10,000 records,
and customers has 1,000,000 records.
The query under consideration was:
select c.fname, c.lname
from customers c, doctors d, medicines m, prescriptions p
where c.state='South Dakota'
and m.descr='Poison'
and p.when>=19720301 and p.when<=19720331
and d.city!=c.city
and p.patient_id=c.id
and p.doctor_id=d.id
and p.medicine_id=m.id;
The three tables customers, doctors, medicines, can not be joined directly;
they are only connected through the prescriptions table. So, the first join
must be between prescriptions and something else. The order of joining the
remaining tables is not fixed. I'll simply join the tables in the reverse of the
order in which they appear in the "from" clause.
The Canonical Query according to some definitions, joins all the tables first
then applies all the select conditions. According to other definitions
we move conditions to the join operations when they
form a natural join, but leave all other conditions until after all joins have been
performed. I'll do it both ways. Representing the canonical query tree as a sequence of operations
(it is too hard to draw a tree in HTML) we have:
- join prescriptions with medicines, natural join condition is p.medicine_id=m.id.
- join previous result with doctors, natural join condition is p.doctor_id=d.id.
- join previous result with customers, natural join condition is p.patient_id=c.id;
- select from previous result only records meeting condition c.state='South Dakota'
and m.descr='Poison'
and p.when>=19720301 and p.when<=19720331
and d.city!=c.city
- project from previous result only c.fname and c.lname.
So now we can calculate the amount of work done at each step. Remember that
a join without indexes has to be implemented as nested loops. In programming terms
we would have something like:
rewind(prescriptions)
while not(EOF(prescriptions)) // loop 1
{ p=read(prescriptions)
rewind(medicines)
while not(EOF(medicines)) // loop 2
{ m=read(medicines)
rewind(doctors)
while not(EOF(doctors)) // loop 3
{ d=read(doctors)
rewind(customers)
while not(EOF(customers)) // loop 4
{ c=read(customers)
if (all conditions are true)
print(c.fname, c.lname) } } } }
Loop 1 will execute 10,000,000 times. Therefore loop 2 will execute 10,000,000
x 1,000 times. Therefore loop 3 will execute 10,000,000 x 1,000 x 10,000 times.
Therefore loop 4 will execute 10,000,000 x 1,000 x 10,000 x 1,000,000 times.
Giving a total of roughly 100,000,000,000,000,000,000 read operations on
the customer table. (There will also be a lot of reads on the other tables, but
not nearly so many, so they won't have much effect on the total time). As
previously calculated, we can fit 4 customer records in a block, so we only have to
really do a disc read once for each 4 records, giving 25,000,000,000,000,000,000
disc reads. At 10mS each, that gives 250,000,000,000,000,000 seconds or just
over 8,000,000,000 years.
That was the most naive possible canonical query tree. Normally
it is considered OK to move the natural join conditions so that they
are tested as their joins are computed. That would result in a different
canonical query tree:
rewind(prescriptions)
while not(EOF(prescriptions)) // loop 1
{ p=read(prescriptions)
rewind(medicines)
while not(EOF(medicines)) // loop 2
{ m=read(medicines)
if (p.medicine_id==m.id) // join condition 2
{ rewind(doctors)
while not(EOF(doctors)) // loop 3
{ d=read(doctors)
if (p.doctor_id=d.id) // join condition 3
{ rewind(customers)
while not(EOF(customers)) // loop 4
{ c=read(customers)
if (p.patient_id=c.id) // join condition 4
{ if (remaining conditions are true)
print(c.fname, c.lname) } } } } } } }
Loop 1 still executes 10,000,000 times, and loop 2 still executes
10,000,000 x 1,000 times. But only one medicine record can match
each prescription, so join condition 2 only succeeds 10,000,000 times.
So loop 3 will execute 10,000,000 x 10,000 times, and each prescription
only matches one doctor, so join condition 3 only succeeds 10,000,000 times.
So loop 4 will execute 10,000,000 x 1,000,000 times, giving a total of
10,000,000,000,000 records read from the customer table. This is one
ten-millionth of the number for the naive canonical query tree, so this
version should only take about 800 years.
Both versions of the canonical query would be accepted as correct.
For the optimised query tree, we would really have to try all
six possible orderings for joining the tables (six, not twenty-four, because
prescriptions has to be in the first join). However, that is likely to be
too much work under examination conditions, so we don't bother. Let's
just hope that the order we've already got is good enough, but keep alert
for obvious improvements. The only other
optimisation worth performing is moving the remaining select conditions
as far down the tree as possible. (We could also move projects, but that
has a much lesser effect). The optimised query tree would be:
rewind(prescriptions)
while not(EOF(prescriptions)) // loop 1
{ p=read(prescriptions)
if (p.when>=19720301 and
p.when<19720331) // select condition 1
rewind(medicines)
while not(EOF(medicines)) // loop 2
{ m=read(medicines)
if (p.medicine_id==m.id and // join condition 2
m.descr='Poison') // select condition 2
{ rewind(doctors)
while not(EOF(doctors)) // loop 3
{ d=read(doctors)
if (p.doctor_id=d.id) // join condition 3
{ rewind(customers)
while not(EOF(customers)) // loop 4
{ c=read(customers)
if (p.patient_id=c.id and // join condition 4
d.city!=c.city and // select condition 4
c.state='Soth Dakota')
print(c.fname, c.lname) } } } } } }
Loop 1 will execute 10,000,000 times, but condition 1 will only be true
(assumption: 1% of prescriptions are from March 1972) 1% of the time,
so only 100,000 prescriptions survive. Therefore loop 2 will execute
100,000 x 1,000 times; each prescription only matches one medicine,
so only 100,000 combinations survive join condition 2, and (assumption:
only 1% of prescriptions are for poison) only 1% of those, which is 1,000
combinations survive select condition 2.
So loop 3 is executed 1,000
x 10,000 times, giving 10,000,000 combinations, but each prescription only matches
one doctor, so only 1,000 combinations survive join condition 3.
So loop 4 is executed 1,000
x 1,000,000 times, giving 1,000,000,000 combinations, but each prescription
only matches one customer, so only 1,000 survive join condition 4, and
(assumption: 2% of people live in South Dakota, and 5% of people go to
out-of-town doctors) 2% of 5% of that 1,000, which is just one record
survives select condition 4.
The busiest peice of
the code is loop 4, which executes 1,000,000,000 times, meaning that
1,000,000,000 records are read from the customers file. That is just one
ten-thousandth of the number for the non-naive canonical query tree, so
the time should be one ten-thousandth of 800 years, which is about one
month.
You may have noticed that there was no "select condition 3" in the optimised
query tree, and that meant that we couldn't reduce the amount of work done
at that stage. If we had joined customers before doctors perhaps
the query would run even faster. Let's try it:
rewind(prescriptions)
while not(EOF(prescriptions)) // loop 1
{ p=read(prescriptions)
if (p.when>=19720301 and
p.when<19720331) // select condition 1
rewind(medicines)
while not(EOF(medicines)) // loop 2
{ m=read(medicines)
if (p.medicine_id==m.id and // join condition 2
m.descr='Poison') // select condition 2
{ rewind(customers)
while not(EOF(customers)) // loop 3
{ c=read(customers)
if (p.patient_id=c.id and // join condition 3
c.state='South Dakota') // select condition 3
{ rewind(doctors)
while not(EOF(doctors)) // loop 4
{ d=read(doctors)
if (p.doctor_id=d.id and // join condition 4
d.city!=c.city) // select condition 4
print(c.fname, c.lname) } } } } } }
Loop 1 will execute 10,000,000 times, but condition 1 will only be true
(assumption: 1% of prescriptions are from March 1972) 1% of the time,
so only 100,000 prescriptions survive. Therefore loop 2 will execute
100,000 x 1,000 times; each prescription only matches one medicine,
so only 100,000 combinations survive join condition 2, and (assumption:
only 1% of prescriptions are for poison) only 1% of those, which is 1,000
combinations survive select condition 2.
So loop 3 is executed 1,000
x 10,000 times, giving 10,000,000 combinations, but each prescription only matches
one customer, so only 1,000 combinations survive join condition 3. Also, only
(assumption: 2% of people live in South Dakota) 2% of those survive
select condition 3, so we are left with just 20 combinations.
So loop 4 is executed 20
x 1,000,000 times, giving 20,000,000 combinations, but each prescription
only matches one doctor, so only 20 survive join condition 4, and
(assumption: 5% of people go to
out-of-town doctors) 5% of that 20, which is just one record
survives select condition 4.
All parts of the code are now
roughly equally busy. Loop 1 reads 10,000,000 prescriptions, and therefore does
10,000,000÷25 = 400,000 disc reads. Loop 2 reads 100,000,000 medicines,
and therefore does 100,000,000÷9 = 11,000,000 disc reads. Loop 3
reads 10,000,000 customers, and therefore does 2,500,000 disc reads. Loop 4
reads 20,000,000 doctors, and therefore does 5,000,000 disc reads. We now have a total of
about 19,000,000 disc reads, at 10mS each giving a total time of 190,000 seconds
or just over two days.
A big improvement,
but still too slow. Obviously indexes are useful.
Creating and Using Indexes
Using the last version of the plan for query execution, it would obviously help
a lot if we could jump straight to the group of prescriptions that meet the
condition (when>=19720301 and when<=19720331) and just run through them
one-by-one without having to waste time on all the others. For this, an ordering
index (B-tree or ISAM) should be created on prescriptions.when.
As we run through the
selected prescriptions one-by-one, we would like to be able to jump directly
to the correct medicine without searching the whole medicine table. For this
we need an index on medicines.id. Once we have found the right medicine, no extra
work is required to check that its description is poison; no additional index
will help.
Next, we would like to be able to jump
straight to the customer whose id matches the prescription, so an index on
customers.id would help. If that customer does not live in South Dakota, the
record is abandoned; no additional index will help.
Next we want to jump
straight to the correct doctor record, so an index on doctors.id would help.
No other indexes are needed;
we can't avoid looking up the correct doctor, and once that is done, no more
table accesses are needed to check that the cities are different.
So, we want indexes on
- prescriptions.when
- medicines.id
- customers.id
- doctors.id
The only restriction is that the index on prescriptions.when must be an ordering
index, therefore not a hash table.
For simplicity, lets assume
they are all ISAM indexes.
Creating an ISAM index takes something in the order of 2×Ta×N×log2(N),
where Ta is the disc access time (10mS) and N is the table size. The prescriptions
table is the biggest, giving time 2×10mS×10,000,000×27 = 5,400,000
seconds, which is about two months. If we are really careful with programming,
we should be able to reduce that by a factor of 25 (because we can fit 25
prescription records in a block, and therefore read them at the rate of 25
for the price of 1), giving 2.5 days. Thus we realise that it is a really good
idea to create indexes before we fill our tables with data. The other tables
are significantly shorter, so the total index creation time would be a little
less than three days.
The plan of execution becomes:
seek(prescriptions, when>=19720301) // index search 1
while not(EOF(prescriptions)) // loop
{ p=read(prescriptions)
if (p.when>19720331) break // termination condition 1
seek(medicines, id==p.medicine_id) // index search 2
if (not found) contine
m=read(medicines)
if (m.descr!='Poison') continue // select condition 2
seek(customers, id=p.patient_id) // index search 3
if (not found) continue
c=read(customers)
if (c.state!='South Dakota') continue // select condition 3
rewind(doctors, id=p.doctor_id) // index search 4
if (not found) continue
d=read(doctors)
if (d.city==c.city) continue // select condition 4
print(c.fname, c.lname) }
(Recall that "continue" means go back to the top of the current loop;
we can only get away with just having a single loop because the
id fields are all clearly supposed to be unique keys, which means we
can guarantee that only one record will match any given value, so there
is no need for a nested loop to walk through all the matches).
Finding something in an ISAM index takes time 2×Ta×log2(N),
for index search 1, N=10,000,000, so we have 54×Ta. We have already
made the assumption that 1% of prescriptions meet the date requirement,
so the loop executes 100,000 times, each time taking Ta to read one
prescription record.
Inside
that loop, we have an index search
on medicines, N=1000 so time=20×Ta executed 100,000 times. Only 1%
or 1,000 records survive select condition 2.
Next we have an
index search on customers, N=1,000,000 so time=40×Ta executed 1,000
times. Only 2% or 20 records survive select condition 3.
Next we have an index
search on doctors, N=10,000 so time=28×Ta executed 20 times.
The total time for all these operations is Ta × (54 + 100,000
+ 100,000×20 + 1,000×40 + 20×28) which comes to about
2,000,000×10mS or 20,000 seconds or nearly six hours.
If we had used hash tables for each of the indexes (except of course
prescriptions.when) The average index search time would be something
in the region of 2.5×Ta instead of 2×Ta×log2(N), which
would reduce the total time to about one hour.
The total time to create
the indexes and then use them in a query is slightly more than the time to
answer the query without indexes (three days instead of two), but once the indexes have been
created, they can be used over and over again at no extra cost. If you are
ever going to ask two queries of this sort, it would be worth expending the
time it takes to create the indexes.
We might like to note that in the real world, we would probably not
keep all the records for as far back as 1972 in the current, on-line version
of the database. There would probably be a collection much smaller archive files
for dates in the distant past, so such searches would be a lot faster. Also, it
is very likely that records are added to a table in more-or-less chronological
order, so the prescriptions file might already be ordered by date; this would
mean that creating that first index could be much faster.