Query Cost:--
The cost of query evaluation can be measured in terms of a number of different resources such as
Disk accesses
CPU time to execute a query
the cost of communication (in distributed systems)
The response time for a query evaluation plan assuming that no other activity is going on the computer while the query is being executed would account for all these costs.
Since CPU speeds are much faster than disk access speeds and also estimation of CPU time is relatively hard, disk access cost is considered a reasonable measure of the cost of query evaluation plan.
A measure of the number of block transfers would estimate
The number of seek operations performed
the number of blocks read
the number of blocks written
The costs of all the algorithms depend on the size of the buffer in main memory.
In the best case, all data can be read into buffers, and disk does not need to be accessed again.
In the worst case, we assume that the buffer can hold only a few blocks of data, approximately one block per relation.
Join Operation
Assume the following information about two relations depositor and customer
Number of records of customer, ncustomer=10000
Number of blocks of customer, bcustomer=400
Number of records of depositor, ndepositor=5000
Number of blocks of depositor, bdepositor=100
Nested Loop Join
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr , ts) to see if they satisfy the join condition θ
if they do, add tr.ts to the result
end
end
The nested loop join algorithm is expensive, since it examines every pair of tuples in the two relations.
The number of pairs of tuples to be considered is nr * ns, where nr denotes the no. of tuples in r, and ns is the no. of tuples in relation s.
For each record in r we have to perform a complete scan on s.
In the worst case, the buffer can hold only one block of each relation, and a total of nr * bs + br block accesses would be required, where br and bs denote the number of blocks containing tuples of r and s
In the best case, there is enough space for both relations to fit simultaneously in memory, so each block would have to read only once; hence, only br + bs block accesses would be required.
Considering the natural join of depositor and customer
Join Strategies
1. We've seen how to estimate the size of a join. Now we look at estimating the cost of processing a join.
2. Several factors influence the selection of an optimal strategy:
o Physical order of tuples in a relation.
o Presence of indices and type of index (clustering or not).
o Cost of computing a temporary index for the sole purpose of processing one query.
3. We'll look at computing the expression deposit customer assuming no indices exist. We also let
(number of deposit tuples)
(number of customer tuples)
Simple Iteration
1. If we don't create an index, we must examine every pair of tuples in deposit and in customer. This means examining 10,000 * 200 = 2,000,000 pairs.
2. If we execute this query cleverly, we can cut down the number of block accesses. We use the following method:
for each tuple deposit do begin
for each tuple customer do begin
examine pair (d, c) to see if a tuple should be added to the result
end
end
3.
o We read each tuple of deposit once.
o This could require 10,000 block accesses.
o The total number of block access, if the tuples are not stored together physically, would be 10,000 + 10,000 * 200 = 2,010,000.
o If we put customer in the outer loop, we get 2,000,200 accesses.
o If the tuples of deposit are stored together physically, fewer accesses are required (at 20 per block, 10,000/20 = 500 block accesses).
o We read each tuple of customer once for each tuple of deposit.
o This suggests we read each tuple of customer 10,000 times, giving as many as 2,000,000 accesses to read customer tuples!
o This would give a total of 2,000,500 accesses.
o We can reduce accesses significantly if we store customer tuples together physically.
o At 20 tuples per block, only 10 accesses are required to read the entire relation (as opposed to 200).
o Then we only need 10 * 10,000 = 100,000 block accesses for customer.
o This gives a total of 100,500.
4. if we use customer in the outer loop,
o Now we reference each tuple of deposit once for each tuple of customer.
o If deposit tuples are stored together physically, then since 20 tuples fit on one block, accesses are needed to read the entire relation.
o Since customer has 200 tuples, we read the deposit relation 200 times.
o WRONG! 200 * 500 is 100,000.
o Total cost is then 100,000 for inner loop plus 10 accesses to read the customer relation once for a total of 100,010.
o Compared to previous estimate of 100,500, the savings are small (490).
5. Note that we are considering worst-case number of block reads, where every time a block is needed it is not in the buffer.
Good buffer management can reduce this considerably.
Block-Oriented Iteration
1. If we process tuples on a per-block basis we can save many accesses.
The idea is that, if both relations have tuples stored together physically, we can examine all the tuple pairs for a block of each relation at one time. We still need to read all the tuples of one relation for a block of the other relation.
The block method algorithm is:
for each block Bd of deposit do begin
for each block Bc of customer do begin
for each tuple d in Bd do begin
for each tuple c in Bc do begin
test pair (d, c) to see if a tuple should be added to t the result
end
end
end
end
o Instead of reading the entire customer relation for each tuple of deposit, we read the entire customer relation once for each block of deposit.
o Since there are 500 blocks of deposit tuples and 10 blocks of customer tuples, reading customer once for each block of deposit requires 10 * 500 = 5000 accesses to customer blocks.
o Total cost is then 5000 + 500 (for accesses to deposit blocks) = 5500.
o This is obviously a significant improvement over the non-block method, which required roughly 100,000 or 2,000,000 accesses.
o Choice of customer for the inner loop is arbitrary, but does provide a potential advantage.
o Being the smaller relation, it may be possible to keep it all in main memory.
o If this was the case, we would only require 500 blocks to read deposit plus 10 blocks to read customer into main memory, for a total of 510 block accesses.
No comments:
Post a Comment