Information Technology: Estimation of Query-Processing Cos

Estimation of Query-Processing Cost:--

1. To choose a strategy based on reliable information, the database system may store statistics for each relation r:
o - the number of tuples in r.
o - the size in bytes of a tuple of r (for fixed-length records).
o V(A, r) - the number of distinct values that appear in relation r for attribute A.
2. The first two quantities allow us to estimate accurately the size of a Cartesian product.
o The Cartesian product contains tuples.
o Each tuple of occupies bytes.
o The third statistic is used to estimate how many tuples satisfy a selection predicate of the form
o =
o We need to know how often each value appears in a column.
o If we assume each value appears with equal probability, then
o
is estimated to have

tuples.
o This may not be the case, but it is a good approximation of reality in many relations.
o We assume such a uniform distribution for the rest of this chapter.
o Estimation of the size of a natural join is more difficult.
o Let and be relations on schemes and .
o If (no common attributes), then is the same as and we can estimate the size of this accurately.
o If is a key for , then we know that a tuple of will join with exactly one tuple of .
o Thus the number of tuples in will be no greater than .
o If is not a key for or , things are more difficult.
o We use the third statistic and the assumption of uniform distribution.
o Assume .
o We assume there are

tuples in with an A value of t[A] for tuple t in .
o So tuple t of produces

tuples in
3. Considering all the tuples in , we estimate that there are

tuples in total in
4. If we reverse the roles of and in this equation, we get a different estimate

if .
o If this occurs, there are likely to be some dangling tuples that do not participate in the join.
o Thus the lower estimate is probably the better one.
o This estimate may still be high if the values in have few values in common with the values in .
o However, it is unlikely that the estimate is far off, as dangling tuples are likely to be a small fraction of the tuples in a real world relation.
5. To maintain accurate statistics, it is necessary to update the statistics whenever a relation is modified.
This can be substantial, so most systems do this updating during periods of light load on the system.

Estimation of Access Costs Using Indices
1. So far, we haven't considered the effects of indices and hash functions on the cost of evaluating an expression.
o Indices and hash functions allow fast access to records containing a specific value on the index key.
o Indices (but not most hash functions) also allow the records of a file to be read in sorted order.
o It is efficient to read records of a file in an order corresponding closely to physical order.
o If an index allows this, we call the index a clustering index.
o Such indices allow us to take advantage of the physical clustering of records into blocks.
o Text doesn't distinguish clearly between a clustering index and a primary index. [ELNA89] define a primary index as one on the primary key where the file is sorted on that key, and a clustering index as one on non-primary key attribute(s) that the file is sorted on.
o With this definition, for a primary index, there is only one tuple per search key value, while for a clustering index there may be many tuples.
o In both cases, only one pointer is needed per search key value. (Why?)
2. Detailed strategy for processing a query is called the access plan. This includes not only the relational operations to be performed, but also the indices to be used and the order in which tuples are to be accessed and the order in which operations are to be performed.
3. The use of indices imposes some overhead (access to blocks containing the index.) We also must take this into account in computing cost of a strategy.
4. We'll look at the query
select account#

from deposit

where bname = ``Perryridge''

and cname = ``Williams''

and balance > 1000

5. We assume the following statistical information is kept about the deposit relation:
o 20 tuples of deposit fit in one block.
o V(bname, deposit) = 50.
o V(cname, deposit) = 200.
o V(balance, deposit) = 5000.
o (number of tuples).
6. We also assume the following indices exist on deposit:
o A clustering -tree index for bname.
o A nonclustering -tree index for cname.
7. We also still assume values are distributed uniformly.
o As V(bname, deposit) = 50, we expect 10,000/50 = 200 tuples of the deposit relation apply to Perryridge branch.
o If we use the index on bname, we will need to read these 200 tuples and check each one for satisfaction of the rest of the where clause.
o Since the index is a clustering index,
200/20 = 10 block reads are required.
o Also several index blocks must be read.
o Assume the -tree stores 20 pointers per node.
o Then the -tree must have between 3 and 5 leaf nodes (to store the 50 different values of bname).
o So the entire tree has a depth of 2, and at most 2 index blocks must be read.
So the above strategy requires 12 block reads.
8. Note: another way of calculating the number of levels in a -tree is to remember that the height is no greater than

where there are K search key values in the relation, and n is the number of pointers in a node.
9. You can use the change of base formula to calculate this value using a log function of base x with your calculator:

10. If we use the index for cname, we estimate the number of block accesses as follows:
o Since V(cname, deposit)=200, we expect that 10,000/200 = 50 tuples pertain to Williams.
o However, as the index on cname is nonclustering, we can expect that 50 block reads will be required, plus some for the index (as before).
o Assume that 20 pointers fit into one node of a -tree index.
o As there are 200 customer names, the tree has between 11 and 20 leaf nodes.
o So the index has a depth of 2 (says the text), and 2 block accesses are required to read the index blocks. (Actually, depth could be 3 -- can you see how?)
o This strategy requires a total of 52 block accesses.
o So we conclude that it is better to use the index on bname.
11. If both indices were non-clustering, which one would we choose?
o We only expect 50 tuples for cname = ``Williams'', versus 200 tuples with bname =``Perryridge''.
o So without clustering, we would choose the cname index as it would require reading and inspecting fewer tuples.
12. Another interesting method is to look at pointers first:
o Use the index for cname to retrieve pointers to records with cname = ``Williams'', rather than the records themselves.
o Let denote this set of pointers.
o Similarly, use the index on bname to obtain , the set of pointers to records with bname = ``Perryridge''.
o Then is the set of pointers to records with bname = ``Perryridge'' and cname = ``Williams''.
o Only these records need to be retrieved and tested to see if balance > 1000.
o Cost is 4 blocks for both indices to be read, plus blocks for records whose pointers are in .
o This last quantity can be estimated from our statistics.
o As V(bname, deposit) = 50 and V(cname, deposit) = 200, we can expect one tuple in 50 * 200, or 1 in 10,000 to have both values we are looking for.
o This means that is estimated to have only one pointer.
o So we only need to read 1 block, and total cost is 5 block reads.
13. We didn't use the balance attribute as a starting point because there is no index for balance and also the predicate involves a ``greater than'' comparison (> 1000).
14. Generally, equality predicates are more selective than ``greater than'' predicates, as they return fewer tuples.
15. Estimation of access cost using indices allows us to estimate the complete cost, in terms of block accesses, of a plan. It is often worthwhile for a large number of strategies to be evaluated down to the access plan level before a choice is made.

Join Strategies
1. We've seen how to estimate the size of a join. Now we look at estimating the cost of processing a join.
2. Several factors influence the selection of an optimal strategy:
o Physical order of tuples in a relation.
o Presence of indices and type of index (clustering or not).
o Cost of computing a temporary index for the sole purpose of processing one query.
3. We'll look at computing the expression deposit customer assuming no indices exist. We also let
(number of deposit tuples)

(number of customer tuples)

Simple Iteration
1. If we don't create an index, we must examine every pair of tuples in deposit and in customer. This means examining 10,000 * 200 = 2,000,000 pairs.
2. If we execute this query cleverly, we can cut down the number of block accesses. We use the following method:
for each tuple deposit do begin
for each tuple customer do begin
examine pair (d, c) to see if a tuple should be added to the result
end
end

3.
o We read each tuple of deposit once.
o This could require 10,000 block accesses.
o The total number of block access, if the tuples are not stored together physically, would be 10,000 + 10,000 * 200 = 2,010,000.
o If we put customer in the outer loop, we get 2,000,200 accesses.
o If the tuples of deposit are stored together physically, fewer accesses are required (at 20 per block, 10,000/20 = 500 block accesses).
o We read each tuple of customer once for each tuple of deposit.
o This suggests we read each tuple of customer 10,000 times, giving as many as 2,000,000 accesses to read customer tuples!
o This would give a total of 2,000,500 accesses.
o We can reduce accesses significantly if we store customer tuples together physically.
o At 20 tuples per block, only 10 accesses are required to read the entire relation (as opposed to 200).
o Then we only need 10 * 10,000 = 100,000 block accesses for customer.
o This gives a total of 100,500.
4. if we use customer in the outer loop,
o Now we reference each tuple of deposit once for each tuple of customer.
o If deposit tuples are stored together physically, then since 20 tuples fit on one block, accesses are needed to read the entire relation.
o Since customer has 200 tuples, we read the deposit relation 200 times.
o WRONG! 200 * 500 is 100,000.
o Total cost is then 100,000 for inner loop plus 10 accesses to read the customer relation once for a total of 100,010.
o Compared to previous estimate of 100,500, the savings are small (490).

5. Note that we are considering worst-case number of block reads, where every time a block is needed it is not in the buffer.
Good buffer management can reduce this considerably.

Block-Oriented Iteration
1. If we process tuples on a per-block basis we can save many accesses.
The idea is that, if both relations have tuples stored together physically, we can examine all the tuple pairs for a block of each relation at one time. We still need to read all the tuples of one relation for a block of the other relation.
The block method algorithm is:
for each block Bd of deposit do begin
for each block Bc of customer do begin
for each tuple d in Bd do begin
for each tuple c in Bc do begin
test pair (d, c) to see if a tuple should be added to t the result
end
end
end
end

o Instead of reading the entire customer relation for each tuple of deposit, we read the entire customer relation once for each block of deposit.
o Since there are 500 blocks of deposit tuples and 10 blocks of customer tuples, reading customer once for each block of deposit requires 10 * 500 = 5000 accesses to customer blocks.
o Total cost is then 5000 + 500 (for accesses to deposit blocks) = 5500.
o This is obviously a significant improvement over the non-block method, which required roughly 100,000 or 2,000,000 accesses.
o Choice of customer for the inner loop is arbitrary, but does provide a potential advantage.
o Being the smaller relation, it may be possible to keep it all in main memory.
o If this was the case, we would only require 500 blocks to read deposit plus 10 blocks to read customer into main memory, for a total of 510 block accesses.

Merge-Join
1. Suppose neither relation fits in main memory, and both are stored in sorted order on the join attributes. (E.g. both deposit and customer sorted by cname.)
2. We can then perform a merge-join, computed like this:
o Associate one pointer with each relation.
o Initially these pointers point to the first record in each relation.
o As algorithm proceeds, pointers move through the relation.
o A group of tuples in one relation with the same value on the join attributes is read.
o Then the corresponding tuples (if any) of the other relation are read.
o Since the relations are in sorted order, tuples with same value on the join attributes are in consecutive order. This allows us to read each tuple only once.
3. In the case where tuples of the relations are stored together physically in their sorted order, this algorithm allows us to compute the join by reading each block exactly once.
4.
o For deposit customer, this is a total of 510 block accesses.
o This is as good as the block-oriented method with the inner loop relation fitting into main memory.
o The disadvantage is that both relations must be sorted physically.
o It may be worthwhile to do this sort to allow a merge-join.

Use of an Index
1. Frequently the join attributes form a search key for an index on one of the relations being joined.
o In such cases we can consider a join strategy that makes use of such an index.
o Our first join algorithm is more efficient if an index exists on customer (inner loop relation) for cname.
o Then for a tuple d in deposit, we no longer have to read the entire customer relation.
o Instead, we use the index to find tuples in customer with the same value on cname as tuple d of deposit.
o We saw that without an index, we could take as many as 2,010,000 block accesses.
o Using the index, and making no assumptions about physical storage, the join can be performed more efficiently.
o We still need 10,000 accesses to read deposit (outer loop relation).
o However, for each tuple of deposit, we only need an index lookup on the customer relation.
o As and we assumed that 20 pointers fit in one block, this lookup requires at most 2 index block accesses (depth of -tree) plus block access for the customer tuple itself.
o This means 3 block accesses instead of 200, for each of the 10,000 deposit tuples.
o This gives 40,000 total, which appears high, but is still better than 2,010,000.
o More efficient strategies required tuples to be stored physically together.
o If tuples are not stored physically together, this strategy is highly desirable.
o The savings (1,970,000 accesses) is enough to justify creation of the index, even if it is used only once.

Hash Join
1. Sometimes it may be useful to construct a ``use once only'' hash structure to assist in the computation of a single join.
o We use a hash function h to hash tuples of both relations on the basis of join attributes.
o The resulting buckets, pointing to tuples in the relations, limit the number of pairs of tuples that must be compared.
o If d is a tuple in deposit and c is a tuple in customer, then d and c must be compared only if h(d) = h(c).
o The comparison must still be done as it is possible that d and c have different customer names that hash to the same value.
o As before, we need our hash function to hash randomly and uniformly.
2. We will now estimate the cost of a hash-join.
o Assume that h is a hash function mapping cname values to .
o denote buckets of pointers to customer tuples, each initially empty.
o denote buckets of pointers to deposit tuples, each initially empty.
o The first two loops assign pointers to the hash buckets, requiring a complete reading of both relations.
o This requires 510 block accesses if we assume that both relations are stored together physically (i.e. not clustered with other relations).
o As buckets contain only pointers, we assume they fit in main memory, so no disk accesses are required to read the buckets.
o Final loop of the algorithm iterates over the range of hash function h.
o Assume i is a particular value in the range of h.
o The tuples rd and rc are assembled, from the pointers, where rd is the set of deposit tuples that hash to bucket i, and rc is the set of customer tuples that hash to bucket i.
o Then is calculated.
o This join is done using simple iteration, since we expect rd and rc to be small enough to fit in main memory.
o Since a tuple hashes to exactly one bucket, each tuple is read only once by the final outer loop.
o If there are 10,000 deposit tuples and 200 customer tuples, then reading every tuple once could take 10,200 accesses, worst-case.
o Total is then 510 block accesses to create the buckets, plus 10,200 to assemble the buckets' records, giving 10,710.
3. If the query optimizer chooses to do a hash-join, the hash function must be chosen so that
o The range is large enough to ensure that buckets have a small number of pointers, so that rd and rc fit in main memory.
o The range is not so large that many buckets are empty and the algorithm processes many empty buckets.

Three-Way Join
1. We now consider the strategies for computing a 3-way join:
branch deposit customer

2. We'll assume that
o
o
o
3. Strategy 1:
o Compute deposit customer using one of the previous methods.
o As cname is a key for customer, the result of this join has at most 10,000 tuples (the number in deposit).
o If we then build an index on branch for bname, we can compute
o branch (deposit customer)
o
by considering each tuple t of
(deposit customer)

and looking up tuples in branch with the same bname value using the index.
4. Strategy 2:
o Compute the join without any indices.
o This requires checking 50 * 10,000 * 200 possibilities = 100,000,000.
o Not too good of an idea...
5. Strategy 3:
o Perform the pair of joins at once.
o Construct two indices:
 One on branch for bname.
 One on customer for cname.
o Then consider each tuple t in deposit.
o For each t, look up corresponding tuples in customer and in branch with equal values on respective join attributes.
o Thus we examine each tuple of deposit exactly once.
Using strategy 3 it is often possible to perform a three-way join more efficiently than by using two two-way joins.
6. It is hard to calculate exact costs for 3-way joins. Costs depend on how the relations are stored, distribution of values and presence of indices.

Structure of the Query Optimizer
1. There are many query-processing strategies used in database systems.
2. Most systems only implement a few strategies.
3. Some systems make a heuristic guess of a good strategy, in order to minimize the number of strategies to be considered.
4. Then the optimizer considers every possible strategy, but quits as soon as it determines that the cost is greater than the best previously considered strategy.
5. To simplify the strategy selection task, a query may be split into several sub-queries.
6. This simplifies strategy selection and permits recognition of common sub-queries (no need to compute them twice).
7. Examination of a query for common subqueries and the estimation of the cost of a large number of strategies impose a substantial overhead on query processing.
8. However, this is usually more than offset by savings at query execution time.
9. Therefore, most commercial systems include relatively sophisticated optimizers.

Information Technology

Wednesday, September 1, 2010

Estimation of Query-Processing Cos

No comments:

Post a Comment