Wednesday, September 1, 2010

Query Interpretation

Query Interpretation:--

1. Why do we need to optimize?
o A high-level relational query is generally non-procedural in nature.
o It says ``what'', rather than ``how'' to find it.
o When a query is presented to the system, it is useful to find an efficient method of finding the answer, using the existing database structure.
o Usually worthwhile for the system to spend some time on strategy selection.
o Typically can be done using information in main memory, with little or no disk access.
o Execution of the query will require disk accesses.
o Transfer of data from disk is slow, relative to the speed of main memory and the CPU
o It is advantageous to spend a considerable amount of processing to save disk accesses.
2. Do we really optimize?
o Optimizing means finding the best of all possible methods.
o The term ``optimization'' is a bit of a misnomer here.
o Usually the system does not calculate the cost of all possible strategies.
o Perhaps ``query improvement'' is a better term.
3. Two main approaches:
1. Rewriting the query in a more effective manner.
2. Estimating the cost of various execution strategies for the query.
Usually both strategies are combined.


o The difference in execution time between a good strategy and a bad one may be huge.
o Thus this is an important issue in any DB system.
o As a relational query can be expressed entirely in a relational query language without the use of a host language, it is possible to optimize queries automatically.
o SQL is suitable for human use, but internally a query should be represented in a more useful form, like the relational algebra.
4. So, first the system must translate the query into its internal form. Then optimization begins:
o Find an equivalent expression that is more efficient to execute.
o Select a detailed strategy for processing the query. (Choose specific indices to use, and order in which tuples are to be processed, etc.)
5. Final choice of a strategy is based primarily on the number of disk accesses required.








Equivalence of Expressions
1. The first step in selecting a query-processing strategy is to find a relational algebra expression that is equivalent to the given query and is efficient to execute.
2. We'll use the following relations as examples:
Customer-scheme = (cname, street, ccity)
Deposit-scheme = (bname, account#, cname, balance)
Branch-scheme = (bname, assets, bcity)
We will use instances customer, deposit and branch of these schemes

Selection Operation
1. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is

(customer deposit branch))

This expression constructs a huge relation,
customer deposit branch
of which we are only interested in a few tuples.
We also are only interested in two attributes of this relation.
We can see that we only want tuples for which ccity = ``Port Chester''.
Thus we can rewrite our query as:

deposit branch)
This should considerably reduce the size of the intermediate relation.
2. Suggested Rule for Optimization:
o Perform select operations as early as possible.
o If our original query was restricted further to customers with a balance over $1000, the selection cannot be done directly to the customer relation above.
o The new relational algebra query is


(customer deposit branch))

o The selection cannot be applied to customer, as balance is an attribute of deposit.


o We can still rewrite as


(customer deposit))
branch)

o If we look further at the subquery (middle two lines above), we can split the selection predicate in two:

(customer deposit))

o This rewriting gives us a chance to use our ``perform selections early'' rule again.
o We can now rewrite our subquery as:



3. Second Transformational Rule:
o Replace expressions of the form by where and are predicates and e is a relational algebra expression.
o Generally,





Projection Operation
1. Like selection, projection reduces the size of relations.
It is advantageous to apply projections early. Consider this form of our example query:









2. When we compute the subexpression

we obtain a relation whose scheme is
(cname, ccity, bname, account#, balance)
3. We can eliminate several attributes from this scheme. The only ones we need to retain are those that
o appear in the result of the query or
o are needed to process subsequent operations.
4. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size.
5. In our example, the only attribute we need is bname (to join with branch). So we can rewrite our expression as:



6. Note that there is no advantage in doing an early project on a relation before it is needed for some other operation:
o We would access every block for the relation to remove attributes.
o Then we access every block of the reduced-size relation when it is actually needed.
o We do more work in total, rather than less.
Natural Join Operation
1. Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations.
2. Natural join is associative:

3. Although these expressions are equivalent, the costs of computing them may differ.
Look again at our expression



o we see that we can compute deposit branch first and then join with the first part.
o However, deposit branch is likely to be a large relation as it contains one tuple for every account.
o The other part,
o
is probably a small relation (comparatively).

o So, if we compute
o
first, we get a reasonably small relation.
o It has one tuple for each account held by a resident of Port Chester.
o This temporary relation is much smaller than deposit branch.

4. Natural join is commutative:

Thus we could rewrite our relational algebra expression as:



But there are no common attributes between customer and branch, so this is a Cartesian product, that results in lots of tuples.
If a user entered this expression, we would want to use the associativity and commutativity of natural join to transform this into the more efficient expression we have derived earlier (join with deposit first, then with branch).
Other Operations
1. Some other equivalences for union and set difference:




2. We can summarize query optimization as combining the implementation of various sets of operations in order to reduce the size of intermediate relations:
o Combine projects and selects with a Cartesian product or natural join.
o The idea is to do the selection and/or projection while computing the join.
o This saves computing a large intermediate relation that is going to be subsequently reduced by the select or project anyway.


Query Cost

The cost of query evaluation can be measured in terms of a number of different resources such as

Disk accesses
CPU time to execute a query
The cost of communication (in distributed systems)

The response time for a query evaluation plan assuming that no other activity is going on the computer while the query is being executed would account for all these costs.

Since CPU speeds are much faster than disk access speeds and also estimation of CPU time is relatively hard, disk access cost is considered a reasonable measure of the cost of query evaluation plan.

A measure of the number of block transfers would estimate

The number of seek operations performed
The number of blocks read
The number of blocks written

The costs of all the algorithms depend on the size of the buffer in main memory.

In the best case, all data can be read into buffers, and disk does not need to be accessed again.

In the worst case, we assume that the buffer can hold only a few blocks of data, approximately one block per relation.


Join Strategies
1. We look at estimating the cost of processing a join.
2. Several factors influence the selection of an optimal strategy:
o Physical order of tuples in a relation.
o Presence of indices and type of index
o Cost of computing a temporary index for the sole purpose of processing one query.
3. We'll look at computing the expression deposit customer assuming no indices exist. We also let
ndeposit = 10,000 (number of deposit tuples)
ncustomer = 200 (number of customer tuples)
Simple Iteration
1. If we don't create an index, we must examine every pair of tuples t1 in deposit and t2 in customer. This means examining 10,000 * 200 = 2,000,000 pairs.
2. If we execute this query cleverly, we can cut down the number of block accesses. We use the following method:
for each tuple deposit do begin
for each tuple customer do begin
examine pair (d, c) to see if a tuple should be added to the result
end
end

3.
o We read each tuple of deposit once.
o This could require 10,000 block accesses.
o The total number of block access, if the tuples are not stored together physically, would be 10,000 + 10,000 * 200 = 2,010,000.
o If we put customer in the outer loop, we get 2,000,200 accesses.
o If the tuples of deposit are stored together physically, fewer accesses are required (at 20 per block, 10,000/20 = 500 block accesses).
o We read each tuple of customer once for each tuple of deposit.
o This suggests we read each tuple of customer 10,000 times, giving as many as 2,000,000 accesses to read customer tuples!
o This would give a total of 2,000,500 accesses.
o We can reduce accesses significantly if we store customer tuples together physically.
o At 20 tuples per block, only 10 accesses are required to read the entire relation (as opposed to 200).
o Then we only need 10 * 10,000 = 100,000 block accesses for customer.
o This gives a total of 100,500.
4. if we use customer in the outer loop,
o Now we reference each tuple of deposit once for each tuple of customer.
o If deposit tuples are stored together physically, then since 20 tuples fit on one block, ndeposit/20=500 accesses are needed to read the entire relation.
o Since customer has 200 tuples, we read the deposit relation 200 times.
o Total cost is then 100,000 for inner loop plus 10 accesses to read the customer relation once for a total of 100,010.
o Compared to previous estimate of 100,500, the savings are small (490).

5. Note that we are considering worst-case number of block reads, where every time a block is needed it is not in the buffer.
Good buffer management can reduce this considerably.


Block-Oriented Iteration
1. If we process tuples on a per-block basis we can save many accesses.
The idea is that, if both relations have tuples stored together physically, we can examine all the tuple pairs for a block of each relation at one time. We still need to read all the tuples of one relation for a block of the other relation.

The block method algorithm is:
for each block Bd of deposit do begin
for each block Bc of customer do begin
for each tuple d in Bd do begin
for each tuple c in Bc do begin
test pair (d, c) to see if a tuple should be added to t the result
end
end
end
end

o Instead of reading the entire customer relation for each tuple of deposit, we read the entire customer relation once for each block of deposit.
o Since there are 500 blocks of deposit tuples and 10 blocks of customer tuples, reading customer once for each block of deposit requires 10 * 500 = 5000 accesses to customer blocks.
o Total cost is then 5000 + 500 (for accesses to deposit blocks) = 5500.
o This is obviously a significant improvement over the non-block method, which required roughly 100,000 or 2,000,000 accesses.
o Choice of customer for the inner loop is arbitrary, but does provide a potential advantage.
o Being the smaller relation, it may be possible to keep it all in main memory.
o If this was the case, we would only require 500 blocks to read deposit plus 10 blocks to read customer into main memory, for a total of 510 block accesses.

Merge-Join
1. Suppose neither relation fits in main memory, and both are stored in sorted order on the join attributes. (E.g. both deposit and customer sorted by cname.)
2. We can then perform a merge-join, computed like this:
o Associate one pointer with each relation.
o Initially these pointers point to the first record in each relation.
o As algorithm proceeds, pointers move through the relation.
o A group of tuples in one relation with the same value on the join attributes is read.
o Then the corresponding tuples (if any) of the other relation are read.
o Since the relations are in sorted order, tuples with same value on the join attributes are in consecutive order. This allows us to read each tuple only once.
3. In the case where tuples of the relations are stored together physically in their sorted order, this algorithm allows us to compute the join by reading each block exactly once.

4.
o For deposit customer, this is a total of 510 block accesses.
o This is as good as the block-oriented method with the inner loop relation fitting into main memory.
o The disadvantage is that both relations must be sorted physically.
o It may be worthwhile to do this sort to allow a merge-join.

Structure of the Query Optimizer
1. There are many query-processing strategies used in database systems.
2. Most systems only implement a few strategies.
3. Some systems make a heuristic guess of a good strategy, in order to minimize the number of strategies to be considered.
4. Then the optimizer considers every possible strategy, but quits as soon as it determines that the cost is greater than the best previously considered strategy.
5. To simplify the strategy selection task, a query may be split into several sub-queries.
6. This simplifies strategy selection and permits recognition of common sub-queries (no need to compute them twice).
7. Examination of a query for common subqueries and the estimation of the cost of a large number of strategies impose a substantial overhead on query processing.
8. However, this is usually more than offset by savings at query execution time.
9. Therefore, most commercial systems include relatively sophisticated optimizers.

No comments:

Post a Comment