Wednesday, September 1, 2010

Relational Database Design

Relational Database Design:--

The goal of relational database design is to generate a set of schemas that allow us to
• Store information without unnecessary redundancy.
• Retrieve information easily (and accurately).

A bad design may have several properties, including:
• Repetition of information.
• Inability to represent certain information.
• Loss of information.


Integrity Constraints
1. Integrity constraints provide a way of ensuring that changes made to the database by authorized users do not result in a loss of data consistency.
2. We are familiar with following forms of integrity constraint in the E-R models:
o key declarations: stipulation that certain attributes form a candidate key for the entity set.
o form of a relationship: mapping cardinalities 1-1, 1-many and many-many.
An integrity constraint can be any arbitrary predicate applied to the database



Domain Constraints
1. A domain of possible values should be associated with every attribute. These domain constraints are the most basic form of integrity constraint.
They are easy to test for when data is entered.
2. Domain types
o Attributes may have the same domain, e.g. cname and employee-name.
o It is not as clear whether bname and cname domains ought to be distinct.
o At the implementation level, they are both character strings.
o At the conceptual level, we do not expect customers to have the same names as branches, in general.
o Strong typing of domains allows us to test for values inserted, and whether queries make sense. Newer systems, particularly object-oriented database systems, offer a rich set of domain types that can be extended easily.
3. The check clause in SQL permits domains to be restricted in powerful ways
The check clause permits schema designer to specify a predicate that must be satisfied by any value assigned to a variable whose type is the domain.

Examples:
create domain hourly-wage numeric(5,2)
constraint wage-value-test check(value >= 4.00)

Note that ``constraint wage-value-test'' is optional (to give a name to the test to signal which constraint is violated).
create domain account-number char(10)
constraint account-number-null-test
check(value not null)

create domain account-type char(10)
constraint account-type-test
check(value in (``Checking'', ``Saving''))





Keys Constraints
Primary Key
A PRIMARY KEY constraint is used to specify the primarykey of a table, which is a column or set of columns that uniquely identifies a row. Because it identifies the row, a primary key column can never be NULL.

CREATE TABLE customer
(
first_name char(20) NOT NULL,
mid_init char(1) NULL,
last_name char(20) NOT NULL,
SSN char(11) PRIMARY KEY,
cust_phone char(10) NULL
)
GO

Foreign Key
A FOREIGN KEY constraint defines a foreignkey, which identifies a relationship between two tables. The foreign key column or columns in one table reference a candidate key in another table. When a row is inserted into the table with the FOREIGN KEY constraint, the values to be inserted into the column or columns defined as the foreign key are checked against the values in the candidate key of the referenced table. If no row in the referenced table matches the values in the foreign key, the new row cannot be inserted. But if the foreign key values to be inserted into the table do exist in the candidate key of the other table, the new row will be inserted.
CREATE TABLE items
(
item_name char(15) NOT NULL,
item_id smallint NOT NULL IDENTITY(1,1),
price smallmoney NULL,
item_desc varchar(30) NOT NULL DEFAULT 'none',
CONSTRAINT PK_item_id PRIMARY KEY (item_id)
)
GO

CREATE TABLE inventory
(
store_id tinyint NOT NULL,
item_id smallint NOT NULL,
item_quantity tinyint NOT NULL,
CONSTRAINT FK_item_id FOREIGN KEY (item_id)
REFERENCES items(item_id)
)
GO

Referential Integrity
Often we wish to ensure that a value appearing in a relation for a given set of attributes also appears for another set of attributes in another relation. This is called referential integrity.

Basic Concepts
1. Dangling tuples.
o Consider a pair of relations r(R) and s(S), and the natural join.
o There may be a tuple in r that does not join with any tuple in s.
o That is, there is no tuple in s such that
.
o We call this a dangling tuple.
o Dangling tuples may or may not be acceptable.
2. Suppose there is a tuple in the account relation with the value ``Lunartown'', but no matching tuple in the branch relation for the Lunartown branch.
This is undesirable, as should refer to a branch that exists.
Now suppose there is a tuple in the branch relation with ``Mokan'', but no matching tuple in the account relation for the Mokan branch.
This means that a branch exists for which no accounts exist. This is possible, for example, when a branch is being opened. We want to allow this situation.


3. Note the distinction between these two situations: bname is the primary key of branch, while it is not for account.
In account, bname is a foreign key, being the primary key of another relation.
o Let and be two relations with primary keys and respectively.
o We say that a subset of is a foreign key referencing in relation if it is required that for every tuple in there must be a tuple in such that

o We call these requirements referential integrity constraints.
o Also known as subset dependencies, as we require


Referential Integrity in the E-R Model
1. These constraints arise frequently. Every relation arising from a relationship set has referential integrity constraints.



An n-ary relationship set
o Figure shows an n-ary relationship set R relating entity sets .
o Let denote the primary key of .
o The attributes of the relation scheme for relationship set R include .
o Each in the scheme for R is a foreign key that leads to a referential integrity constraint.


2. Relation schemes for weak entity sets must include the primary key of the strong entity set on which they are existence dependent. This is a foreign key, which leads to another referential integrity constraint.

Representation of Information
1. Suppose we have a schema, Lending-schema,
Lending-schema = (bname, bcity, assets, cname, loan#, amount)

and suppose an instance of the relation is

Sample lending relation.
2. A tuple t in the new relation has the following attributes:
o t[assets] is the assets for t[bname]
o t[bcity] is the city for t[bname]
o t[loan#] is the loan number made by branch t[bname] to t[cname].
o t[amount] is the amount of the loan for t[loan#]
3. If we wish to add a loan to our database, the original design would require adding a tuple to borrow:
(SFU, L-31, Turner, 1K)

4. In our new design, we need a tuple with all the attributes required for Lending-schema. Thus we need to insert
(SFU, Burnaby, 2M, Turner, L-31, 1K)
5. We are now repeating the assets and branch city information for every loan.
o Repetition of information wastes space.
o Repetition of information complicates updating.
6. Under the new design, we need to change many tuples if the branch's assets change.
7. Let's analyze this problem:
o We know that a branch is located in exactly one city.
o We also know that a branch may make many loans.
o The functional dependency bname bcity holds on Lending-schema.
o The functional dependency bname loan# does not.
o These two facts are best represented in separate relations.
8. Another problem is that we cannot represent the information for a branch (assets and city) unless we have a tuple for a loan at that branch.
9. Unless we use nulls, we can only have this information when there are loans, and must delete it when the last loan is paid off.

Decomposition
1. The previous example might seem to suggest that we should decompose schema as much as possible.
Careless decomposition, however, may lead to another form of bad design.
2. Consider a design where Lending-schema is decomposed into two schemas
Branch-customer-schema = (bname, bcity, assets, cname)
Customer-loan-schema = (cname, loan#, amount)
3. We construct our new relations from lending by:
branch-customer =

customer-loan =

The decomposed lending relation.
4. It appears that we can reconstruct the lending relation by performing a natural join on the two new schemas.
5. Following figure shows what we get by computing branch-customer customer-loan.

Join of the decomposed relations.
6. We notice that there are tuples in branch-customer customer-loan that are not in lending.
7. How did this happen?
o The intersection of the two schemas is cname, so the natural join is made on the basis of equality in the cname.
o If two lendings are for the same customer, there will be four tuples in the natural join.
o Two of these tuples will be spurious - they will not appear in the original lending relation, and should not appear in the database.
o Although we have more tuples in the join, we have less information.
o Because of this, we call this a lossy or lossy-join decomposition.
o A decomposition that is not lossy-join is called a lossless-join decomposition.
o The only way we could make a connection between branch-customer and customer-loan was through cname.
8. When we decomposed Lending-schema into Branch-schema and Loan-info-schema, we will not have a similar problem.
Branch-schema = (bname, bcity, assets)
Branch-loan-schema = (bname, cname, loan#, amount)
The only way we could represent a relationship between tuples in the two relations is through bname.
This will not cause problems.
9. For a given branch name, there is exactly one assets value and exactly one bcity; whereas a similar statement associated with a loan depends on the customer, not on the amount of the loan (which is not unique).
10. We'll make a more formal definition of lossless-join:
o Let R be a relation schema.
o A set of relation schemas is a decomposition of R if

o That is, every attribute in R appears in at least one for .

o Let r be a relation on R, and let

o That is, is the database that results from decomposing R into .
o It is always the case that:

o To see why this is, consider a tuple .
 When we compute the relations , the tuple t gives rise to one tuple in each .
 These n tuples combine together to regenerate t when we compute the natural join of the .
 Thus every tuple in r appears in .
o However, in general,

o We saw an example of this inequality in our decomposition of lending into branch-customer and customer-loan.
o In order to have a lossless-join decomposition, we need to impose some constraints on the set of possible relations.
o Let C represent a set of constraints on the database.

o A decomposition of a relation schema R is a lossless-join decomposition for R if, for all relations r on schema R that are legal under C:

11. In other words, a lossless-join decomposition is one in which, for any legal relation r, if we decompose r and then ``recompose'' r, we get what we started with - no more and no less.

Normalization Using Functional Dependencies
We can use functional dependencies to design a relational database in which most of the problems we have seen do not occur.
Using functional dependencies, we can define several normal forms which represent ``good'' database designs.
Desirable Properties of Decomposition
1. We'll take another look at the schema
Lending-schema = (bname, assets, bcity, loan#, cname, amount)
which we saw was a bad design.
2. The set of functional dependencies we required to hold on this schema was:
bname assets bcity
loan# amount bname
3. If we decompose it into
Branch-schema = (bname, assets, bcity)
Loan-info-schema = (bname, loan#, amount)
Borrow-schema = (cname, loan#)
we claim this decomposition has several desirable properties.

Lossless-Join Decomposition
1. We claim the above decomposition is lossless. How can we decide whether a decomposition is lossless?
o Let R be a relation schema.
o Let F be a set of functional dependencies on R.
o Let and form a decomposition of R.
o The decomposition is a lossless-join decomposition of R if at least one of the following functional dependencies are in :
1.
2.
Why is this true? Simply put, it ensures that the attributes involved in the natural join ( ) are a candidate key for at least one of the two relations.
This ensures that we can never get the situation where spurious tuples are generated, as for any value on the join attributes there will be a unique tuple in one of the relations.
2. We'll now show our decomposition is lossless-join by showing a set of steps that generate the decomposition:
First we decompose Lending-schema into
Branch-schema = (bname, bcity, assets)
Loan-info-schema = (bname, cname, loan#, amount)
Since bname assets bcity, the augmentation rule for functional dependencies implies that
bname bname assets bcity

Since Branch-schema Borrow-schema = bname, our decomposition is lossless join.



Next we decompose Borrow-schema into
Loan-schema = (bname, loan#, amount)
Borrow-schema = (cname, loan#)
As loan# is the common attribute, and
loan# amount bname
This is also a lossless-join decomposition.













Boyce-Codd Normal Form
1. A relation schema R is in Boyce-Codd Normal Form (BCNF) with respect to a set F of functional dependencies if for all functional dependencies in of the form , where and , at least one of the following holds:
o is a trivial functional dependency (i.e. ).
o is a superkey for schema R.
2. A database design is in BCNF if each member of the set of relation schemas is in BCNF.
3. Let's assess our example banking design:
Customer-schema = (cname, street, ccity)
cname street ccity

Branch-schema = (bname, assets, bcity)
bname assets bcity

Loan-info-schema = (bname, cname, loan#, amount)
loan# amount bname

Customer-schema and Branch-schema are in BCNF.


4. Let's look at Loan-info-schema:
o We have the non-trivial functional dependency loan# amount, and
o loan# is not a superkey.
o Thus Loan-info-schema is not in BCNF.
o We also have the repetition of information problem.
o For each customer associated with a loan, we must repeat the branch name and amount of the loan.
o We can eliminate this redundancy by decomposing into schemas that are all in BCNF.

5. If we decompose into
Loan-schema = (bname, loan#, amount)
Borrow-schema = (cname, loan#)
We have a lossless-join decomposition. To see whether these schemas are in BCNF, we need to know what functional dependencies apply to them.
For Loan-schema, we have loan# amount bname applying.
Only trivial functional dependencies apply to Borrow-schema.
Thus both schemas are in BCNF.
We also no longer have the repetition of information problem. Branch name and loan amount information are not repeated for each customer in this design.

6. Some Things To Note About BCNF
o There is sometimes more than one BCNF decomposition of a given schema.
o The BCNF decomposition given produces only one of these possible decompositions.
o Some of the BCNF decompositions may also yield dependency preservation, while others may not.
o Changing the order in which the functional dependencies are considered may change the decomposition.









Third Normal Form
1. When we cannot meet all the design criteria, (i.e. lossless join, dependency preservation and repetition of information) we abandon BCNF and accept a weaker form called third normal form (3NF).
2. It is always possible to find a dependency-preserving lossless-join decomposition that is in 3NF.
3. A relation schema R is in 3NF with respect to a set F of functional dependencies if for all functional dependencies in of the form , where and , at least one of the following holds:
o is a trivial functional dependency.
o is a superkey for schema R.
o Each attribute A in is contained in a candidate key for R.
4. A database design is in 3NF if each member of the set of relation schemas is in 3NF.
5. We now allow functional dependencies satisfying only the third condition. These dependencies are called transitive dependencies, and are not allowed in BCNF.
6. As all relation schemas in BCNF satisfy the first two conditions only, a schema in BCNF is also in 3NF.
7. BCNF is a more restrictive constraint than 3NF.



Comparison of BCNF and 3NF
1. We have seen BCNF and 3NF.
o It is always possible to obtain a 3NF design without sacrificing lossless-join or dependency-preservation.
o If we do not eliminate all transitive dependencies, we may need to use null values to represent some of the meaningful relationships.
o Repetition of information occurs.
2. These problems can be illustrated with Banker-schema.
o As banker-name bname , we may want to express relationships between a banker and his or her branch.

An instance of Banker-schema.
o Figure shows how we must either have a corresponding value for customer name, or include a null.
o Repetition of information also occurs.
o Every occurrence of the banker's name must be accompanied by the branch name.


3. If we must choose between BCNF and dependency preservation, it is generally better to opt for 3NF.
o If we cannot check for dependency preservation efficiently, we either pay a high price in system performance or risk the integrity of the data.
o The limited amount of redundancy in 3NF is then a lesser evil.
4. To summarize, our goal for a relational database design is
o BCNF.
o Lossless-join.
o Dependency-preservation.
5. If we cannot achieve this, we accept
o 3NF
o Lossless-join.
o Dependency-preservation.
6. A final point: there is a price to pay for decomposition. When we decompose a relation, we have to use natural joins or Cartesian products to put the pieces back together. This takes computational time.








Multivalued Dependencies
1. Functional dependencies rule out certain tuples from appearing in a relation.
If A B, then we cannot have two tuples with the same A value but different B values.
2. Multivalued dependencies do not rule out the existence of certain tuples.
Instead, they require that other tuples of a certain form be present in the relation.
Example:

(name, address, car) where and .



Fourth Normal Form (4NF)
1. Even when a schema is in BCNF, it may not be an ideal design as it might contain multivalued dependencies and thus there might be repetition of information.
2. We can use the given multivalued dependencies to improve the database design by decomposing it into fourth normal form.
3. A relation schema R is in 4NF with respect to a set D of functional and multivalued dependencies if for all multivalued dependencies in of the form , where and , at least one of the following hold:
o is a trivial multivalued dependency.
o is a superkey for schema R.
4. A database design is in 4NF if each member of the set of relation schemas is in 4NF.
5. The definition of 4NF differs from the BCNF definition only in the use of multivalued dependencies.Every 4NF schema is also in BCNF.
6. We have seen that if we are given a set of functional and multivalued dependencies, it is best to find a database design that meets the three criteria:
o 4NF.
o Dependency Preservation.
o Lossless-join.
7. If we only have functional dependencies, the first criteria is just BCNF.
8. We cannot always meet all three criteria. When this occurs, we compromise on 4NF, and accept BCNF, or even 3NF if necessary, to ensure dependency preservation.

No comments:

Post a Comment