Information Technology: Extended Relational DBMS

Exploding interest in the Internet and World Wide Web as application platforms has had a significant effect on the relational database management system (RDBMS) marketplace. Initially, Web hype reduced the profile of the major RDBMS combatants and overshadowed their emphasis on areas such as parallel database operations, data warehousing, and data replication. But users developing Internet/Web applications quickly recognized the need for a scalable, robust environment in which to store, manipulate, and manage the dynamic delivery of multimedia data and other complex datatypes. Connecting the DBMS and Web applications via a Web server also requires effective support for a three-tiered application architecture. In fact, a Web application can represent a microcosm of the full range of business requirements -- scalability, performance, access to integrated data across an organization, transaction management, deployment on multiple platforms, Java support, and so on.

When the Web came along, most RDBMS vendors were already working on extensibility to handle more complex data and applications -- Web applications simply provided an immediate need and a focal point for these efforts. Thus we are seeing a resurgence of interest in RDBMS technology as an underlying platform for complex-data applications, particularly those implemented on the Web.

This article addresses the application requirements that drive the efforts to extend the RDBMS to handle complex data. It also describes the components of an extensible data management architecture and the core features required to achieve RDBMS server extensibility. Part II of this article, which will appear next month, will discuss how the five major RDBMS vendors -- Informix Software Inc. (Menlo Park, Calif.), IBM Corp. (Armonk, N.Y.), Microsoft Corp. (Redmond, Wash.), Oracle Corp. (Redwood Shores, Calif.), and Sybase Inc. (Emeryville, Calif.) -- plan to support data extensibility.

Application Requirements
Users have always wanted a unified view of and integrated access to all data across the organization. A related requirement is integrated content searching. Delivering these capabilities has been difficult if not impossible in the past. Although the RDBMS has successfully met the requirements of a wide range of applications, it natively understands only highly structured data in the form of simple alphanumeric datatypes. Text and images may be stored as BLOBs, but the DBMS doesn't have a clue about the content of these BLOBs. Users who want to build applications that intelligently access data stored as time series, geospatial locations, dynamic Web pages, documents, spreadsheets, mail messages, and other data have had to explore other alternatives, such as specialized servers or logic in the application itself. The RDBMS has not been a viable alternative.

In addition, corporate information systems are becoming more complex as organizations strive to design and integrate operational, data warehousing, and Web information systems. (See Figure 1.) A Web information system, for example, lets users access corporate information using a Web browser and Web-based applications; it essentially Web-enables anything.

To meet these application requirements, many organizations are looking for a single database platform that applies scalability, transaction integrity, proactive enforcement of business rules, and other robust DBMS functionality to complex data as well as traditional data. Because the RDBMS is already widely used for traditional applications, it makes sense to explore the possibility of extending it to manage a wider range of datatypes, application-specific semantics, and complex data relationships intelligently. This evolution of the RDBMS is referred to as "extended relational" (because the underlying data model is extensible) or "object-relational" (because the RDBMS can now understand "rich datatypes" or "objects" that represent complex internal structures, attributes, and behavior and require new search methods). A common term for products in this space is "universal server" or "universal database."

Another factor that is driving efforts to extend the RDBMS is the ever-present desire to increase developer productivity. One aspect of this is providing a single API to all data so that developers don't have to contend with multiple APIs depending on the type and source of data. An example is retrieving structured data from an RDBMS using SQL and retrieving related information on documents using the native API of a text search engine. If the data is related, why not just use one API, such as SQL, and have a single database server manage both types of data? The controversy that is evolving in this area is whether SQL is the appropriate language for this process. The object-relational approach pushes SQL3 as the single API. Microsoft, on the other hand, would like to see its OLE DB common object method interface take on this single-API role.

A second aspect of developer productivity is pushing support for object modeling techniques into the database server itself. These techniques include the ability to encapsulate data and its associated methods as objects and to reuse code through features such as inheritance and polymorphism.

There are two important points to be made here. One is that many people in the industry confuse objects with datatypes. Objects encapsulate both data and methods. Adding new datatypes to the RDBMS is only one step toward supporting true objects. Also required is the ability to define new methods and to associate these methods with the appropriate datatypes.

The second point is the need to differentiate the object-relational approach from that of object DBMSs. Object-relational DBMSs deliver some object capabilities, but they do not yet offer the same level of support as object DBMSs for features such as encapsulation and inheritance. In addition, it is unlikely that object-relational DBMSs will fully support features such as pointer navigation and tight integration with object-oriented programming languages to provide persistent storage of native objects created in an application. I expect object-relational products to move in this direction to gain the benefits of the object approach where possible, but object features will be implemented differently in object-relational, given its underlying data structure of tables and columns.

An Extensible Data Management Architecture
There are three major approaches to creating an extensible data management architecture: the universal server approach, the middleware approach, and the object layer approach, which I describe later in this article. Perhaps a better way to present these approaches is to refer to each one as a component of data extensibility, for they are not mutually exclusive. In fact, a fully extensible DBMS solution should address all of these components. (See Figure 2.)

There are two key differences among these approaches. The first is how and where the data is managed. Is all of the data tightly integrated and managed by a single DBMS server? Or is the data loosely integrated and managed by multiple servers? The second differentiator is where query optimization occurs and how well it performs. Query optimization maps the logical view of data to the physical view and decides the best way to execute the query. Is this handled by the database server (the universal server approach) or in middleware (the OLE DB approach)? Good optimization is particularly important in an environment in which data can be stored in any combination of one table, multiple tables, multiple databases, and/or files outside the DBMS.

The Universal Server Approach
The "universal server" approach extends RDBMS server capabilities to understand, store, and manage complex data natively in the database itself. Informix, IBM, and Oracle are all implementing this approach (with Informix-Universal Server, IBM DB2 Universal Database, and Oracle8, respectively). Sybase will also add limited complex-data support to Sybase SQL Server in the future. This approach assumes that all of the data is physically stored within the database.

An "extended universal server" accommodates the fact that there may be very good reasons (such as performance) for not storing all data in the DBMS. So the DBMS must also be able to efficiently access data stored in external files. Large data values -- images, for example -- can be stored externally, and a pointer to each image file is stored inside the database as a column value. An additional step is enabling the DBMS to also manage and ensure the integrity of this external data. The only RDBMS vendor planning to address the latter is IBM, with its robust file-links. However, I expect efforts in this area to receive much more attention in the industry over the next 12 months.

Object purists already criticize the extended-relational approach because of the fact that the DBMS must decompose objects into relational tables -- rows and columns -- for storage and then rebuild them into objects before delivery to the user. The vendors extending relational are very much aware of the performance issues and the need to avoid the overhead of joins where possible. My philosophy here is the following: If product performance meets user requirements, it doesn't matter what the product does under the covers (unless, of course, the architecture hampers the ability to add future enhancements). In the real world, user experience will demonstrate whether the extended RDBMS vendors have been successful in adapting the relational model for complex-data support. I discuss specific universal server extensions in more detail later in this article.

The Middleware Approach
Another approach is to use middleware that coordinates and executes requests across multiple, heterogeneous servers (RDBMS, text search engine, image system, and flat files); the data itself is managed within each specialized server. The middleware provides the unified view of the data, executes the global optimization of user queries, and provides global transaction management. There are two types of middleware in an extensible data management architecture. Both types use the SQL API and provide drivers out the back end to access each supported server. One is database middleware, such as IBM's DataJoiner and Sybase's OmniConnect, for integrated access to heterogeneous data. Sybase also plans to extend this "federated server" approach in its adaptive server architecture.

Microsoft's OLE DB and DCOM, and other object request brokers (ORBs), on the other hand, represent another type of middleware: application middleware. OLE DB is an interface that was designed to provide universal access to data. It "componentizes" DBMS functionality, breaking it up into components that can run in the middleware space or in the operating system, such as query processors, optimizers, and transaction managers. OLE DB will be an integral component of Microsoft's operating systems and servers. In fact, DBMSs were originally developed because of operating-system deficiencies. If these services are implemented at the operating-system level, will we still need the DBMS? This discussion will become particularly interesting if SQL3 doesn't prove that it is up to the task of handling real objects. Oracle's Network Computing Architecture also addresses the ability to extend the data environment at the middleware level; Oracle's Web Application Server will evolve into a generic application server that can provide some database functionality such as transaction management.

How well a middleware approach performs will be determined by several factors: the level of integration among components, how smart the middleware is about data stored in supported servers and the native capabilities of each server, whether the user wants to manipulate the data or simply retrieve it, and the mechanism by which components communicate. Some RDBMS vendors claim that using RPCs to execute queries across distributed data will have serious performance implications. However, the comment I just made about performance as it relates to object-relational also applies here.

One issue with OLE DB in particular and ORBs in general is the ability of third-party software vendors to provide competent database functionality such as global query processing and optimization. Given the amount of effort and R&D investment that have gone into optimization algorithms on the part of the major RDBMS vendors, it is not clear that middleware will offer comparable functionality and performance. Another issue is the number of data sources accessible through middleware. Is the user dependent on the DBMS vendor, or can a customer or a third party integrate data to meet specific business requirements?

The Object Layer Approach
The object layer in an extensible data architecture provides integrated object views and object functionality at the application level. This can encompass client cache management, pointer navigation among objects, local execution of functions, and local query optimization. Object DBMSs are clearly focused here, including persistent storage of objects created by the application. In the case of the RDBMS, the object layer could include the ability to map objects in the application to objects in the database so that relational data can be materialized in the form of native C or C++ objects, Java objects, and so on. The benefits of this approach are tighter integration between the data manager and the application development language and the potential for better performance. IBM is planning to address the object layer in the future through its client object-support development effort. Oracle will do some of this in Oracle 8 with its object views of relational data and client-cache management. Microsoft's OLE DB also provides support at the object layer with a unified view of heterogeneous data.

Universal Server Extensions
Much of the marketing hype in the industry is focused on the universal server component of the architecture. Here is a brief look at how RDBMS vendors are extending their products to satisfy the user requirements I described previously. Many, but not all, of these features are included in the SQL3 draft standard. (See Table 1 for a summary.)

Extensible Type System. An extended RDBMS must support user-defined datatypes (UDTs) at both the column and the row level. Column-level UDTs are either distinct or abstract datatypes. UDTs enable the extended RDBMS to incorporate new datatypes and understand complex data or business relationships. Distinct types are relatively simple UDTs that extend an existing base datatype for a column. A strongly typed system will not permit the user to make inappropriate direct comparisons between types with different names, even though they share the same base datatype and length. Abstract datatypes define more complex datatypes that have special internal structures and attributes, such as text, geospatial, or time-series data. As with objects, the internal structure of an abstract datatype is hidden from the user; data is accessed and manipulated using a set of external attributes and functions. Abstract datatypes are defined using SQL (the database engine is aware of the attributes and internal structure) or a host language (the type is essentially "opaque" and seen only as a large object by the DBMS).

A row type describes an entire row or a set of nested columns in a table, providing a way to represent hierarchical "entities" in the database -- customer, employee, and so on -- and identify multiple related columns. Reference types can then define relationships between row types and uniquely identify a row within an entire database. References enable users to replace complex-join definitions in queries with much simpler path expressions. References also give the optimizer an alternative way to navigate data instead of via value-based joins.

Collections are type constructors that are used to define collections of other types, such as arrays, lists, and sets. Collections are used to store multiple values in a single column in a table and can result in nested tables where a column in one table actually contains another table. The result can be a single table that represents multiple master-detail levels. Collections add flexibility to the design of database structures.

An important aspect of object modeling is inheritance, in which subtypes inherit the attributes and behavior of their supertypes. Inheritance facilitates code reuse and the ability to maintain logical integrity in the database.

User-Defined Functions. User-defined functions (UDFs) define methods for manipulating data and are an important adjunct to UDTs. An extended RDBMS should provide significant flexibility in this area, such as allowing UDFs to return complex values that can then be further manipulated (such as tables), execution options so that the user can decide whether performance or security is more important when running UDFs, and support for overloading of function names to simplify application development.

Index Structures. Traditional RDBMSs use B-tree (binary tree) indexes to speed access to scalar data. With the ability to define more complex datatypes in the RDBMS, specialized index structures are required for efficient access to data. Some extended RDBMSs are beginning to support additional index types, such as R-trees (region trees) for fast access to two- and three-dimensional data, and the ability to index on the output of a function. A mechanism to plug in any user-defined index structure provides the highest level of flexibility.

Optimizer. The query optimizer is the heart of RDBMS performance and must also be extended with knowledge about how to execute UDFs efficiently, take advantage of new index structures, transform queries in new ways, and navigate among data using references. Successfully opening up such a critical and highly tuned DBMS component and educating third parties about optimization techniques is a major challenge for DBMS vendors.

Other Extensions. Other important extensions are support for large-object storage either inside the database or outside in external files, the ability to apply business rules and integrity constraints to new datatypes, recursive queries to support complex-data relationships, and extended language support in the server. This last area is key to improving both flexibility and portability. Extended RDBMSs must support the SQL3 standard (in committee draft status now) plus additional languages for writing UDFs and stored procedures, such as 3GLs and Java. Unfortunately, the SQL3 standard does not address some areas of extensibility, so implementation of features such as the mechanism for giving the optimizer cost information about UDFs and new index structures will vary among products. The lack of a standard way for third-party vendors to integrate their software with multiple RDBMSs demonstrates the need for standards beyond the focus of SQL3.

We also need application language extensions to "complex-data-enable" applications on the front end. Application development tools must be extended to take advantage of new server functionality.

Getting Customers to Migrate
Vendors must address two major obstacles to getting customers to buy into their universal server solutions. One is providing a substantial portfolio of predefined extensions -- such as DataBlades for Informix-Universal Server -- as building blocks for application development. Most extensions will come from a combination of the DBMS vendor and third-party software vendors who want to integrate their products tightly within the DBMS. The most flexible solution is for DBMS vendors to offer basic built-in extensions but also support a wide variety of (potentially competitive) third-party extensions. Thus the customer can assess the trade-off between one-stop shopping and a best-of-breed solution. Third parties that write DBMS extensions must understand complex database functionality such as transaction management and how to assess the cost of executing a particular function. They also must keep their products in synch with new releases of the database server. Third-party database extensions would be similar to the way third parties provide components used in application development tools.

The second obstacle is convincing customers that to get extensibility, they don't have to give up what they already have, such as good performance on existing applications. A question to ask each vendor is how integrated complex-data extensions are with existing DBMS functionality such as parallel processing, backup and recovery, data integrity constraints, and data replication. Does the customer have to choose between complex-data support and the ability to use other features such as parallel processing, data replication, or distributed databases? In some cases, the answer is yes.

Upgrade or Perish?
The next generation of RDBMS products has started to roll out in the form of object-relational "universal servers." However, none of the initial product releases from Informix, IBM, or Oracle supports a complete set of object extensions, nor do any of the releases support all of the components of an extensible, data management architecture. Taking advantage of a universal server requires customers to upgrade to a new version of the DBMS. It is important to understand what this entails and whether existing applications must be modified. Can the customer take advantage of new extensions in an evolutionary way without losing the benefits of current database features and functions? Stay tuned for a look at the specific vendor strategies and product plans of Informix, IBM, Microsoft, Oracle, and Sybase and how they are addressing data extensibility.

Information Technology

Wednesday, September 1, 2010

Extended Relational DBMS

No comments:

Post a Comment