Should the database for the data warehouse be supported by a conventional RDBMS, or a specialized, multidimensional database (MDDB)?
First of all, I need to clarify the difference between an OLAP Server and a Database Management System(DBMS), as well as their roles in a data warehouse architecture.
Oracle Server, Microsoft SQL Server and IBM DB2 are all relational database management systems. The main role of a relational database server is to manage access to data stored in bi-dimensional tables comprised of rows and columns (relational tables¾the table defines a relation between things in each row/record). The DBMS also provides management of computer resources, shared access, security, programming interfaces, and many administrative functions.
An OLAP server is a specialized database engine capable of handling the ethereal multidimensional data cubes. The idea of a cube is a mere abstraction, actually, OLAP products store their data in multidimensional array structures (array data types that are referenced by multiple indexes). "Cubes" are constructed in such a way that each combination of dimensional data attributes with the relevant numeric additive data is either precalculated or easy to compute very quickly. OLAP engines are very good at, and specially designed for, doing analytical processing, i.e., calculating aggregations/summarizations, constraining queries through different paths (slice and dice), etc. There are currently several different implementations of OLAP technologies in the market. The major categories are:
- Desktop OLAP Tools (DOLAP) - Processing on PC or on mid-tier servers, not on data servers. Multidimensional arrays stored on PCs or mid-tier servers. Allows users to have current information that is portable and can be analyzed on their desktops. These tools provide users with the ability to have a personal copy of their multidimensional database or provide access to a central data repository from the desktop.
Representative Products/Vendors: BusinessObjects, Brio, PowerPlay/Cognos and WebFocus/InformationBuilders.
- Relational OLAP Tools (ROLAP) - ROLAP tools support a three tier architecture consisting of a client, a mid-tier server, and a target database (data warehouse). The function of the client is to support a GUI interface for the initiation of queries and the display of query results. The mid-tier server is a dedicated server that performs multidimensional OLAP calculations and accesses data on a relational database through SQL.
Representative Products/Vendors: MicroStrategy, EurekaSuite/Sterling(now CA), MetaCube/Informix, InfoBeacon/CA(formerly Platinum).
- Multidimensional OLAP Tools (MOLAP) - MOLAP tools implement database engines that store data on multidimensional arrays. They are recommended for relatively small data marts (typically less than 50 GB), where performance is critically important, or where complex analytical processing must be supported.
Representative Products/Vendors: Essbase/Hyperion, Oracle Express, OLAP@Work/BusinessObjects, SAS Multidimensional Database Server.
- Hybrid OLAP Tools (HOLAP) - The Hybrid OLAP tools combine the best of both worlds, relational access for low granularity and high volume data, and a multidimensional database engine for OLAP processing.
Representative Products/Vendors: Microsoft SQL Server OLAP Services(Plato), IBM DB2 OLAP Server, Holos 7/Seagate.
Below are descriptions of all the components of a sustainable enterprise data warehouse architecture:
- Source Systems - often referred to as "legacy" systems. All existing corporate data assets currently in use that are relevant for the data warehouse purpose. These source systems include formal and informal data sources, internal and external systems, and structured and unstructured data.
- Data Conversion and Migration Services - these are the processes, programs and tools used to extract the source data from the legacy environment, cleanse it, transform it, and load it into the data warehouse.
- The Data Warehouse - integrates and stores the enterprise data. It is designed to optimize query and analysis performance, end-user understandability and scalability. The data marts are subsets of the data warehouse focused on a specific subject area. The data warehouse can be used to feed multidimensional databases (OLAP servers) for more specialized analytical applications. The central corporate data warehouse is the cornerstone of the data warehousing environment, but this function also includes other special purpose data stores like the ODS (Operational Data Store), Multidimensional databases (MDDB), and specialized downstream data marts.
- Data Access and Analysis Services - those are the front-end tools and applications that provide intuitive access, powerful query and analysis capabilities to end-users across the organization
- Metadata - "it is data about the data". The Metadata Repository is a foundation of the data warehouse. It is essential to integrate all components of the data warehouse. Metadata stores definitions of the source data, data models for target databases, transformation rules that convert source data into target data, and semantic definitions for the end-user tools/applications.
The Central Data Warehouse
should be implemented in a relational database(RDBMS)
. It stores consolidated, detailed, corporate-wide data. It is based on a "star-schema" design, and it is constituted by multiple data marts
integrated through conformed facts and dimensions.
RDBMSs are more suitable for the Central Data Warehouse role than multidimensional databases (OLAP servers) because they are relatively open, they have no database size constraints, they are supported by many third-party vendors, they are compatible with existing standards at the organization, and they are easier to maintain and administrate.
Multidimensional databases (MDDBs) are ideally suited for very specialized multidimensional calculations, involving aggregations, matrix calculations, cross-dimensional calculations, read/write analysis, statistical analysis, what-if-analysis, etc. MDDBs provide faster response and additional capabilities for more sophisticated OLAP processing than RDBMSs, but they have severe size limitations, and are difficult to integrate with the other components of the data warehouse architecture.
Based on the architecture explained above, our recommendation is to build the data warehouse on a relational database like Oracle, MS SQL Server, or IBM DB2. The data model for the warehouse should be based on a dimensional design ("the star-schema framework") to facilitate integration and scalability, and provide greater performance for analytical processing. The data warehouse can then provide data directly to BusinessObjects and/or other OLAP servers.
I will be discussing and comparing in more detail the features of the 3 main databases: Oracle8i, MS SQL Server 7.0, and IBM DB2 UDB.
The 3 aforementioned products are relational database management systems (RDBMSs). The related 3 vendors also have OLAP servers available that are well integrated with the respective RDBMS products. Oracle's OLAP server is Oracle Express, Microsoft has the SQL Server OLAP Services (codename Plato), and IBM's product is DB2 OLAP server.
What to Look for in a Relational Database for your Data Warehouse
Scalability is usually a very critical factor in data warehousing. It is normally hard to predict how much a data warehouse will grow on the mid/long term. Relational databases are mandatory when it is likely that the data warehouse will grow to hundreds of gigabytes, or larger.
However, response times are likely to be slow for ad hoc queries to very large target databases, or queries that generate large results sets. To obtain maximum performance from a RDBMS, it is important to employ all of the advanced features supported by the database, such as parallelism, extended SQL for OLAP, physical partitioning of data, advanced indexing options, support for massively parallel processes, aggregate processing, etc.
Following are some of the main requirements for a relational database to be used to host an enterprise data warehouse:
Representative RDBMSs for Large Data Warehouses:
- Scalability to support very large databases (terabytes) and large numbers of concurrent end-users performing complex analysis.
- Adequate performance for ad hoc queries to any data in the database.
- High-speed query processing using symmetric multiprocessors (SMP), massively parallel processors (MPP), and/or clustered multiprocessors (NUMA).
- Integration with local and central metadata repositories.
- Integration with data extraction and transformation tools.
- Integration with multidimensional databases (OLAP servers).
- Integration with business intelligence tools.
- Supported by large number of third-party tools.
- Support of open systems standards, e.g., SQL, ODBC, OLE DB.
- Support for star join and multidimensional extensions to SQL to support OLAP calculations, variances, moving averages, etc.
- Support for physical partitioning of data.
- Aggregate awareness.
- Portability, security, data integrity, backup/restore.
The following list has the most representative relational databases available in the market today to support the requirements of a large data warehouse:
- Oracle 7.3, 8.x Parallel Server supports shared database, not MPP. (Highlight: the latest Oracle8i)
- Microsoft SQL Server 6.5, 7.0 (SQL Server OLAP Services) ¾ SQL Server 2000 is coming with great data warehousing specific enhancements.
- IBM DB2 Universal Database for MVS, AS/400, AIX, NT, OS/2, HP/UX, Sun Solaris, SCO. DB2 Parallel Edition runs on IBM SP2
- Informix Dynamic Server, Extended Parallel Option (was XPS), Universal Data Option (was Universal Server).
- Teradata from NCR ¾ runs on NCR WorldMark 5100M (MPP)
- Sybase Adaptative Server Enterprise, Adaptative Server IQ.
- RedBrick Warehouse from Informix Software.
- CA-OpenIngress from Computer Associates.
- Hewlett-Packard AllBase/SQL.
In the future, Oracle, IBM, and Microsoft are likely to capture most of the RDBMS market, squeezing out the smaller database vendors, except at the high-end of the market.
Oracle is on most short lists due to its ubiquity and wide range of specialized functions and tools that support data warehousing. The main drawback of Oracle is that it uses a shared-disc architecture, which is not optimized for running on MPPs (except when running on a Sun Starfire 10000 processor).
Another minor disadvantage of Oracle is that it decided to promote its own metadata standard, namely the Common Warehouse Model (CWM). Due to Oracle's market predominance, this standard has become increasingly more important, and it is now supported by many other vendors, such as IBM, Unysis, Sun and Hyperion. It is now the adopted metadata model of the Object Management Group. Its worth saying that there is another competing metadata standard, more widely accepted, managed by the Open Information Model group, that has Microsoft as one of its leading members.
Oracle is definitely one of the stronger contenders as the platform of choice for enterprise data warehousing. The latest version, namely Oracle8i has come with significant improvements to support advanced data warehousing applications.
Building upon the solid base of data warehousing technologies introduced in Oracle8 (like bitmap indexes, advanced parallel query operations, and powerful partitioning capabilities), Oracle8i introduces a number of powerful new features for data warehousing to deliver scalable, high-performance query processing, unlimited growth potential in data volumes and users, and business-critical availability.
A few of these new features include materialized views, summary management, improved parallelism, a new database resource management facility, transportable tablespaces, two new methods of partitioning (hash partitioning and composite partitioning), and analytical functions, such as ranking. These features improve performance, availability, and management of the critical information resources housed in the data warehouses.
Oracle8i is available on more than eighty system platforms, including 40+ UNIX variants, Windows NT, OS/390, and Novell Netware.
Below are listed some of the main features of Oracle8i that are important for data warehousing:
- Optimized star queries.
- Resource management.
- Online index rebuilds.
- Materialized Views and Summary Tables
- Sophisticated query optimization to choose the most efficient data access path.
- Integrated indexing schemes, join methods, and summary management features.
- Dynamic Bitmapped indexes
- Function-based indexes
- Fast Full Index Scans
- Advanced join methods: Partition-wise joins and Hash joins.
- Extended SQL analytic functions such as: ranking, moving aggregates, cumulative aggregates, lag/lead expressions, period-over-period comparisons, and ratio-to-report. The new CUBE and ROLLUP operators.
- Sophisticated SQL Optimizer: dynamically determines the most efficient access paths and join methods for every query. It incorporates powerful query transformation technology that automatically rewrites queries generated by end-user tools, for efficient query execution.
- Very large database support.
- Partitioned tables and indexes.
- Scalable parallel SQL architecture: Oracle8i's parallel SQL architecture provides scalability on Symmetric Multiprocessor (SMP), massively parallel processors (MPP), and hybrid hardware platforms.
- Transportable tablespaces.
- Support for the Rolling-Window Load Scheme
- Direct-Path Load API
- Large user population support.
- Extended backup/recovery subsystem.
- Security. Highlight: Fine-Grained Access Control
- Support for data distribution and data replication
- XML support.
Oracle definitely has many data warehouse specific advanced resources. Besides the features described above offered by the data server (Oracle8i), many applications are available to be used on building and managing the warehouse and data marts.
Tool packages are available from Oracle to help on the modeling and design of the data warehouse and data marts, data movement, transformation and load, metadata creation and management, as well as applications integration. The main tools are:
Microsoft SQL Server
- Oracle Warehouse Builder - is a set of graphical tools used to design the data warehouse, extract and transform data from operational systems.
- Oracle Data Mart Designer, an easy-to-use tool for designing the data mart.
- Oracle Data Mart Builder, a graphical tool to extract and transform data from operational systems
Release 7.0 of SQL Server has enhancements that make it conducive to warehousing, such as sophisticated optimization of complex queries, improvements in parallelism, heterogeneous join capability, utilities for performance tuning, and support for large page sizes to improve the efficiency of large table scans. These and other enhancements give SQL Server the potential to support warehouse databases in excess of a terabyte.
Building on the strengths of SQL Server 6.5, SQL Server 7.0 makes designing, building, and managing data warehouses and data marts even easier. It also provides better tools for gathering data from many enterprise sources, building a powerful query environment for using that data, and distributing it across the enterprise.
Microsoft is very competitive with the Oracle data warehouse offerings. Many of the data warehouse specialized features found in the Oracle Data Server are present within MS SQL Server. Particularly, I think that SQL Server is the easiest to use, with the best graphical interface available for its administrative tools. Following are some of the features that make SQL Server 7.0 an excellent platform for data warehouses and data marts:
- Terabyte-sized database support.
- Scalability - access to central data sources, across the enterprise to departmental servers and even to laptops that use the 100 percent code-compatible Desktop edition.
- Integrated Online Analytical Processing (OLAP) services. Multi-dimensional storage and navigation support.
- Integrated Data Transformation Services (DTS). Enables you to import, export, and transform data between SQL Server 7.0 and text files as well as heterogeneous data sources supported by Open Database Connectivity (ODBC) and OLE DB.
- Data lineage audit support.
- Sophisticated replication services. SQL Server 7.0 supports data consistency in a distributed environment.
- Advanced query processor- the processor optimizes and executes typical complex queries, such as star query joins. Intraquery parallelism enables faster performance by breaking a complex single query into component parts, and then distributing the workload to multiple processors.
- High-performance utilities. Use these utilities for performance tuning, data loading, and index construction.
- Advanced Backup and Restore features. Example: differential backups, file and file group backups, point-in-time restoration.
- Cost-based analysis.
- Microsoft Repository and the Open Information Model (OIM). SQL Server, together with Microsoft Repository and the Open Information Model, coordinates metadata from SQL Server databases, OLAP Services, and third-party tools (e.g., data modeling tools, ETL tools, data cleansing tools).
- Visual Data Tools - these tools make it possible to graphically design and build data warehouse schema.
- Broad third-party tool and application support.
- Tight integration with Microsoft Office 2000. Users can analyze data in Excel
- The Microsoft Alliance for Data Warehousing ensures that users can choose from solutions designed to work together for all phases of the data warehousing process.
The Microsoft Data Warehouse Framework
Microsoft SQL Server 7.0 introduced several components and tools that improve the ability to build data warehouses. Microsoft call it "The Data Warehousing Framework".
The Data Warehousing Framework is a set of components and APIs that implement the data warehousing features of SQL Server. It provides a common interface to be used by various components seeking to build and use a data warehouse or data mart. Following is a short explanation of the 4 main components:
- Data Transformation Services (DTS) - DTS is a tool for extracting data from heterogeneous OLE DB data sources and the summarizing or aggregating of the data to build a data warehouse.
- Microsoft SQL Server 7 - This is the storage environment for the data warehouse, and also implement all related DBMS administrative functions such as: security, monitoring, tuning, controlling shared and distributed access, archiving, etc.
- Microsoft Repository - This is the central metadata repository for the datawarehouse. It consists of a set of Microsoft ActiveX interfaces and information models that define database schema and data transformations as defined by the Microsoft Data Warehousing Framework.
- Online Analytical Processing (OLAP) Support - This is the multidimensional database engine of the Microsoft framework. Microsoft SQL Server OLAP Services provides OLAP processing capabilities against heterogeneous OLE DB data sources. It has efficient algorithms for defining and building multidimensional cubes that can be referenced by applications using the OLE DB 2.0 OLAP extensions or the Microsoft ActiveX Data Objects Multidimensional extensions (ADO MD).
With superior management and scalability features, SQL Server 7.0 reduces cost of ownership while providing enormous flexibility in all levels of the enterprise. Integration with Microsoft Office 2000, Microsoft Windows NT, and Microsoft BackOffice enables users to use tools they already know. Most importantly, the interoperability with relational and non-relational data from a variety of sources protects previous technology investments and puts them to greater use.
Certainly the strongest advantage of IBM DB2 over its competitors, particularly Microsoft SQL Server, is its availability on multiple platforms that reaches from handhelds to Unix servers to mainframes. The main disadvantages are that it is some-what more difficult to use and has less availability of third-party products.
The newest version available for Windows NT and UNIX, namely DB2 7.1, presents specialized features highly recommended for large data warehouse implementations. Some of these features include: support for shared-nothing clustering, Java and SQL99 stored procedures, object-oriented database designs and precalculated summary tables. The sophistication of DB2's core database engine is matched only by Oracle Corp.'s Oracle8i.
DB2 is substantially less expensive than Oracle8i, which doesn't include an online analytical processing server, but DB2 doesn't have a few high-end features found in Oracle, such as bit-map indexes (though DB2 can construct nonpersistent bitmap indexes from its on-disk B-tree indexes on the fly). DB2 also support XML and geographical data.
Like Microsoft and Oracle, DB2 also has a visual tool to help on moving data between databases. The Data Warehouse Center included in DB2 is based on IBM's Visual Warehouse, and it lets the user define source and target databases and then extract, transform and load data between a Microsoft SQL Server 7.0 database and a DB2 7.1 database using a graphical flowchart interface.
IBM adopts the same metadata protocols as Oracle, as defined by the Object Management Group (OMG).
Like Microsoft, DB2's base product ships with an OLAP server, namely DB2 OLAP Server. Actually, this is the Essbase 5 OLAP engine licensed from Hyperion. This is not the full-blown Hyperion multidimensional database, but primarily Essbase with a relational data storage. It has severe limitations compared to Essbase OLAP Server.
The greatest advantage by far of IBM over the other 2 vendors is its ability to run on both symmetric multiprocessors (SMP) and massively parallel processors (MPP) architectures with the DB2 Parallel Edition Option.
The 3 relational database products analyzed in this document, Oracle Server, Microsoft SQL Server and IBM DB2 are very robust and strong data warehouse platforms.
Oracle and IBM have the most sophisticated core database engines. IBM is some-what-less user friendly and lacks the support of as many third-party vendors as Microsoft and Oracle. Microsoft's user-friendly graphical tools and strong integration with the MS Office package is an advantage. Microsoft SQL server has also the best OLAP engine (Plato) integrated with the base database server (although it still lacks front-end functionality). Oracle has the overall most sophisticated and comprehensive product, but it is probably the most expensive as well.
I think that any of the three relational database products could be a good. Besides the technical features, you will have to consider other criteria such as leveraging existing systems and available resources to support a new database environment.
Rodolfo Reopell, Cornerstone Solutions, Inc.