How To Rate a Dimensional Data Warehouse
Ralph Kimball has proposes 20 criteria for what makes a successful dimensional Data Warehouse. The 20 criteria have been divided into three broad groups: architecture, administration, and expression. In most of the cases, it is fairly clear why a criterion belongs to a particular group. The architecture criteria are fundamental characteristics of the overall system that are not only “features” but are central to the whole way the system is organized. Architectural criteria usually extend from the back room, through the DBMS, all the way to the front room and the user’s desktop. Administration criteria are certainly more tactical than architectural criteria, but have been chosen to be “show stoppers” if they are missing from a dimensionally oriented data warehouse. Administration criteria generally affect IT personnel who are building and maintaining the data warehouse. Expression criteria are mostly analytic capabilities that are needed in real-life situations. The end-user community experiences all expression criteria directly.
A D V E R T I S E M E N T
Explicit Declaration: The system provides explicit database declarations that distinguish a dimensional entity from a measurement (fact) entity. These declarations are stored in the system metadata. The declarations are visible to administrators and end users and affect query strategy, query performance, grouping logic, and physical storage. Facts can be declared as fully additive, semi-additive, and nonadditive. Default (automatic) aggregation techniques other than summation can be associated with facts. The default association between dimensions and facts is declared in the metadata so that the user can omit specifying the link between them. A dimension attribute included in a query is automatically the basis of a dynamic aggregation. A fact included in a query is by default summed within the context of all aggregations. Semi-additive facts and nonadditive facts are prohibited from being summed across the wrong dimensions.
Conformed Dimensions and Facts: The system uses conformed dimensions and facts to implement drill-across queries where answer sets from different databases, different locations, and possibly different technologies can be combined into a higher-level answer set by matching on the row headers supplied by the conformed dimensions. The system detects and warns against the attempted uses of unconformed facts. This is the most fundamental and profound architecture criterion. It is the basis for implementing distributed data warehouses, and especially Webhouses, consisting of far-flung organizations (with no center) sharing data over the Web.
Dimensional Integrity: The system guarantees that the dimensions and the facts maintain referential integrity. In particular, a fact may not exist unless it is in a valid framework of all its dimensions. However, a dimensional entry may exist without any corresponding facts.
Open Aggregate Navigation: The system uses physically stored aggregates as a way to enhance performance of common queries. These aggregates, like indexes, are chosen silently by the database if they are physically present. End users and application developers do not need to know what aggregates are available at any point in time, and applications are not required to explicitly code the name of an aggregate. All query processes accessing the data, even those from different application vendors, realize the full benefit of aggregate navigation.
Dimensional Symmetry: All dimensions allow comparison calculations that constrain two or more disjoint values of a single attribute from a dimension in computations such as ratios or differences. Also, the underlying database engine supports an indexing scheme that allows a single indexing strategy to efficiently support query constraints on an arbitrary and unpredictable subset of the dimensions in a highly dimensional database.
Dimensional Scalability: The system places no fundamental constraints on either the number of members or the number of attributes within a single dimension. Dimensions with 100 million members or 1,000 textual attributes are practical. Dimensions with a billion members are possible.
Sparsity Tolerance: Any single measurement can exist within a space of many dimensions, which can be viewed as extraordinarily sparse. The system imposes no practical limit on the degree of sparsity. A 20-dimensional database, each of whose dimensions has a million or more members, is practical.
Graceful Modification: The system must allow the following modifications to be made in place without dropping or reloading the primary database: a) adding an attribute to a dimension; b) adding a new kind of fact to a measurement set, possibly beginning at a specific point in time; c) adding a whole new dimension to a set of existing measurements; and d) splitting an existing dimension into two or more new dimensions.
Dimensional Replication: The system supports the explicit replication of a conformed dimension outward from a dimension authority to all the client data marts, in such a way that we can only perform drill-across queries on data marts if they have consistent versions of the dimensions. Aggregates that are affected by changes to the content of a dimension are automatically taken offline in each client data mart until we can make them consistent with the revised dimension and the base fact table.
Dimension Notification: The system delivers upon request all the records from a production source of a dimension that have changed since the last such request. In addition, a reason code is supplied with this dimension notification that allows the data warehouse to distinguish between Type 1 and Type 3 slowly changing dimensions (overwrites) and Type 2 slowly changing dimensions (true physical changes at a point in time).
Surrogate Key Administration:The system implements a surrogate key pipeline process for: a) assigning new keys when the system encounters a Type 2 slowly changing dimension; and b) replacing the natural keys in a fact table record with the correct surrogate keys before loading into the fact table. In other words, the cardinality of a dimension can be made independent from the definition of the original production key. Surrogate keys, by definition, must have no semantics or ordering that makes their individual values relevant to an application. Surrogate keys must support not-applicable, nonexistent, and corrupted measurement data. A surrogate key may not be visible to an end-user application.
International Consistency:The system supports the administration of international language versions of dimensions by guaranteeing that a translated dimension possesses the same grouping cardinality as the original dimension. The system supports the UNICODE character set, as well as all common international numerical punctuation and formatting alternatives. Incompatible, language-specific collating sequences are allowed.
- Multiple-dimension hierarchies
- Ragged-dimension hierarchies
- Multiple valued dimensions
- Slowly changing dimensions
- Roles of a dimension
- Hot-swappable dimensions
- On-the-fly fact range dimensions
- On-the-fly behavior dimensions
These criteria are deliberately tough and till date no data warehouse can be rated a 20 out of 20. But that’s the value of a tough rating system. There is room to improve the quality of the data warehouse.