Data Warehousing Process
A Data Warehouse is not an individual repository product. Rather, it is an overall strategy, or process, for building decision support systems and a knowledge-based applications architecture and environment that supports both everyday tactical decision making and long-term business strategizing. The Data Warehouse environment positions a business to utilize an enterprise-wide data store to link information from diverse sources and make the information accessible for a variety of user purposes, most notably, strategic analysis. Business analysts must be able to use the Warehouse for such strategic purposes as trend identification, forecasting, competitive analysis, and targeted market research.
A D V E R T I S E M E N T
Data Warehouses and Data Warehouse applications are designed primarily to support executives, senior managers, and business analysts in making complex business decisions. Data Warehouse applications provide the business community with access to accurate, consolidated information from various internal and external sources.
The primary objective of Data Warehousing is to bring together information from disparate sources and put the information into a format that is conducive to making business decisions. This objective necessitates a set of activities that are far more complex than just collecting data and reporting against it. Data Warehousing requires both business and technical expertise and involves the following activities:
- Accurately identifying the business information that must be contained in the Warehouse
- Identifying and prioritizing subject areas to be included in the Data Warehouse
- Managing the scope of each subject area which will be implemented into the Warehouse on an iterative basis
- Developing a scaleable architecture to serve as the Warehouse’s technical and application foundation, and identifying and selecting the hardware/software/middleware components to implement it
- Extracting, cleansing, aggregating, transforming and validating the data to ensure accuracy and consistency
- Defining the correct level of summarization to support business decision making
- Establishing a refresh program that is consistent with business needs, timing and cycles
- Providing user-friendly, powerful tools at the desktop to access the data in the Warehouse
- Educating the business community about the realm of possibilities that are available to them through Data Warehousing
- Establishing a Data Warehouse Help Desk and training users to effectively utilize the desktop tools
- Establishing processes for maintaining, enhancing, and ensuring the ongoing success and applicability of the Warehouse
Until the advent of Data Warehouses, enterprise databases were expected to serve multiple purposes, including online transaction processing, batch processing, reporting, and analytical processing. In most cases, the primary focus of computing resources was on satisfying operational needs and requirements. Information reporting and analysis needs were secondary considerations. As the use of PCs, relational databases, 4GL technology and end-user computing grew and changed the complexion of information processing, more and more business users demanded that their needs for information be addressed. Data Warehousing has evolved to meet those needs without disrupting operational processing.
In the Data Warehouse model, operational databases are not accessed directly to perform information processing. Rather, they act as the source of data for the Data Warehouse, which is the information repository and point of access for information processing. There are sound reasons for separating operational and informational databases, as described below.
- The users of informational and operational data are different. Users of informational data are generally managers and analysts; users of operational data tend to be clerical, operational and administrative staff.
- Operational data differs from informational data in context and currency. Informational data contains an historical perspective that is not generally used by operational systems.
- The technology used for operational processing frequently differs from the technology required to support informational needs.
- The processing characteristics for the operational environment and the informational environment are fundamentally different.
The Data Warehouse functions as a Decision Support System (DSS) and an Executive Information System (EIS), meaning that it supports informational and analytical needs by providing integrated and transformed enterprise-wide historical data from which to do management analysis. A variety of sophisticated tools are readily available in the marketplace to provide user-friendly access to the information stored in the Data Warehouse.
Data Warehouses can be defined as subject-oriented, integrated, time-variant, non-volatile collections of data used to support analytical decision making. The data in the Warehouse comes from the operational environment and external sources. Data Warehouses are physically separated from operational systems, even though the operational systems feed the Warehouse with source data.
Data Warehouses are designed around the major subject areas of the enterprise; the operational environment is designed around applications and functions. This difference in orientation (data vs. process) is evident in the content of the database. Data Warehouses do not contain information that will not be used for informational or analytical processing; operational databases contain detailed data that is needed to satisfy processing requirements but which has no relevance to management or analysis.
Integration and Transformation
The data within the Data Warehouse is integrated. This means that there is consistency among naming conventions, measurements of variables, encoding structures, physical attributes, and other salient data characteristics. An example of this integration is the treatment of codes such as gender codes. Within a single corporation, various applications may represent gender codes in different ways: male vs. female, m vs. f, and 1 vs. 0, etc. In the Data Warehouse, gender is always represented in a consistent way, regardless of the many ways by which it may be encoded and stored in the source data. As the data is moved to the Warehouse, it is transformed into a consistent representation as required.
All data in Data Warehouse is accurate as of some moment in time, providing an historical perspective. This differs from the operational environment in which data is intended to be accurate as of the moment of access. The data in the Data Warehouse is, in effect, a series of snapshots. Once the data is loaded into the enterprise data store and data marts, it cannot be updated. It is refreshed on a periodic basis, as determined by the business need. The operational data store, if included in the Warehouse architecture, may be updated.
Data in the Warehouse is static, not dynamic. The only operations that occur in Data Warehouse applications are the initial loading of data, access of data, and refresh of data. For these reasons, the physical design of a Data Warehouse optimizes the access of data, rather than focusing on the requirements of data update and delete processing.
Data Warehouse Configurations
A Data Warehouse configuration, also known as the logical architecture, includes the following components:
- one Enterprise Data Store (EDS) - a central repository which supplies atomic (detail level) integrated information to the whole organization.
- one Operational Data Store - a "snapshot" of a moment in time's enterprise-wide data
- One or more individual Data Mart(s) - summarized subset of the enterprise's data specific to a functional area or department, geographical region, or time period
- one or more Metadata Store(s) or Repository(ies) - catalog(s) of reference information about the primary data. Metadata is divided into two categories: information for technical use, and information for business end-users.
The EDS is the cornerstone of the Data Warehouse. It can be accessed for both immediate informational needs and for analytical processing in support of strategic decision making, and can be used for drill-down support for the Data Marts which contain only summarized data. It is fed by the existing subject area operational systems and may also contain data from external sources. The EDS in turn feeds individual Data Marts that are accessed by end-user query tools at the user's desktop. It is used to consolidate related data from multiple sources into a single source, while the Data Marts are used to physically distribute the consolidated data into logical categories of data, such as business functional departments or geographical regions. The EDS is a collection of daily "snapshots" of enterprise-wide data taken over an extended time period, and thus retains and makes available for tracking purposes the history of changes to a given data element over time. This creates an optimum environment for strategic analysis. However, access to the EDS can be slow, due to the volume of data it contains, which is a good reason for using Data Marts to filter, condense and summarize information for specific business areas. In the absence of the Data Mart layer, users can access the EDS directly.
Metadata is "data about data," a catalog of information about the primary data that defines access to the Warehouse. It is the key to providing users and developers with a road map to the information in the Warehouse. Metadata comes in two different forms: end-user and transformational. End-user metadata serves a business purpose; it translates a cryptic name code that represents a data element into a meaningful description of the data element so that end-users can recognize and use the data. For example, metadata would clarify that the data element "ACCT_CD" represents "Account Code for Small Business." Transformational metadata serves a technical purpose for development and maintenance of the Warehouse. It maps the data element from its source system to the Data Warehouse, identifying it by source field name, destination field code, transformation routine, business rules for usage and derivation, format, key, size, index and other relevant transformational and structural information. Each type of metadata is kept in one or more repositories that service the Enterprise Data Store.
While an Enterprise Data Store and Metadata Store(s) are always included in a sound Data Warehouse design, the specific number of Data Marts (if any) and the need for an Operational Data Store are judgment calls. Potential Data Warehouse configurations should be evaluated and a logical architecture determined according to business requirements.
Data Warehousing Process
Data Warehouse Process does not encompass the analysis and identification of organizational value streams, strategic initiatives, and related business goals, but it is a prescription for achieving such goals through a specific architecture. The Process is conducted in an iterative fashion after the initial business requirements and architectural foundations have been developed with the emphasis on populating the Data Warehouse with "chunks" of functional subject-area information each iteration. The Process guides the development team through identifying the business requirements, developing the business plan and Warehouse solution to business requirements, and implementing the configuration, technical, and application architecture for the overall Data Warehouse. It then specifies the iterative activities for the cyclical planning, design, construction, and deployment of each population project. The following is a description of each stage in the Data Warehouse Process.
Business Case Development
A variety of kinds of strategic analysis, including Value Stream Assessment, have likely already been done by the customer organization at the point when it is necessary to develop a Business Case. The Business Case Development stage launches the Data Warehouse development in response to previously identified strategic business initiatives and "predator" (key) value streams of the organization. The organization will likely have identified more than one important value stream. In the long term it is possible to implement Data Warehouse solutions that address multiple value streams, but it is the predator value stream or highest priority strategic initiative that usually becomes the focus of the short-term strategy and first run population projects resulting in a Data Warehouse.
At the conclusion of the relevant business reengineering, strategic visioning, and/or value stream assessment activities conducted by the organization, a Business Case can be built to justify the use of the Data Warehouse architecture and implementation approach to solve key business issues directed at the most important goals. The Business Case defines the outlying activities, costs, benefits, and critical success factors for a multi-generation implementation plan that results in a Data Warehouse framework of an information storage/access system. The Warehouse is an iterative designed/developed/refined solution to the tactical and strategic business requirements. The Business Case addresses both the short-term and long-term Warehouse strategies (how multiple data stores will work together to fulfill primary and secondary business goals) and identifies both immediate and extended costs so that the organization is better able to plan its short and long-term budget appropriation.
Business Question Assessment
Once a Business Case has been developed, the short-term strategy for implementing the Data Warehouse is mapped out by means of the Business Question Assessment (BQA) stage. The purpose of BQA is to:
- Establish the scope of the Warehouse and its intended use
- Define and prioritize the business requirements and the subsequent information (data) needs the Warehouse will address
- Identify the business directions and objectives that may influence the required data and application architectures
- Determine which business subject areas provide the most needed information; prioritize and sequence implementation projects accordingly
- Drive out the logical data model that will direct the physical implementation model
- Measure the quality, availability, and related costs of needed source data at a high level
- Define the iterative population projects based on business needs and data validation
The prioritized predator value stream or most important strategic initiative is analyzed to determine the specific business questions that need to be answered through a Warehouse implementation. Each business question is assessed to determine its overall importance to the organization, and a high-level analysis of the data needed to provide the answers is undertaken. The data is assessed for quality, availability, and cost associated with bringing it into the Data Warehouse. The business questions are then revisited and prioritized based upon their relative importance and the cost and feasibility of acquiring the associated data. The prioritized list of business questions is used to determine the scope of the first and subsequent iterations of the Data Warehouse, in the form of population projects. Iteration scoping is dependent on source data acquisition issues and is guided by determining how many business questions can be answered in a three to six month implementation time frame. A "business question" is a question deemed by the business to provide useful information in determining strategic direction. A business question can be answered through objective analysis of the data that is available.
Architecture Review and Design
The Architecture is the logical and physical foundation on which the Data Warehouse will be built. The Architecture Review and Design stage, as the name implies, is both a requirements analysis and a gap analysis activity. It is important to assess what pieces of the architecture already exist in the organization (and in what form) and to assess what pieces are missing which are needed to build the complete Data Warehouse architecture.
During the Architecture Review and Design stage, the logical Data Warehouse architecture is developed. The logical architecture is a configuration map of the necessary data stores that make up the Warehouse; it includes a central Enterprise Data Store, an optional Operational Data Store, one or more (optional) individual business area Data Marts, and one or more Metadata stores. In the metadata store(s) are two different kinds of metadata that catalog reference information about the primary data.
Once the logical configuration is defined, the Data, Application, Technical and Support Architectures are designed to physically implement it. Requirements of these four architectures are carefully analyzed so that the Data Warehouse can be optimized to serve the users. Gap analysis is conducted to determine which components of each architecture already exist in the organization and can be reused, and which components must be developed (or purchased) and configured for the Data Warehouse.
The Data Architecture organizes the sources and stores of business information and defines the quality and management standards for data and metadata.
The Application Architecture is the software framework that guides the overall implementation of business functionality within the Warehouse environment; it controls the movement of data from source to user, including the functions of data extraction, data cleansing, data transformation, data loading, data refresh, and data access (reporting, querying).
The Technical Architecture provides the underlying computing infrastructure that enables the data and application architectures. It includes platform/server, network, communications and connectivity hardware/software/middleware, DBMS, client/server 2-tier vs.3-tier approach, and end-user workstation hardware/software. Technical architecture design must address the requirements of scalability, capacity and volume handling (including sizing and partitioning of tables), performance, availability, stability, chargeback, and security.
The Support Architecture includes the software components (e.g., tools and structures for backup/recovery, disaster recovery, performance monitoring, reliability/stability compliance reporting, data archiving, and version control/configuration management) and organizational functions necessary to effectively manage the technology investment.
Architecture Review and Design applies to the long-term strategy for development and refinement of the overall Data Warehouse, and is not conducted merely for a single iteration. This stage develops the blueprint of an encompassing data and technical structure, software application configuration, and organizational support structure for the Warehouse. It forms a foundation that drives the iterative Detail Design activities. Where Design tells you what to do; Architecture Review and Design tells you what pieces you need in order to do it.
The Architecture Review and Design stage can be conducted as a separate project that runs mostly in parallel with the Business Question Assessment stage. For the technical, data, application and support infrastructure that enables and supports the storage and access of information is generally independent from the business requirements of which data is needed to drive the Warehouse. However, the data architecture is dependent on receiving input from certain BQA activities (data source system identification and data modeling), so the BQA stage must conclude before the Architecture stage can conclude.
The Architecture will be developed based on the organization's long-term Data Warehouse strategy, so that future iterations of the Warehouse will have been provided for and will fit within the overall architecture.
The purpose of this stage is to identify the candidate tools for developing and implementing the Data Warehouse data and application architectures, and for performing technical and support architecture functions where appropriate. Select the candidate tools that best meet the business and technical requirements as defined by the Data Warehouse architecture, and recommend the selections to the customer organization. Procure the tools upon approval from the organization.
It is important to note that the process of selecting tools is often dependent on the existing technical infrastructure of the organization. Many organizations feel strongly for various reasons about using tools for the Data Warehouse applications that they already have in their "arsenal" and are reluctant to purchase new application packages. It is recommended that a thorough evaluation of existing tools and the feasibility of their reuse be done in the context of all tool evaluation activities. In some cases, existing tools can be form-fitted to the Data Warehouse; in other cases, the customer organization may need to be convinced that new tools would better serve their needs.
It may even be feasible that this series of activities is skipped altogether, if the organization is insistent that particular tools be used (no room for negotiation), or if tools have already been assessed and selected in anticipation of the Data Warehouse project.
Tools may be categorized according to the following data, technical, application, or support functions:
- Source Data Extraction and Transformation
- Data Cleansing
- Data Load
- Data Refresh
- Data Access
- Security Enforcement
- Version Control/Configuration Management
- Backup and Recovery
- Disaster Recovery
- Performance Monitoring
- Database Management
- Data Modeling
- Metadata Management
Iteration Project Planning
The Data Warehouse is implemented (populated) one subject area at a time, driven by specific business questions to be answered by each implementation cycle. The first and subsequent implementation cycles of the Data Warehouse are determined during the BQA stage. At this point in the Process the first (or next if not first) subject area implementation project is planned. The business requirements discovered in BQA and, to a lesser extent, the technical requirements of the Architecture Design stage are now refined through user interviews and focus sessions to the subject area level. The results are further analyzed to yield the detail needed to design and implement a single population project, whether initial or follow-on. The Data Warehouse project team is expanded to include the members needed to construct and deploy the Warehouse, and a detailed work plan for the design and implementation of the iteration project is developed and presented to the customer organization for approval.
In the Detail Design stage, the physical Data Warehouse model (database schema) is developed, the metadata is defined, and the source data inventory is updated and expanded to include all of the necessary information needed for the subject area implementation project, and is validated with users. Finally, the detailed design of all procedures for the implementation project is completed and documented. Procedures to achieve the following activities are designed:
- Warehouse Capacity Growth
- Data Extraction/Transformation/Cleansing
- Data Load
- Data Refresh
- Data Access
- Backup and Recovery
- Disaster Recovery
- Data Archiving
- Configuration Management
- Transition to Production
- User Training
- Help Desk
- Change Management
Once the Planning and Design stages are complete, the project to implement the current Data Warehouse iteration can proceed quickly. Necessary hardware, software and middleware components are purchased and installed, the development and test environment is established, and the configuration management processes are implemented. Programs are developed to extract, cleanse, transform and load the source data and to periodically refresh the existing data in the Warehouse, and the programs are individually unit tested against a test database with sample source data. Metrics are captured for the load process. The metadata repository is loaded with transformational and business user metadata. Canned production reports are developed and sample ad-hoc queries are run against the test database, and the validity of the output is measured. User access to the data in the Warehouse is established. Once the programs have been developed and unit tested and the components are in place, system functionality and user acceptance testing is conducted for the complete integrated Data Warehouse system. System support processes of database security, system backup and recovery, system disaster recovery, and data archiving are implemented and tested as the system is prepared for deployment. The final step is to conduct the Production Readiness Review prior to transitioning the Data Warehouse system into production. During this review, the system is evaluated for acceptance by the customer organization.
Transition to Production
The Transition to Production stage moves the Data Warehouse development project into the production environment. The production database is created, and the extraction/cleanse/transformation routines are run on the operations system source data. The development team works with the Operations staff to perform the initial load of this data to the Warehouse and execute the first refresh cycle. The Operations staff is trained, and the Data Warehouse programs and processes are moved into the production libraries and catalogs. Rollout presentations and tool demonstrations are given to the entire customer community, and end-user training is scheduled and conducted. The Help Desk is established and put into operation. A Service Level Agreement is developed and approved by the customer organization. Finally, the new system is positioned for ongoing maintenance through the establishment of a Change Management Board and the implementation of change control procedures for future development cycles.