Introduction to Data Warehousing
Data warehousing and business intelligence are key enablers in the drive to harness management information and improve decision-making. They underpin the recent wave of interest in data mining and customer relationship management, and create opportunities for performance improvements based on quantitative analysis at every level in an organisation.
A D V E R T I S E M E N T
A data warehouse is a copy of transaction data specifically structured for querying and reporting.
According to Ralph Kimball, a data warehouse is "a copy of transaction data specifically structured for query and analysis."
Sometimes non-transaction data are stored in a data warehouse - though probably 95-99% of the data usually are transaction data. The main output from data warehouse systems are either tabular listings (queries) with minimal formatting or highly formatted "formal" reports. Queries and reports generated from data stored in a data warehouse may or may not be used for analysis.
The form of the stored data has nothing to do with whether something is a data warehouse.
A data warehouse can be normalized or denormalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity.
Data warehousing is not necessarily for the needs of "decision makers" or used in the process of decision making. The overwhelming uses of data warehouses are for quite mundane, non-decision making purposes rather than for grist for making decisions with wide ranging effects. In fact, most of data warehouses are used for post-decision monitoring of the effects of decisions – or, as some people might say, for "operational" issues. By the way, this is not saying that using data warehousing in the decision making process is not a wonderful, potentially high return effort.
Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
This definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.
In contrast to data warehouses are operational databases that support day-to-day transaction processing.
The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow - mainly, the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same data. The process of gathering, cleaning and integrating data from various sources, usually long existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from the operational systems that were logically related to prior gathered data.
Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data. Data could be retrieved from one central point or data could be distributed to "retail stores" or "data marts" that were tailored for ready access by users.