Building a Data Warehouse
Building a data warehouse involves several complicated steps and can take time if the population routines aren't designed carefully. But the effort is worth it.
A D V E R T I S E M E N T
A data warehouse, of course, isn't actually an application itself, but an enabling tool that allows select individuals to develop high-powered forecasting and performance monitoring functions. It is a very sophisticated application development kit that can be directly put to work by users. Building a dtawarehouse involves creating a storage system that consolidates vast amounts of relevant data and stores it in such a way as to maximize its convertibility into useful information. Since the whole point of a data warehouse is to consolidate data that is difficult to pull together otherwise, your design must allow for data acquisition from multiple databases in-house, from remote locations in your company WAN, and from external sources.
Building a data warehouse involves several complicated steps. After the data warehousing architect locates all data elements necessary to support the data warehouse, it is time to build a dimensional model. As data change in transactional systems, the data warehouse needs to have a way of tracking and reflecting such changes. Populating fact and dimension tables can take considerable amount of time if the population routines aren't designed carefully.
Transform the Data for Optimum Usage
As data is extracted from various sources, it needes to be processed in several ways.
- Integration : Common fields used in data structures from different sources must be reconciled, both structurally and content-wise. Differing measurements must be transformed. Fields of differing lengths and formats must be negotiated.
- Condensation: Where possible, condense data at the time of extraction and before loading into the warehouse.
- Stabilization: Frequency of data change to be found out. e.g the name of a product almost never does, but its price often does. Quantity/on-hand changes continuously. Data can be grouped together by attribute, depending on the attribute's stability. The power to configure data in this way has obvious design benefits.
- Normalization: Data warehouse is a high I/O environment. When data is normalized, there are occurrences of related items in different locations. This isn't I/O efficient. To avoid such inefficiency instances where the number of occurrences of a particular data item is stable enough are to be identified so as to grab it with a single I/O.
Equip the system with superior tools
In addition to the extra storage, the data warehouse system needs to be equiped with several pieces of key software:
- Extract-Transform-Load (ETL) : This software is the workhorse of a data warehouse. It will wind up costing you more than every other component combined in building the warehouse.
- Online Analytical Processing (OLAP) : On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities
- Data mining and EIS : Datamining, other than multidimensional analysis (OLAP), is intended to show any correlations in a significant volume of data of the information system in order to detect any trends.Datamining is supported by artificial intelligence (neural network) techniques to show hidden links between data.
An EIS (Executive Information System) is a tool which makes it possible to organize, analyze, and determine indicators to create border tables. This type of easy-to-ise tool only makes it possible to handle queries modeled prior thereto by the designer.
Major design considerations
- Data granularity : Granularity refers to the data's level of detail. The more detailed a data item is, the more granular it is. One instance of a purchase order, for example, would have high granularity; a summary record of all purchase orders for a sales quarter would have low granularity.
This is the single greatest factor in determining how much storage is needed for the warehouse over time, because it determines the volume of the data for required for analysis and reporting. As a rule, data is going to have high granularity levels (too high) coming in. It should be broken down at the ETL stage. Granularity vs. storage is a trade-off, to be sure, but the more granular the data, the more flexible it is.
- Partitioning : For efficiency reasons, data should be broken into physical units that can be easily handled using partitioning. In general, it is best to partition along subject lines.
A data warehouse can't be explicitly defined, functionally, before it's up and running, because neither executives nor trench-level users really know what actually is looked for.For this reason, the data warehouse is never quite finished. The analytics grow in sophistication with time; data gets refined and configured in new and different ways as results improve. A data warehouse is a growing, evolving thing, and so is an iterative process, heavily dependent on your the community and their creative efforts to pull out information that improves performance.