Tallan's Technology Blog

Tallan's Top Technologists Share Their Thoughts on Today's Technology Challenges

A Warehousing Overview

Data warehousing is a big subject. This overview is intended to cover some of the most representative issues on a high level: the nature of OLAP systems, star schemas, facts and dimensions, and differing perspectives (Inmon vs. Kimball) on warehouse design.

OLTP vs. OLAP

OLTP systems are the operational databases supporting applications. They are highly normalized, and focused on CRUD operations.

OLAP databases are usually arranged in star schemas and are built for speed in retrieving aggregated data.

OLTP vs. OLAP
http://www.rainmakerworks.com/pdfdocs/OLTP_vs_OLAP.pdf

OLAP systems pick up query speed by:

  • Reducing the number of joins required to answer complex queries
  • Increasing the number of non-clustered indexes on tables

The Star Schema

A star schema features a central fact table containing numeric measurements and foreign keys. The foreign keys reference tables containing mostly textual descriptions, which are called dimensions.

Fact tables have these general properties

  • Take up the vast majority of space in an OLAP database
  • Columns fall into two general categories:
    • Foreign keys to related dimensions
    • Measurements, which are often additive. Examples are sales figures or units sold. Semi-additive measures, such as account balances or inventory levels are also common.
  • Are inherently normalized – the foreign keys comprise a composite key, and no dependencies exist in the facts

Dimensional tables

  • Are denormalized
  • Often contain an artificial key as well as a business key
  • Contain mostly textual descriptions used to slice fact data

Sample Star Schema

SELECT

     P.Brand,

     S.Country,

     SUM (F.Units_Sold)

FROM

     Fact_Sales F

INNER JOIN Dim_Date D

     ON F.Date_Id = D.Id

INNER JOIN Dim_Store S

     ON F.Store_Id = S.Id

INNER JOIN Dim_Product P

     ON F.Product_Id = P.Id

WHERE

     D.Year = 1997 AND P.Product_Category = ‘tv’

GROUP BY

     P.Brand,

     S.Country

 Typical star schema and query, courtesy of http://en.wikipedia.org/wiki/Star_schema

The grain of a fact table

The grain of a fact table is the most atomic level at which the data is available. This is often expressed in terms of “Daily Sales, by store, by day, by customer”. The foreign keys which uniquely identify records in the fact table comprise the primary key of the fact table, and define the grain.

Surrogate Keys

While the primary key of a fact table is a composite of foreign keys from related dimensions, each dimensional table contains an artificially generated primary key.

  • Should be created with integer identity columns
  • They do not have a business meaning – date dimension is an exception
  • Surrogate key should be used to represent dimensional record in fact table – this ties into the next topic, Slowly Changing Dimensions.

Slowly Changing Dimensions

Attributes of dimensions change over time. A customer’s name may change, or a store manager may be replaced. There are three general approaches to handling change in a dimension:

  • Type 1
    • If tracking change in the attribute is not important, the new value may be updated in place
    • This is especially appropriate in the case of a correction to bad data
  • Type 2
    • The most commonly used approach
    • A new row is generated containing the most current value of all attributes
    • The row will contain the same business key as the previous record, but a new surrogate key will be automatically generated
    • To enable the ETL process to correctly pick the appropriate surrogate key in time, a common approach is for each slowly changing dimension to contain [Begin Effective Date] and [End Effective Date] columns. When a new row is generated, the previous record is expired and the new record is set with a maximum value.

Customer ID

Customer No

First Name

Last Name

Begin Eff Date

End Eff Date

182

55A-3B

Allison

Hill

-2147483647

2004223

10345

55A-3B

Allison

Chandler

2004224

2147483647

  • Type 3
    • Two columns are used to track each attribute which should maintain change history
    • One column tracks either the original, or previous change
    • The other column stores the current value
    • This approach is less common than Type 2, since it cannot track more than two historical changes, and can result in changes to table structure.

Components of a Data Warehouse

  • Staging Area
    • Landing area for operational data
    • ETL operations are performed in preparation for load into warehouse
      • My personal preference is that transformations occur after staging loads, before dimension/fact loads. In this way, the staging tables should reflect the source systems exactly – this is the only place in the warehouse where this will be the case.
    • Typically truncated between loads
  • ODS (Operational Data Store)
    • In one view (Inmon’s), the ODS is used to store finely-grained, current source data for flash reporting. The ODS is in 3rd normal form. It is loaded separately from the warehouse, and is not used for extracts.
      • I feel this approach is dated by modern storage capacities – it is now feasible to store most fact data on a very granular level. Proactive-caching is a good alternative for real-time reporting.
    • Kimball’s view of the ODS could be described as the base layer of the warehouse. It shares dimensions with aggregated tables, and contains history.
    • In practice, many organizations use the ODS as an intermediary staging area for data cleansing.

 

  • Conformed Dimensions
    • Dimensions which are shared between fact tables.
    • This allows queries to be executed across star schemas – also known as “drill-across”.

 

  • Data Marts
    • Subsets of the total warehouse
    • Usually optimized for a particular department
      • Data will be optimally summarized
      • Unneeded dimensions will not be included
    • May be deployed as Analysis Services cubes

Ralph Kimball vs. Bill Inmon

Areas of agreement

  • Consensus on need for solid business requirements and end-user validation
  • Agreement that it is rarely feasible to build an entire warehouse at once – incremental development with a focus on high-priority elements
  • Conformed dimensions are desirable
  • Warehouse data needs to be tracked as atomically as possible
  • Star schema is most desirable format for data marts

Areas of differences

  • Bill Inmon
    • Approach is known as Top-Down, or Corporate Information Factory
    • Warehouse data should be stored in a centralized relational structure
    • Dependent data marts should be created from central warehouse
      • These data marts will employ star schemas
      • Additional transformations may be employed between the warehouse and the data marts
    • ODS is used for transaction level detail with little to no history

                      Inmon                                                  Kimball

  • Ralph Kimball
    • Approach is known as Bottom-Up, or Kimball Bus Architecture
    • “… The data warehouse is nothing more than the union of all the data marts …”
    • Conformed dimensions are the glue which unite disparate data marts, while Inmon maintains conformed dimensions within centralized relational structure
    • ODS may be integrated directly into warehouse

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

\\\