Tallan's Technology Blog

Tallan's Top Technologists Share Their Thoughts on Today's Technology Challenges

SCD via SQL Stored Procedure

srichards

We are going to revisit the issue of dealing with Slowly Changing Dimensions in a data warehouse. We have seen a demonstration of using the SCD transformation that is available in SQL Server Integration Services (SSIS); however, this is not always the best option for processing an SCD. When the volume of rows you’re dealing with is substantial, this creates a significant, and usually unacceptable, performance hit. Another valid reason why you may choose not to implement the SSIS transformation is flexibility. The SSIS transformation requires your begin and end dates to be datetime data types, which could prove to be inconvenient when your SCD needs to tie back to a date and/or time dimension.

One alternative we are going to exhibit is using a SQL Server stored procedure. This example demonstrates the implementation of a Type 2 SCD, preserving the change history in the dimension table by creating a new row when there are changes.

Let’s take a look at our Vendor dimension table:

 

VENDOR_KEY serves as the surrogate or warehouse key in the table. The effective date range columns retain the history of each vendor (VENDOR_IDNT) in the dimension, allowing us to see the column values at any point in time. Fact table rows can be joined to the dimension row on the VENDOR_KEY where the date of the fact is within the effective date range of the dimension. The BEGIN_EFF_DT and END_EFF_DT have an integer data type because they will correspond to key values that tie back to the date dimension, whose primary key has a yyyyddd format (ex: 2008001).

The Vendor staging table:

The staging table reflects its source file with the addition of the FILE_DT column. The FILE_DT column will be used as the effective date in the load process should the file contain changes or new vendors.

The stored procedure takes the data from the staging table and loads it into the dimension table. This process can be broken down into three steps.

1. Insert new records of vendors that do not exist in the dimension

2. Insert new records of vendors that do exist in the dimension and contain field values that are different from the previous record.

3. Update the effective dates in changed records to expire the previously active record.

The first step is to insert any completely new vendor records.

The first record for each vendor is pulled from the dimension table and the VENDOR_IDNT is compared to the VENDOR_IDNT values in the staging table. A new record is inserted only when there is no match in the dimension table. The BEGIN_EFF_DT and END_EFF_DT are set to -2147483648 (smallest integer value) and 2147483647 (largest integer value), respectively, for the new record. The idea being that if there is only one record for a vendor then it is always effective.

The second step of the procedure is to insert new rows for any items that have changed.

 

The latest record for each vendor is pulled from the dimension and the values are compared to those in the staging table. The columns compared by the EXCEPT clause are what tracks changes.  A new row is inserted for any differences in the LONG_NAME or SHORT_NAME columns for a matching VENDOR_IDNT. The BEGIN_EFF_DT of the inserted row is set to the FILE_DT from the staging table and the END_EFF_DT is set to 2147483647 because it will be the latest record for that vendor.

The third step for the procedure is to expire all the entries that have been updated.

 

The END_EFF_DT is updated on the previously valid record and it is set to one day before the BEGIN_EFF_DT on the new record, so there is no overlap.

Here we have an example of a record that changed and what the table will look like when the procedure is finished:

Related Articles

6 Comments. Leave new

Very clever with that EXCEPT clause to check for changes… Any idea how it compares performance wise vs binary_checksum() comparison vs the tests all OR’ed together?

“vs the tests” refers to != (how many ways can i say not equals? The open and close carets disappeared for obvious reasons
) comparisons in comment 1

“Fact table rows can be joined to the dimension row where the fact row transaction date is within the effective date range of the dimension row. The BEGIN_EFF_DT and END_EFF_DT have an integer data type because they will correspond to key values that tie back to the date dimension, whose primary key has a yyyyddd format (ex: 2008001).”

Let’s be clear that INCOMING fact table rows can be joined to the dimension for range scan to determine _KEY at load time. The rows in the actual fact table should only be joined to the _KEY in the dimension table.

My preference is to load the updates first followed by the inserts. My reason for doing this is that the inserted rows will be added correctly, with no need for update. If these rows are inserted first in the process, the process to update existing rows will have to at a minimum pass these rows and verify that no change has been made.

As to the update process, again i think that i would prefer to reverse the order of the process suggested above. The reason here is that I would prefer to not have two rows end dated with high values available simultaneously which would be the case if you first insert the changed row and then update the row that is changing at a later time. If current fact data happens to be loading during this process (which it should not be, but i never trust that that will be the case), I would rather have a row fail to insert into the fact table and handle it as an exception than to have a fact duplicated with no exception raised.

Two additional points…
If the update process is treated as a single transaction, then the concern over two “current” rows goes away, but that means that a larger transaction log must be available.
There might be an argument for a cursor that does both the insert and update or blocking n (TOP (10000) for example) updates and inserts into a transaction block, so that the data consistency is maintained, but the transaction log footprint is minimized.

While I understand that code here is for illustration, best practice says INSERT INTO should always be followed by a column list. You do not want to (necessarily) have your procedure break because someone went and added a column to your table. This argument is the same one that says that SELECT * should never be used. (I will use SELECT *, if and only if, I am in complete control of the column list, ie, i have a derived table/inline view, etc. embedded in the overall SELECT statement, but even then I will usually specify the list.)

binary_checksum()…
http://msdn.microsoft.com/en-us/library/ms173784.aspx
The SQL Server checksum functions can be used to detect changes to records. So if i am looking to see if a value changed on a row i might code something that looks like

UPDATE a
SET …
FROM targetTable a
JOIN sourceTable b
ON a.keys = b.keys
WHERE binary_checksum( a.col1
, a.col2)
binary_checksum( b.col1
, b.col2)

This is far more efficient than

UPDATE a
SET …
FROM targetTable a
JOIN sourceTable b
ON a.keys = b.keys
WHERE ((a.col1 b.col1)
OR (a.col2 b.col2))

It can be made even more efficient by storing the checksum as a column in your fact/dimension table and indexing it.

BUT… MS does not guarantee that rows that have changed will not product the same checksum. So if you use it, you have to be careful to Test, Test, Test. It is only a small percent of changes where this can occur, but it can.

Further caveat… NULLs in your expression list are ignored. Lets say your list looks something like this

SELECT binary_checksum(1,2,NULL,3)
It will return 291 as the result

Similarly, the following will also produce 291 as the result set…
SELECT binary_checksum(1,NULL,2,3)

Simply casting the columns to a datatype will usually fix this, but i usually coalesce the values with a value unlikely to occur in the dataset. So my update statement above might look somethink like this

DECLARE @ValueIfNullStr varchar
, @ValueIfNullNbr int

SELECT @ValueIfNullStr = ‘#^#@’
, @ValueIfNullNbr = -2000000000

UPDATE a
SET …
FROM targetTable a
JOIN sourceTable b
ON a.keys = b.keys
WHERE binary_checksum( coalesce(a.col1,@ValueIfNullStr)
, coalesce(a.col2,@ValueIfNullNbr))
binary_checksum( coalesce(b.col1,@ValueIfNullStr)
, coalesce(b.col2,@ValueIfNullNbr))

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

\\\