The SCD Transformation in SSIS
In this post, we will demonstrate the Slowly Changing Dimension (SCD) transformation that is available for use in a Data Flow task within SQL Server Integration Services (SSIS). We’ll walk through an example of how to use it, explaining the terms as we go. In addition, I’ll demonstrate some “extra effort” kind of modifications that can be made for a more thorough usage. After that, we’ll discuss the pros and cons of using it at all. And we’ll wrap up with some tips on how to tune the transformation for better performance.
The Slowly Changing Dimension transformation is an SSIS Data Flow transformation that can be used to update slowly changing dimensions in a data warehouse. To get started with it:
- Open BIDS or Visual Studio
- Create a new SSIS project. Call it SCDDemo.
- Add a Data Flow task to Package.dtsx
- On the Data flow tab add an OLE DB Source and a Slowly Changing Dimension transformation. The data source does not have to be OLE DB, it can be almost any thing. But most of the time it will be a data base of some sort. Notice that the Slowly Changing Dimension transformation does not have an error output. That is because it addresses all possible outputs and does not need one.
- To provide the transformation with input data, connect the output of the OLE DB Source to the Slowly Changing Dimension. The Data Flow tab will look like this:
Before configuring the SCD transformation, you need to create and configure the data source. For this demo, I used a demo data base, schemas, and tables based on the AdventureWorksDW data base. The demo data base is named SlowlyChangingDimensionDemo. The schemas are Staging and Dim. The tables are Staging.Customer and Dim.Customer. The T-SQL used to populate the tables was:
— insert some sample data
INSERT INTO Staging.Customer
SELECT TOP 100
FROM AdventureWorksDW..DimCustomer— Inferred row in dimension
insert into Dim.Customer (CustomerKey, InferredRow, StartDate)
So, at the beginning the Staging table has 100 rows in it and the Dim table has just one. Don’t worry what an inferred row is just now, it will be explained later.
The processing cycle for the SCD transformation is to accept all the rows of data from the source, analyze each one to determine the type of change(s) that it contains and distribute a copy of the row to each of the appropriate outputs that the transformation can create (there are six in all). Each of these outputs can then be used as input to a step that processes a particular type of change. The way that the SCD transformation determines the type of change for an input row is to compare it to its matching row (based on a business key) in an already existing dimension table.
The next step in preparing to use the SCD is to configure it. There is a wizard to help. But before doing that, there are some important concepts that the wizard uses that need some explanation. The starting point for this is the definition of the dimension table, Dim.Customer in the example.
Notice that the table has its own key, CustomerWK, which is separate from the “Business Key” (which is the key the source system uses to uniquely identify the row. In this case it is called CustomerKey). Next, note the six data items FirstName, LastName, BirthDate, MaritalStatus, Gender, and NumberOfCarsOwned. These six columns, along with the CustomerKey, make up the data from the source system that will be stored in the dimension table. The StartDate and EndDate are populated by the dimension update process. They allow the table to have different, time-dependent rows for the same customer (identified by the CustomerKey). The CurrentRow column is a bit (Boolean) field that is 1 if the row is the current row for the customer and 0 (zero) otherwise. It is set by dimension table updating. The last column, InferredRow, is a Boolean field that is set to 1 if the row was added to the dimension table by fact table processing.
That last statement above deserves a little more explanation. It will sometimes happen in the course of joining a newly arrived fact (from Staging) to a dimension table (on the Business Key) that there is no match in the dimension table. That is, the fact contains a Business Key value that cannot be found in the corresponding dimension table. When this happens, there are three possible actions that can be taken:
- The process can ignore the fact and exclude it from the data warehouse (this happens automatically if the join between the fact and the dimension is an “inner join”)
- The process can attempt to enter some value into the data warehouse fact table that indicates there was no match on the dimension when the fact was loaded
- The process can add an “empty” row into the dimension table that contains the unmatched Business Key along with empty/default values for the remaining data columns. The fact then joins to this newly added dimension row.
In this last case, the inserted row is called an Inferred Member or Inferred Row. If, at some later time, a change row for the inferred row arrives from the source system, then the inferred row can be updated and no change is necessary in the fact table.
The final topic that needs to be explained before configuring the Slowly Changing Dimension transformation is the types of changes that it will identify for each column in the dimension table and how each one will be processed. There are three basic types that the wizard addresses: fixed, historical and changing. A column marked as having the “fixed” change type (it does sound a little funny, I know) does not allow changes to the column’s value, even if one arrives from the source system. It is difficult to imagine such data items in most systems, but in the demo the FirstName and Gender columns will be marked this way.
When a change arrives for a column which is marked to allow “changing” changes (that sounds funny, too), the change overwrites the existing values in the column. This type of change is used for error correction or to rewrite history. Ralph Kimball (and others) refers to these as Type 1 changes.
The last change type, “historical”, refers to changes that are meant to be applied only going forward, or for a specific period of time. That name doesn’t seem to match its functionality to me, but that’s what it’s called. Don’t blame me, blame Microsoft. In the case of a “historical” change, a new current row is inserted into the dimension table and the old current row is flagged as expired (or no longer current). Kimball calls these Type 2 changes.
With these concepts explained, the Slowly Changing Dimension transformation can be configured. To begin, double click the red circle with the white “x” in it and the Slowly Changing Dimension Wizard will be displayed. Click Next to get by the Welcome pane. The Select a Dimension Table and Keys pane will be displayed.
Fill in or select the Connection Manager and the Table or View. The columns of the table or view will be displayed in the pane. The wizard will attempt to match the source columns to the dimension table columns. Make any changes necessary to get the desired mappings. These mappings will be used to determine if the row in the Staging table represents a change or not. Change the Key Type of at least one column pair to “Business key”. More than one pair can be tagged as “Business key” if needed. For the demo, the completed pane looks like this:
Click Next, and the Slowly Changing Dimension Columns pane will appear.
This is the place where you indicate which columns of the dimension are allowed to change and how those changes are to processed. In the Dimension Columns column of the grid enter those column names which will allow changes. Those that are “fixed” do not have to be entered, but it doesn’t do any harm to put them in. For the sake of documentation, it may even be a good thing to do. In the Change Type column of the grid, select the desired change for each of the dimension columns listed. Definitions of the different types, called attributes here, are displayed right on the pane to help you decide. For this demo, this is the completed pane:
Notice that there is at least one of each type of change in the demo.
Click Next. If there were any Fixed or Historical change types selected, the Fixed and Changing Attribute Options pane will appear. If no change types of a particular type were selected, then its option will be grayed out (disabled). Check the appropriate box for each option. My personal opinion is that you should not fail the transformation if fixed changes are detected. For this demo, neither box was checked.
Click Next. If there were any columns tagged as “Historical”, the Historical Attribute Options pane will appear. Click the option button that indicates the desired method for identifying current and expired records. Only one of the two buttons can be selected. Once a button is chosen, the fields associated with the option are enabled and those for the other option are disabled. Fill in the fields for the selected option. The wizard loads the drop down list with potential choices for each question. This demo uses start and end dates. This is the completed pane:
Click Next, and the Inferred Dimension Members pane will be displayed. If your dimension will contain Inferred Members, check the box to enable inferred member support. When checked, the supporting option buttons will be enabled. Click on the desired option. The wizard will fill in the possible choices for the selections in the drop down list. This is the completed pane for the demo.
Click Next, and the Finish the Slowly Changing Dimension Wizard pane will appear. Click Finish to complete the configuration. The wizard will clear and calculate which additional transformations and destinations to include in the Data Flow tab. When complete, the finalized transformation will be displayed. I found this layout hard to follow. So I rearranged the boxes a little, changed some names and came up with this:
Even this picture needs some explanation. The box labeled “Derived Column” is where the value for EndDate is generated. The “OLE DB Command” box updates the existing row with the EndDate. The “Derived Column 1″ box generates the value for StartDate. “Insert Destination” inserts new rows for two different types of changes. “OLE DB Command 1″ updates any inferred members that have changes. “OLE DB Command 2″ updates any rows that have Type 1 changes.
The Dim.Customer table has one inferred row in it and Staging.Customer contains the first 100 records from the AdventureWorksDW DimCustomer table.
The next step is the first run of the SSIS package we just created.
After working with it for a while, one thing that puzzled me about the wizard was the number of outputs it generated. All the documentation indicated that the transformation created (or could create) six, but the the most I could get the wizard to create was four. After a little investigation I discovered that the two it wouldn’t generate were the ones for rows that didn’t have any changes in them and the one for the rows that just had changes for fixed columns. I finally added them manually. Just add the Data Flow Destination first, then select Add Path from the Slowly Changing Dimension transformation. Connect it to the Data Flow Destination. The interface will make you choose one of the unused outputs. When that’s all done, the Data Flow tab looks like this:
After I had this looking the way I wanted, it dawned on me that the SCD transformation doesn’t necessarily have anything to do with data warehouse dimensions. It has everything to do with comparing the contents of two tables based on a set of column mappings and distributing the contents of one of those tables based on the results of those comparisons. So it could be used in other, non data warehousing, situations as well. All you have to do is identify them when they occur.
Here are some Pros and Cons about using SCD.
- It does all the work. You just have to figure out the configurartion, which you should do any way.
- It’s easy for beginners. All you have to do is configure it.
- It’s a great training tool. You learn a lot about Slowly Changing Dimensions by using it. I know I did!
- It gives you a starting point. After you configure it, you can modify what it created to produce the desired result.
- It can be slow when applied to large dimensions, i.e. any thing over 1,000 rows.
- You can’t configure the implied comparison/lookup between the two tables. It is strictly a .NET comparison for equality. That means, for example, that a varchar(10) may not compare correctly to a char(10), even though to the human eye they are identical.
- If the results of any part of the comparison results in a NULL, the transformation throws an exception, which may or may not cause the entire package to abort. It depends on how you have “programmed” the package.
- It can’t handle any “special cases”. Like updating Oracle tables.
Listed below are some tuning tips that can make the transformation perform better.
- Modify the DefaultBufferMaxSize and DefaultBufferMaxRows properties of the Data Flow task. These reserve memory, which keeps SQL Server from grabbing it later.
- In a SQL Server environment, use the SQL Server Destination instead of the OLE DB Destination. The SQL Server Destination is really Batch Insert.
- Set the Max Insert Commit Size on the SQL Server Destination. It defaults to 0 (zero), which means all the rows are committed as a single batch. If there is a failure, the entire batch is sent to the error output.
- Disable constraints on the dimension table. It’s an option on the SQL Server Destination.
- Send each type of output to a separate table. Then use stored procedures to maintain the dimension table. One poster indicated this was 10 times faster than having SCD do the updates and inserts.
- Make sure the data types match between the Staging and Dimension tables.
- Hash the values of the change columns and compare the hash totals to determine if changes are necessary.
- Limit logging, both SSIS and SQL Server.
- Use another transformation instead of SCD. There is a freeware one called TableDifference that can be downloaded from sqlbi.eu.
My bottom line on the SCD is that I would use it as a starting point for updating any Slowly Changing Dimensions in my data warehouse. The analysis it forces you to do (and easily modify) is great. After it is configured, I would modify it for documentation purposes. Then I would test the daylights out of it and make any performance modifications I could. If it still didn’t perform up to expectations, then I would consider doing all the updating in T-SQL.