SSIS data management fundamentals
Loading data into a warehouse for use in business intelligence often involves loading data from multiple dissimilar sources. Many times the easiest way to consistently manage the data involves doing extracts from the source systems into a simple text format such as comma delimited. Even legacy systems have some form of reporting engines that allow for text file output, and in dealing with more current database types there are mechanisms such as BCP to allow for dumping output to a text file. A few more reason for using flat files – it allows for a snapshot in time that is replicable, it allows for a single output from a source system to easily be imported into multiple different warehouses, and allows for the imported data to easily be broken into manageable chunks by managing the text files.
SSIS has many features to simplify working with data files. It allows you to create Control Flows which sequence tasks involved in importing data, allowing for decisioning based on your criteria, as well as robust options for error management. It also allows you to easily perform common transformations, set checkpoints to define restart points and view intermediate results by using debug windows and visualizers. Microsoft has given easy visual designers for creating this flow within SSIS, making it easy and straightforward.
Our example for this week is based on a codeplex project which leverages the AdventureworksDW database. The original project is available at http://www.codeplex.com/MSFTISProdSamples/Wiki/View.aspx?title=SS2008%21Lookup%20Transformation%20Sample&referringTitle=Home . The primary modifications to the project are to add more error management, and show some debugging examples.
This project uses a Lookup transformation based on a cached datatable that is also read from a file. In this particular instance it is done to provide for a structured Time dimension in a reference table including days of the week and names of the month in multiple languages, the ability to separate calendar quarter from fiscal quarter, etc. The same idea can be used to do a transformation of any business key to a dimension key such as with a CustomerID or ProductID.
There are two primary pieces in the control flow, loading the reference table and then loading the fact table itself.
Each control flow task has its own data flow set of tasks. The control flow and data flow tasks execute in order as defined by the lines and arrows joining them, with green arrows showing the path taken for a success and red arrows for error conditions.
The data flow for the Load Lookup Cache task is as follows
It is made up of 3 components which execute in order. The flat file source is a Data Flow Source, which could also be an ADO.NET, Excel, OLEDB or XML source. Depending on the specific type of data source, different properties are available to set the connection string, query, etc.
The Derived Column is a transformation used to create columns for reference purposes that have their own ids. In this particular case it is used to create columns for the calendar quarter as shown below
And finally the Cache Transform writes data from the connected data source and writes it to a cache for later use by a Lookup Transform. The Cache Transform allows you to select which columns to insert in the cache, as well as giving you the opportunity to map them to other column names if your destination data set has different names.
As this is the last data flow task it completes the Load Lookup Cache control flow task, and control would be passed on to the Load Fact Table task. In this task, the flat file is loaded, the previously cached Date dimension table is referenced to determine what date key to use when inserting the fact table data into the warehouse, and various destinations for the data row based on the result of the lookup.
The FactBudget source is a flat file datasource, similar to what was used to load the cache. The Lookup DimTime using CCM task is a lookup transformation. In a lookup transformation, the two most important areas you will deal with are the Columns section, and Error Output. In the columns pane of the editor, you are able to select how columns in you incoming data stream are matched to your lookup table,
So for this case the TimeKey is matched from the inbound data stream to the previously loaded Date dimension’s TimeKey, and is inserted into the data warehouse with CalendarQtrID, CalendarQtrName and EnglishMonthName added.
The Error Output pane allows you to set what happens when a row in the input data stream either throws a general error, or has a truncation error.
As you can see in the overview picture of the data flow for the Load Fact Table control task, each of the possible data flows (Lookup Match, Lookup No Match, Lookup Error) enter their own Row Count data flow task. These allow for a count of records that follow each of the individual paths.
On the Lookup Match flow, the data in this case flows out to another flat text file, however it could just as easily be going into the data warehouse directly. On the Lookup No Match and the Lookup Error we have added Data Viewers which allow for simple visualization of the rows going down that flow path. When the package is run, you can view the data in the visualizer.
The above snapshot is taken while the debugger is operating, and pauses the operation of the SSIS package. The color of the tasks designates their state – yellow is processing, white is not yet processed, green would be complete, and red would be failed.
Note that the Error Output flow is green – that is due to the fact that it is handled errors. A task would be red at runtime only if it was unable to complete. In this particular example, one of the destination files already being open for write and locked by another program would cause that destination task to fail and end up red.
So at this point what we have is our initial factBudget flat text file transformed from using its business date key to using the date dimension’s key instead, and rewritten to a new flat file for import into the data warehouse. We also have separate flat files for rows that were in the original fact table text file that had no matching rows in the date dimension lookup table, and the other errors which may range from null values in columns or data conversion or other errors. This allows for the clean records to easily be imported into the data warehouse while giving the ability to separately work with the failed rows to correct whatever their issues are.