Tallan's Technology Blog

Tallan's Top Technologists Share Their Thoughts on Today's Technology Challenges

SSIS data management fundamentals

elizotte

Loading data into a warehouse for use in business intelligence often involves loading data from multiple dissimilar sources. Many times the easiest way to consistently manage the data involves doing extracts from the source systems into a simple text format such as comma delimited. Even legacy systems have some form of reporting engines that allow for text file output, and in dealing with more current database types there are mechanisms such as BCP to allow for dumping output to a text file. A few more reason for using flat files – it allows for a snapshot in time that is replicable, it allows for a single output from a source system to easily be imported into multiple different warehouses, and allows for the imported data to easily be broken into manageable chunks by managing the text files.

SSIS has many features to simplify working with data files. It allows you to create Control Flows which sequence tasks involved in importing data, allowing for decisioning based on your criteria, as well as robust options for error management. It also allows you to easily perform common transformations, set checkpoints to define restart points and view intermediate results by using debug windows and visualizers. Microsoft has given easy visual designers for creating this flow within SSIS, making it easy and straightforward.

Our example for this week is based on a codeplex project which leverages the AdventureworksDW database. The original project is available at http://www.codeplex.com/MSFTISProdSamples/Wiki/View.aspx?title=SS2008%21Lookup%20Transformation%20Sample&referringTitle=Home . The primary modifications to the project are to add more error management, and show some debugging examples.

This project uses a Lookup transformation based on a cached datatable that is also read from a file. In this particular instance it is done to provide for a structured Time dimension in a reference table including days of the week and names of the month in multiple languages, the ability to separate calendar quarter from fiscal quarter, etc. The same idea can be used to do a transformation of any business key to a dimension key such as with a CustomerID or ProductID.

There are two primary pieces in the control flow, loading the reference table and then loading the fact table itself.

Each control flow task has its own data flow set of tasks. The control flow and data flow tasks execute in order as defined by the lines and arrows joining them, with green arrows showing the path taken for a success and red arrows for error conditions.

The data flow for the Load Lookup Cache task is as follows

It is made up of 3 components which execute in order. The flat file source is a Data Flow Source, which could also be an ADO.NET, Excel, OLEDB or XML source. Depending on the specific type of data source, different properties are available to set the connection string, query, etc.

The Derived Column is a transformation used to create columns for reference purposes that have their own ids. In this particular case it is used to create columns for the calendar quarter as shown below

And finally the Cache Transform writes data from the connected data source and writes it to a cache for later use by a Lookup Transform. The Cache Transform allows you to select which columns to insert in the cache, as well as giving you the opportunity to map them to other column names if your destination data set has different names.

As this is the last data flow task it completes the Load Lookup Cache control flow task, and control would be passed on to the Load Fact Table task. In this task, the flat file is loaded, the previously cached Date dimension table is referenced to determine what date key to use when inserting the fact table data into the warehouse, and various destinations for the data row based on the result of the lookup.

 

 

The FactBudget source is a flat file datasource, similar to what was used to load the cache. The Lookup DimTime using CCM task is a lookup transformation. In a lookup transformation, the two most important areas you will deal with are the Columns section, and Error Output. In the columns pane of the editor, you are able to select how columns in you incoming data stream are matched to your lookup table,

So for this case the TimeKey is matched from the inbound data stream to the previously loaded Date dimension’s TimeKey, and is inserted into the data warehouse with CalendarQtrID, CalendarQtrName and EnglishMonthName added.

The Error Output pane allows you to set what happens when a row in the input data stream either throws a general error, or has a truncation error.

 

 

As you can see in the overview picture of the data flow for the Load Fact Table control task, each of the possible data flows (Lookup Match, Lookup No Match, Lookup Error) enter their own Row Count data flow task. These allow for a count of records that follow each of the individual paths.

On the Lookup Match flow, the data in this case flows out to another flat text file, however it could just as easily be going into the data warehouse directly. On the Lookup No Match and the Lookup Error we have added Data Viewers which allow for simple visualization of the rows going down that flow path. When the package is run, you can view the data in the visualizer.

 

 

 

The above snapshot is taken while the debugger is operating, and pauses the operation of the SSIS package. The color of the tasks designates their state – yellow is processing, white is not yet processed, green would be complete, and red would be failed.

Note that the Error Output flow is green – that is due to the fact that it is handled errors. A task would be red at runtime only if it was unable to complete. In this particular example, one of the destination files already being open for write and locked by another program would cause that destination task to fail and end up red.

So at this point what we have is our initial factBudget flat text file transformed from using its business date key to using the date dimension’s key instead, and rewritten to a new flat file for import into the data warehouse. We also have separate flat files for rows that were in the original fact table text file that had no matching rows in the date dimension lookup table, and the other errors which may range from null values in columns or data conversion or other errors. This allows for the clean records to easily be imported into the data warehouse while giving the ability to separately work with the failed rows to correct whatever their issues are.

Tags: ETL, SQL Server, SSIS,

1 Comment. Leave new

Coming out of Eric’s presentation, we discussed several SSIS “pain points”. Here’s a follow-up with info on the issues we discussed.

Connection Managers are key elements of SSIS packages. This article contains a decent overview and links: http://www.mssqltips.com/tip.asp?tip=1147

In an enterprise setting, it’s important to be able to move an SSIS package from Dev, through QA, and into Production without modifications. In the case of Connection Managers, this is accomplished by either creating SSIS configurations, or using the /CONF flag with dtexec on the command line.

Here’s a Brian Knight Vidoe on SSIS Configurations: http://www.jumpstarttv.com/Media.aspx?vid=202

Here’s the MSDN article covering dtexec options: http://msdn.microsoft.com/en-us/library/ms162810(SQL.90).aspx

Eric pointed out that Metadata Issues arise frequently in SSIS dataflow tasks. Fixing them is not particularly elegant. Here’s a link describing the errors you can encounter and the solution: http://followtheheard.blogspot.com/2007/10/ssis-external-metadata-refresh.html

Terry did not disappoint. His reputation for the arcane was reinforced when he mentioned IMEX=1. Apparently, this can be added to Excel connection strings to enforce the interpretation of mixed data within the same columns. Clear as mud? This may help: http://support.microsoft.com/default.aspx?scid=kb;en-us;194124

SSIS Data Flows require a destination, which can be problematic when you’re on a deadline and don’t have control of the database environment. Konesans has a useful free component you can download here: http://www.konesans.com/trashdest.aspx

Script Task! I believe it was Kenneth who mentioned how painful they can be. You’d think a Script Task would lend itself to great Visual Studio debugging integration. Sadly, you’d be wrong. Here’s the MSDN breakdown: http://msdn.microsoft.com/en-us/library/ms140033.aspx

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>