Analyzing the XML of SSIS Packages
What is SSIS?
Migrating data is generally done with a process known as an ETL (Extract, transform, and load). This involves ingesting data from one or more source(s), converting it to the desired format and finally sending it to a destination that is often a data warehouse.
An SSIS project is broken down into smaller units of work called packages that are executed individually. Packages are typically organized and grouped according to their destination, such as the table they are loading data into, rather than their sources. This is because a package may create all of the necessary output for a single table in a data warehouse while coming from multiple sources.
Different parts of an SSIS package include:
- Connection(s): sources and destinations such as databases or files.
- Data Flow(s): containers used for modifying data that are composed of smaller components that can input, output, and alter data.
- Control Flow(s): containers that do not directly modify any data, but manage the flow and organization of a package. One example of a control flow is iterating through multiple spreadsheets that are then sent to a data flow task.
Why Access the XML of an SSIS Package?
Why would anyone want to look through the XML of an SSIS package when there is a perfectly fine user interface in Visual Studio to access every part of it? Nearly anything you do with the XML can be done through the package designer, but this must be done to each package individually. This can become very time consuming, as most ETL projects involve hundreds of packages. By directly accessing the XML, you can programmatically iterate through each package to make add changes or even generate documentation.
XML can be parsed into data mappings for documentation to show where source data is being exported. This normally is done prior to creating an ETL, but it is also helpful after to review that an existing ETL is working correctly.
You can also utilize the XML of the packages to add changes. This is high-risk, as packages are very easy to break if the syntax is not perfect and Visual Studio does not produce helpful error codes when the XML is corrupt or includes mistakes.
Basic XML Structure
The XML formatting of an SSIS package follows the basic logic and organization of how the package design looks in Visual Studio.
Each package can be broken down into Connection Managers, Control Flow Tasks, and Data Flow Tasks. Connection Managers represent any database, file, etc. used as either a source or destination for data in a package. Data Flow and Control Flow Tasks are nested inside “Executables.”
As you can see in the Data Flow in the above XML, there is a grouping of components under “DTS:ObjectData”. These represent each of the tools that are doing work on our data, including the data sources, destinations, and the “Union All” tool.
Once these components are expanded, you can access any specific information about them. This includes their connections, properties, and data.
Let’s take a look at a closer look at an OLE DB Source, in this case, the source titled “UserB Source.”
This component contains properties, connections, and outputs.
The properties contain useful information about the component, such as the “AccessMode”, which signifies how data is being imported. There are multiple ways to import data, the most common of which is to use a SQL statement or by selecting a table or view from a dropdown. This is referred to as “opening” a rowset. The “AccessMode” value denotes each of these methods as 2 and 0 respectively. The mode the component is using is important because it determines what the output sections actually represent.
Consider example 1 shown above, where the access mode is set to 0 and uses a rowset. The “externalMetadataColumns” under “outputs” represent the actual names of the columns from the table imported under the “OpenRowset” value in properties and the “outputColumns” represent the names of the same columns and any changes to them.
Now consider example 2 shown above, where the access mode is set to 2 and uses a SQL statement as a source. In this example, the name of the imported table is in the statement under the “SqlCommand” property and “externalMetadataColumns” may not accurately represent the original column names. In this case, you would have to parse the SQL statement to retrieve original column names because column names from the select statement could have been renamed or possibly be new derived columns. This can become a daunting task as the complexity of a SQL statement, especially when there are tables being joined in the statement.
Once you understand the basic formatting of the XML files that SSIS Packages are composed of, you should be able to iterate through each package and parse them using anything from C# to PowerShell!