Introduction to SAP Business Objects Data Integrator ETL tool
SAP Business Objects Data Integrator XI R2/R3
The purpose of this presentation is to familiarize the team with some of the features of Business Objects Data Integrator (DI). The presentation is also aimed at providing some insight into how technology can affect the design and implementation of the ETL processes’.
While at a high level it is best that an ETL architecture be technology agnostic, the physical implementation can stand to benefit by being designed to take advantage of the features provided by the technology.
• Overarching container that can hold several jobs within it. A project is a way of logically grouping jobs together.
• A job is a complete end to end independent process. There can be a hierarchy built in to determine when a job runs however they are designed to be independent units.
• A job is the most atomic execution unit i.e only jobs can be executed.
• What delineate a job from a project
– A job actually performs some work a project on the other hand is merely a container
• A workflow is used to control the order in which the data is processed.
• They are used to logically and physically structure the job so that it runs most efficiently. (helps get around the 2 gb memory limit)
• There are looping constructs that can be wrapped around a workflow. If necessary these looping constructs can be used both for breaking the data down into smaller more manageable units as well as to build a logical decision tree based on parameters set within variables.
• A dataflow is a discrete/atomic unit of work.
• It is the physical implementation of the jobs
• All of the data processing work happens within the context of a dataflow.
• All of the transforms exist within the context of a dataflow.
• Its is the smallest reusable object.
– When reusing a dataflow though you need to remember if you change it in one place you change it everywhere just like C++ objects. So its often better to duplicate the dataflow.
• You can imbed a workflow within a dataflow for the purpose of controlling the logical path data takes through the dataflow.
• Query Transform
– Most commonly used transform
• History Preserving Transform
– Used to populate type 2 history preserving table
• Hierarchy Flattening transform.
– Used to turn a vertical hierarchy into a flattened horizontal hierarchy.
– Can deal with a hierarchy of indeterminate depth
– Not the best way to deal with a ragged hierarchy as you are very likely to end up with a sparse table and you will need to deal with that sparsity caused by the ragged nature of the hierarchy the level of sparsity will be determined by how the level of raggedness in the hierarchy. Extremely ragged vertical hierarchies do not lend themselves to being flattened horizontally and you may be better off normalizing the raggedness in a vertical hierarchy.
• Case Transform
• Case statement
– Very powerful function that can be used to look up keys when populating fact tables. Or to get other lookup values
• Table Comparison
– Compares two tables best to sort input to speed up processing
• Map Transform
• Maps updates to
• Merge Transform
• Xml Pipeline
• In order to facilitate group development some versions (mainly enterprise ) of DI have a central repository bundled with it. The central repo has a check in check out feature that allows developers to check in and check out code. You can have several central repositories. Typically a dev central repo, test central repo and a prod central repo.
• The central repository also allows for limited version control however in order for it be useful the developers need to be proactive in keeping up with the labeling each version of the job correctly.
• DI also comes bundled with a very rudimentary scheduler. DI jobs can be called from and external scheduler if necessary. The process is similar to the one we use for Autosys at a client site.
• Since the scheduler provided with DI does not provide any job dependency capability you have to build it for your self. The complexity of the job dependency structures will be determined by the scope and complexity of the project. The way to accomplish job dependency control is through the use of control tables and job hierarchy tables. Every job then has a script wrapper within it that will read these control tables to see it is clear to execute the script will also need to populate the tables with the relevant control information both before and after the job has run.
• SAP has recently started to sell a MDM product which will have DI bundled with it.
The purpose of this presentation is to familiarize the team with some of the features of Business Objects Data Integrator (DI). The presentation is also aimed at providing some insight into how technology can affect the design and implementation of the ETL processes’. While at a high level it is best that an ETL architecture be technology agnostic, the physical implementation can stand to benefit by being designed to take best advantage of the features provided by the technology.