Tallan's Technology Blog

Tallan's Top Technologists Share Their Thoughts on Today's Technology Challenges

Introduction to SAP Business Objects Data Integrator ETL tool

sfernandes

SAP Business  Objects Data Integrator XI R2/R3

The purpose of this presentation is to familiarize the team with some of the features of Business Objects Data Integrator (DI). The presentation is also aimed at providing some insight into how technology can affect the design and implementation of  the ETL processes’.

While at a high level it is best that an ETL architecture be technology agnostic, the physical implementation can stand to benefit by being designed to take advantage of the features provided by the technology.

 Structure:

•       Project

•       Job

•       Workflow

•       Dataflow

Project

•       Overarching container that can hold several jobs within it. A project is a way of logically grouping jobs together.

Job

•   A job is a complete end to end independent process. There can be a hierarchy built in to determine when a job runs however they are designed to be independent units.

•       A job is the most atomic execution unit i.e only jobs can be executed.

•       What delineate a job from a project

   – A job actually performs some work a project on the other hand is merely a container

 

 

 

 

Workflow

•       A workflow is used to control the order in which the data is processed.

•       They are used to logically and physically structure the job so that it runs most efficiently. (helps get around the 2 gb memory limit)

•       There are looping constructs that can be wrapped around a workflow. If necessary  these looping constructs can be used both for breaking the data down into smaller more manageable units as well as to build a logical decision tree based on parameters set within variables.

Dataflow

•       A dataflow is a discrete/atomic unit of work.

•       It is the physical implementation of the jobs

•       All of the data processing work happens within the context of a dataflow.

•       All of the transforms exist within the context of a dataflow.

•       Its is the smallest reusable object.

–      When reusing a dataflow though you need to remember if you change it in one place you change it everywhere just like C++ objects. So its often better to duplicate the dataflow.

•       You can imbed a workflow within a dataflow for the purpose of controlling the logical path data takes through the dataflow.

 

Transforms

  •       Query Transform

–       Most commonly used transform

  •       History Preserving Transform

–      Used to populate type 2 history preserving table

•       Hierarchy Flattening transform.

–      Used to turn a vertical hierarchy into a flattened horizontal hierarchy.

–      Can deal with a  hierarchy of indeterminate depth

–      Not the best way to deal with a ragged hierarchy as you are very likely to end up with a sparse table and you will need to deal with that sparsity caused by the ragged nature of the hierarchy the level of sparsity will be determined by how the level of raggedness in the hierarchy. Extremely ragged vertical hierarchies do not lend themselves to being flattened horizontally and you may be better off normalizing the raggedness in a vertical hierarchy.

 

 •       Case Transform

•       Case statement

•       Lookup_ext

–      Very powerful function that can be used to look up keys when populating fact tables. Or to get other lookup values

•       Table Comparison

–      Compares two tables best to sort input to speed up processing

•       Map Transform

•       Maps updates to

•       Merge Transform

•       Xml Pipeline

Misc Notes

•       In order to facilitate group development some versions (mainly enterprise ) of DI have a central repository bundled with it. The central repo has a check in check out feature that allows developers to check in and check out code. You can have several central repositories. Typically a dev central repo, test central repo and a prod central repo.

•       The central repository also allows for limited version control however in order for it be useful the developers need to be proactive in keeping up with the labeling each version of the job correctly.

•       DI also comes bundled with a very rudimentary scheduler.  DI jobs can be called from and external scheduler if necessary.  The process is similar to the one we use for Autosys at a client site.

•       Since the scheduler provided with DI does not provide any job dependency capability you have to build it for your self. The complexity of the job dependency structures will be determined by the scope and complexity of the project. The way to accomplish job dependency control is through the use of control tables and job hierarchy tables. Every job then has a script wrapper within it that will read these control tables to see it is clear to execute the script will also need to populate the tables with the relevant control information both before and after the job has run.

•       SAP has recently started to sell a MDM product which will have DI bundled with it.

 

 

               

The purpose of this presentation is to familiarize the team with some of the features of Business Objects Data Integrator (DI). The presentation is also aimed at providing some insight into how technology can affect the design and implementation of  the ETL processes’.  While at a high level it is best that an ETL architecture be technology agnostic, the physical implementation can stand to benefit by being designed to take best advantage of the features provided by the technology.

Tags: ETL, Integration,

4 Comments. Leave new

One thing I did not mention in my presentation or my post is the fact that using a tool provides you with an additional benefit namely it makes your ETL jobs database platform agnostic. This allows you to move your code from one database platform to another with minimal changes, typically the only change necessary is to change the connection information. I have actually worked on a project where the test environment had a SQL Server backend and a Oracle backend in production. I was able to move my code from test to production by simply changing the connection information. The caveat on doing this though is that if you chose to write your own Stored procs or SQL you will need to rewrite it when you move to another platform as DI will not translate this kind of code.

Really practical presentation. I know it’s really variable, but I was wondering how long will it take to develop and ETL using BODI to load a 15 column table with formated flat files, given there’s no data transformation needed? I know its really variable but would like to know an estimated time gap.
Thank you!

So will this work the same way in the cloud with the data integrator? I realize that your post is rather old, but I was wondering how much change to the ETL process there would be if you were wanting to do this in the cloud. Do you know?

I really like your presentation. If you have any ETL Framework document by using this tool, could you please share with me.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>