ETL Process Control Architecture
Data Volumes are growing at incredible rates across many IT departments. Businesses are dealing with new problems every day, and there is much demand on IT to be flexible and to react to many issues facing the business on a daily basis. Many times, processes become obsolete or disfunct because they cannot keep up with the rate of change and increasing demands by the business. In some cases, the users lose confidence in the systems and seek alternative methods for delivering reports and data in a timely fashion.
A centralized, standard Process Control Environment can provide an inexpensive and efficient way to provide an auditing mechanism on new and existing data migration processes which may run occasionally or very frequently. This will enable an IT department to become more pro-active and aware of issues well before the end users are aware. This will allow for the IT department to react and correct the issue before the specified time in the Service Level agreement, or to send a notification or alert to their users, without the fear of losing their users confidence in the data or the overall performance of the IT department.
A Process Control Environment (PCE) can be a Data Mart built on a simple, normalized star schema where processes are required to log any activity, typically in the form of “wrapper” scripts that are un-intrusive and easy to add to new or existing processes. This will eliminate the need for timely (and costly) regression testing. A front end reporting system can be built on top of this structure, allowing drill down and other mining capabilities by the IT group from a single, centralized repository.
Alerts and notifications can be sent out by the IT staff (or directly to the end users) by querying this structure without any impact on resources to the existing processes and applications. Because the processes will log activity in real-time, the structure should always be up to date without the need for additional batch processing.
The Process Control Structure would consist of a single fact table which processes can log to. This Fact table would contain basic information about the Job, Job Type, Job Stewart, Job Status, start and end times, as well as aggregated row count information on rows processed successfully, with warnings or with errors. The model is scalable to the point where errors can also be logged and tracked for further analysis, tying the errors back a unique process id. The Unique process ID will allow an auditor to view all of the details associated with the process, providing as much historical information possible, and allow a developer to retrace the process and view any internal or external conditions which may have existed at the time of execution.
Design and developer details are pre-loaded at the time a job is introduced into a production environment, minimizing any impact on resources at the time of execution.
Example Process Control Data Model
Viewing Process Control Data
The Tables in the Process Control model can be pre-joined together in a view to provide a summary of the log information available. These views can be modified to select key performance items which can then be loaded into a cube for analysis and reporting. An example report would look similar to the following:
Once the Process control tables are created and the dimensions are properly seeded, then all the developers need to do is implement the logic in each of their jobs so that the jobs can log themselves at the start of each run. This is best implemented in standard wrapper scripts added to the beginning and end of each process. This is un-intrusive to the normal process flow of the application, and requires minimal changes or modification, other than setting any variables defined at the start of the wrapper.
Pre Processing Wrapper Script
The Pre-Processing Wrapper script should be smart enough to detect if the appropriate dependencies are first stored in the dimension tables. If not, they should automatically create new entries so the process will not fail. This also simplifies the process to go live in a production environment, as no additional pre-loading of the dimension tables is required. Each process which runs in a production environment will be logged, provided the developer has set the initial Job, Type and Status variables accordingly.
Post Processing Wrapper Script
The Post processing Wrapper Script will update the End time of the job, as well as calculate or query any row counts from the source and/or target tables as required.
Tallan Consulting Internal Resources
Code Samples, Templates and example Process Control Implementations can be found and utilized by Tallan Consultants on the Tallan intranet under Blog Presentations in the Business Intelligence portal.