Loading XML via EIM

We recently came across a requirement to consume large volumes of data from a system that published a text feed of individually formatted XML “records”. It’s an odd integration mechanism, involving a text based feed of header blocks and individual XML:

There could be in excess of 10 million records published in this way – crazy, I know, but in the Enterprise world you do come across some legacy mechanisms that were built for some purpose or other, often long forgotten.

Anyway, my goal was to consume this data en mass into Siebel.

Looking at the format of what we’re consuming, my first thought was how to get at the structured data. Header lines in text files fill me with dread – as does the use of arbitrary delimiters to separate records. XML is a wonderfully descriptive language, ideally suited to describe payloads and define relationships. A quick Perl script later and I have a list of records in a format that I’m comfortable with.

Working with XML? Got to be EAI, right? Well, not when you’re dealing with millions of records it isn’t. We all know that EIM is the mechanism of choice for bulk loading data, but how do we get from structured XML to EIM staging tables? The answer that I found lies in Microsoft’s Integration Services technology.

SQL Server Integration Services (SSIS) is an integration technology that allows you to extract, transform and load from one data format to another. It used to be called Data Transformation Services (DTS), if you’re old enough to remember. As it is, SSIS allows the specification of an XML file, backed up by an XSD, as input; we can then define either an Oracle schema as output (mapping directly to EIM, if you want to get straight to the point) or to automatically generated MS SQL Server staging tables. I went for the latter, as it gives you an ability to do some transformation and / or reconciliation before loading in to EIM.

Creating a process flow to extract from XML to a structures table schema is surprisingly straightforward. Here’s how I achieved it using MS SQL Server 2014:

  1. Create a new Execute SQL task in your Control Flow. Call this “Truncate Tables” and leave it empty for now
  2. Create a new Data Flow object in the Control Flow. Name this “Load Staging from XML” and double click for more detail
    ControlFlow
  3. Drag an XML source into the flow
  4. Double click this and select an XSD to determine the data types, size and cardinality for the input source. SSIS will ensure that data flowing through the component is validated against the XSD.
  5. Select an XML file for input
  6. Now drag an OLE DB Destination component into the flow. Left click and give it a sensible name, based on the XML node data that it will consume
  7. Drag an output from the XML input into the OLE DB Destination – this will conveniently inform the component of the XSD schema for the input data. Double click, click “New” for the “Name of table or the view”
  8. SISS will automatically create a table schema based on the XSD that is being consumed – great stuff:
    Map
  9. Click the “Mappings” item and SISS will automatically map the XSD schema to your new table
  10. Repeat the above for all node levels of your XSD:DataFlow
  11. Now revisit your Control flow and add some SQL to truncate the tables created as part of your mapping

The beauty of this process is that SISS will maintain the referential integrity of your XML as it maps it out to the table schema. So, if you’ve got “ParentNode” with one or more “ChildNode” records, the process will generate a primary key for the parent and use this as a foreign key in the child table, via a column called “ParentNodeId”. This can then be used when mapping your staging tables out to EIM.

SISS is an excellent ETL tool, if you’re licensed to use it. Mapping large XML data sources out to a structured schema is the first step in getting data into EIM – you can even bypass the staging altogether and map XML directly.

Do you have experience with other ETL tools? Post below and share your experience!

Talend Open Studio and EIM

I discovered an amazing Open Source ETL tool the other day called Talend Open Studio. We’re working on a mini project at the moment to bring Contact and Account data in from Excel spreadsheets and thought this a good chance to try it out.

The tool itself is very intuitive, being based on the well proven Eclipse platform and offers a drag and drop graphical interface for constructing business processes and jobs.

It’s perfectly suited to loading EIM tables: simply set up source and destination metadata (data source) entries for your input data and Siebel database then bring in the EIM schemas for the destination tables that you’re interested in. A number of components can be dragged and linked between the two to do any number of mapping and transformation steps. If you’re familiar with Actuate in the Siebel context, you’ll have a real head start as the concept of ‘rows’ and their flow through the components is key. A little sample of one of the jobs I’m currently working on is below:

Projects can be created in both Java and Perl flavours: Java being somewhat more familiar to me but Perl sneaks ahead with performance and general scalability. A limitation in Java prevents overly complex jobs which made Perl the only choice on this occasion. Mapping expressions can be used, in the language of your choosing, to perform all sorts of complex transformations. You can even create your own functions and components, if you need to do something particular fancy. I found the mapping component (tMap) pretty much suited to everything I needed on this occasion.

I suggest downloading a copy, setting up a local test project and having a mess about – I’d be really interested to hear from anyone who finds the tool useful. I’m considering developing a suit of Siebel components – the tool currently supports Sugar CRM, SAP and a number of others out of the box – and a community approach to this would be excellent.