poor man's ETL using apache camel
If you haven't read about apache camel; here is the short introduction. You can also get free chapter of "Camel in Action" book. Buy it if you are working on enterprise integration project and using java/scala.
Problem: OLTP system's database schema is not good for generating reports and data mining /analytics. Moving data from OLTP database to reporting schema is tricky and not straightforward. You need to
- get rid of unnecessary data
- figure out contextual grouping of data
- take care of correctness of data
- shouldn't hamper performance of source as its also used for OLTP
- graceful recovery on failure of such transformation
ETL is appropriate term for such scenarios and there are many toos available in market but they are costly(considering learning curve vs productivity),cumbersome and bloated with unnecessary features. Ok enough of rant. Lets solve the problem.
Narrowing down the problem: Am assuming source and destination are mysql databases. And this is common scenario. I'll write down how it can be extended to other source/targets.
What one needs:
- ETL Job is single job which can be repeated if it fails. It copies only the new data when it runs. It rollbacks all changes on failure
- Scheduling of such a job
- Transformation/Processing is required on source data before it is stored at destination.
Before going further read this article. So basically apache camel is routing engine. You can route information from source to destination through some processor(optional).
So here is the sample route
In nutshell 1. am firing some query on source database; 2.processing its results(using camel processor; basically it is just a pojo with utility method) 3. saving it in destination database (using jpa)
- inputSql is a spring bean registered in camel-context xml file. camel-context is regular spring beans xml file having entries for all required beans.e.g. sourceDataSource,destinationDataSource,inputSql One can create springCamelContext by passing path to camel-context.xml.
- Processor does the role of Message Translator
- Since we are using jpa in our example; we'll need persistence xml with all required entities declared for each persistence unit. You can refer above mentioned ETL article from camel website.
Now details: InputSql bean and jpa is pretty straightforward. Camel's DSL is so powerful you need not do anything other than providing your sourceQuery,destination entity, correct translation and it does its job really well under the hood.
Below is the way one can use polymorphism and inheritence for creating multiple processors as and when required.
Common processor when you are writing etl jobs for data transformation from source database to target database(jdbc-->process-->jpa) as below. You can see its receiving array of resultsets for processing.
Specific Message Translator aka Processor can be written as below.Array of resultsets can be mapped to jpa entity by passing custom rowmapper in processor implementation. See next code snippet for this.
Workflow of custom ETL
- Write a quartz job and pass the spring-config(camel-context.xml) path to it.
- Create a camel context
- optional Enable tracing for debug information
- Get the source query from quatz job datamap
- optional: Modify the query prior to job by appending where clause info so that we select only new data from source
- Add route to camel context
- Create a producer template
- Pass the source query(could be updated in preETL hook mentioned above) to producer template
- Start the camel context
- optional postETL job hook
This workflow suffices because
- source query is optimized and returns me only the new rows( using preETL hook which add where clause to select new rows)
- on any exception (camel/quartz/logic/runtime) destination remains untouched. So on failure I can restart the job and it'll copy new rows only
- Can add batch-job behavior by adding LIMIT clause in where condition of source-query. It will copy only N rows in on run and re-run if there are still some new data at the source. Running such job in batch of 5k/10k is really performant.
- How I use destination data for reporting is totally decoupled to my ETL jobs application. I can run a portal using jasper reports/BIRT or simply flot for charting purpose.
note: This post is work in progress.
