Posterous theme by Cory Watilo

poor man's ETL using apache camel

If you haven't read about apache camel; here is the short introduction. You can also get free chapter of "Camel in Action" book. Buy it if you are working on enterprise integration project and using java/scala.

Problem:  OLTP system's database schema is not good for generating reports and data mining /analytics. Moving data from OLTP database to reporting schema is tricky and not straightforward. You need to

  • get rid of unnecessary data
  • figure out contextual grouping of data 
  • take care of correctness of data
  • shouldn't hamper performance of source as its also used for OLTP
  • graceful recovery on failure of such transformation

ETL is appropriate term for such scenarios and there are many toos available in market but they are costly(considering learning curve vs productivity),cumbersome and bloated with unnecessary features. Ok enough of rant. Lets solve the problem.

Narrowing down the problem: Am assuming source and destination are mysql databases. And this is common scenario. I'll write down how it can be extended to other source/targets.

What one needs:

  • ETL Job is single job which can be repeated if it fails. It copies only the new data when it runs. It rollbacks all changes on failure
  • Scheduling of such a job
  • Transformation/Processing is required on source data before it is stored at destination.

Before going further read this article. So basically apache camel is routing engine. You can route information from source to destination through some processor(optional).

So here is the sample route 

 

In nutshell 1. am firing some query on source database; 2.processing its results(using camel processor; basically it is just a pojo with utility method) 3. saving it in destination database (using jpa)

  1. inputSql is a spring bean registered in camel-context xml file. camel-context is regular spring beans xml file having entries for all required beans.e.g. sourceDataSource,destinationDataSource,inputSql  One can create springCamelContext by passing path to camel-context.xml. 
  2. Processor does the role of Message Translator
  3. Since we are using jpa in our example; we'll need persistence xml with all required entities declared for each persistence unit. You can refer above mentioned ETL article from camel website.

Now details: InputSql bean and jpa is pretty straightforward. Camel's DSL is so powerful you need not do anything other than providing your sourceQuery,destination entity, correct translation and it does its job really well under the hood.

Below is the way one can use polymorphism and inheritence for creating multiple processors as and when required.

Common processor when you are writing etl jobs for data transformation from source database to target database(jdbc-->process-->jpa) as below. You can see its receiving array of resultsets for processing. 

Specific Message Translator aka Processor can be written as below.Array of resultsets can be mapped to jpa entity by passing custom rowmapper in processor implementation. See next code snippet for this.

Workflow of custom ETL

  • Write a quartz job and pass the spring-config(camel-context.xml) path to it.
  • Create a camel context
  • optional Enable tracing for debug information
  • Get the source query from quatz job datamap
  • optional: Modify the query prior to job by appending where clause info so that we select only new data from source
  • Add route to camel context
  • Create a producer template
  • Pass the source query(could be updated in preETL hook mentioned above) to producer template
  • Start the camel context
  • optional postETL job hook

This workflow suffices because

  1. source query is optimized and returns me only the new rows( using preETL hook which add where clause to select new rows)
  2. on any exception (camel/quartz/logic/runtime) destination remains untouched. So on failure I can restart the job and it'll copy new rows only
  3. Can add batch-job behavior by adding LIMIT clause in where condition of source-query. It will copy only N rows in on run and re-run if there are still some new data at the source. Running such job in batch of 5k/10k is really performant.
  4. How I use destination data for reporting is totally decoupled to my ETL jobs application. I can run a portal using jasper reports/BIRT or simply flot for charting purpose.

note: This post is work in progress. 

Adding audit functionality in spring mvc web applicaion

Many a times one needs to know "who" accessed some critical functionality of your web app at "what" time. Keeping record of such information is called as Auditing. There are lot of things to consider before you implement full-fledged audit-trails in your web application.

Now for a simpler scenarios you can get it done with simple annotation wherever you need "audit" record in your code as shown below. No more code!

Now lets see how it works under the hood. When a method annotated is accessed it transparently saves the audit-information. This is good use-case for AOP and implemented using aspect as below

 

Also the Auditable annotation as below. You can check the usage in the very first snippet of this post.

 

Gems from "The Mythical Man-Month" part 1

The Tar Pit

  • The Joys of the craft
    • Sheer joy of making things
    • pleasure of making things which are useful to other people
    • fascination of fashioning complex puzzles
    • joy of always learning
  • Woes of craft
    • must perform perfectly
    • others provide objectives/resources/information on which you may not have control
    • dependence on other people's programs(hence side-effects of their maldesign,incompleteness and flawed implementation)
    • tedious labor for ironing out working product( though design of grand things is fun)
    • wrong estimation on part of debugging/testing
    • obsoleteness of product when it is complete

Challenge is to avoid obsolescence of implementation with available resources. (sidenote: Whoever says programming is easy has never porgrammed a full system )

Mythical Man-Month

  • Optimism
    • Perhaps, it is merely that computers are young, programmers are younger, and the young are always optimists.
    • For the human makers of things, the incompleteness and incosistencies of our ideas become clear only during implementation
  • The Man-Month
    • Cost does indeed vary as the product of the number of men and the number of months. Progress does not.
  • Gutless estimating
    • We need to develop and publicize productivity figures, bug-incidence figures, estimating rules and so on. The whole profession can only benfit from sharing such data.
    • Adding manpower to later software project makes it later

The Surgical Team

  • The Problem
    • Sheer no of minds to be coordinated affects the cost of the effort, for a major part of the cost is communication and correcting the ill-effects of miscommunication(system debugging)
    • Dilemma: few good minds for efficiency and conceptual integrity vs. timely appearance of large systems product
  • Proposal
    • Transformation of programming "from private art to public practice"
  • How It works
    • No differences of interest
    • Unilateral settlement of differences of judgement
    • works because 1. lack of division of problem 2. superior-subordinate relationship

Aristocracy,Democracy and System Design

  • Conceptual integrity 
    • Purpose = ease of use  =  function / conceptual complexity ; this is ultimate test of system design
  • Aristocracy and Democracy
    • Where architecture tells what happens; implementation tells how it is made to happen

rolling out releases

I have been part of rolling out 1.0 of our application to end customer and this has been fulfilling experience and it brought me lot of shameful moments. Here is the summary.
  • Manage Change : This is I would mark the number 1 thing every developer should be really good at. This means you have to really look farther and figure out when this thing changes because of business requirements; how are you going to manage that.
  • Reject complexities : If you think something is complex; reject that option immediately. 99% there is easy way out. Use online forums/communities often and align your application to what seems to be most recommended approach. Business is different. Technical solutions are more rigorously ironed out. One needs to smart enough to seek help from right people/communities. And in this era I guess that's what people call crowd-sourcing.
  • Technical Debt: I am already buried deep. Trust none or delegate completely some parts of your application to your colleagues. Each person delivers artifacts their own way and that adds to technical debt of team. More rigid measures you have before it gets added to your repository; better. Cleaning mess later is itself a big task. Also everybody on your team is dev+qa combo. There is no other option.
  • Deployment : This is my favorite theme for last couple of weeks. Rinse and repeat. Outlining steps to deploy your software is challenging job if you have multiple systems and multiple point of failures. If developers don't get involved in it; they are going to screw up things in future too. Once you figured out smooth procedure; get it repeated multiple times on your test environment by somebody else on your team. That removes some of the assumptions. Your assumptions and point of failures should be on some whiteboard and make sure everybody understands it completely. Everybody I mean even you customer support team and your managers.
  • Production Issues : Things go wrong. Fixing fast and dirty is the last resort and your stakeholders should approve it. Being a dev/dev manager its not your responsibility to take that decision. Your task is to be ready with options and trade offs for each of them in plain honest way. If you screw things here for short term a** covering; you have already taken few strides back and had set the wrong example. More people are going to follow that in future. Wary of decisions taken in cave.