How to implement increment load in pentaho (spoon)

246 Views Asked by At

I want to implement increment load in pentaho. I have two tables in my OLTP and I want to apply left join them and drop them as single table in OLAP. OlTP and OLAP are in different database connection in mysql means there are two different database connection in pentaho.

I want to implement this etl in pentaho and also include incremental load use case that once all data is dumped in olap so in future jobs it only inserts new entries found in olap. First table in oltp is order which have id field, orderdate and amount. Second table in oltp is orderdet which have id, orderid, prodname fields where orderid is foreign key to first table. Need to join them as single table and drop in olap.

How to do this task in pentaho?

Confused in incremental load part like how to design etl for this?

1

There are 1 best solutions below

0
Bert-Jan Stroop On

To do incremential loads, you have to understand that Pentaho is realy fast in reading data, but (relatively) slower in storing data. Therefore for incremential loads the functional idea is to load data from system A (source), and from system B (target) and to determine what action should be taken for each record. So a lot needs to be read, but only a little needs to be written.

We can either do

  • nothing (record from system A is already "as is" in system B)
  • insert (record from system A is missing in system B)
  • update (record from system A is already in system B, but with some diffrent data)
  • delete (record from system B does not have a corresponding record in system A)

There are some steps in pentaho to set this up pretty easily. But in general most people start the same.

  1. Have a table input step for both system A and system B.
  2. Feed the data into a "merge rows (diff)" step. This will compare the sets from A and B and adds a flagfield to it which tells you which of the four actions have to be taken.
  3. Process rows based on the flagfield by seperate steps, if you want to have full control to optimize the performance, or let the system handle it at via a single "synchronize after merge" step

More info here