I am using CloverETL tools to read through a large csv files with over 100,000 rows, translate some lookup values from a DB before inserting them to a DB table.
Basically I used:
- A flat file reader to read in the file
- Followed by a number of 'Lookup Join' to translate some values from a DB
- Then a 'ExtHashJoin' to link up all the data (since the source is not sorted)
- Then Insert the record to a DB.
Pretty simple process and works fine for a small number of records. But when I try to import the whole file, it simply ran out of memory (increase memory allocation won't help).
I found the ExtHashJoin is trying to wait until it joins all records before starting to insert them and that seems to be the problem, since it has to store all records in memory. I don't really need that behaviour. Those records are all independent and can be processed in batches e.g. every 1000 rows a time but I cannot figure out a way to make it do this way.
I also tried to set different phases values but it still tries to join all values up before starting inserting the first record.
How can I tell the flat file reader to break the records down and process them in batches?
Thanks.
Eric
there is no easy way to force the FlatFileReader component to break records down and process them in batches. However, I can suggest a few tips that might help you resolve the memory overflow issue caused by the ExtHashJoin component:
More details concerning this topic can be found on the CloverETL Forum: http://forum.cloveretl.com/viewtopic.php?p=12926#p12926