Data connection - Parallel JDBC extracts failing with OutOfMemoryError

194 Views Asked by At

I'm trying to run a few JDBC extracts in parallel, but this fails on: java.lang.OutOfMemoryError: Java heap space.

How does Data Connection memory usage work, and how do I resolve this problem?

1

There are 1 best solutions below

0
On

The Data Connection Agent's memory usage here actually depends mostly on the value of the fetchSize parameter. Per the Oracle JDBC driver docs, fetchSize:

Gives the JDBC driver a hint as to the number of rows that should be fetched from the database when more rows are needed for this ResultSet object.

So, the agent's memory usage should roughly be:

number of JDBC extracts running in parallel x fetchSize x size of each row

Unfortunately, the default value of fetchSize varies vastly among different JDBC drivers. For instance, certain versions of the Hive JDBC driver have it set to 50, while other, newer versions have a default value of 1000. Oracle JDBC drivers have a default of 10. Postgres by default will try to get the entire ResultSet at once.

Hence, Data Connection allows you to configure the fetchSize value. This is configurable both per-source and per-extract.

OOM errors aside, tuning fetchSize can improve performance significantly in general. There isn't a one-size-fits-all solution, though, and you'll have to experiment to figure out the best parameter value for your extracts. It usually lies somewhere in the 500–5000 range.