increase efficiency of sqoop export from hdfs

1k Views Asked by At

I am trying to export data using sqoop from files stored in hdfs to vertica. For around 10k's of data the files get loaded within a few minutes. But when I try to run crores of data, it is loading around .5% within 15 mins or so. I have tried to increase the number of mappers, but they are not serving any purpose to improve efficienct. Even setting the chunk size to increase the number the mappers, does not increase the number.

Please help.

Thanks!

3

There are 3 best solutions below

0
On

Most MPP/RDBMS have sqoop connectors to exploit the parallelism and increase efficiency in transfer of data between HDFS and MPP/RDBMS. However it seems the vertica has taken this approach: http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/ https://github.com/vertica/Vertica-Hadoop-Connector

1
On

As you are using Batch export try increasing the records per transaction and records per statement parameter using the following properties:

sqoop.export.records.per.statement : property will aggregate multiple rows inside one single insert statement.

sqoop.export.records.per.transaction: how many insert statements will be issued per transaction

I hope these will surely solves the issue.

0
On

Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu). When number of fields is small, it's usually other way around - when sqoop is bored and rdbms systems can't keep up.