Greenplum download dump to local cluster in parallel

175 Views Asked by At

Is there any more effective way to fetch the whole Greenplum's dump than doing it through multiple JDBC connections to master node?

I need to download the whole dump of Greenplum through JDBC. To do the job quicker I am going to use Spark parallelism (fetching data in parallel through multiple JDBC connections). As I understand, I will have multiple JDBC connections to Greenplum's single master node. I am going to store the data at HDFS in parquet format.

2

There are 2 best solutions below

2
On

For parallel exporting, you can try gphdfs writable external table. Gpdb segments can parallel write/read External sources.

http://gpdb.docs.pivotal.io/4340/admin_guide/load/topics/g-gphdfs.html

0
On

Now, you can use Greenplum-Spark connector to parallelize data transfer between Greenplum segments and Spark executors.

This greenplum-spark connector speeds up data transfer as it leverage parallel processing in Greenplum segments and Spark workers. Definitely, it is faster than using JDBC connector that transfer data via Greenplum master node.

Reference: http://greenplum-spark.docs.pivotal.io/100/index.html