I am trying to migrate data from Postgres to HDFS (ideally stored in parquet). Apache Sqoop was built for this purpose, but it was moved into Attic. There are a large number of different tables I'm migrating, and some table are huge (billions of rows).
I tried several approach:
Use Postgres COPY to csv and put to HDFS:
The issue is the copy is very slow for large tableUse Spark:
I tried to build a general spark job to do JDBC read, and write to Parquet on HDFS. This approach conceptually works. But many of my tables don't have an evenly distributed numeric column, thus it's hard to do the partitions of read and cause many jobs fail.
I'm wondering if there are any good products or open-source projects that's a great fit in this case.