Hadoop Distcp - small files issue while copying between different locations

1.3k Views Asked by At

I have tried to copy 400+ GB and one more distcp job with the size of data 35.6 GB, but both of them took nearly 2 -3 hours for the completion.

We do have enough resources in the cluster.

But when I have examined the container logs, I found it takes so much of time to copy small files. The file in question is a small file.

2019-10-23 14:49:09,546 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying hdfs://service-namemode-prod-ab/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet to s3a://bucket-name/Data/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet 2019-10-23 14:49:09,940 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://bucket-name/Data/.distcp.tmp.attempt_1571566366604_9887_m_000010_0

So what can be done to improve this with distcp to make the copy quicker?

Note: the same copy of data on the same cluster to Object Store (internal storage) not AWS S3, but similar to S3 took 4 mins for 98.6 GB.

Command :

hadoop distcp -Dmapreduce.task.timeout=0 -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=157286400 -Dfs.s3a.multipart.size=314572800 -Dfs.s3a.multipart.threshold=1073741824 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7290m -Dfs.s3a.max.total.tasks=1 -Dfs.s3a.threads.max=10 -bandwidth 1024 /abc/xyz/ava/ s3a://bucket-name/Data/

What can be optimized in terms of value here?

My cluster specs are as follows,

Allocate memory(Cumulative) - 1.2T

Available memory - 5.9T

Allocated VCores(Cumulative) - 119T

Available VCores - 521T

Configured Capacity - 997T

HDFS Used - 813T

Non-HDFS Used - 2.7T

Can anyone suggest a solution to overcome this issue, and suggest an optimal distcp conf for transferring 800 GB - 1 TB files usually from HDFS to Object Store.

3

There are 3 best solutions below

0
On

You can write spark code. List all files in a list. Create a dataframe on it and repartition on it's length... so essentially you have 1 file per partition. Call map partitions on this df, so if you have 100 cores, 100 files will be getting copied at once. CPU utilization will be minimal as it's all IO. Netowork bandwith utilization will be good.

If the number of files is much larger than the core you have, A more faster approach would be to repartition in a way such that there are multiple files in each partition. Then inside mappartition you can spawn n (number of files per partition) number of threads, hence copying n files at once on every core. So if you have 100 cores, you can copy 100 * n files at once. CPU utilization and NBU stats as above.

If you can answer this , do help : Copy data from a on prem s3 object storage to another object storage using hadoop cli commands

0
On

If you are copying to object stores, You can use the -direct option of distcp as well. From the official doc: -direct: Write directly to destination paths Useful for avoiding potentially very expensive temporary file rename operations when the destination is an object store

Distcp before starting to copy builds listing as well, so if that is also taking time you can try using -numListstatusThreads option. Mostly would help if source is object store or you are using the -delete option as well, in which case target listing is also built...

0
On

In my project we have copied 20TB through Distcp to S3a. It was taking almost 24Hrs +. However by adding two new buckets and through same Distcp command, the copying reduced to almost 16Hrs.

One more Option is increase the number of Vcores in the cluster.