I have tried to copy 400+ GB and one more distcp job with the size of data 35.6 GB, but both of them took nearly 2 -3 hours for the completion.
We do have enough resources in the cluster.
But when I have examined the container logs, I found it takes so much of time to copy small files. The file in question is a small file.
2019-10-23 14:49:09,546 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying hdfs://service-namemode-prod-ab/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet to s3a://bucket-name/Data/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet 2019-10-23 14:49:09,940 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://bucket-name/Data/.distcp.tmp.attempt_1571566366604_9887_m_000010_0
So what can be done to improve this with distcp to make the copy quicker?
Note: the same copy of data on the same cluster to Object Store (internal storage) not AWS S3, but similar to S3 took 4 mins for 98.6 GB.
Command :
hadoop distcp -Dmapreduce.task.timeout=0 -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=157286400 -Dfs.s3a.multipart.size=314572800 -Dfs.s3a.multipart.threshold=1073741824 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7290m -Dfs.s3a.max.total.tasks=1 -Dfs.s3a.threads.max=10 -bandwidth 1024 /abc/xyz/ava/ s3a://bucket-name/Data/
What can be optimized in terms of value here?
My cluster specs are as follows,
Allocate memory(Cumulative) - 1.2T
Available memory - 5.9T
Allocated VCores(Cumulative) - 119T
Available VCores - 521T
Configured Capacity - 997T
HDFS Used - 813T
Non-HDFS Used - 2.7T
Can anyone suggest a solution to overcome this issue, and suggest an optimal distcp conf for transferring 800 GB - 1 TB files usually from HDFS to Object Store.
You can write spark code. List all files in a list. Create a dataframe on it and repartition on it's length... so essentially you have 1 file per partition. Call map partitions on this df, so if you have 100 cores, 100 files will be getting copied at once. CPU utilization will be minimal as it's all IO. Netowork bandwith utilization will be good.
If the number of files is much larger than the core you have, A more faster approach would be to repartition in a way such that there are multiple files in each partition. Then inside mappartition you can spawn n (number of files per partition) number of threads, hence copying n files at once on every core. So if you have 100 cores, you can copy 100 * n files at once. CPU utilization and NBU stats as above.
If you can answer this , do help : Copy data from a on prem s3 object storage to another object storage using hadoop cli commands