Transfer files from HDFS dir to sftp server

1.4k Views Asked by At

I am trying to transfer all the part* files from a directory directly from HDFS dir to sftp server. All the files in hdfs folder is pretty huge, so I do not want to copy them to local file system.

The current setup is

hdfs dfs -text "<HDFS_DIR>/part*" > localfile

curl "<sftp_username>:" --key "<private_key_file_path>" --pubkey "<public_key_file_path>" \
    --upload-file local_file "sftp://<SFTP_HOST>/<Upload_dir>"

How can I upload the files directly from HDFS to sftp server path without writing the file to local filesystem.

I considered the following options

  1. scoop with sftp (Did not find enough resources) - https://sqoop.apache.org/docs/1.99.7/user/connectors/Connector-SFTP.html
  2. Copy each part file to local fs and move it to sftp server (inefficient)
  3. hadoop distcp with sftp doesn't work in cdh5. I am using CDH-5.16.2

Please let me know which is the best way to accomplish this. Thanks!

3

There are 3 best solutions below

1
hanshenrik On

maybe you can pipe hdfs's output directly to curl for upload, by using --upload-file . or --upload-file - , eg

hdfs dfs -text "<HDFS_DIR>/part*" | curl "<sftp_username>:" --key "<private_key_file_path>" --pubkey "<public_key_file_path>"
--upload-file . "sftp://<SFTP_HOST>/<Upload_dir>"

about the difference between . and - the docs says

Use the file name "-" (a single dash) to use stdin instead of a given file. Alternately, the file name "." (a single period) may be specified instead of "-" to use stdin in non-blocking mode to allow reading server output while stdin is being uploaded.

which sounds to me like curl may attempt to put the whole file in ram, or at least in a stdin buffer, before starting the upload, so . sounds safer than - if you expect to deal with large files..

0
Piyush Patel On

You could probably do it like this.

hdfs dfs -cat <HDFS_DIR>/part* | ssh <sftp_username>:<sftp_hostname> 'cat - > <Upload_dir>/<file_name>'
0
tsetem On

Hate to say it, but you might be hosed if you're stuck on CDH 5.16. This version is really out of date compared to CDH in general, as well as Apache Hadoop.

SFTP Support has been added since Hadoop v2.8.0. I'd suggest trying to upgrade your cluster, or see if you can get a docker image and shoe-horn in a distcp job to copy that data using updated libraries more natively.