How to configure Apache Flume to not to rename ingested files with .COMPLETE

254 Views Asked by At

We have one AWS S3 bucket in which we get new CSV files at 10 minute interval. Goal is to ingest these files into Hive.

So the obvious way for me is to use Apache Flume for this and use Spooling Directory source which will keep looking for new files in landing directory and ingest them in Hive.

We have read-only permissions for S3 bucket and for landing directory in which files will be copied and Flume suffixes ingested files with .COMPLETED suffix. So in our case Flume won't be able to mark completed files because of permission issue.

Now questions are:

  1. What will happen if Flume is not able to add suffix to completed files? Will it give any error or it will silently fail? (I am actually testing this but if anyone has already tried this then I don't have to reinvent the wheel)
  2. Whether Flume will be able to ingest files without marking them with .COMPLETED?
  3. Is there any other Big Data tool/technology better suited for this use case?
1

There are 1 best solutions below

0
On

Flume Spooling Directory Source needs to have write permission either to rename or delete the processed/read log file.

check 'fileSuffix', 'deletePolicy' settings.

If it doesnt rename/delete the completed files, it can't figure out which files are already processed.

You might want to write a 'script' that reads from read-only S3 bucket to a 'staging' folder with write permissions and provide this staging folder as source to flume.