We have one AWS S3 bucket in which we get new CSV files at 10 minute interval. Goal is to ingest these files into Hive.
So the obvious way for me is to use Apache Flume
for this and use Spooling Directory
source which will keep looking for new files in landing directory and ingest them in Hive.
We have read-only
permissions for S3 bucket and for landing directory in which files will be copied and Flume
suffixes ingested files with .COMPLETED
suffix. So in our case Flume won't be able to mark completed files because of permission issue.
Now questions are:
- What will happen if Flume is not able to add suffix to completed files? Will it give any error or it will silently fail? (I am actually testing this but if anyone has already tried this then I don't have to reinvent the wheel)
- Whether
Flume will be able to ingest files without marking them with
.COMPLETED
? - Is there any other Big Data tool/technology better suited for this use case?
Flume Spooling Directory Source needs to have write permission either to rename or delete the processed/read log file.
check 'fileSuffix', 'deletePolicy' settings.
If it doesnt rename/delete the completed files, it can't figure out which files are already processed.
You might want to write a 'script' that reads from read-only S3 bucket to a 'staging' folder with write permissions and provide this staging folder as source to flume.