Flume not closing all files when adding it successively

42 Views Asked by At

Here is my flume conf

agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel

agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://testbucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sinks.s3hdfs.hdfs.rollInterval = 0
agent.sinks.s3hdfs.hdfs.rollSize = 0
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.idleTimeout = 15

agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = false
agent.sources.MySpooler.deserializer.maxLineLength = 110000

agent.channels.channel.type = memory
agent.channels.channel.capacity = 100000000

When I add a file in /flume_to_aws and wait for it, it is uploaded in amazon s3 and file is closed normally.

[root@de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00001.csv .

log:

06 Feb 2023 14:02:11,802 INFO  [hdfs-s3hdfs-roll-timer-0] (org.apache.flume.sink.hdfs.BucketWriter.doClose:438)  - Closing s3a://testbucket/test/FilePrefix.1675699321675.tmp
06 Feb 2023 14:02:13,599 INFO  [hdfs-s3hdfs-call-runner-4] (org.apache.flume.sink.hdfs.BucketWriter$7.call:681)  - Renaming s3a://testbucket/test/FilePrefix.1675699321675.tmp to s3a://testbucket/test/FilePrefix.1675699321675

But when I add several files without wait, it does not upload all files

ie:

[root@de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00001.csv .
[root@de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00002.csv .
[root@de flume_to_aws]# cp /tmp_flume/globalterrorismdb_0522dist.00003.csv .

log (only one file).

06 Feb 2023 14:02:27,842 INFO  [hdfs-s3hdfs-roll-timer-0] (org.apache.flume.sink.hdfs.BucketWriter.doClose:438)  - Closing s3a://testbucket/test/FilePrefix.1675699338165.tmp
06 Feb 2023 14:02:31,411 INFO  [hdfs-s3hdfs-call-runner-0] (org.apache.flume.sink.hdfs.BucketWriter$7.call:681)  - Renaming s3a://testbucket/test/FilePrefix.1675699338165.tmp to s3a://testbucket/test/FilePrefix.1675699338165

In s3 I only see one file. Why this happen?

1

There are 1 best solutions below

0
Astora On

I misunderstood the concept.

Actually, it is working fine. Flume seems to work doing something called "roll". Those 3 files are rolled together, especially because those 3 parameters.

agent.sinks.s3hdfs.hdfs.rollInterval = 0
agent.sinks.s3hdfs.hdfs.rollSize = 0
agent.sinks.s3hdfs.hdfs.rollCount = 0

Since there is no interval to roll (rollInterval), no size to roll (rollSize) and no event count to roll (rollCount), it will roll those files together and store all the files in a single file in s3 after the timeout agent.sinks.s3hdfs.hdfs.idleTimeout = 15.

In my case, now I am using agent.sinks.s3hdfs.hdfs.rollSize = 2097152, so it will roll when the file reaches 2mb. In this case the size of those three files are:

[root@de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00001.csv
1532    /tmp_flume/globalterrorismdb_0522dist.00001.csv
[root@de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00002.csv
1040    /tmp_flume/globalterrorismdb_0522dist.00002.csv
[root@de flume_to_aws]# du -sk /tmp_flume/globalterrorismdb_0522dist.00003.csv
908     /tmp_flume/globalterrorismdb_0522dist.00003.csv

1532kb + 1040kb + 908kb = 3,480 (3.4mb)

As I am setting it to roll after 2mb, it will store 2 files in s3.

enter image description here

as we can see, the size of the files in s3 match with the above sum.

2mb + 1.4mb = 3.4mb

Please, I just leaned that. Leave a feedback if something is wrong.