Confluent Kafka-to-S3 sink custom s3 naming for easy partitioning

433 Views Asked by At

I'm backing up my kafka topics to s3 using confluent's kafka-connect-s3 https://www.confluent.io/hub/confluentinc/kafka-connect-s3. I want to be able to easily query this data using Athena and have it properly partitioned for cheap/fast reads.

I want to partition by (year/month/day/topic) tuple. I already have the year/month/day part solved by using a Daily partitioner https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#partitioning-records-into-s3-objects. Now year=YYYY/month=MM/day=DD is worked into the path so any hive-based querying is optimized / partitioned on time. Looking at msck explanation, notice the example using userid=

https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html

However, based off these docs https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#s3-object-names I get {topic} in the path but there's no way to modify it to topic={topic}. I could work this into the prefix (instead of env={env} the prefix would be env={env}/topic={topic}) but that seems redundant with another only-child directory {topic} underneath it.

I also noticed topic name is in the message name delimitated by + (along with partition and starting offset).

My question . . . how can I get topic={topic} in my path so hive-based queries automatically create that partition? Or do I already get that for free by having it in the path (with no topic=) or in the message name (again, with no topic=)

1

There are 1 best solutions below

0
On

how can I get topic={topic} in my path so hive-based queries automatically create that partition?

There isn't.

The recommendation would be to make a partitioned table per topic rather than have the topic be a partition itself