I am using Confluent's KafkaAvroDerserializer to deserialize Avro Objects sent over Kafka. I want to write the recieved data to a Parquet file. I want to be able to append data to the same parquet and to create a Parquet with Partitions.
I managed to create a Parquet with AvroParquetWriter - but I didn't find how to add partitions or append to the same file:
Before using Avro I used spark to write the Parquet - With spark writing a parquet with partitions and using append mode was trivial - should I try creating Rdds from my Avro objects and use spark to create the parquet ?
Personally, I would not use Spark for this.
Rather I would use the HDFS Kafka Connector. Here is a config file that can get you started.
If you want HDFS Partitions based on a field rather than the literal "Kafka Partition" number, then refer to the configuration docs on the
FieldPartitioner
. If you want automatic Hive integration, see the docs on that as well.Let's say you did want to use Spark, though, you can try AbsaOSS/ABRiS to read in an Avro DataFrame, then you should be able to do something like
df.write.format("parquet").path("/some/path")
(not exact code, because I have not tried it)