csv to avro without apache spark in scala

360 Views Asked by At

Is there a way I can convert a scv file to Avro without using Apache Spark. I see most of the post suggests using spark which I cannot in my case. I have a schema in a separate file. I was thinking of some custom serializer and deserializer that will use the Schema and convert csv to avro. Any kind of reference would work for me. Thanks

2

There are 2 best solutions below

0
On

Avro is an open format, there are many languages which support it.

Just pick one, like python for example which also support csv. But Go would do, and Java also.

1
On

If you only have strings and primitives, you could put together a crude implementation like this fairly easily:

def csvToAvro(file: Sting, schema: Schema) = {
  val rec = new GenericData.Record(schema)
  val types = schema
    .getFields
    .map { f => f.pos -> f.schema.getType }

  Source.fromFile(file)
   .getLines
   .map(_.split("_").toSeq)
   .foreach { data => 
     (data zip types)
       .foreach {
         case (str, (idx, STRING)) => rec.put(idx, str)
         case (str, (idx, INT)) => rec.put(idx, str.toInt)
         case (str, (idx, LONG)) => rec.put(idx, str.toLong)
         case (str, (idx, FLOAT)) => rec.put(idx, str.toFloat)
         case (str, (idx, DOUBLE)) => rec.put(idx, str.toDouble)
         case (str, (idx, BOOLEAN)) => rec.put(idx, str.toBoolean)  
         case (str, (idx, unknown)) => throw new IllegalArgumentException(s"Don't know how to convert $str to $unknown at $idx))
       }
  }
  rec
}

Note this does not handle nullable fields: for those the type is going to be UNION, and you'll have to look inside the schema to find out the actual data type.

Also, "parsing csv" is very crude here (just splitting at the comma isn't really a good idea, because it'll break if a string field happens to contain , in the data, or if fields are escaped with double-quotes).

And also, you'll probably want to add some sanity-checking to make sure, for example, that the number of fields in the csv line matches the number of fields in the schema etc.

But the above considerations notwithstanding, this should be sufficient to illustrate the approach and get you started.