We have quite a few avro files on GCP (total storage size in PBs) which have older schemas (containing "default":"null" on the header schema section for a few 'record' type columns). Now when we are trying to load those to BQ, BigQuery is not able to interpret those. The solution appears to be converting "default":"null" to "default":null.

We have written a couple of custom python codes to convert the header to the newer format (Using avro and fastavro libraries); but it's taking long time to process even a 1 GB file (25 mins)

As the file count is large, the process is going to run for months (Even with parallel processing). Is there an easy way to do it?

0

There are 0 best solutions below