how to serialize large file more than 5 GB to avro?

418 Views Asked by At

i want to serialize an xml file about 15 GB to avro and store in hadoop using python 3.6. My approach is to load data using xml.minidom in a variable of dictionary type and then save it to avro file. While this works perfectly for a sample xml file of few kb size, can i still store the whole big xml data to that variable ? I guess there is some memory challenge in this approach ? How is the best way to handle this situation ?

1

There are 1 best solutions below

2
On

The whole point in serialization is to not to load or handle big files at once. you need to split your file in smaller "chunks" then serialize them.

you can use Avro DataFileReader from avro.datafile package or reader from fastavro package.