Most efficient way to insert large amount of data (230M entries) in pyspark table

175 Views Asked by At

What is the most efficient way to insert large amounts of data that is generated in a python script? I am retrieving .grib files about weather parameters from several sources. These grib files consist of grid-based data (1201x2400x80), which results in a large amount of data.

I have written a script where each value is combined with the corresponding longitude and latitude, resulting in a data structure as follows:

+--------------------+-------+-------+--------+--------+
|               value|lat_min|lat_max| lon_min| lon_max|
+--------------------+-------+-------+--------+--------+
|           0.0011200|-90.075|-89.925|-180.075|-179.925|
|           0.0016125|-90.075|-89.925|-179.925|-179.775|
+--------------------+-------+-------+--------+--------+

I have tried looping through each of the 80 time steps, and creating a pyspark dataframe, as well as reshaping the whole array into (230592000,), but both methods seem to either take ages to complete or fry the cluster's memory.

I have just discovered Resilient Distributed Dataset's (RDD), and am able to use the map function to create the full 230M entries in RDD format, converting this to a DataFrame or writing it to a file is very slow again.

Is there a way to multithread/distribute/optimize this in a way that is both efficient and doesn't need large amounts of memory?

Thanks in advance!

0

There are 0 best solutions below