Most efficient way to insert large amount of data (230M entries) in pyspark table

175 Views Asked by DetailWeb At 14 September 2022 at 13:58

What is the most efficient way to insert large amounts of data that is generated in a python script? I am retrieving .grib files about weather parameters from several sources. These grib files consist of grid-based data (1201x2400x80), which results in a large amount of data.

I have written a script where each value is combined with the corresponding longitude and latitude, resulting in a data structure as follows:

+--------------------+-------+-------+--------+--------+
|               value|lat_min|lat_max| lon_min| lon_max|
+--------------------+-------+-------+--------+--------+
|           0.0011200|-90.075|-89.925|-180.075|-179.925|
|           0.0016125|-90.075|-89.925|-179.925|-179.775|
+--------------------+-------+-------+--------+--------+

I have tried looping through each of the 80 time steps, and creating a pyspark dataframe, as well as reshaping the whole array into (230592000,), but both methods seem to either take ages to complete or fry the cluster's memory.

I have just discovered Resilient Distributed Dataset's (RDD), and am able to use the map function to create the full 230M entries in RDD format, converting this to a DataFrame or writing it to a file is very slow again.

Is there a way to multithread/distribute/optimize this in a way that is both efficient and doesn't need large amounts of memory?

Thanks in advance!

Original Q&A

Most efficient way to insert large amount of data (230M entries) in pyspark table

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PYSPARK

Related Questions in BIGDATA

Related Questions in GRIB

Related Questions in PYGRIB

Trending Questions

Popular # Hahtags

Popular Questions