using repartion in pyspark for huge set of data

115 Views Asked by At

I have a huge amount of data in a few oracle tables (the total size of data in these tables is around 50GB). I have to perform joins and perform some calculations, and these tables don't have any partitions created. I need to read this data in pyspark data frame, and finally write the CSV file into s3. It's taking a long time to run this query on the database, fetch the data and directly write to s3 (the fetched data, based on the query, is around 100MB only).
Can using repartition on this data frame help me in any way to improve the query performance?
Or is there any other way to make this operation faster?

0

There are 0 best solutions below