Non-Uniform distribution of task and data on Pyspark executors

182 Views Asked by Rakesh Kumar At 07 September 2017 at 10:15

I am running an application on pyspark. For this application below is the snapshot of the distribution of executors. It looks like non-uniformly distributed. Can someone have look and tell where is the problem.

Discription and My Problem:-

I am running my application on huge data, in which I am filtering and joining 3 datasets. After that, I am caching joined data set for generating and aggregating features for the different time period (means my cached data set generate features in the loop). After this, I am trying store these features in a partquet file. This parquet file is taking too much time.

Can any help me to solve this? let me know if you need further information.

Original Q&A

There are 2 best solutions below

Alper t. Turker On 07 September 2017 at 16:55

As you stated (emphasis mine):

I am filtering and joining 3 datasets. After that, I am caching joined data set for generating and aggregating features

Both joins and, to lesser extent, aggregations might result in a skewed distribution of data if join key or grouping columns are not uniformly distributed - it is a natural consequence of the required shuffles.

In general case there is very little you can do about it. In specific cases it is possible to gain a little with broadcasting or salting, but it doesn't look like the problem is particularly severe in your case.

Jack_The_Ripper On 08 September 2017 at 02:32

While my initial suggestion would be to use as little shuffle operations like joins as much as possible. However, if you wish to persist, some suggestions I can provide are to tune your SparkContext in the following ways:

Use Kryo Serializer
Compress data before sending over the network
Play around with your JVM garbage collection
Increase your shuffle memory

Non-Uniform distribution of task and data on Pyspark executors

Discription and My Problem:-

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PYSPARK

Related Questions in EXECUTORS

Trending Questions

Popular # Hahtags

Popular Questions