knowing the size of a broadcasted variable in spark

2.3k Views Asked by Ravi Ranjan At 13 May 2025 at 18:31

I have broadcasted a variable in spark(scala) but because of the size of data, it gives output as this

WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, 10.240.0.33): java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)

When run on smaller database, it works fine. I want to know the size of this broadcasted variable (in mb/gb). Is there a way to find this?

Original Q&A

There are 2 best solutions below

Assaf Mendelson On 28 December 2016 at 07:27

Assuming you are trying to broadcast obj, you can find its size as follows:

import org.apache.spark.util.SizeEstimator

val objSize = SizeEstimator.estimate(obj)

Note that this is an estimator which means it is not 100% correct

Fokko Driesprong On 27 December 2016 at 10:14

This is because the driver runs out of memory. By default this is 1g, this can be increased using --driver-memory 4g. By default Spark will broadcast a dataframe when it is <10m, although I found out that broadcasting bigger dataframes is also not a problem. This might significantly speed up the join, but when the dataframe becomes too big, it might even slow down the join operation because of the overhead of broadcasting all the data to the different executors.

What is your datasource? When the table is read into Spark, under the sql tab and then open the dag diagram of the query you are executing, should give some metadata about the number of rows and size. Otherwise you could also check the actual size on the hdfs using hdfs dfs -du /path/to/table/.

Hope this helps.

knowing the size of a broadcasted variable in spark

There are 2 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in SIZE

Related Questions in BROADCAST

Trending Questions

Popular # Hahtags

Popular Questions