knowing the size of a broadcasted variable in spark

2.3k Views Asked by At

I have broadcasted a variable in spark(scala) but because of the size of data, it gives output as this

WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, 10.240.0.33): java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)

When run on smaller database, it works fine. I want to know the size of this broadcasted variable (in mb/gb). Is there a way to find this?

2

There are 2 best solutions below

2
On

Assuming you are trying to broadcast obj, you can find its size as follows:

import org.apache.spark.util.SizeEstimator

val objSize = SizeEstimator.estimate(obj)

Note that this is an estimator which means it is not 100% correct

1
On

This is because the driver runs out of memory. By default this is 1g, this can be increased using --driver-memory 4g. By default Spark will broadcast a dataframe when it is <10m, although I found out that broadcasting bigger dataframes is also not a problem. This might significantly speed up the join, but when the dataframe becomes too big, it might even slow down the join operation because of the overhead of broadcasting all the data to the different executors.

What is your datasource? When the table is read into Spark, under the sql tab and then open the dag diagram of the query you are executing, should give some metadata about the number of rows and size. Otherwise you could also check the actual size on the hdfs using hdfs dfs -du /path/to/table/.

Hope this helps.