I have broadcasted a variable in spark(scala) but because of the size of data, it gives output as this
WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, 10.240.0.33): java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
When run on smaller database, it works fine. I want to know the size of this broadcasted variable (in mb/gb). Is there a way to find this?
This is because the driver runs out of memory. By default this is
1g, this can be increased using--driver-memory 4g. By default Spark will broadcast a dataframe when it is<10m, although I found out that broadcasting bigger dataframes is also not a problem. This might significantly speed up the join, but when the dataframe becomes too big, it might even slow down the join operation because of the overhead of broadcasting all the data to the different executors.What is your datasource? When the table is read into Spark, under the sql tab and then open the dag diagram of the query you are executing, should give some metadata about the number of rows and size. Otherwise you could also check the actual size on the hdfs using
hdfs dfs -du /path/to/table/.Hope this helps.