I am trying to understand the spark ui and hdfs ui while using pyspark. Following are my properties for the Session that I am running
pyspark --master yarn --num-executors 4 --executor-memory 6G --executor-cores 3 --conf spark.dynamicAllocation.enabled=false --conf spark.exector.memoryOverhead=2G --conf spark.memory.offHeap.size=2G --conf spark.pyspark.memory=2G
I ran a simple code to read a file (~9 GB on disk) in the memory twice. And, then merge the two files and persist the results and ran a count action.
#Reading the same file twice
df_sales = spark.read.option("format","parquet").option("header",True).option("inferSchema",True).load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_sales_copy = spark.read.option("format","parquet").option("header",True).option("inferSchema",True).load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
#caching one
from pyspark import StorageLevel
df_sales = df_sales.persist(StorageLevel.MEMORY_AND_DISK)
#merging the two read files
df_merged = df_sales.join(df_sales_copy,df_sales.order_id==df_sales_copy.order_id,'inner')
df_merged = df_merged.persist(StorageLevel.MEMORY_AND_DISK)
#calling an action to trigger the transformations
df_merged.count()
I expect:
- The data is to first be persisted in the Memory and then on disk
- The HDFS capacity to be utilized at least to the extent that the data persist spilled the data on disk
Both of these expectations are failing in the monitoring that follows:
Expectation 1: Failed. Actually, the data is being persisted on disk first and then in memory maybe. Not sure. The following image should help. Definitely not in the disk first unless im missing something
Expectation 2: Failed. The HDFS capacity is not at all used up (only 1.97 GB)
Can you please help me reconcile my understanding and tell me where I'm wrong in expecting the mentioned behaviour and what it actually is that I'm looking at in those images?
Don't use persist to disk until you have a really good reason.(You should only performance tune when you have identified a bottle neck) It takes way longer to write to disk than it does to just proceed with processing the data. for that reason persist to disk should only be used when you have a reason. Reading data in for a count is not one of those reasons.
I humbly suggest that don't alter the spark launch parameters unless you have a reason.(And you understand them.) Here your not going to fit your data into memory because of your spark launch configuration. (You split the space into 2 Gig allocations which means you'll never fit 9 gigs into the 6 gigs you have) I think that you should consider removing all your configuration and see how that changes what's used in memory. Playing with these launch configuration will help you learn what each parameter does. That might help you learn more.
Really it's hard to really provide more advice because there is a lot to learn and explain. Perhaps you'll get luck and someone else will answer your question.