Trying to find a complete documentation about an internal architecture of Apache Spark, but have no results there.
For example I'm trying to understand next thing: Assume that we have 1Tb text file on HDFS (3 nodes in a cluster, replication factor is 1). This file will be spitted into 128Mb chunks and each chunk will be stored only on one node. We run Spark Workers on these nodes. I know that Spark is trying to work with data stored in HDFS on the same node (to avoid network I/O). For example I'm trying to do a word count in this 1Tb text file.
Here I have next questions:
- Does Spark will load chuck (128Mb) into RAM, count words, and then delete it from memory and do it sequentially? What if there will be no available RAM?
- When does Spark will use not local data on HDFS?
- What if I will need to do more complex task, when a results of each iteration on each Worker need to be transferred to all other Workers (shuffling?), do I need to write them by my self to HDFS and then read them? For example I can't understand how does K-means clustering or Gradient descent works on Spark.
I will appreciate any link to Apache Spark architecture guide.
Adding to other answers, here I would like to include Spark core architecture diagram as it was mentioned in the question.
Master is entry point here.