I am conducting a study comparing the execution time of Bloom Filter Join operation on two environments: Apache Spark Cluster and Apache Spark. I have compared the overall time of the two environments, but I want to compare specific "tasks on each stage" to see which computation has the most significant difference.
I have taken a screenshot of the DAG of Stage 0 and the list of tasks executed in Stage 0.
I write programs
val appName = "scenario4-study2b"
val spark = SparkSession.builder()
.appName(appName)
.getOrCreate()
val s3accessKeyAws = "**"
val s3secretKeyAws = "**"
val connectionTimeOut = "600000"
val s3endPointLoc: String = "http://192.168.1.100"
val sc = spark.sparkContext
// Read file from S3 with capacity is 70GB
var rddL: RDD[String] = spark.sparkContext.emptyRDD[String]
for (index < -16 to 29) {
val fileName = f "$index%02d"
rddL = Tools.readS3A(sc, fileName).union(rddL)
}
// Create filter with BF
val BF = Tools.rdd2BF(rddL)
// Read file from S3 with capacity is 80GB
var coutRS: Long = 0
for (index < -0 to 15) {
val fileName = f "$index%02d"
coutRS = coutRS + Tools.readS3A(sc, fileName).filter(item => BF.contains(item)).count()
}
print("\nResult: " + coutRS + "\n\n")
I have questions:
- Can we determine which tasks are responsible for executing each step scheduled on the DAG during the processing?
- Is it possible to know the function of each task (e.g., what is task ID 0 responsible for? What is task ID 1 responsible for? ... )?