Let's create a DataFrame and a Dataset with the same data:
val personDF = Seq(("Max", 33), ("Adam", 32), ("John", 62)).toDF("name", "age")
case class Person(name: String, age: Int)
val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("John", 62)).toDS()
personDF.select("name").explain // DataFrame
// == Physical Plan ==
// LocalTableScan [name#14]
personDS.map(_.name).explain // Dataset
// == Physical Plan ==
// *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#29]
// +- *MapElements <function1>, obj#28: java.lang.String
// +- *DeserializeToObject newInstance(class $line129.$read$$iwC$$iwC$Person), obj#27: $line129.$read$$iwC$$iwC$Person
// +- LocalTableScan [name#2, age#3]
The Dataset Physical Plan incurs in additional DeserializeToObject
, MapElements
, and SerializeFromObject
steps. What are the performance implications of this?
EDIT:
Are there any experiments/benchmarks available that compare:
- Execution speed
- Memory utilization
- Garbage Collection overhead