Spark Scala DataFrame vs Dataset - Performance analysis

1.2k Views Asked by At

Let's create a DataFrame and a Dataset with the same data:

val personDF = Seq(("Max", 33), ("Adam", 32), ("John", 62)).toDF("name", "age")

case class Person(name: String, age: Int)
val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("John", 62)).toDS()

personDF.select("name").explain  // DataFrame
// == Physical Plan ==
// LocalTableScan [name#14]

personDS.map(_.name).explain     // Dataset
// == Physical Plan ==
// *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#29]
// +- *MapElements <function1>, obj#28: java.lang.String
//    +- *DeserializeToObject newInstance(class $line129.$read$$iwC$$iwC$Person), obj#27: $line129.$read$$iwC$$iwC$Person
//       +- LocalTableScan [name#2, age#3]

The Dataset Physical Plan incurs in additional DeserializeToObject, MapElements, and SerializeFromObject steps. What are the performance implications of this?

EDIT:

Are there any experiments/benchmarks available that compare:

  • Execution speed
  • Memory utilization
  • Garbage Collection overhead
0

There are 0 best solutions below