Spark: pruning nested columns/fields

1.1k Views Asked by At

I have a question about the possibility to prune nested fields.

I'm developing a source for High Energy Physics Data format (ROOT).

below is the schema for some file using a DataSource that I'm developing.

 root
 |-- EventAuxiliary: struct (nullable = true)
 |    |-- processHistoryID_: struct (nullable = true)
 |    |    |-- hash_: string (nullable = true)
 |    |-- id_: struct (nullable = true)
 |    |    |-- run_: integer (nullable = true)
 |    |    |-- luminosityBlock_: integer (nullable = true)
 |    |    |-- event_: long (nullable = true)
 |    |-- processGUID_: string (nullable = true)
 |    |-- time_: struct (nullable = true)
 |    |    |-- timeLow_: integer (nullable = true)
 |    |    |-- timeHigh_: integer (nullable = true)
 |    |-- luminosityBlock_: integer (nullable = true)
 |    |-- isRealData_: boolean (nullable = true)
 |    |-- experimentType_: integer (nullable = true)
 |    |-- bunchCrossing_: integer (nullable = true)
 |    |-- orbitNumber_: integer (nullable = true)
 |    |-- storeNumber_: integer (nullable = true)

The DataSource is here https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L62

When building a reader using the buildReader method of the FileFormat:

override def buildReaderWithPartitionValues(
    sparkSession: SparkSession,
    dataSchema: StructType,
    partitionSchema: StructType,
    requiredSchema: StructType,
    filters: Seq[Filter],
    options: Map[String, String],
    hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

I see that requiredSchema will always contain all of the fields/members of the top column that is being looked at. Meaning that when I want to select a particular nested field with : df.select("EventAuxiliary.id_.run_"), requiredSchema will be again the full struct for that top column ("EventAuxiliary"). I would expect that schema would be something like this:

root
|-- EventAuxiliary: struct...
|  |-- id_: struct ...
|  |    |-- run_: integer

since this is the only schema that has been required by the select statement.

Basically, I want to know how on the data source level I can prune nested fields. I thought that requiredSchema will be only the fields that are coming from the df.select.

I'm trying to see what avro/parquet are doing and found this: https://github.com/apache/spark/pull/14957/files

If there are suggestions/comments - would be appreciated!

Thanks!

VK

0

There are 0 best solutions below