How to avoid dataset from renaming columns to value while mapping?

92 Views Asked by At

While mapping a dataset I keep having the problem that columns are being renamed from _1, _2 ect to value, value.

What is it which is causing the rename?

1

There are 1 best solutions below

0
On

That's because map on Dataset causes that query is serialized and deserialized in Spark.

To Serialize it, Spark must now the Encoder. That's ewhy there is an object ExpressionEncoder with method apply. It's JavaDoc says:

 A factory for constructing encoders that convert objects and primitives to and from the
  internal row format using catalyst expressions and code generation.  By default, the
  expressions used to retrieve values from an input row when producing an object will be created as
  follows:
   - Classes will have their sub fields extracted by name using [[UnresolvedAttribute]] expressions
     and [[UnresolvedExtractValue]] expressions.
   - Tuples will have their subfields extracted by position using [[BoundReference]] expressions.
   - Primitives will have their values extracted from the first ordinal with a schema that defaults
     to the name `value`.

Please look at the last point. Your query is just mapped to primitives, so Catalyst uses name "value".

If you add .select('value.as("MyPropertyName")).as[CaseClass], the field names will be correct.

Types that will have column name "value":

  • Option(_)
  • Array
  • Collection types like Seq, Map
  • types like String, Timestamp, Date, BigDecimal