Value Type is binary after Spark Dataset mapGroups operation even return a String in the function

Question

Value Type is binary after Spark Dataset mapGroups operation even return a String in the function

1.1k Views Asked by DeepNightTwo At 12 March 2020 at 16:54

Environment:

Spark version: 2.3.0
Run Mode: Local
Java version: Java 8

The spark application trys to do the following

1) Convert input data into a Dataset[GenericRecord]

2) Group by the key propery of the GenericRecord

3) Using mapGroups after group to iterate the value list and get some result in String format

4) Output the result as String in text file.

The error happens when writing to text file. Spark deduced that the Dataset generated in step 3 has a binary column, not a String column. But actually it returns a String in the mapGroups function.

Is there a way to do the column data type convertion or let Spark knows that it is actually a string column not binary?


    val dslSourcePath = args(0)
    val filePath = args(1)
    val targetPath = args(2)
    val df = spark.read.textFile(filePath)

    implicit def kryoEncoder[A](implicit ct: ClassTag[A]): Encoder[A] = Encoders.kryo[A](ct)

    val mapResult = df.flatMap(abc => {
      JavaConversions.asScalaBuffer(some how return a list of Avro GenericRecord using a java library).seq;
    })

    val groupResult = mapResult.groupByKey(result => String.valueOf(result.get("key")))
      .mapGroups((key, valueList) => {
        val result = StringBuilder.newBuilder.append(key).append(",").append(valueList.count(_=>true))
        result.toString()
      })

    groupResult.printSchema()

    groupResult.write.text(targetPath + "-result-" + System.currentTimeMillis())

And the output said it is a bin

root
 |-- value: binary (nullable = true)

Spark gives out an error that it can't write binary as text:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Text data source supports only a string column, but you have binary.;
    at org.apache.spark.sql.execution.datasources.text.TextFileFormat.verifySchema(TextFileFormat.scala:55)
    at org.apache.spark.sql.execution.datasources.text.TextFileFormat.prepareWrite(TextFileFormat.scala:78)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
    at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:595)

Original Q&A

There are 1 best solutions below

**DeepNightTwo** · Answer 1 · 2020-03-26T01:39:28.490000

As @user10938362 said, the reason is the following code will encode all data to bytes

implicit def kryoEncoder[A](implicit ct: ClassTag[A]): Encoder[A] = Encoders.kryo[A](ct)

Replacing it with the following code will just enable this encoding for GenericRecord

implicit def kryoEncoder: Encoder[GenericRecord] = Encoders.kryo

Value Type is binary after Spark Dataset mapGroups operation even return a String in the function

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-DATASET

Related Questions in SPARK-AVRO

Related Questions in APACHE-SPARK-ENCODERS

Trending Questions

Popular # Hahtags

Popular Questions