Environment:
Spark version: 2.3.0
Run Mode: Local
Java version: Java 8
The spark application trys to do the following
1) Convert input data into a Dataset[GenericRecord]
2) Group by the key propery of the GenericRecord
3) Using mapGroups after group to iterate the value list and get some result in String format
4) Output the result as String in text file.
The error happens when writing to text file. Spark deduced that the Dataset generated in step 3 has a binary column, not a String column. But actually it returns a String in the mapGroups function.
Is there a way to do the column data type convertion or let Spark knows that it is actually a string column not binary?
val dslSourcePath = args(0)
val filePath = args(1)
val targetPath = args(2)
val df = spark.read.textFile(filePath)
implicit def kryoEncoder[A](implicit ct: ClassTag[A]): Encoder[A] = Encoders.kryo[A](ct)
val mapResult = df.flatMap(abc => {
JavaConversions.asScalaBuffer(some how return a list of Avro GenericRecord using a java library).seq;
})
val groupResult = mapResult.groupByKey(result => String.valueOf(result.get("key")))
.mapGroups((key, valueList) => {
val result = StringBuilder.newBuilder.append(key).append(",").append(valueList.count(_=>true))
result.toString()
})
groupResult.printSchema()
groupResult.write.text(targetPath + "-result-" + System.currentTimeMillis())
And the output said it is a bin
root
|-- value: binary (nullable = true)
Spark gives out an error that it can't write binary as text:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Text data source supports only a string column, but you have binary.;
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.verifySchema(TextFileFormat.scala:55)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.prepareWrite(TextFileFormat.scala:78)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:595)
As @user10938362 said, the reason is the following code will encode all data to bytes
Replacing it with the following code will just enable this encoding for GenericRecord