JavaPairRDD to Dataset<Row> in SPARK

1k Views Asked by Jack At 28 July 2025 at 06:34

I have data in JavaPairRDD in format

JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>

I tried using below code

 Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
 Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
 Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");

But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???

Original Q&A

There are 1 best solutions below

T. Gawęda On 13 June 2018 at 10:20

Try to run printSchema - you will see, that value2 is a complex type.

Having such information, you can write:

Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")

value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only

Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

JavaPairRDD to Dataset<Row> in SPARK

There are 1 best solutions below

Related Questions in JAVA

Related Questions in APACHE-SPARK

Related Questions in JAVA-PAIR-RDD

Trending Questions

Popular # Hahtags

Popular Questions