How to access array type value and set in two different columns spark?

Question

How to access array type value and set in two different columns spark?

468 Views Asked by happy At 21 September 2020 at 13:30

I am learning Spark, I have below xml from which I want to read 2 values and create two different columns

<appRoot>
    <applist>
      <app type="test">
        <code>8.52544</code>
      </app>
      <app type="dev">
        <code>8.52537</code>
      </app>
    </applist>
</appRoot>

I want

if type="test" then it should set the value (i.e. 8.52544) in new column "app_test" and

if type="dev" then it should set the value (i.e. 8.52537) in new column "app_dev"

I tried below

df.select(
functions.when($"applist.app._type" === "test", $"applist.app.code").as("app_test"),
functions.when($"applist.app._type" === "dev", $"applist.app.code").as("app_dev"))

but it returns

app_test with value [8.52544, 8.52537]
app_dev with value [8.52544, 8.52537]

How can i differentiate while setting the value to the column?

Update:

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rootTag", "appRoot")
      .option("rowTag", "applist")
      .load("test.xml")

    df.printSchema()



root
 |-- app: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- code: double (nullable = true)

Original Q&A

There are 2 best solutions below

**Shrey Jakhmola** · Answer 1 · 2020-09-21T14:06:49.740000

You can do this by:

Read the data into a dataframe.

val df = spark.read.format("com.databricks.spark.xml").option("rootTag", "appRoot").option("rowTag","applist").load("/home/shrey/Downloads/mydoc.xml")

df.printSchema()
df.show()

Finally, you can select the appropriate columns in when condition by exploding the array:

df.select(explode($"app").as("app")).select(functions.when($"app._type" === "test", $"app.code").as("app_test"),functions.when($"app._type" === "dev", $"app.code").as("app_dev")).show(false)

If you want the output in one line can you can aggregate it using max() functions.

df.select(explode($"app").as("app")).select(functions.when($"app._type" === "test", $"app.code").as("app_test"),functions.when($"app._type" === "dev", $"app.code").as("app_dev")).agg(max($"app_test").as("app_test"),max($"app_dev").as("app_dev")).show(false)

**werner** · Answer 2 · 2020-09-21T18:57:01.133000

Using map_from_entries the array can be transformed into a map using _type as keys and code as values. Then the map can be used to create the two new columns.

val df = spark.read.format("com.databricks.spark.xml")
   .option("rootTag", "appRoot").option("rowTag","applist")
   .load(...)

df.withColumn("map", map_from_entries('app))
   .withColumn("app_test", 'map("test"))
   .withColumn("app_dev", 'map("dev"))
   .drop('map)
   .show(false)

prints (with some additional test data)

+--------------------------------------------+--------+-------+
|app                                         |app_test|app_dev|
+--------------------------------------------+--------+-------+
|[[test, 8.52544], [dev, 8.52537], [x, 1.23]]|8.52544 |8.52537|
|[[test, 1.2], [dev, 3.4], [X, 9.9]]         |1.2     |3.4    |
|[[dev, -5.6], [Z, -12.9]]                   |null    |-5.6   |
+--------------------------------------------+--------+-------+

How to access array type value and set in two different columns spark?

There are 2 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-XML

Trending Questions

Popular # Hahtags

Popular Questions