Nesting dataframe using pyspark

43 Views Asked by At

I am new to pyspark, I am trying to have multiple country data in a single row. I dont know the exact number of country fields i will get. So, i want to have a row where i will have multiple data of country name and country capital according to the following schema.Is it possible to do it using pyspark?

    StructField('id', LongType()),
    StructField('country', StructType([
        StructField('name', StringType()),
        StructField('capital', StringType())
    ])),
    StructField('review', StringType())
])```
```data = [[(1,[(Japan, Tokyo),(France, Paris),(Uk, London)], 'nice'] 
           [2,[(Japan, Tokyo),(France, Paris),(Uk, London),(US, 
                Washington), 'not good']
       ```
I am dealing with hierarchical data, I want all to have all countries and capitals present in the list with id = 1 in a single row of id = 1. Converting this tuple into a separate list of countries and capitals is not an option because a number of these tuples are different for every data.

Expected dataframe -
+----+---------+------------+----------+
| id | name    | capital    | review   |
+----+---------+------------+----------+
| 1  | Japan   | Tokyo      | Nice     |
|    | France  | Paris      |          |
|    | UK      | London     |          |
+----+---------+------------+----------+
| 2  | Japan   | Tokyo      | Not Good |
|    | France  | Paris      |          |
|    | UK      | London     |          |
|    | US      | Washington |          |
+----+---------+------------+----------+
0

There are 0 best solutions below