I am new to pyspark, I am trying to have multiple country data in a single row. I dont know the exact number of country fields i will get. So, i want to have a row where i will have multiple data of country name and country capital according to the following schema.Is it possible to do it using pyspark?
StructField('id', LongType()),
StructField('country', StructType([
StructField('name', StringType()),
StructField('capital', StringType())
])),
StructField('review', StringType())
])```
```data = [[(1,[(Japan, Tokyo),(France, Paris),(Uk, London)], 'nice']
[2,[(Japan, Tokyo),(France, Paris),(Uk, London),(US,
Washington), 'not good']
```
I am dealing with hierarchical data, I want all to have all countries and capitals present in the list with id = 1 in a single row of id = 1. Converting this tuple into a separate list of countries and capitals is not an option because a number of these tuples are different for every data.
Expected dataframe -
+----+---------+------------+----------+
| id | name | capital | review |
+----+---------+------------+----------+
| 1 | Japan | Tokyo | Nice |
| | France | Paris | |
| | UK | London | |
+----+---------+------------+----------+
| 2 | Japan | Tokyo | Not Good |
| | France | Paris | |
| | UK | London | |
| | US | Washington | |
+----+---------+------------+----------+