how to change a column type in array struct by pyspark

2k Views Asked by At

how to change a column type in array struct by pyspark, for example, I would like to change userid from int to long

root
 |-- id: string (nullable = true)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
        |-- m1: long (nullable = true)
        |-- m2: long (nullable = true)
        |-- m3: struct (nullable = true)
           |-- userid: integer (nullable = true)
 
2

There are 2 best solutions below

4
On BEST ANSWER

Would have been useful if you provide a reproducible df as well.

Following you comments below see the following code.

  sch= StructType([StructField('id', StringType(),False),StructField('numbers', ArrayType(
  StructType([StructField('m1',LongType(),True),
              StructField('m2',LongType(),True),
             StructField('m3',StructType([StructField('userid',IntegerType(),True)]),True)])),True)])



df=spark.createDataFrame([
  ('21',[(1234567, 9876543,(1,))]),
  ('34',[(63467892345, 19523789,(2,))])
], schema=sch)
  
  

df.printSchema()

root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = true)
 |    |    |    |-- userid: integer (nullable = true)

Solution

df1 = df.selectExpr(
  "id",
  
  "CAST(numbers AS array<struct<m1:long,m2:long, m3:struct<userid:double>>>) numbers"
)

df1.printSchema()

root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = true)
 |    |    |    |-- userid: double (nullable = true)
0
On

I would like to point out another solution, possible since Spark version 3.1.0, making use of higher-order functions, in this case the transform function.

I find it safer and more robust since one can reference columns by name and safely change their names or data types, whereas with the previous method we're subject to the risk of inconsistency when input data schema changes over time, since that method references columns by position (i.e. it will call the first column that it gets m1 regardless of its content or name)

The new proposed solution is implemented with df2 below:

sch= StructType([StructField('id', StringType(),False),StructField('numbers', ArrayType(
  StructType([StructField('m1',LongType(),True),
              StructField('m2',LongType(),True),
              StructField('m3',StructType([StructField('userid',IntegerType(),True)]),True)])),True)])



df=spark.createDataFrame([
    ('21',[(1234567, 9876543,(1,))]),
    ('34',[(63467892345, 19523789,(2,))])
], schema=sch)
  
df1 = df.selectExpr(
    "id",  
    "CAST(numbers AS array<struct<m1:long,m2:long, m3:struct<userid:double>>>) numbers"
)

df2 = df.selectExpr(
    "id", 
    """
    transform(numbers, s -> struct(                    
        cast(s.m1 as long) as m1,
        cast(s.m2 as long) as m2,
        struct(cast(s.m3.userid as double) as userid) as m3
    )) as numbers
    """)

df.printSchema()
df1.printSchema()
df2.printSchema()

With the following schema results in each case:

df:
root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = true)
 |    |    |    |-- userid: integer (nullable = true)

df1:
root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = true)
 |    |    |    |-- userid: double (nullable = true)

df2:
root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = false)
 |    |    |    |-- userid: double (nullable = true)

Note however that when creating the structs in this way, there is no way yet in PySpark to explicitly make the struct field nullable (nullable = true) as it will infer from data, which should be fine in most practical cases.