how to change a column type in array struct by pyspark

Question

how to change a column type in array struct by pyspark

2k Views Asked by Frank At 28 July 2025 at 07:28

how to change a column type in array struct by pyspark, for example, I would like to change userid from int to long

root
 |-- id: string (nullable = true)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
        |-- m1: long (nullable = true)
        |-- m2: long (nullable = true)
        |-- m3: struct (nullable = true)
           |-- userid: integer (nullable = true)

Original Q&A

There are 2 best solutions below

Vasco Abreu On 19 September 2023 at 17:44

I would like to point out another solution, possible since Spark version 3.1.0, making use of higher-order functions, in this case the transform function.

I find it safer and more robust since one can reference columns by name and safely change their names or data types, whereas with the previous method we're subject to the risk of inconsistency when input data schema changes over time, since that method references columns by position (i.e. it will call the first column that it gets m1 regardless of its content or name)

The new proposed solution is implemented with df2 below:

sch= StructType([StructField('id', StringType(),False),StructField('numbers', ArrayType(
  StructType([StructField('m1',LongType(),True),
              StructField('m2',LongType(),True),
              StructField('m3',StructType([StructField('userid',IntegerType(),True)]),True)])),True)])



df=spark.createDataFrame([
    ('21',[(1234567, 9876543,(1,))]),
    ('34',[(63467892345, 19523789,(2,))])
], schema=sch)
  
df1 = df.selectExpr(
    "id",  
    "CAST(numbers AS array<struct<m1:long,m2:long, m3:struct<userid:double>>>) numbers"
)

df2 = df.selectExpr(
    "id", 
    """
    transform(numbers, s -> struct(                    
        cast(s.m1 as long) as m1,
        cast(s.m2 as long) as m2,
        struct(cast(s.m3.userid as double) as userid) as m3
    )) as numbers
    """)

df.printSchema()
df1.printSchema()
df2.printSchema()

With the following schema results in each case:

df:
root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = true)
 |    |    |    |-- userid: integer (nullable = true)

df1:
root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = true)
 |    |    |    |-- userid: double (nullable = true)

df2:
root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = false)
 |    |    |    |-- userid: double (nullable = true)

Note however that when creating the structs in this way, there is no way yet in PySpark to explicitly make the struct field nullable (nullable = true) as it will infer from data, which should be fine in most practical cases.

**wwnde** · Accepted Answer

Would have been useful if you provide a reproducible df as well.

Following you comments below see the following code.

  sch= StructType([StructField('id', StringType(),False),StructField('numbers', ArrayType(
  StructType([StructField('m1',LongType(),True),
              StructField('m2',LongType(),True),
             StructField('m3',StructType([StructField('userid',IntegerType(),True)]),True)])),True)])



df=spark.createDataFrame([
  ('21',[(1234567, 9876543,(1,))]),
  ('34',[(63467892345, 19523789,(2,))])
], schema=sch)
  
  

df.printSchema()

root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = true)
 |    |    |    |-- userid: integer (nullable = true)

Solution

df1 = df.selectExpr(
  "id",
  
  "CAST(numbers AS array<struct<m1:long,m2:long, m3:struct<userid:double>>>) numbers"
)

df1.printSchema()

root
 |-- id: string (nullable = false)
 |-- numbers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- m1: long (nullable = true)
 |    |    |-- m2: long (nullable = true)
 |    |    |-- m3: struct (nullable = true)
 |    |    |    |-- userid: double (nullable = true)

how to change a column type in array struct by pyspark

There are 2 best solutions below

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in PYSPARK-SCHEMA

Trending Questions

Popular # Hahtags

Popular Questions