How to sort a PySpark dataframe rows by the order of a list?

49 Views Asked by Madhu At 23 March 2024 at 19:04

I have a pySpark dataframe with multiple columns and a list with the items with one of the column items. I want to sort rows by the order of the given list.

col_A	col_B	col_c
a1	b1	c1
a2	b2	c2
a3	b3	b3

col_A_itm_order = ['a2', 'a3', 'a1']

Expected output

col_A	col_B	col_c
a2	b2	c2
a3	b3	c3
a1	b1	b1

I found a similar question for Pandas dataframe, but not for PySpark.

Original Q&A

There are 1 best solutions below

Suramuthu R On 23 March 2024 at 19:48

from pyspark.sql import SparkSession as ss
from pyspark.sql.functions import col

# Assuming spark is already created
spark = ss.builder.appName("Sortdf").getOrCreate()

# DataFrame
data = [("a1", "b1", "c1"),
        ("a2", "b2", "c2"),
        ("a3", "b3", "c3")]

df = spark.createDataFrame(data, ["col_A", "col_B", "col_C"])

def sortdf(cl, order):
    
    # Create the sort_logic dynamically
    sort_logic = [col(cl).desc() if x.startswith('-') else col(cl).asc() for x in order]
    
    # Apply sorting to the DataFrame
    res = df.orderBy(*sort_logic)

    return res

# Example
r = sortdf("col_A", ["a2", "a3", "a1"])
r.show()

How to sort a PySpark dataframe rows by the order of a list?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in DATAFRAME

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Trending Questions

Popular # Hahtags

Popular Questions