How to sort a PySpark dataframe rows by the order of a list?

49 Views Asked by At

I have a pySpark dataframe with multiple columns and a list with the items with one of the column items. I want to sort rows by the order of the given list.

col_A col_B col_c
a1 b1 c1
a2 b2 c2
a3 b3 b3

col_A_itm_order = ['a2', 'a3', 'a1']

Expected output

col_A col_B col_c
a2 b2 c2
a3 b3 c3
a1 b1 b1

I found a similar question for Pandas dataframe, but not for PySpark.

1

There are 1 best solutions below

0
Suramuthu R On
from pyspark.sql import SparkSession as ss
from pyspark.sql.functions import col

# Assuming spark is already created
spark = ss.builder.appName("Sortdf").getOrCreate()

# DataFrame
data = [("a1", "b1", "c1"),
        ("a2", "b2", "c2"),
        ("a3", "b3", "c3")]

df = spark.createDataFrame(data, ["col_A", "col_B", "col_C"])

def sortdf(cl, order):
    
    # Create the sort_logic dynamically
    sort_logic = [col(cl).desc() if x.startswith('-') else col(cl).asc() for x in order]
    
    # Apply sorting to the DataFrame
    res = df.orderBy(*sort_logic)

    return res

# Example
r = sortdf("col_A", ["a2", "a3", "a1"])
r.show()