Handle different date formats in Pyspark

51 Views Asked by At

I am trying to handle different date data formats using pyspark. I tried using to_timestamp but I am not getting the expected output. Any help would be highly appreciated.

following is my input and expected output

enter image description here

code i tried

# Input data
input_data = [
    "2021-06-16T13:24:44.240-05:00",
    "2011-10-20-05:00",
    "1982-09-27-06:00",
    "20/11/2021"
]

# Create a DataFrame with the input data
df = spark.createDataFrame([(i,) for i in input_data], ["input"])

dfWithDate = df.withColumn("output", F.to_date(F.to_timestamp(col("input"), "M/d/yyyy H:mm")))

dfWithDate.show()
2

There are 2 best solutions below

0
user2704177 On

You don't need to use to_timestamp, you can use to_date immediately. If omitting the format argument doesn't make pyspark infer the format correctly, you'll need to create a when/otherwise for the different formats.

E.g.

f.when(f.col(input). contains("/"), f.to_date(input, format=y/m/d).when(...
0
Andrew On

If you're sure it's only the 3, I would do something like this:

dfWithDate = df.withColumn("a", to_timestamp("input")).withColumn("b",to_timestamp("input","yyyy-MM-ddXXX")).withColumn("c",to_timestamp("input","dd/MM/yyyy"))

Which will give us something like:

>>> dfWithDate.show()
+--------------------+--------------------+-------------------+-------------------+
|               input|                   a|                  b|                  c|
+--------------------+--------------------+-------------------+-------------------+
|2021-06-16T13:24:...|2021-06-16 13:24:...|               null|               null|
|    2011-10-20-05:00|                null|2011-10-20 00:00:00|               null|
|    1982-09-27-06:00|                null|1982-09-27 01:00:00|               null|
|          20/11/2021|                null|               null|2021-11-20 00:00:00|

Then we can coalesce those 3 together:

>>> dfWithDate.select("input",coalesce("a","b","c")).show()
+--------------------+--------------------+
|               input|   coalesce(a, b, c)|
+--------------------+--------------------+
|2021-06-16T13:24:...|2021-06-16 13:24:...|
|    2011-10-20-05:00| 2011-10-20 00:00:00|
|    1982-09-27-06:00| 1982-09-27 01:00:00|
|          20/11/2021| 2021-11-20 00:00:00|
+--------------------+--------------------+