Add current timestamp to Spark dataframe but partition it by the current date without adding it to the dataframe

1.1k Views Asked by CodingIsArt At 18 June 2025 at 22:19

I understand we can add current timestamp to a dataframe by doing this:

import org.apache.spark.sql.functions.current_timestamp    
df.withColumn("time_stamp", current_timestamp())

However if we'd like to partition it by the current date at the point of saving it as a parquet file by deriving it from the timestamp without adding it to the dataframe, would that be possible? What I am trying to achieve would be something like this:

df.write.partitionBy(date("time_stamp")).parquet("/path/to/file")

Original Q&A

There are 2 best solutions below

过过招 On 18 April 2022 at 07:54

You can't do that. partitionBy must specify the name of a column or columns of the dataset. In addition, when reading table data, spark implements Partition Discovery according to the storage structure.

Vaebhav On 18 April 2022 at 09:19

As explained by @过过招 , partitionBy takes in a column , and you cannot supply a calculated field

You can implicitly create a column using current_date , and use that in partitionBy , the current_date column that you have created will anyways not be part of your data dump

import org.apache.spark.sql.functions.current_date    
df.withColumn("current_date", current_date())

df.write.partitionBy(current_date).parquet("/path/to/file")

Add current timestamp to Spark dataframe but partition it by the current date without adding it to the dataframe

There are 2 best solutions below

Related Questions in JAVA

Related Questions in DATAFRAME

Related Questions in APACHE-SPARK

Related Questions in PARQUET

Related Questions in PARTITION-BY

Trending Questions

Popular # Hahtags

Popular Questions