Write dataframe without column names as part of the file path

1k Views Asked by At

I have to write a Spark dataframe in the path of the format: base_path/{year}/{month}/{day}/{hour}/ If I do something like below:

pc = ["year", "month", "day", "hour"]
df.write.partitionBy(*pc).parquet("base_path/", mode = 'append')

It creates the location as: base_path/year=2022/month=04/day=25/hour=10/. I do not want the column names like year, month, day and hour to be the part of path but something like: base_path/2022/04/25/10/. Any solution for this?

1

There are 1 best solutions below

3
On

The column names are written as part of the path because they are not written in the object itself so you need the column name in the path in order to be able to read it back (following hive style convention).
For more information about this see here.

If you would still want to write the data with the above path you can use multiple write commands with the explicit path and filter according to the partition values.
The current logic for determining the partition path is located here and there doesn't seem to be a way to replace this in a pluggable way (you could technically load a different implementation in the JVM or write you own writer implementation but I would not recommend that)