pyspark - will partition option in autoloader->writesteam partitioned for existing table data?

498 Views Asked by peace At 28 June 2025 at 01:12

i used autoloader to read data file and write it to table periodically(without partition at first) by below code:

.writeStream\
.option("checkpointLocation", "path") \
.format("delta")\
.outputMode("append")\
.start("table")

Now data size is growing, and want to partition the data with adding this option " .partitionBy("col1") "

.writeStream\
.option("checkpointLocation", "path") \
.partitionBy("col1")\
.format("delta")\
.outputMode("append")\
.start("table")

I want to ask if this option partitionBy("col1") will partition the existing data in the table? If not, how to partition all the data (include existing data and new data ingested)

Original Q&A

There are 1 best solutions below

Alex Ott On 08 January 2023 at 10:12 BEST ANSWER

No, it wont' partition existing data automatically, you will need to do it explicitly. Something like this, test first on a small dataset:

Stop stream if it's running continuously
Read existing data and overwrite with the new partitioning schema

spark.read.table("table") \
  .write.mode("overwrite")\
  .partitionBy("col1")\
  .option("overwriteSchema", "true") \
  .saveAsTable("table")

Start stream again

pyspark - will partition option in autoloader->writesteam partitioned for existing table data?

There are 1 best solutions below

Related Questions in PYSPARK

Related Questions in DATABRICKS

Related Questions in DELTA-LAKE

Related Questions in DATABRICKS-AUTOLOADER

Trending Questions

Popular # Hahtags

Popular Questions