pyspark - will partition option in autoloader->writesteam partitioned for existing table data?

489 Views Asked by At

i used autoloader to read data file and write it to table periodically(without partition at first) by below code:

.writeStream\
.option("checkpointLocation", "path") \
.format("delta")\
.outputMode("append")\
.start("table")

Now data size is growing, and want to partition the data with adding this option " .partitionBy("col1") "

.writeStream\
.option("checkpointLocation", "path") \
.partitionBy("col1")\
.format("delta")\
.outputMode("append")\
.start("table")

I want to ask if this option partitionBy("col1") will partition the existing data in the table? If not, how to partition all the data (include existing data and new data ingested)

1

There are 1 best solutions below

0
On BEST ANSWER

No, it wont' partition existing data automatically, you will need to do it explicitly. Something like this, test first on a small dataset:

  • Stop stream if it's running continuously
  • Read existing data and overwrite with the new partitioning schema
spark.read.table("table") \
  .write.mode("overwrite")\
  .partitionBy("col1")\
  .option("overwriteSchema", "true") \
  .saveAsTable("table")
  • Start stream again