By default, when you're using Hive partitions directory structure,the auto loader option cloudFiles.partitionColumns
add these columns automatically to your schema (using schema inference).
This is the code:
checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("s3://autoloader-source/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table")
)
But can we have an option to also create partitionq to the target table like you can do with a simple CREATE TABLE
? (E.g. if you have such classical structure /year=xxxx/month=xxx/day=xx)
You can use the .partitionBy() function.