Can Databricks Auto loader infer partitions?

717 Views Asked by At

By default, when you're using Hive partitions directory structure,the auto loader option cloudFiles.partitionColumns add these columns automatically to your schema (using schema inference).

This is the code:

checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
    spark
    .readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .load("s3://autoloader-source/json-data")
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .trigger(availableNow=True)
    .toTable("dev_catalog.dev_database.dev_table")
)

But can we have an option to also create partitionq to the target table like you can do with a simple CREATE TABLE ? (E.g. if you have such classical structure /year=xxxx/month=xxx/day=xx)

1

There are 1 best solutions below

0
On BEST ANSWER

You can use the .partitionBy() function.

checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
    spark
    .readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .load("s3://autoloader-source/json-data")
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .partitionBy("col1", "col2")
    .trigger(availableNow=True)
    .toTable("dev_catalog.dev_database.dev_table")
)