Is it possible to not merge parquets while using AutoLoader in Databricks?

43 Views Asked by TechLuk At 09 January 2024 at 14:07

Is it possible to not merge parquets while using AutoLoader in Databricks? The thing is, I want to replicate data from S3 bucket directly to Azure Blob Storage without merging it. Just 1:1 copy from S3 to Azure Blob

This is my code:


cda_path = f'{planet}/{center}/{deployment}'

streams = list()

for table_name in tables:
    streams.append(
        (table_name, spark.readStream.format("cloudFiles")\
        .option("cloudFiles.format", "parquet")\
        .option('cloudFiles.schemaLocation', f'dbfs:/FileStore/shared_uploads/checkpints/stream_{table_name}')\
        .option('cloudFiles.schemaEvolutionMode', 'rescue')\
        .load(f'dbfs:/mnt/gwcp/{cda_path}/{table_name}/*'))
        )

for table_name, stream in streams:
    blob_container = "databricks-container" 
    blob_output_path = f"/mnt/test_mount_databricks/{planet}/{center}/{deployment}/{table_name}_test"

    stream.writeStream\
        .format("parquet")\
        .outputMode("append")\
        .option("checkpointLocation", f"dbfs:/FileStore/shared_uploads/checkpoints/blob_{table_name}_test")\
        .option("mergeSchema", "false")\
        .start(blob_output_path)

This code works, but as an output, it gives parquets with multiple rows, for example 100. "Source parquets" have only 1 to 2 rows. As you can see I've tried to set "mergeSchema" option to false, but it didn't work out. I couldn't see any topic regarding this thing nor find anything in Databricks docs nor google.

Thanks!

Original Q&A

There are 1 best solutions below

Mohamed Azarudeen Z On 09 January 2024 at 14:55

Yes, i think you can achieve a 1:1 copy from S3 to Azure Blob Storage without merging Parquet files using AutoLoader in Databricks. make sure that the output files in Azure Blob maintain the same structure and size as the source Parquet files.

Below is an updated code give it a try,

cda_path = f'{planet}/{center}/{deployment}'

streams = []

for table_name in tables:
    streams.append(
        (table_name, spark.readStream.format("cloudFiles")
         .option("cloudFiles.format", "parquet")
         .option('cloudFiles.schemaLocation', f'dbfs:/FileStore/shared_uploads/checkpints/stream_{table_name}')
         .option('cloudFiles.schemaEvolutionMode', 'rescue')
         .load(f'dbfs:/mnt/gwcp/{cda_path}/{table_name}/*'))
    )

for table_name, stream in streams:
    blob_output_path = f"wasbs://<container_name>@<storage_account_name>.blob.core.windows.net/{planet}/{center}/{deployment}/{table_name}_test"

    stream.writeStream\
        .format("parquet")
        .outputMode("append")
        .option("checkpointLocation", f"dbfs:/FileStore/shared_uploads/checkpoints/blob_{table_name}_test")
        .start(blob_output_path)

Is it possible to not merge parquets while using AutoLoader in Databricks?

There are 1 best solutions below

Related Questions in DATABRICKS

Related Questions in AZURE-DATABRICKS

Related Questions in AWS-DATABRICKS

Related Questions in DATABRICKS-AUTOLOADER

Trending Questions

Popular # Hahtags

Popular Questions