Is it possible to not merge parquets while using AutoLoader in Databricks? The thing is, I want to replicate data from S3 bucket directly to Azure Blob Storage without merging it. Just 1:1 copy from S3 to Azure Blob
This is my code:
cda_path = f'{planet}/{center}/{deployment}'
streams = list()
for table_name in tables:
streams.append(
(table_name, spark.readStream.format("cloudFiles")\
.option("cloudFiles.format", "parquet")\
.option('cloudFiles.schemaLocation', f'dbfs:/FileStore/shared_uploads/checkpints/stream_{table_name}')\
.option('cloudFiles.schemaEvolutionMode', 'rescue')\
.load(f'dbfs:/mnt/gwcp/{cda_path}/{table_name}/*'))
)
for table_name, stream in streams:
blob_container = "databricks-container"
blob_output_path = f"/mnt/test_mount_databricks/{planet}/{center}/{deployment}/{table_name}_test"
stream.writeStream\
.format("parquet")\
.outputMode("append")\
.option("checkpointLocation", f"dbfs:/FileStore/shared_uploads/checkpoints/blob_{table_name}_test")\
.option("mergeSchema", "false")\
.start(blob_output_path)
This code works, but as an output, it gives parquets with multiple rows, for example 100. "Source parquets" have only 1 to 2 rows. As you can see I've tried to set "mergeSchema" option to false, but it didn't work out. I couldn't see any topic regarding this thing nor find anything in Databricks docs nor google.
Thanks!
Yes, i think you can achieve a 1:1 copy from S3 to Azure Blob Storage without merging Parquet files using AutoLoader in Databricks. make sure that the output files in Azure Blob maintain the same structure and size as the source Parquet files.
Below is an updated code give it a try,