I currently have a df of around 44000 rows. My goal is to store in a blob container (let's name it output) each row into separate jsons. Meaning that i will end up with 44000 jsons and each json will contain one row of the df.
I'm currently using the below piece of code:
# my function to write:
def write_output_file(file_name, storageName, output_df):
# Setting the final file location
outputName = "output-" + file_name.split('.')[0]
output_container_path = f"wasbs://[email protected]"
output_blob_folder = output_container_path #+ "/" + outputName + "_Final"
# Write the result
output_df.write.mode("overwrite").option("ignoreNullFields", "false").option("maxRecordsPerFile", 1).json(output_blob_folder)
return output_blob_folder
# command calling this function:
chunk_size = 5000
num_chunks = final_df.count() // chunk_size + 1
# Split final_df into chunks of 5000 rows
dfs = final_df.randomSplit([1.0] * num_chunks)
# Write each chunk to output file
for i, df in enumerate(dfs):
write_output_file(fileName, storageName, df)
My problem is that i end up with only one chunk so around 5000 jsons in the blob.
I think you need to use
foreachfunction. I also modified your code a bit: