How to export data from Delta table in Databricks and write to txt file

120 Views Asked by At

I have a delta table with 20 columns and around 5.23 million rows. Some of the columns are complex datatypes. I want to export data from the table and write to .txt files using python with a header row using tab (\t) field delimiter and 50,000 rows of data in each file. I am pretty new to Databricks and python and need to have a solution. Please write the full code and not just logic.

Thanks in advance.

Tried searching no result

1

There are 1 best solutions below

5
ARCrow On BEST ANSWER

Does the 50,000 record count per file have to be exact? If not 5.32e6/50,000 is roughly 106, so if we repartition data into 106 partitions, it's gonna give us files with roughly 50,000 records:

import pyspark.sql.functions as f
from pyspark.sql.types import *

df = spark.read.format('delta').load('<path to table>')
record_count = df.count()
(
    df
    .select(*[f.col(element).cast(StringType()).alias(element) for element in df.columns]) #To take care of complex data types
    .repartition(int(record_count/50000))
    .write.option('delimiter', '\t').option('header', True).csv('<write destination>')
)

#since you're using Databricks and if you absolutely care about the file extension being .txt
files = dbutils.fs.ls('<write destination>')
for element in flies:
    dbutils.fs.mv(element.path, element.path.replace('.csv', '.txt'))