How do I ensure consistent file sizes in datasets built in Foundry Python Transforms?

144 Views Asked by Yurii Mashtalir At 19 October 2025 at 05:03

My Foundry transform is producing different amount of data on different runs, but I want to have similar amount of rows in each file. I can use DataFrame.count() and then coalesce/repartition, but that requires computing the full dataset and then either caching or recomputing it again. Is there a way for Spark to take care of this?

Original Q&A

There are 2 best solutions below

Yurii Mashtalir On 08 December 2021 at 10:56 BEST ANSWER

You can use spark.sql.files.maxRecordsPerFile configuration option by setting it per output of @transform:

output.write_dataframe(
    output_df,
    options={"maxRecordsPerFile": "1000000"},
)

nweatherburn On 08 December 2021 at 15:25

proggeo's answer is useful if the only thing you care about is the number of records per file. However, sometimes it is useful to bucket your data so Foundry is able to optimize downstream operations like Contour Analysis or other transforms.

In those cases you can use something like:

bucket_column = 'equipment_number'
num_files = 8
output_df = output_df.repartition(num_files, bucket_column)
output.write_dataframe(
    output_df,
    bucket_cols=[bucket_column],
    bucket_count=num_files,
)

If your bucket column is well distributed this will work to keep a similar number of rows per dataset file.

How do I ensure consistent file sizes in datasets built in Foundry Python Transforms?

There are 2 best solutions below

Related Questions in PALANTIR-FOUNDRY

Related Questions in FOUNDRY-CODE-REPOSITORIES

Related Questions in FOUNDRY-PYTHON-TRANSFORM

Trending Questions

Popular # Hahtags

Popular Questions