Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

694 Views Asked by vanhooser At 19 October 2025 at 05:06

I want to run df.count() on my DataFrame, but I know my total dataset size is pretty large. Does this run the risk of materializing the data back to the driver / increasing my risk of driver OOM?

Original Q&A

There are 1 best solutions below

vanhooser On 13 December 2021 at 16:42 BEST ANSWER

This will not materialize your entire dataset to the driver, nor will it necessarily increase your risk of OOM. (It forces the evaluation of the incoming DataFrame, so if that evaluation means you will OOM then that will be recognized at the point you .count(), but the .count() itself didn't cause this, it only made you realize it).

What this will do, however, is halt the execution of your job from the point you make the call to. .count(). This is because this value must be known to the driver before it can proceed with any of the rest of your work, so it's not a particularly efficient use of Spark / distributed compute. Use .count() only when necessary, i.e. when making choices about partition counts or other such dynamic sizing operations.

Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

There are 1 best solutions below

Related Questions in PYSPARK

Related Questions in PALANTIR-FOUNDRY

Related Questions in FOUNDRY-CODE-REPOSITORIES

Related Questions in FOUNDRY-PYTHON-TRANSFORM

Trending Questions

Popular # Hahtags

Popular Questions