How to read parquet files using only one thread on a worker/task node?

179 Views Asked by sojim2 At 30 October 2025 at 18:25

In spark, if we execute the following command:

spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .show(5,false)

Spark distributes the read on all threads on a worker/task node. How do we execute this command and just limit it to one thread? Is this even possible?

Original Q&A

There are 1 best solutions below

Hussein Awala On 14 November 2022 at 23:18 BEST ANSWER

If you want to do this for the whole spark session, you can limit the shuffle partitions (the number of partitions used for reduce actions) and the default parallelism (the number of partitions in RDDs for transformation actions) to 1:

spark.conf.set("spark.sql.shuffle.partitions",1)
spark.conf.set("spark.default.parallelism",1)
spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .show(5,false)

If not, you can repartition your dataframe before calling an action operation:

spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .repartition(1)
  .show(5,false)

How to read parquet files using only one thread on a worker/task node?

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-SQL-REPARTITION

Trending Questions

Popular # Hahtags

Popular Questions