How to read parquet files using only one thread on a worker/task node?

179 Views Asked by At

In spark, if we execute the following command:

spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .show(5,false)

Spark distributes the read on all threads on a worker/task node. How do we execute this command and just limit it to one thread? Is this even possible?

1

There are 1 best solutions below

2
On BEST ANSWER

If you want to do this for the whole spark session, you can limit the shuffle partitions (the number of partitions used for reduce actions) and the default parallelism (the number of partitions in RDDs for transformation actions) to 1:

spark.conf.set("spark.sql.shuffle.partitions",1)
spark.conf.set("spark.default.parallelism",1)
spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .show(5,false)

If not, you can repartition your dataframe before calling an action operation:

spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
  .repartition(1)
  .show(5,false)