Hadoop Config settings through spark-shell seems to have no effect

249 Views Asked by At

I'm trying to edit the hadoop block size configuration through spark shell so that the parquet part files generated are of a specific size. I tried setting several variables this way :-

val blocksize:Int = 1024*1024*1024
sc.hadoopConfiguration.setInt("dfs.blocksize", blocksize) //also tried dfs.block.size
sc.hadoopConfiguration.setInt("parquet.block.size", blocksize)

val df = spark.read.csv("/path/to/testfile3.txt")
df.write.parquet("/path/to/output/")

The test file is a large text file of almost 3.5 GB. However, no matter what blocksize I specify or approach I take, the number of part files created and their sizes are the same. It is possible for me to change the number of part files generated using the repartition and coalesce functions, but I have to use and approach that would not shuffle the data in the data frame in any way!

I have also tried specifying

f.write.option("parquet.block.size", 1048576).parquet("/path/to/output")

But with no luck. Can someone please highlight what I am doing wrong? Also is there any other approach I can use that can alter parquet block sizes that are written into hdfs?

0

There are 0 best solutions below