Hadoop Config settings through spark-shell seems to have no effect

249 Views Asked by Sparky At 20 October 2024 at 19:27

I'm trying to edit the hadoop block size configuration through spark shell so that the parquet part files generated are of a specific size. I tried setting several variables this way :-

val blocksize:Int = 1024*1024*1024
sc.hadoopConfiguration.setInt("dfs.blocksize", blocksize) //also tried dfs.block.size
sc.hadoopConfiguration.setInt("parquet.block.size", blocksize)

val df = spark.read.csv("/path/to/testfile3.txt")
df.write.parquet("/path/to/output/")

The test file is a large text file of almost 3.5 GB. However, no matter what blocksize I specify or approach I take, the number of part files created and their sizes are the same. It is possible for me to change the number of part files generated using the repartition and coalesce functions, but I have to use and approach that would not shuffle the data in the data frame in any way!

I have also tried specifying

f.write.option("parquet.block.size", 1048576).parquet("/path/to/output")

But with no luck. Can someone please highlight what I am doing wrong? Also is there any other approach I can use that can alter parquet block sizes that are written into hdfs?

Original Q&A

Hadoop Config settings through spark-shell seems to have no effect

There are 0 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in HADOOP

Related Questions in PARQUET

Related Questions in APACHE-SPARK-2.2

Trending Questions

Popular # Hahtags

Popular Questions