How does spark reads text format files

140 Views Asked by braj At 18 August 2025 at 21:42

I have a dataset in S3 in text format(.gz) and I am using spark.read.csv to read the file into spark.

This is about 100GB of data but it contains 150 columns. I am using only 5 columns (so I reduce the breadth of the data) and I have selecting only 5 columns.

For this kind of scenario, does spark scans the complete 100GB of data or it smartly filters only these 5 columns without scanning all the columns(like in columnar formats)?

Any help on this would be appreciated.

imp_feed = spark.read.csv('s3://mys3-loc/input/', schema=impressionFeedSchema, sep='\t').where(col('dayserial_numeric').between(start_date_imp,max_date_imp)).select("col1","col2","col3","col4")

Original Q&A

There are 1 best solutions below

stevel On 06 January 2017 at 14:25

make step 1 of your workflow the process of reading in the CSV file and saving it as snappy compressed ORC or parquet files.

then go to whoever creates those files and tell them to stop it. At the very least, they should switch to Avro + Snappy, as that's easier to split up the initial parse and migration to a columnar format.

How does spark reads text format files

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in SPARK-CSV

Trending Questions

Popular # Hahtags

Popular Questions