Spark - handle blank values in CSV file

3.3k Views Asked by Andrew At 23 January 2020 at 16:27

Let's say I've got a simple pipe delimited file, with missing values:

A|B||D

I read that into a dataframe:

val foo =  spark.read.format("csv").option("delimiter","|").load("/path/to/my/file.txt")

The missing third column, instead of being a null value, has the string null:

+-----------+----+
|_c0|_c1| _c2|_c3|
+-----------+----+
|a  |b  |null|c  |

I would like to be able to control what Spark does with that missing value. According to the CSV Documentation, there is a nullValue option:

nullValue: string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.

But since I don't have any value there, I can't seem to control it that way.

Short of testing every single field of every single file I read in, is there way to control what spark does with these blanks? For example, replace it with 'N/A'.

We're using Spark 2.1, if it's relevant.

Original Q&A

There are 1 best solutions below

Charlie Flowers On 24 January 2020 at 19:30 BEST ANSWER

The nullValue and emptyValue options do the opposite of your expectation - they allow you to specify values that, if encountered in the source data, should be turned into null or "" (respectively) in the resultant dataframe. For instance, if your input was instead

a|b|N/A|d

and you set

option("nullValue", "N/A")

you would wind up once again with

+-----------+----+
|_c0|_c1| _c2|_c3|
+-----------+----+
|a  |b  |null|c  |

More to the point, Spark includes functions for dealing with null values in the class DataFrameNaFunctions. For example, df.na.fill("N/A") will replace all null instances in string columns with "N/A".

Spark - handle blank values in CSV file

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in SPARK-CSV

Trending Questions

Popular # Hahtags

Popular Questions