Spark - handle blank values in CSV file

3.3k Views Asked by At

Let's say I've got a simple pipe delimited file, with missing values:

A|B||D

I read that into a dataframe:

val foo =  spark.read.format("csv").option("delimiter","|").load("/path/to/my/file.txt")

The missing third column, instead of being a null value, has the string null:

+-----------+----+
|_c0|_c1| _c2|_c3|
+-----------+----+
|a  |b  |null|c  |

I would like to be able to control what Spark does with that missing value. According to the CSV Documentation, there is a nullValue option:

nullValue: string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.

But since I don't have any value there, I can't seem to control it that way.

Short of testing every single field of every single file I read in, is there way to control what spark does with these blanks? For example, replace it with 'N/A'.

We're using Spark 2.1, if it's relevant.

1

There are 1 best solutions below

0
Charlie Flowers On BEST ANSWER

The nullValue and emptyValue options do the opposite of your expectation - they allow you to specify values that, if encountered in the source data, should be turned into null or "" (respectively) in the resultant dataframe. For instance, if your input was instead

a|b|N/A|d

and you set

option("nullValue", "N/A")

you would wind up once again with

+-----------+----+
|_c0|_c1| _c2|_c3|
+-----------+----+
|a  |b  |null|c  |

More to the point, Spark includes functions for dealing with null values in the class DataFrameNaFunctions. For example, df.na.fill("N/A") will replace all null instances in string columns with "N/A".