Let's say I've got a simple pipe delimited file, with missing values:
A|B||D
I read that into a dataframe:
val foo = spark.read.format("csv").option("delimiter","|").load("/path/to/my/file.txt")
The missing third column, instead of being a null value, has the string null:
+-----------+----+
|_c0|_c1| _c2|_c3|
+-----------+----+
|a |b |null|c |
I would like to be able to control what Spark does with that missing value. According to the CSV Documentation, there is a nullValue option:
nullValue: string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.
But since I don't have any value there, I can't seem to control it that way.
Short of testing every single field of every single file I read in, is there way to control what spark does with these blanks? For example, replace it with 'N/A'.
We're using Spark 2.1, if it's relevant.
The
nullValueandemptyValueoptions do the opposite of your expectation - they allow you to specify values that, if encountered in the source data, should be turned intonullor""(respectively) in the resultant dataframe. For instance, if your input was insteada|b|N/A|dand you set
option("nullValue", "N/A")you would wind up once again with
More to the point, Spark includes functions for dealing with null values in the class DataFrameNaFunctions. For example,
df.na.fill("N/A")will replace allnullinstances in string columns with"N/A".