How to handle null entries in SparkR

6.6k Views Asked by Ole Petersen At 23 July 2015 at 21:46

I have a SparkSQL DataFrame.

Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?

In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.

Thanks.

Original Q&A

There are 2 best solutions below

zero323 On 25 July 2015 at 13:44 BEST ANSWER

SparkR Column provides a long list of useful methods including isNull and isNotNull:

> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)

  Id Age
1  1  21
2  2  18
3  3  NA

> filter(people, isNotNull(people$Age)) %>% head()
  Id Age
1  1  21
2  2  18
3  3  30

> filter(people, isNull(people$Age)) %>% head()
  Id Age
1  4  NA

Please keep in mind that there is no distinction between NA and NaN in SparkR.

If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:

> fillna(people, 99) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30
4  4  99

> dropna(people) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30

Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:

> people_with_names_local <- data.frame(
    Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob
4  4  NA  <NA>

> dropna(people_with_names, minNonNulls=2) %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob

Wannes Rosiers On 24 July 2015 at 06:13

It is not the nicest workaround, but if you cast them as strings, they are stored as "NaN" and then you can filter them, a short example:

testFrame   <- createDataFrame(sqlContext, data.frame(a=c(1,2,3),b=c(1,NA,3)))
testFrame$c <- cast(testFrame$b,"string")

resultFrame <- collect(filter(testFrame, testFrame$c!="NaN"))
resultFrame$c <- NULL

This omits the entire row where the element in column b is missing.

How to handle null entries in SparkR

There are 2 best solutions below

Related Questions in R

Related Questions in APACHE-SPARK

Related Questions in SPARKR

Related Questions in APACHE-SPARK-1.4

Trending Questions

Popular # Hahtags

Popular Questions