Filter out null strings and empty strings in hivecontext.sql

1.2k Views Asked by At

I'm using pyspark and hivecontext.sql and I want to filter out all null and empty values from my data.

So I used simple sql commands to first filter out the null values, but it doesen't work.

My code:

hiveContext.sql("select column1 from table where column2 is not null")

but it work without the expression "where column2 is not null"

Error:

Py4JavaError: An error occurred while calling o577.showString

I think it was due to my select is wrong.

Data example:

column 1 | column 2
null     |   1
null     |   2
1        |   3
2        |   4
null     |   2
3        |   8

Objective:

column 1 | column 2
1        |   3
2        |   4
3        |   8

Tks

4

There are 4 best solutions below

0
On BEST ANSWER

It work for me:

df.na.drop(subset=["column1"])
0
On
Have you entered null values manually?
If yes then it will treat those as normal strings,
I tried following two use cases

dbname.person table in hive

name    age

aaa     null // this null is entered manually -case 1
Andy    30
Justin  19
okay       NULL // this null came as this field was left blank. case 2

---------------------------------
hiveContext.sql("select * from dbname.person").show();
+------+----+
|   name| age|
+------+----+
|  aaa |null|
|  Andy|  30|
|Justin|  19|
|  okay|null|
+------+----+

-----------------------------
case 2 
hiveContext.sql("select * from dbname.person where age is not null").show();
+------+----+
|  name|age |
+------+----+
|  aaa |null|
|  Andy| 30 |
|Justin| 19 |
+------+----+
------------------------------------
case 1
hiveContext.sql("select * from dbname.person where age!= 'null'").show();
+------+----+
|  name| age|
+------+----+
|  Andy|  30|
|Justin|  19|
|  okay|null|
+------+----+
------------------------------------

I hope above use cases would clear your doubts about filtering null values out. and if you are querying a table registered in spark then use sqlContext.

0
On

You have to give database_name.table and run the same query it will work. Please let me know if that helps

0
On

We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One of the way to read Hive table is using the pysaprk shell.

We need to register the data frame we get from reading the hive table. Then we can run the SQL query.