Below is my script to use sql in dataframe with python:
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///root/Downloads/data/flight201601short.csv')
df.show(5) shows result below:
then I register the dataframe to a temp table:
df.registerTempTable("flight201601")
and tried to run some sql query like below:
sqlContext.sql("select distinct CARRIER from flight201601")
It doesn't produce expected result, instead:
I also tried:
sqlContext.sql("select * from flight201601")
and it gives me:
So it seems the registerTempTable method only create the table schema and the table is NOT populated, what am I missing?
You will have to call
show()
method on Dataframe returned bysqlContext.sql
to get the result of query. Check the spark document which says