I wonder what is the difference of using .format("org.apache.phoenix.spark") vs .format("jdbc") when loading HBase table (through Phoenix) to spark dataframe.
val tracesDF = spark.sqlContext.read
.format("org.apache.phoenix.spark")
.option("table", hbaseTblName)
.option("zkUrl", appConf.getString("zookeeper_url"))
vs
val tracesDF = spark.sqlContext.read
.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", appConf.getString("hbasedb_url"))
Another issue I found which related to this issue:
- I create the HBase table through jdbc statement
hbaseCon.createStatement().execute('CREATE TABLE ...) - The dataframe of
.format("org.apache.phoenix.spark")is empty, while.format("jdbc")return the data properly - Need to specify column family [
tracesDF.select(...,"``B.SAMPLES_BINARY``")] when using .format("org.apache.phoenix.spark") but not when using .format("jdbc") [tracesDF.select(...,"SAMPLES_BINARY")]