dataframe or sqlctx (sqlcontext) generated "Trying to call a package" error

735 Views Asked by At

I am using spark 1.3.1. In PySpark, I have created a DataFrame from a RDD and registered the schema, something like this :

dataLen=sqlCtx.createDataFrame(myrdd, ["id", "size"])
dataLen.registerTempTable("tbl")

at this point everything is fine I can make a "select" query from "tbl", for example "select size from tbl where id='abc'".

Then in a Python function, I define something like :

def  getsize(id):
    total=sqlCtx.sql("select size from tbl where id='" + id + "'")
    return total.take(1)[0].size

at this point still no problem, I can do getsize("ab") and it does return a value.

The problem occurred when I invoked getsize within a rdd, say I have a rdd named data which is of (key, value) list, when I do

data.map(lambda x: (x[0], getsize("ab"))

this generated an error which is

py4j.protocol.Py4JError: Trying to call a package

Any idea?

1

There are 1 best solutions below

0
On

Spark doesn't support nested actions or transformations and SQLContext is not accessible outside the driver. So what you're doing here simply cannot work. It is not exactly clear what you want here but most likely a simple join, either on RDDs or DataFrames should do the trick.