I am using a built-in function in Impala like:
select id, parse_url(my_table.url, "QUERY", "extensionId") from my_table
Now I am migrating to SparkSQL (using pyspark in Jupyter Notebook):
my_table.select(my_table.id.cast('string'), parse_url(my_table.url.cast('string'), "QUERY", "extensionId")).show()
However, I got the following error:
NameError: name 'parse_url' is not defined
Also tried below:
my_table.registerTempTable("my_table")
sqlContext.sql("select id, url, parse_url(url, 'QUERY', 'extensionId') as new_url from my_table").show(100)
But all the new_url
becomes null
.
Any idea what I missed here? Also, how would people handle such problem? Thanks!
Some missing parts:
HiveContext
/SparkSession
with Hive support.In general it should work just fine:
and
NULL
output means that given part cannot be matched:You could achieve a similar result using an UDF but it will be significantly slower.
With data defined as:
it could be used as follows:
with result being: