Our team has a bunch of Hive QL, so when migrating to spark, we want to reuse the existng HQL which uses Hive Regex Column Specification like SELECT `(ds)?+.+` FROM.
This could be done simply be simply enable the following configuration:
spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')
However, with the above configuration enabled, querying any Hive view using Spark SQL failed and Spark SQL Analyzer will complain
pyspark.sql.utils.AnalysisException: u"Invalid usage of '*' in expression 'unresolvedextractvalue';"
A simple pyspark script to reproduce the issue is like the following:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('test_spark').enableHiveSupport().getOrCreate()
spark.conf.set('hive.exec.dynamic.partition.mode', 'nonstrict')
spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic')
spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')
spark_sql = r'''
SELECT
id_listing
FROM
<A Hive View>
WHERE
ds = '2019-03-09'
'''
result = spark.sql(spark_sql)
print(result.count())
if __name__ == '__main__':
main()
I was wondering if there is a way to make Regex Column specification and Hive View coexist in Spark.
I observed this behavior in both Spark 2.3.0 and 2.4.0