Spark SQL does not work with Hive View when Hive Regex Column support is enabled

525 Views Asked by At

Our team has a bunch of Hive QL, so when migrating to spark, we want to reuse the existng HQL which uses Hive Regex Column Specification like SELECT `(ds)?+.+` FROM.

This could be done simply be simply enable the following configuration:

spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')

However, with the above configuration enabled, querying any Hive view using Spark SQL failed and Spark SQL Analyzer will complain

pyspark.sql.utils.AnalysisException: u"Invalid usage of '*' in expression 'unresolvedextractvalue';"

A simple pyspark script to reproduce the issue is like the following:

from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.appName('test_spark').enableHiveSupport().getOrCreate()
    spark.conf.set('hive.exec.dynamic.partition.mode', 'nonstrict')
    spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic')
    spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')

    spark_sql = r'''
    SELECT
        id_listing
    FROM
        <A Hive View>
    WHERE
        ds = '2019-03-09'
'''
    result = spark.sql(spark_sql)
    print(result.count())

if __name__ == '__main__':
    main()

I was wondering if there is a way to make Regex Column specification and Hive View coexist in Spark.

I observed this behavior in both Spark 2.3.0 and 2.4.0

0

There are 0 best solutions below