Getting error while reading BIGNUMERIC data type from a BigQuery table using Apache Spark

886 Views Asked by At

I have a Dataproc Spark job which reads data from a Big Query table. The Big Query table is having a column of type BIGNUMERIC. Spark is able to read from the table successfully but the problem arises when I try to get the column names from the spark DF i.e. while executing below code

df = spark.read.format('bigquery').load('project_id.dataset_id.table_id')
columns = df.columns
print(f'*********Columns - {columns}**********')
df.show()
df.printSchema()

Error I get is as below:

columns = df.columns() File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 939, in columns File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 256, in schema File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 871, in _parse_datatype_json_string File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 888, in _parse_datatype_json_value File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 577, in fromJson File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 577, in File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 434, in fromJson File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 890, in _parse_datatype_json_value File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 736, in fromJson ModuleNotFoundError: No module named 'google.cloud.spark'

But, if I omit the df.columns and only execute show() and printSchema() it works fine. The DF schema from printSchema() is as below:

root
|-- col1: string (nullable = true)
|-- col2: bignumeric (nullable = true)

I have used the Spark - Big Query connector to read from Big Query. Any help and possible solution is highly appreciated. Happy to provide if any additional details are needed.

3

There are 3 best solutions below

0
On

There is an issue talked within spark-bigquery-connector github looks match this question

Basically, BigNumeric support is provided through spark UserDefinedType, when launching your pyspark job, the required python class file need be provided through command line by '--py-files'

# use appropriate version for jar depending on the scala version

pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.29.0.jar --py-files gs://spark-lib/bigquery/spark-bigquery-support-0.29.0.zip --files gs://spark-lib/bigquery/spark-bigquery-support-0.29.0.zip

or at runtime by spark.sparkContext.addPyFile

0
On

As mentioned above there are multiple issues that are there with respect to reading and writing BigNumeric values from bigquery . Alteast for reading the solution has already been mentioned in the readme file of spark-bigquery-connector.

Below is the link:- https://github.com/GoogleCloudDataproc/spark-bigquery-connector#bignumeric-support

Also please find the code block solution.

if the code throws ModuleNotFoundError, please add the following code before reading the BigNumeric data.

try:
    import pkg_resources

    pkg_resources.declare_namespace(__name__)
except ImportError:
    import pkgutil

    __path__ = pkgutil.extend_path(__path__, __name__)

Also, please make sure that you have included the connector's jar in the cluster (using the connectors init action) or by using the --jars option. Also verify that gs://spark-lib/bigquery/spark-bigquery-support-0.26.0.zip is configured in spark.submit.pyfiles or add it in runtime

spark.sparkContext.addPyFile("gs://spark-lib/bigquery/spark-bigquery-support-0.26.0.zip")
1
On

Starting version 0.31.1 of the spark bigquery connector, BigNumerics are converted directly to Spark decimals