I have a Dataproc
Spark job which reads data from a Big Query table. The Big Query table is having a column of type BIGNUMERIC
. Spark is able to read from the table successfully but the problem arises when I try to get the column names from the spark DF i.e. while executing below code
df = spark.read.format('bigquery').load('project_id.dataset_id.table_id')
columns = df.columns
print(f'*********Columns - {columns}**********')
df.show()
df.printSchema()
Error I get is as below:
columns = df.columns() File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 939, in columns File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 256, in schema File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 871, in _parse_datatype_json_string File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 888, in _parse_datatype_json_value File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 577, in fromJson File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 577, in File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 434, in fromJson File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 890, in _parse_datatype_json_value File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 736, in fromJson ModuleNotFoundError: No module named 'google.cloud.spark'
But, if I omit the df.columns and only execute show()
and printSchema()
it works fine. The DF schema from printSchema()
is as below:
root
|-- col1: string (nullable = true)
|-- col2: bignumeric (nullable = true)
I have used the Spark - Big Query connector
to read from Big Query. Any help and possible solution is highly appreciated. Happy to provide if any additional details are needed.
There is an issue talked within spark-bigquery-connector github looks match this question
Basically, BigNumeric support is provided through spark UserDefinedType, when launching your pyspark job, the required python class file need be provided through command line by '--py-files'
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.29.0.jar --py-files gs://spark-lib/bigquery/spark-bigquery-support-0.29.0.zip --files gs://spark-lib/bigquery/spark-bigquery-support-0.29.0.zip
or at runtime by spark.sparkContext.addPyFile