How to write a dataframe column with json data (STRING type) to BigQuery table as JSON type using pyspark?

558 Views Asked by At

I have a pyspark dataframe with a column containing json string (column type is string). I would like to write this dataframe to Bigquery table with column type as JSON. I got below information from this link https://github.com/GoogleCloudDataproc/spark-bigquery-connector

Spark has no JSON type. The values are read as String. In order to write JSON back to BigQuery, the following conditions are REQUIRED:

  • Use the INDIRECT write method
  • Use the AVRO intermediate format
  • The DataFrame field MUST be of type String and has an entry of sqlType=JSON in its metadata

I am not sure how to set an entry of sqlType=JSON in dataframe field metadata? Can someone please help?

I am using below code to write dataframe to Bigquery table

df.write \
  .format("bigquery") \
  .option("temporaryGcsBucket","some-bucket") \
  .save("dataset.table")
1

There are 1 best solutions below

0
On

Have look at the withMetadata column.

pyspark.sql.DataFrame.withMetadata

df_meta = df.withMetadata('age', {'foo': 'bar'})
df_meta.schema['age'].metadata
{'foo': 'bar'}