Google Cloud DataStream failing with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE

1k Views Asked by At

I am getting error when I was trying to partition the destination table in BigQuery while working with DataStream.

step by step to reproduce this:

  1. start DataStream from CloudSQL(MYSQL) to BigQuery
  2. once the Stream Completed all tables in BigQuery, pause the job
  3. Partition one of the table
  4. Resume the job
  5. Getting error log as below

====================================================

Discarded 97 unsupported events for BigQuery destination: 833537404433.Test_Membership_1.internal_Membership, with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported..

{ insertId: "65ad79ec-0000-24c7-a66e-14223bbf970a@a1"

jsonPayload: {

context: "CDC"

event_code: "UNSUPPORTED_EVENTS_DISCARDED"

message: "Discarded 97 unsupported events for BigQuery destination:

833537404433.Test_Membership_1.internal_Membership, with reason code:

BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to

BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported.."

read_method: ""

}

logName: "projects/gcp-everwash-wh-dw/logs/datastream.googleapis.com%2Fstream_activity"

receiveTimestamp: "2022-11-22T22:08:38.620495835Z"

resource: {2}

severity: "WARNING"

timestamp: "2022-11-22T22:08:37.726075Z"

}


What you expected to happen: ?

I am expecting to create Partition for the certain tables that are getting inserted in BigQuery via DataStream.

1

There are 1 best solutions below

0
On

Partitioning to the existing BigQuery table is not supported.You have to add partitioning to a net-new table. You can create a newly partitioned table from the result of a query as mentioned in this document, however this approach won't work for existing Datastream sourced tables since there wouldn't be a _CHANGE_SEQUENCE_NUMBER field which is required to correctly apply UPSERT operations in the correct order. So the only option would be to pre-create the table with partitioning/clustering/primary keys before starting the Datastream stream like the below DDL SQL sample query.

CREATE TABLE `project.dataset.new_table`
 (
   `Primary_key_field` INTEGER PRIMARY KEY NOT ENFORCED,
   `time_field` TIMESTAMP,
   `field1` STRING,
   #Just an example above. Add needed fields within the base table...
 )
PARTITION BY
 DATE(time_field)
CLUSTER BY
 Primary_key_field #This must be an exact match of the specified primary key fields
OPTIONS(max_staleness = INTERVAL 15 MINUTE) #or whatever the desired max_staleness value is

For more information, you can check this issue tracker.