The Google provided Dataflow Streaming template for data masking/tokenization from cloud storage to bigquery using cloud DLP is giving inconsistent output for each source files.
We have 100 odd files with 1M records each in the GCS bucket and we are calling the dataflow streaming template to tokenize the data using DLP and load into BigQuery.
While loading the files sequentially we saw that the results are inconsistent
For few files full 1M got loaded but for most of them the rows are varied between 0.98M to 0.99M. Is there any reason for such behaviour?
I am not sure but it's maybe due to
BigQuery best-effort deduplication mechanismused for streaming data toBigQuery:From the Beam documentation :
Note: Streaming inserts by default enables BigQuery best-effort deduplication mechanism. You can disable that by setting ignoreInsertIds. The quota limitations are different when deduplication is enabled vs. disabled :
From the Google Cloud documentation :
This mecanism can be disabled with
ignoreInsertIdsYou can test with disabling this mecanism and check if all the rows are inserted.