The Google provided Dataflow Streaming template for data masking/tokenization from cloud storage to bigquery using cloud DLP is giving inconsistent output for each source files.
We have 100 odd files with 1M records each in the GCS bucket and we are calling the dataflow streaming template to tokenize the data using DLP and load into BigQuery.
While loading the files sequentially we saw that the results are inconsistent
For few files full 1M got loaded but for most of them the rows are varied between 0.98M to 0.99M. Is there any reason for such behaviour?
By adjusting the value of the batch size in the template all files of 1M records each got loaded successfully