Maintaining a global state within Apache Beam

1.3k Views Asked by chuwy At 29 July 2025 at 13:32

We have a PubSub topic with events sinking into BigQuery (though particular DB is almost irrelevant here). Events can come with new unknown properties that eventually should end up as separate BigQuery columns.

So, basically I have two questions here:

What is the right way for maintaining a global state within Pipeline (with set of encountered properties in my case)?
What would be a good strategy for buffering/holding stream of events as soon as new property is encountered and until ALTER TABLE is executed

Right now I tried to use following (I'm using Spotify scio):

rows
  .withFixedWindows(Duration.millis(duration))
  .withWindow[IntervalWindow]
  .swap
  .groupByKey
  .map { case (window, rowsIterable) =>
    val newRows = findNewProperties(rowsIterable)
    mutateTableWith(newRows)
    rowsIterable
  }
  .flatMap(id)
  .saveAsBigQuery()

But this is terribly inefficient, as we at least need to load whole rowsIterable into memory and even traverse it.

Original Q&A

There are 1 best solutions below

Carlos On 01 June 2018 at 08:22 BEST ANSWER

We're building the very same project and we're following this approach with a refreshing side input containing the schemas (refreshed at intervals from BQ). So basically:

On a side input load the schemas from BQ
Stream data into BQ using streaming mode so that you can actually do something else with the rows that fail to insert (i.e: when they have a new, unknown property)
Save those failed ones somewhere else (datastore?) to process them later (in another job, for example)
That recovery job will issue schema changes, that will eventually be loaded by the main pipeline refreshing side input (step 1).

I have an example of a job with that refreshing side input approach here

Maintaining a global state within Apache Beam

There are 1 best solutions below

Related Questions in GOOGLE-BIGQUERY

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in APACHE-BEAM

Related Questions in GOOGLE-CLOUD-PUBSUB

Related Questions in SPOTIFY-SCIO

Trending Questions

Popular # Hahtags

Popular Questions