For development purposes, I'd like to cache the results of queries in BigQuery made by beam.io.ReadFromBigQuery connector - so I'd be able to load them quickly from the local file system when running the exact same query in the next times.
The problem is that I cannot run any PTransform before beam.io.ReadFromBigQuery to validate existence of caching and skip the reading from BigQuery as a result.
Currently I came up with two possible solutions:
- Creating a customized
beam.DoFnfor reading from BigQuery. It will include the caching mechanism, but might underperform comparing to the existing connector. One variation might be inheritance of the existing connector - but it will require knowledge of Beam "under the hood" - which might be overwhelming. - Implementing the caching when building the pipeline, and the resulting step will be determined according the existence or inexistence of the cache (
apache_beam.io.textio.ReadAllFromTextorbeam.io.ReadFromBigQuery, respectively).
I found out that preceding
beam.io.ReadFromBigQueryby any other PTransform is impossible by design. Unfortunately it's currently not well reflected in the Python SDK docs - but according to Apache Beam's official Java documentation: "A root PTransform conceptually has no input" - and as the equivalent java PTransform BigQueryIO.Read() inherits fromPBegin, it puts the seal on preceding it with something else.However, I found a workaround which resembles the second approach that I suggested in the question - implementing a
beam.PTransform(notbeam.DoFn) that dynamically returns the appropriate PTransform when building the pipeline, according to the existence or inexistence of cache. It looks like this: