Google Cloud Dataflow consume external source

924 Views Asked by At

So I am having a bit of a issue with the concepts behind Dataflow. Especially regarding the way the pipelines are supposed to be structured.

I am trying to consume an external API that delivers an index XML file with links to separate XML files. Once I have the contents of all the XML files I need to split those up into separate PCollections so additional PTransforms can be done.

It is hard to wrap my head around the fact that the first xml file needs to be downloaded and read, before the product XML's can be downloaded and read. As the documentation states that a pipeline starts with a Source and ends with a Sink.

So my questions are:

  • Is Dataflow even the right tool for this kind of task?
  • Is a custom Source meant to incorporate this whole process, or is it supposed to be done in separate steps/pipelines?
  • Is it ok to handle this in a pipeline and let another pipeline read the files?
  • How would a high-level overview of this process look like?

Things to note: I am using the Python SDK for this, but that probably isn't really relevant as this is more a architectural problem.

1

There are 1 best solutions below

1
On BEST ANSWER

Yes, this can absolutely be done. Right now, it's a little klutzy at the beginning, but upcoming work on a new primitive called SplittableDoFn should make this pattern much easier in the future.

  1. Start by using Create to make a dummy PCollection with a single element.
  2. Process that PCollection with a DoFn that downloads the file, reads out the subfiles, and emits those.
  3. [Optional] At this point, you'll likely want work to proceed in parallel. To allow the system to easily parallelize, you'll want to do a semantically unnecessary GroupByKey followed by a ParDo to 'undo' the grouping. This materializes these filenames into temporary storage, allowing the system to have different workers process each element.
  4. Process each subfile by reading its contents and emit into PCollections. If you want different file contents to be processed differently, use Partition to sort them into different PCollections.
  5. Do the relevant processing.