So I am having a bit of a issue with the concepts behind Dataflow. Especially regarding the way the pipelines are supposed to be structured.
I am trying to consume an external API that delivers an index XML file with links to separate XML files. Once I have the contents of all the XML files I need to split those up into separate PCollections so additional PTransforms can be done.
It is hard to wrap my head around the fact that the first xml file needs to be downloaded and read, before the product XML's can be downloaded and read. As the documentation states that a pipeline starts with a Source and ends with a Sink.
So my questions are:
- Is Dataflow even the right tool for this kind of task?
- Is a custom Source meant to incorporate this whole process, or is it supposed to be done in separate steps/pipelines?
- Is it ok to handle this in a pipeline and let another pipeline read the files?
- How would a high-level overview of this process look like?
Things to note: I am using the Python SDK for this, but that probably isn't really relevant as this is more a architectural problem.
Yes, this can absolutely be done. Right now, it's a little klutzy at the beginning, but upcoming work on a new primitive called SplittableDoFn should make this pattern much easier in the future.