I am wondering whether Google Dataflow can do something that is equivalent of like SQL
SELECT * FROM A INNER JOIN B ON A.a = B.b **LIMIT 1000**
I know that Dataflow has very standard programming paradigm to do join. However, the part I am interested in. is this LIMIT 1000
. Since I don't need all of the joined result but only any 1000 of them. I am wondering whether I can utilize this use case to speed up my job (assuming the join are between very expansive tables and will produce very large result on a fully join)
So I assume that a very naive way to achieve the above SQL result is some template code as follows:
PCollection A = ...
PCollection B = ...
PCollection result = KeyedPCollectionTuple.of(ATag, A).and(BTag, B)
.apply(CoGroupByKey.create())
.apply(ParDo.of(new DoFn<KV<...,CoGbkResult>, ...>() {
})
.apply(Sample.any(1000))
However my concern is that how is this Sample
transformation hooking up with ParDo
internally handled by dataflow. Will dataflow able to optimize in the way that it will stop processing join as long as it know it will definitely have enough output? Or there is simply no optimization in this use case that dataflow will just compute the full join result and then select 1000 from the result? (In this way, Sample
transform is will only be an overhead)
Or long question short, it is possible for me to utilize this use case to do partial join in dataflow?
EDIT:
Or in essentially, I am wondering does Sample.any()
transform will able to hint any optimization to upstream PCollection? For example if I do
pipeline.apply(TextTO.Read.from("gs://path/to/my/file*"))
.apply(Sample.any(N))
Will dataflow first load all data in and then select N or will it able to take advantage of Sample.any()
and do some optimization and prune out some useless read.
Currently neither Cloud Dataflow, nor any of the other Apache Beam runners (as far as I'm aware) implement such an optimization.