I'm trying to find connected components using BigQuery recursive SQL and I have solved it almost, but not getting the expected result I wanted.
For example
SELECT 1 AS from_node, 2 AS to_node UNION ALL
SELECT 1, 3 UNION ALL
SELECT 2, 3 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 4, 7 UNION ALL
SELECT 6, 8 UNION ALL
SELECT 5, 9 UNION ALL
SELECT 6, 10 UNION ALL
SELECT 7, 1 UNION ALL
SELECT 1, 14
If you see I have a closed loop - 1 -> 3 -> 4 -> 7 -> 1
I have written this query
WITH RECURSIVE
GraphData AS (
SELECT 1 AS from_node, 2 AS to_node UNION ALL
SELECT 1, 3 UNION ALL
SELECT 2, 3 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 4, 7 UNION ALL
SELECT 6, 8 UNION ALL
SELECT 5, 9 UNION ALL
SELECT 6, 10 UNION ALL
SELECT 7, 1 UNION ALL
SELECT 1, 14
),
R AS (
select from_node, to_node, from_node as cluster_id
from GraphData
union all
select b.to_node, a.from_node,
case when b.to_node = a.from_node then b.cluster_id else a.from_node end as cluster_id
from GraphData as a
join
R as b
on b.from_node = a.to_node
and b.cluster_id <> b.from_node
)
SELECT cluster_id, array_agg(to_node) as clusters from R
group by cluster_id
If you see the results, I have a cluster_id 5 ID which is related to cluster [6, 9], and again I have cluster_id 6 which is related to [10, 8], I wanted to assign all these related IDs to the minimum ID root link that I have. That means [6, 9, 10, 8] ID should get 5 as the ID. and 6 should be removed from the cluster_id column.
Expected results -
[
{
"cluster_id": 1
"clusters": [3, 14, 2, 4, 7]
},
{
"cluster_id": 5
"clusters": [6, 9, 10, 8]
},
]
parent of each cluster is the minimum ID among all related IDs. The above-expected results are as per my assumptions. order of actual results can be different. But this can give us an idea. It is the same as BFS.
Can anyone suggest a fix to my logical error? Or if this isn't possible to do using BigQuery as of now?
