finding connected components using BigQuery SQL

56 Views Asked by At

I'm trying to find connected components using BigQuery recursive SQL and I have solved it almost, but not getting the expected result I wanted.

For example

    SELECT 1 AS from_node, 2 AS to_node UNION ALL
    SELECT 1, 3 UNION ALL
    SELECT 2, 3 UNION ALL
    SELECT 3, 4 UNION ALL
    SELECT 5, 6 UNION ALL
    SELECT 4, 7 UNION ALL
    SELECT 6, 8 UNION ALL
    SELECT 5, 9 UNION ALL
    SELECT 6, 10 UNION ALL
    SELECT 7, 1 UNION ALL
    SELECT 1, 14

If you see I have a closed loop - 1 -> 3 -> 4 -> 7 -> 1

I have written this query

WITH RECURSIVE
  GraphData AS (
    SELECT 1 AS from_node, 2 AS to_node UNION ALL
    SELECT 1, 3 UNION ALL
    SELECT 2, 3 UNION ALL
    SELECT 3, 4 UNION ALL
    SELECT 5, 6 UNION ALL
    SELECT 4, 7 UNION ALL
    SELECT 6, 8 UNION ALL
    SELECT 5, 9 UNION ALL
    SELECT 6, 10 UNION ALL
    SELECT 7, 1 UNION ALL
    SELECT 1, 14
  ),
  R AS (
    select from_node, to_node, from_node as cluster_id
    from GraphData
    union all
    select b.to_node, a.from_node, 
    case when b.to_node = a.from_node then b.cluster_id else a.from_node end as cluster_id
    from GraphData as a
    join 
    R as b
    on b.from_node = a.to_node
    and b.cluster_id <> b.from_node
  )
SELECT cluster_id, array_agg(to_node) as clusters from R
group by cluster_id

If you see the results, I have a cluster_id 5 ID which is related to cluster [6, 9], and again I have cluster_id 6 which is related to [10, 8], I wanted to assign all these related IDs to the minimum ID root link that I have. That means [6, 9, 10, 8] ID should get 5 as the ID. and 6 should be removed from the cluster_id column.

enter image description here

Expected results -

[
  {
    "cluster_id": 1
    "clusters": [3, 14, 2, 4, 7]
  },
  {
    "cluster_id": 5
    "clusters": [6, 9, 10, 8]
  },
]

parent of each cluster is the minimum ID among all related IDs. The above-expected results are as per my assumptions. order of actual results can be different. But this can give us an idea. It is the same as BFS.

Can anyone suggest a fix to my logical error? Or if this isn't possible to do using BigQuery as of now?

0

There are 0 best solutions below