Cluster rectangles on a grid

433 Views Asked by At

I'm trying to cluster web page content based on visual proximity. You can see a visual display of blocks on link below https://i.stack.imgur.com/qzGKE.png

I tried to use a DBSCAN clustering with sckikit-learn with features below with not much success : - left X coordinate of block (because content are frequently left aligned) - right X coordinate of block (because content are frequently right aligned) - top Y coordinate of block (to further close blocks)

Do you have any idea of better features

1

There are 1 best solutions below

0
On

Have a look at Generalized DBSCAN (not available in scipy, though).

How about clustering objects together when they overlap or almost overlap (by 1 pixel)?

See: DBSCAN doesn't really use the distance. It is based on a binary "is close enough to" decision only.

Also note that DBSCAN is not restricted to vectors. DBSCAN can work with anything where you can define the "similar enough" predicate for.

So you might not need to "extract features", instead consider when you want two objects to be in the same cluster.