DBSCAN in hadoop

1.6k Views Asked by Girjesh At 16 December 2013 at 12:12

Actually I don't know what should be the key and value for map() and what should be the input format and output format. If I read one point at a time by map() then how the neighbors can be computed using one point because remaining points are not read yet.

Original Q&A

There are 2 best solutions below

Has QUIT--Anony-Mousse On 16 December 2013 at 14:01

DBSCAN is not an embarrassingly parallel algorithm.

It will not be trivial to represent it as map-reduce.

You will need to either employ some hacks (such as partitioning your data, and mapping each value to the partition), or completely redesign the algorithm.

There are a number of articles on parallel DBSCAN. You will likely be able to run some of these in a map-reduce like framework, or at least on a custom (non-map-reduce) YARN engine.

Viktor Tóth On 21 March 2017 at 05:56

Check this paper out: https://www.researchgate.net/publication/261212964_A_new_scalable_parallel_DBSCAN_algorithm_using_the_disjoint-set_data_structure

The following is my solution, might be easier to understand than the solution in the paper:

First I would compute your distance matrix - that could be a sparse matrix containing only those distances, that are less than the DBSCAN epsilon parameter - find a way to implement it map-reduce.

You can map that distance matrix to multiple devices and cluster points. You realize that parallelized clustering in this case breaks up the input space and you get a cluster id in one instance that might correspond to another id at another instances.

To remedy that, gather all core points in a reduce step, then check each neighbor of every core point (map, doesn't have to be O(n^2), be clever about it). If you can find other core elements close, create an entry of the 2 cluster ids of the 2 neighboring cores; gather these id pairs (reduce). Using these pairs, derive the correct, global clustering ids.

The above description might sound a bit abstract, but it should give you the idea.

DBSCAN in hadoop

There are 2 best solutions below

Related Questions in HADOOP

Related Questions in MAPREDUCE

Related Questions in DATA-MINING

Related Questions in CLUSTER-ANALYSIS

Related Questions in DBSCAN

Trending Questions

Popular # Hahtags

Popular Questions