Algorithms for building a peer to peer search engine with distributed database

461 Views Asked by At

I'm looking to build a distributed torrent search engine.

I'm aware of Distributed Hash Tables for addressing nodes in peer to peer networks. I don't fully understand how each node acquires a globally unique ID though.

What algorithms and data-structures I need to build the distributed database though, I'm not sure. It would need to have a high degree of redundancy obviously, and be as efficiently searchable as possible.

What I really need is a pointer in the direction of some resources and preferably some code examples.

1

There are 1 best solutions below

1
On BEST ANSWER

I don't fully understand how each node acquires a globally unique ID though.

I'd say that's not really relevant to the title of your question and implementation-specific anyway. But generally it's either done at random or based on a hash of their public IP + some random sub-part modulo some adjustments for subnets. Have a look at bittorrent's secure node ID generation algorithm for example.

What algorithms and data-structures I need to build the distributed database though, I'm not sure.

This is a non-trivial topic that I don't think can be answered within a few paragraphjs. DHTs at their base do not allow enumeration of stored values or any complex operations coordinated by multiple nodes, a direct key-value lookup is all they do. To implement keyword search on top of that you'll have to do quite some algorithmic and language processing gymnastics and add extensions to the basic DHT protocol to accommodate these requirements.

Here's an incomplete list of several problems to solve:

  • uneven word distributions placing more load on some parts of the DHT keyspace than others - this can be mitigated to some extent by nodes moving themselves, target address failover or widening of the set of nodes responsible for a target key. and simply dropping extremely commong words
  • performing union or intersection operations on multiple search terms - this can be done with bloom filters to some extent
  • slicing of scripts that don't have whitespaces into search terms - a problem that also has to be solved by non-distributed indexing engines such as lucene. afaik the use N-grams
  • preventing popular content containing a particular word from drowning out all other results sharing that word
  • trust. i.e. preventing keyword spamming attacks

I'm not sure if a DHT is even the right approach here. I vaguely recall other metrics based on language / the keywords themselves where nodes move themselves in the keyspace to gravitate towards words in use and thus provide the necessary network capacity.

I recommend hitting google scholar looking for keyword search related modifications or alternative overlays to DHTs.