Create document clustering based on the text of the document

802 Views Asked by At

In Elasticsearch, is possible to group documents that share the most similar texts, without giving an initial query to compare to?

I know is possible to query and get MLT("more like this document") but, is possible to cluster documents within an index according to a field values?

For instance:

document 1: The quick brown fox jumps over the lazy dog

document 2: Barcelona is a great city

document 3: The fast orange fox jumps over the lazy dog

document 4: Lotus loft Room - Bear Mountains Neighbourhood

document 5: I do not like to eat fish

document 6: "Lotus Loft" Condo From $160.00 CAD/night, sleeps up to 4

document 7: Lotus Loft

Now, perform some kind of aggregation that, without giving a search query, it can group:

Group 1: document 1 and document 3

Group 2: document 2 

Group 3: document 4 and document 6 and document 7

Group 4: document 5

OR

Please just let me know other ways to find the different document clustering e.g using Apache Spark, KNN, Unsupervised learning way or any other algorithm to find the near-duplicate documents or cluster similar documents?

I just want to cluster my document based on country, city, latlng, property name or description etc. field of my elasticsearch documents.

Basically I want to know-

How to make clusters of similar documents(e.g json/csv) or find duplicate documents using python text analysis/unsupervised learning with KNN/ pyspark with MLIB or any other document clustering algorithms? give me some hint/open source projects or any other resource links. I just need some concrete examples or tutorials for this task

1

There are 1 best solutions below

1
derek.z On

Yes, it's possible. There is an ElasticSearch plugin named Carrot2. The clustering plugin automatically group together similar "documents" and assign human-readable labels to these groups, and it has 4 built-in clustering algorithms (3 free, 1 license required). You can make a match_all query if you want to cluster all documents in an ES index.

Here is my ES 6.6.2 client code example for clustering in Python 3:

import json
import requests

REQUEST_URL = 'http://localhost:9200/b2c_index/_search_with_clusters'
HEADER = {'Content-Type':'application/json; charset=utf-8'}

requestDict = {
  "search_request": {
    "_source": [ "title", "content", "lang" ],
    "query": {"match_all":{}},
    "size": 100
  },

  "query_hint": "",
  "field_mapping": {
    "title": ["_source.title"],
    "content": ["_source.content"],
    "language": ["_source.lang"],
  }
}

resp = requests.post(REQUEST_URL, data=json.dumps(requestDict), headers=HEADER)
print(resp.json())

By the way, Solr also uses Carrot2 to cluster documents.