Significant Terms Aggregation of "flat" structures

551 Views Asked by At

I currently try to prototype a product recommendation system using the Elasticsearch Significant Terms aggregation. So far, I didn't find a good example yet which deals with "flat" JSON structures of sales (here: The itemId) coming from a relational database, such as mine:

Document 1

{
    "lineItemId": 1,
    "lineNo": 1,
    "itemId": 1,
    "productId": 1234,
    "userId": 4711,
    "salesQuantity": 2,
    "productPrice": 0.99,
    "salesGross": 1.98,
    "salesTimestamp": 1234567890
}

Document 2

{
    "lineItemId": 1,
    "lineNo": 2,
    "itemId": 1,
    "productId": 1235,
    "userId": 4711,
    "salesQuantity": 1,
    "productPrice": 5.99,
    "salesGross": 5.99,
    "salesTimestamp": 1234567890
}

I have around 1.5 million of these documents in my Elasticsearch index. A lineItem is a part of a sale (identified by itemId), which can consist of 1 or more lineItems What I would like to receive is the, say, 5 most uncommonly common products which were bought in conjunction with the sale of one specific productId.

The MovieLens example (https://www.elastic.co/guide/en/elasticsearch/guide/current/_significant_terms_demo.html) deals with data in the structure of

{
    "movie": [122,185,231,292,
              316,329,355,356,362,364,370,377,420,
              466,480,520,539,586,588,589,594,616
    ],
    "user": 1
}

so it's unfortunately not really useful to me. I'd be very glad for an example or a suggestion using my "flat" structures. Thanks a lot in advance.

3

There are 3 best solutions below

0
On

Since I don't have the amount of data that you do, try this:

  1. get the list of itemIds for bundles that contain a certain productId that you want to find "stuff" for:
{
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "productId": 1234
        }
      }
    }
  },
  "fields": [
    "itemId"
  ]
}

Then

  1. using this list create this query:
GET /sales/sales/_search?search_type=count
{
  "query": {
    "filtered": {
      "filter": {
        "terms": {
          "itemId": [1,2,3,4,5,6,7,11]
        }
      }
    }
  },
  "aggs": {
    "most_sig": {
      "significant_terms": {
        "field": "productId",
        "size": 0
      }
    }
  }
}
0
On

It sounds like you're trying to build an item-based recommender. Apache Mahout has tools to help with collaborative filtering (formerly the Taste project).

There is also a Taste plugin for Elasticsearch 1.5.x which I believe can work with data like yours to produce item-based recommendations.

(Note: This plugin uses Rivers which were deprecated in Elasticsearch 1.5, so I'd check with the authors about plans to support more recent versions of Elasticsearch before adopting this suggestion.)

0
On

If I understand correctly you have a doc per order line item. What you want is a single doc per order. The Order doc should have an array of productIds (or an array of line item objects that each include a productId field).

That way when you query for orders containing product X the sig_terms aggregation should find product Y is found to be uncommonly common in these orders.