Elasticsearch significant terms on nested objects

562 Views Asked by At

For my masterthesis I am using Elasticsearch to measure significance of sentences, paragraphs and documents to the rest of the index. I've used 3 different indexes to enable fast querying. Everything works fine, but I want to evaluate if it is even possibe to do the same with nested objects or parent child relationships.

Here I try to set up and query significant terms with nestd objects:

PUT /test_nested
{
    "settings": { 
      "analysis": {
        "filter": {
          "german_stop": {
            "type":       "stop",
            "stopwords":  "_german_" 
          },
          "german_keywords": {
            "type":       "keyword_marker",
            "keywords":   [""] 
          },
          "german_stemmer": {
            "type":       "stemmer",
            "language":   "light_german"
                },
                "shingle_bigram": {
                    "type":       "shingle",
                    "max_shingle_size": 2,
                    "min_shingle_size": 2,
                    "output_unigrams": false
                },
                "shingle_trigram": {
                    "type":       "shingle",
                    "max_shingle_size": 3,
                    "min_shingle_size": 3,
                    "output_unigrams": false                    
                }
        },
        "analyzer": {
          "unigram": {
            "tokenizer":  "standard",
            "filter": [
              "lowercase",
              "german_stop",
              "german_keywords",
              "german_normalization",
              "german_stemmer"
            ]
          }
        }
      }
    },  
  "mappings": {
    "document": {
      "properties": {
        "tags" : {
          "type" : "string",
          "analyzer" : "unigram",
          "index" : "analyzed"
        },
        "publishDate" : {
          "type" : "date" 
        },
        "paragraphs": {
          "type": "nested",
          "properties": {
            "sentences" :{
              "type" : "nested",
              "properties": {
                "textBody": {
                  "type": "string",
                  "analyzer" : "unigram",
                  "index" : "analyzed",
                  "term_vector" : "with_positions_offsets",
                "term_statistics" : true
                }
              }
            }
          }
        }
      }
    }
  }
}

and 2 test documents:

PUT /test_nested/document/1
{
  "tags" : "DerSpiegel, Frankfurt",
  "publishDate" : "2005-12-11",
  "paragraphs" : [
    {
      "sentences" : [ 
        {"textBody" : "Größter anzunehmender Einschlag"},
        {"textBody": "Es gibt ziemlich blöde Vorurteile über Fußball."},
        {"textBody": "Eines lautet: Der Ball ist rund."},
        {"textBody": "Freitagabend, Messehalle 1 in Leipzig, die Auslosung zur Fußballweltmeisterschaft: Der Ball ist gar nicht rund."}
        ]
    }
  ]
}

PUT /test_nested/document/2
{
  "tags" : "DerSpiegel, Frankfurt",
  "publishDate" : "2005-12-11",
  "paragraphs" : [
    {
      "sentences" : [ 
        {"textBody" : "Dafür aber kann man mit so einem Ball auch viel mehr anstellen als mit diesen runden, kleinen Dingern, die früher aus Leder waren und heute aus Polyurethan sind."},
        {"textBody": "Zum Beispiel die gigantischste Fußball-WM-Auslosungsshow aller Zeiten zelebrieren."},
        {"textBody": "Eine Show, die zum globalen Fußball passt."},
        {"textBody": "Hauptsache riesig - wen interessiert schon rund?"}
        ]
    },
    {
      "sentences" : [
        {"textBody" : "Mit der Verteilung der 32 Teams auf ihre acht Gruppen bekamen die Deutschen damit erstmals auch einen Vorgeschmack auf das Gewicht und die Wucht der WM im nächsten Jahr." },
        {"textBody" : "Mag die Nachricht des Abends auch gewesen sein, dass Deutschland gegen Costa Rica, Polen und Ecuador spielt und dass im Achtelfinale die Engländer drohen, die Botschaft des Spektakels von Leipzig heißt, dass die WM mit einer Opulenz über das Land kommen wird, von der sich die Deutschen bisher noch gar keine rechte Vorstellung gemacht haben." },
        {"textBody" : "Die Stimme von 1974 gehörte Wolfhard Kuhlins, Sportchef des HR, und das Weltereignis war nach 45 Minuten ausgestrahlt, nicht nach 150." },
        {"textBody" : "Zwar kam auch schon Franz Beckenbauer zum Interview ins Studio, aber selbst der Kaiser war noch nicht, was er mal wurde: Zum schwarzen Anzug trug er weiße Socken." }
        ]
    }
  ]
}

Unfortunately I don't get any significant terms for the following query:

GET test_nested/document/_search?search_type=count
{
  "query" : {
    "match_all" :{}
  },
  "aggs" :{
      "sentences":{
        "nested" :{
          "path" : "paragraphs.sentences"
        }
      },
      "aggs" : {

            "significant_terms" : { 
              "chi_square": {}, 
              "field" : "paragraphs.sentences.textBody"
            }

      }
  }
}
2

There are 2 best solutions below

1
On BEST ANSWER

You just had a syntax error, basically. This seems to do what you want:

POST test_nested/document/_search?search_type=count
{
   "query": {
      "match_all": {}
   },
   "aggs": {
      "sentences": {
         "nested": {
            "path": "paragraphs.sentences"
         },
         "aggs": {
            "sentances_sig_terms": {
               "significant_terms": {
                  "chi_square": {},
                  "field": "paragraphs.sentences.textBody"
               }
            }
         }
      }
   }
}

Here's some code I used to test it:

http://sense.qbox.io/gist/e53122ea5887bf48a9bab570ad1c63546494026d

Very well-written question, by the way.

0
On

Your issue is that you have a nested object inside a nested object. I'm not sure if this is intended or you gave us just a small piece of your data to do minimal testing.

Why I'm telling that? Because there's just one nested type aggregation in your query, these should be handled seperately. Also - your significant_terms aggregation has no name. Summing this up:

  1. I've split your nested aggregation into two
  2. Gave significant_terms aggregation a name
  3. Profit??

There's your query:

POST test_nested/document/_search?search_type=count
{
  "aggs": {
    "paragraphs": {
      "nested": {
        "path": "paragraphs"
      },
      "aggs": {
        "sentences": {
          "nested": {
            "path": "paragraphs.sentences"
          },
          "aggs": {
            "Significants": {
              "significant_terms": {
                "chi_square": {},
                "field": "paragraphs.sentences.textBody"
              }
            }
          }
        }
      }
    }
  }
}

And here's your result (I used test-data you provided):

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "paragraphs": {
         "doc_count": 3,
         "sentences": {
            "doc_count": 12,
            "Significants": {
               "doc_count": 12,
               "buckets": [
                  {
                     "key": "rund",
                     "doc_count": 4,
                     "score": 2.1794871794871793,
                     "bg_count": 4
                  },
                  {
                     "key": "ball",
                     "doc_count": 3,
                     "score": 1.5178571428571428,
                     "bg_count": 3
                  },
                  {
                     "key": "wm",
                     "doc_count": 3,
                     "score": 1.5178571428571428,
                     "bg_count": 3
                  },
                  {
                     "key": "fussball",
                     "doc_count": 3,
                     "score": 1.5178571428571428,
                     "bg_count": 3
                  }
               ]
            }
         }
      }
   }
}

Let me know if this is what you needed.