I am searching for synonyms for a particular phrase from a dataset. I have 2 JSON files in which data is stored consisting of synonyms for yes and no . If I query for "not interested" it gives both yes and no phrases/synonyms as result, the expected result is just no phrases/synonyms.
en-gen-yes.json
{
"tag":"en-gen-yes",
"phrases": [
"yes",
"yeah",
"sure",
"suits me",
"interested"
]
}
en-gen-no.json
{
"tag":"en-gen-no",
"phrases": [
"no",
"nope",
"not sure",
"does not suits me",
"not interested"
]
}
query code
query := bleve.NewMatchPhraseQuery("not interested")
req := bleve.NewSearchRequest(query)
req.Fields = []string{"phrases"}
searchResults, err := paraphraseIndex.Search(req)
if err != nil {
log.Fatal(err)
}
if searchResults.Hits.Len() == 0 {
fmt.Println("No matches found")
} else {
for i := 0; i < searchResults.Hits.Len(); {
hit := searchResults.Hits[i]
fmt.Printf("%s\n", hit.Fields["phrases"])
i = i + 1
}
}
The result comes as
[no nope not sure does not suits me not interested] [yes yeah sure suits me interested]
Expected Result is only
[no nope not sure does not suits me not interested]
The reason that it matches both is that the MatchPhraseQuery you are using will analyze the search terms. You didn't show the IndexMapping here so I can't be sure, but I'll assume you're using the "standard" analyzer. This analyzer removes English stop words, and the English stop word list is defined here:
https://github.com/blevesearch/bleve/blob/master/analysis/lang/en/stop_words_en.go#L281
So, this means that when you do a MatchPhraseQuery for "not interested" you end up just searching for "interested". And that term happens to also be in your "yes" list of synonyms.
It is worth noting that there is a variant called PhraseQuery (without Match) that does exact matching. And while that wouldn't remove the word "not" at search time, it still wouldn't find the match. And the reason is that the word "not" has been removed at index time as well, so and exact match of "not interested" would not find any matches (neither yes or no).
The solution is configure a custom analyzer which either doesn't remove any stop words, or that uses a custom stop word list that doesn't contain the word "not". If you do this, and use it for both indexing and searching, the query you're using should start to work correctly.