Analyse single tags per post in Elasticsearch dsl

69 Views Asked by At

I have a ES instance running with data from travel.stackexchange.

# Example Data
first = ["This was one of our definition questions, but also one that interests me personally:
          How can I find a guide that will take me safely through the Amazon jungle? I'd love
          to explore the Amazon but would not attempt it without a guide, at least not the first
          time. I'd prefer a guide that wasn't going to ambush me or anything.I don't want to go
          anywhere touristy.  Start and end points are open, but the trip should take me places
          where I am not likely to see other travelers/tourists and where I will definitely
          require a good guide in order to be safe.", # content
          '2011-06-21T20:22:33.760', # date of creation
          '39', # votes
          '2799', # views
          '8', # answers
          '4', # comments
          'How can I find a guide that will take me safely through the Amazon jungle?', # title
          '"guides", "extreme-tourism", "amazon-river", "amazon-jungle"'] # TAGS

I connect to it using

connections.create_connection(alias='es', hosts=['localhost'], timeout=60)

As you can see, the post has several tags ("guides", "amazon-river", ...). When I input my data into ES, I have the tags formated as strings.

Now, when I query my index (with a larger dataset of course)

s = Search(using="es", index=current_index)

and aggregate the number of times each tag was mentioned.

s.aggs.bucket("per_tag", "terms", field="tags", size=5)
r = s.execute()

However, when I view at the results, they look like

r.aggregations.per_tag.buckets
>>> [{'key': 'no tags', 'doc_count': 70672},
>>>  {'key': '"visas", "uk"', 'doc_count': 330}, 
>>>  {'key': '"visas", "schengen"', 'doc_count': 264}, 
>>>  {'key': '"visas"', 'doc_count': 253},
>>>  {'key': '"air-travel"', 'doc_count': 182}]

Which is good, but not what I wanted. As you can see, the tag "visas" is mentioned three times, instead of just one time. What I'd like to have is a return which look like

>>> [{'key': 'no tags', 'doc_count': 70672},
>>>  {'key': 'visas', 'doc_count': XXX}, 
>>>  {'key': 'uk', 'doc_count': YYY}, 
>>>  {'key': 'Schenge', 'doc_count': ZZZ},
>>>  {'key': 'air-travel', 'doc_count': AAA}]

What I have tried so far, is to input the tags in different ways. Once with "" once without, leaving the ,, only with spaces. However, I feel like, that I have to define the aggregation function a bit more concise, instead of the input. Any help would be appreciated.

0

There are 0 best solutions below