I have a ES instance running with data from travel.stackexchange.
# Example Data
first = ["This was one of our definition questions, but also one that interests me personally:
How can I find a guide that will take me safely through the Amazon jungle? I'd love
to explore the Amazon but would not attempt it without a guide, at least not the first
time. I'd prefer a guide that wasn't going to ambush me or anything.I don't want to go
anywhere touristy. Start and end points are open, but the trip should take me places
where I am not likely to see other travelers/tourists and where I will definitely
require a good guide in order to be safe.", # content
'2011-06-21T20:22:33.760', # date of creation
'39', # votes
'2799', # views
'8', # answers
'4', # comments
'How can I find a guide that will take me safely through the Amazon jungle?', # title
'"guides", "extreme-tourism", "amazon-river", "amazon-jungle"'] # TAGS
I connect to it using
connections.create_connection(alias='es', hosts=['localhost'], timeout=60)
As you can see, the post has several tags ("guides", "amazon-river", ...). When I input my data into ES, I have the tags formated as strings.
Now, when I query my index (with a larger dataset of course)
s = Search(using="es", index=current_index)
and aggregate the number of times each tag was mentioned.
s.aggs.bucket("per_tag", "terms", field="tags", size=5)
r = s.execute()
However, when I view at the results, they look like
r.aggregations.per_tag.buckets
>>> [{'key': 'no tags', 'doc_count': 70672},
>>> {'key': '"visas", "uk"', 'doc_count': 330},
>>> {'key': '"visas", "schengen"', 'doc_count': 264},
>>> {'key': '"visas"', 'doc_count': 253},
>>> {'key': '"air-travel"', 'doc_count': 182}]
Which is good, but not what I wanted. As you can see, the tag "visas" is mentioned three times, instead of just one time. What I'd like to have is a return which look like
>>> [{'key': 'no tags', 'doc_count': 70672},
>>> {'key': 'visas', 'doc_count': XXX},
>>> {'key': 'uk', 'doc_count': YYY},
>>> {'key': 'Schenge', 'doc_count': ZZZ},
>>> {'key': 'air-travel', 'doc_count': AAA}]
What I have tried so far, is to input the tags in different ways. Once with ""
once without, leaving the ,
, only with spaces
. However, I feel like, that I have to define the aggregation function a bit more concise, instead of the input. Any help would be appreciated.