I have found the frequecny of bigrams in certain sentences using:
import nltk
from nltk import ngrams
mydata = “xxxxx"
mylist = mydata.split()
mybigrams =list(ngrams(mylist, 2))
fd = nltk.FreqDist(mybigrams)
print(fd.most_common())
On printing out the bigrams with the most common frequencies, one occurs 7 times wheras all 95 other bigrams only occur 1 time. However when comparing the bigrams to my sentences I can see no logical order to the way the bigrams all of frequency 1 are printed out. Does anyone know if there is any logic to the way .most_common() prints the bigrams or is it randomly generated
Thanks in advance
Short answer, based on the documentation of collections.Counter.most_common:
In current versions of NLTK,
nltk.FreqDist
is based onnltk.compat.Counter
. On Python 2.7 and 3.x,collections.Counter
will be imported from the standard library. On Python 2.6, NLTK provides its own implementation.For details, look at the source code:
https://github.com/nltk/nltk/blob/develop/nltk/compat.py
In conclusion, without checking all possible version configurations, you cannot expect words with equal frequency to be ordered.