I was trying semcor corpus in nltk.
I found this code here:
>>> list(map(str, semcor.tagged_chunks(tag='both')[:3]))
['(DT The)', "(Lemma('group.n.01.group') (NE (NNP Fulton County Grand Jury)))", "(Lemma('state.v.01.say') (VB said))"]
I tried the same on colab (check last cell in this notebook):
>>> list(map(str, semcor.tagged_chunks(tag='both')[:3]))
['(DT The)',
'(group.n.01 (NE (NNP Fulton County Grand Jury)))',
'(say.v.01 (VB said))']
Here is the screenshot from colab:
The problem
Note that on nltk page, for Fulton County Grand Jury
output is given as Lemma('group.n.01.group')
, but on colab, I am getting group.n.01
. So I am not getting sense / synset lemma.
- In
group.n.01.group
- first
group
is a "stem for sense word" - last
group
is "stem for input"
- first
- In
group.n.01
- (first and only)
group
is "stem for input" - no "stem for sense word" is returned
- (first and only)
Weird thing is that it was giving me correct output yesterday. This notebook will clear the doubt as it has same two lines executed today and yesterday. Yesterday (2/9/2021), I was getting tags in format group.n.01.group
, but today I am getting tags in group.n.01
format (NOTICE RED AND BLUE COMMENTS):
What I am missing here?
I knew that
semcor
useswordnet
senses to tag to subset ofbrown
corpus. But I was not aware thatsemcor
APIs can work with or withoutwordnet
predownloaded and it will give tags in different format in these different scenarios. I honestly feel, at leastsemcor
API documentation should have some mention of this.So, without
wordnet
predownloaded, it does not return sense stems:With
wordnet
pre-downloaded, it does return sense stems: