Why can't I pass wn.ADJ_SAT as a pos when requesting synsets

755 Views Asked by At

I know that wordnet has an "adverb synset" type. I know that is in the synset type enum in nltk

from nltk.corpus import wordnet as wn
wn.ADJ_SAT
u's'

Why can't I pass it as a key to synsets?

>>> wn.synsets('dog', wn.ADJ_SAT)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1413, in synsets
    for form in self._morphy(lemma, p)
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1627, in _morphy
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
KeyError: u's'
2

There are 2 best solutions below

0
On
0
On

From:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('able')
[Synset('able.a.01'), Synset('able.s.02'), Synset('able.s.03'), Synset('able.s.04')]
>>> wn.synsets('able', pos=wn.ADJ)
[Synset('able.a.01'), Synset('able.s.02'), Synset('able.s.03'), Synset('able.s.04')]
>>> wn.synsets('able', pos=wn.ADJ_SAT)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1413, in synsets
    for form in self._morphy(lemma, p)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1627, in _morphy
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
KeyError: u's'

From https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1397 , we see that when you try to retrieve a synset from the NLTK wordnet API, the POS restrictions appears in the return list comprehension that calls the self._morphy(lemma, p) function:

def synsets(self, lemma, pos=None, lang='en'):
    """Load all synsets with a given lemma and part of speech tag.
    If no pos is specified, all synsets for all parts of speech
    will be loaded. 
    If lang is specified, all the synsets associated with the lemma name
    of that language will be returned.
    """
    lemma = lemma.lower()

    if lang == 'en':
        get_synset = self._synset_from_pos_and_offset
        index = self._lemma_pos_offset_map
        if pos is None:
            pos = POS_LIST
        return [get_synset(p, offset)
                for p in pos
                for form in self._morphy(lemma, p)
                for offset in index[form].get(p, [])]

If we look at the _morphy() function, from https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1573.

 def _morphy(self, form, pos):
        # from jordanbg:
        # Given an original string x
        # 1. Apply rules once to the input to get y1, y2, y3, etc.
        # 2. Return all that are in the database
        # 3. If there are no matches, keep applying rules until you either
        #    find a match or you can't go any further

        exceptions = self._exception_map[pos]
        substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

        def apply_rules(forms):
            return [form[:-len(old)] + new
                    for form in forms
                    for old, new in substitutions
                    if form.endswith(old)]

        def filter_forms(forms):
            result = []
            seen = set()
            for form in forms:
                if form in self._lemma_pos_offset_map:
                    if pos in self._lemma_pos_offset_map[form]:
                        if form not in seen:
                            result.append(form)
                            seen.add(form)
            return result

        # 0. Check the exception lists
        if form in exceptions:
            return filter_forms([form] + exceptions[form])

        # 1. Apply rules once to the input to get y1, y2, y3, etc.
        forms = apply_rules([form])

        # 2. Return all that are in the database (and check the original too)
        results = filter_forms([form] + forms)
        if results:
            return results

        # 3. If there are no matches, keep applying rules until we find a match
        while forms:
            forms = apply_rules(forms)
            results = filter_forms(forms)
            if results:
                return results

        # Return an empty list if we can't find anything
        return []

We see that it retrieves some substitution rules from substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos] to perform some morphological reduction before it retrieves the Synsets that are stored in the "based"/"root" form. E.g.

>>> from nltk.corpus import wordnet as wn
>>> wn._morphy('dogs', 'n')
[u'dog']

And if we look at the MORPHOLOGICAL_SUBSTITUTIONS, we see that ADJ_SAT is missing, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1609 :

MORPHOLOGICAL_SUBSTITUTIONS = {
    NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'),
           ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'),
           ('men', 'man'), ('ies', 'y')],
    VERB: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''),
           ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')],
    ADJ: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
    ADV: []}

Thus to prevent this from happening a simple fix to add this line after line 1609 of https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1609:

MORPHOLOGICAL_SUBSTITUTIONS[ADJ_SAT] = MORPHOLOGICAL_SUBSTITUTIONS[ADJ]

For proof of concept:

>>> MORPHOLOGICAL_SUBSTITUTIONS = {
...     1: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'),
...            ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'),
...            ('men', 'man'), ('ies', 'y')],
...     2: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''),
...            ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')],
...     3: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
...     4: []}
>>> 
>>> MORPHOLOGICAL_SUBSTITUTIONS[5] = MORPHOLOGICAL_SUBSTITUTIONS[3]
>>> MORPHOLOGICAL_SUBSTITUTIONS
{1: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 2: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')], 3: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 4: [], 5: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')]}