In NLTK, Can I do morphological analysis for specific language

967 Views Asked by At

I am trying to add some arabic features into the NLTK, but some tasks such as stemming need a morphological analysis. Is there any way to define the morphological features of specific language such as Arabic to NLTK or I must to customize the analyzer?

2

There are 2 best solutions below

0
On

If you're looking for Arabic processing, there's the ISRI stemmer that @alexis pointed to:

>>> from nltk.stem.isri import ISRIStemmer
>>> isri = ISRIStemmer()
>>> isri = 'حركات'
>>> isri = ISRIStemmer()
>>> s = 'حركات'
>>> isri.stem(s)
'حرك' 

See Python ISRIStemmer for Arabic text

If you're asking for a generic tool, nltk doesn't really have such a function but if you're looking at customized stemming you can try the updated customizable LancasterStemmer rules with NLTK v3.2.3, see https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L50 but you would have to understand how Lancaster works first.

Possibly, Morfessor might be what you're looking for if you have training data for morphologically split words/sentences.

0
On

Forget it. Creating a morphological analyzer, especially for a language with complex morphology like Arabic, is extremely difficult. Look around for solutions you can install and interface with the nltk. But the nltk does come with an Arabic stemmer, see here. You'll have to decide if it's any good.