I have an outputted .conll format file from Malt Parser, which is using the engmalt.linear-1.7.mco training model. My original input was a large text file of sentences. How can I use this file for feature selection?
I am using python with Scikit-learn (currently using tfidf bag of words to select features). However, I want to utilize nlp, by for example, only searching for adjectives. How can I do this with a conll file?
The output of a parser in the CoNLL-X format provides a separate column for the part-of-speech tags. For example, if you parse the sentence
the output might be as follows:
Columns 4 and 5 show the coarse- and fine-grained part-of-speech tags, respectively. If you only want to select adjectives, you need to just pick words that have
JJ
as their coarse-tag in column 4.Once you have selected the specific words according to whatever your selection criteria is, you can proceed to construct the vectors in the usual way.
P.S. I assumed your query was mostly to do with the CoNLL format, and not about how to extract the adjectives (which, of course, can be done by tab-splitting rows or regex matching -- there are several questions and answers on SO pertaining to the pythonic ways of doing those).