How to assign multiple tags to a token using OpenNLP?

65 Views Asked by At

I'm using OpenNLP and it works fine for detecting parts of speech and such when doing this:

try (InputStream modelIn = new FileInputStream("en-pos-maxent.bin"){
  POSModel model = new POSModel(modelIn);
  POSTaggerME tagger = new POSTaggerME(model);
  String tags[] = tagger.tag(tokenList);
}

so if tokens = [Test, Recipe, of, Incredible, Goodness, .] then tags = [ADJ, NOUN, ADP, ADJ, NOUN, PUNCT]

can I further add even more tags than just those defined as parts of speech? what if I want to add a tag for short words, products, food, etc...

would i need to add a custom POS model with my definitions, run it in addition to the english POS model, and just have additional tag arrays for each POS model that I run the sentence through??

I have tried what I described, defining my own model and running it so that I have multiple arrays. I was just wondering if there was some other way to do this that might be better than what I tried.

1

There are 1 best solutions below

0
Jonathan On

I decided to tackle it this way. whatever limited knowledge I have seems to be no limitation here.

I was using POSSample as my object which stores tags and tokens together, i created a different object like POSSample which stores in a hashmap the same tags, tokens, but also expandable to whatever other data i want to put in there like lems, custom tags, etc...

public class TaggedSentence {
    private String sentence;
    private HashMap<TagTypes, List<String>> tagHash = new HashMap<TagTypes, List<String>>();
}

my tag types i use as the hash key is just an enum with values i can use for bitwise operations... so i can easily flag when i want or don't want these extra tags to be scanned for and populated in my sentence tagger

public enum TagTypes {
    TOKENS(0B001),
    TAGS(0B010),
    LEMS(0B100);
    // expand this list later as i need
    
    /**
     * Values are set for bitwise operator checks
     */
    public final int value;

    private TagTypes(int value) {
        this.value = value;
    }
}

so doing this way, normally LEMS for instance would not get populated unless my method specified:

// only populate my object with tokens
TaggedSentence pos = nlpService.tokenizeSentence(sentence);

// populate my object with tokens, tags (since lems requires tags), and lems 
TaggedSentence pos = nlpService.tokenizeSentence(sentence, TagTypes.LEMS.value);

// populate my object with lems and any other random tagtype category I add to my enum to scan for later
TaggedSentence pos = nlpService.tokenizeSentence(sentence, TagTypes.LEMS.value | TagTypes.SOMEOTHERTYPE.value);