Dealing with integer-valued features for CRF in mallet

747 Views Asked by At

I am just starting to use the SimpleTagger class in mallet. My impression is that it expects binary features. The model that I want to implement has positive integer-valued features and I wonder how to implement this in mallet. Also, I heard that non-binary features need to be normalized if the model is to make sense. I would appreciate any suggestions on how to do this.

ps. yes, I know that there is a dedicated mallet mail list but I am waiting for nearly a day already to get my subscription approved to be able to post there. I'm simply in a hurry.

1

There are 1 best solutions below

0
On

Well it's 6 years later now. If you're not in a hurry anymore, you could check out the Java API to create your instances. A minimal example:

private Instance createInstance(LabelAlphabet labelAlphabet){
  // observations and labels should be equal size for linear chain CRFs
  TokenSequence observations = new TokenSequence();
  LabelSequence labels = new LabelSequence(labelAlphabet, n); 

  observations.add(createToken());
  labels.add("idk, some target or something");     

  return new Instance(
            observations,
            label,
            "myInstance",
            null
    );  
}

private Token createToken() {
    Token token = new Token("exampleToken");

    // Note: properties are not used for computing (I think)
    token.setProperty("SOME_PROPERTY", "hello");

    // Any old double value
    token.setFeatureValue(featureVal, 666.0);      

    // etc for more features ...

    return token;
 }


public static void main(String[] args){
  // Note the first arg is false to denote we *do not* deal with binary features
  InstanceList instanceList = new InstanceList(new TokenSequence2FeatureVectorSequence(false, false));    

  LabelAlphabet labelAlphabet = new LabelAlphabet();
  // Converts our tokens to feature vectors
  instances.addThruPipe(createInstance(labelAlphabet)); 
}

Or, if you want to keep using SimpleTagger, just define binary features like HAS_1_LETTER, HAS_2_LETTER, etc, though this seems tedious.