"? I'm a scala newbie and, in fact, I only have to use one sc..." /> "? I'm a scala newbie and, in fact, I only have to use one sc..." /> "? I'm a scala newbie and, in fact, I only have to use one sc..."/>

Filtering out numbers when using ScalaNLP tokenizer

461 Views Asked by At

is there a command in scala to ignore all kind of numbers, such as " IgnoreNumbers() ~> "?

I'm a scala newbie and, in fact, I only have to use one script in this language.

Thanks a lot for any help!

It's for a tokenizer from here http://nlp.stanford.edu/software/tmt/tmt-0.4/examples/example-1-dataset.scala:

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // Remove punctuation
  CaseFolder() ~>                        // Lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // Ignore non-words and non-numbers
  MinimumLengthFilter(3)                 // Take terms with >=3 characters
}
1

There are 1 best solutions below

5
On

I've never used ScalaNLP, but it looks like it is trivial to modify (or better, create a new type) based on WordsAndNumbersOnlyFilter by simply removing the Number usage, e.g.

case class WordsOnlyFilter() extends Transformer {
  // original from WordsAndNumbersOnlyFilter
  // override def apply(terms : Iterable[String]) =
  //   terms.filter(term => TokenType.Word.matches(term) || TokenType.Number.matches(term));

  // Modification that doesn't use/accept TokenType.Number
  override def apply(terms : Iterable[String]) =
    terms.filter(term => TokenType.Word.matches(term));
}

Then:

val tokenizer = {
  // ..
  WordsOnlyFilter() ~>         // Ignore non-words
  // ..
}