Mallet: Tokenization by N-grams (1,2)

71 Views Asked by Louise At 22 September 2021 at 12:03

I was wondering whether it would be possible to tokenize words in Mallet by n-gram size between 1 and 2?

This is the code that I have used so far:

bin\mallet import-dir --input sample-data\web\en --output sample.txt --keep-sequence-bigrams --remove-stopwords
bin\mallet train-topics  --input sample.txt  --num-topics 20 --optimize-interval 10 --output-doc-topics sample_composition.txt --output-topic-keys sample_keys.txt

Thank you in advance.

Original Q&A

There are 1 best solutions below

David Mimno On 22 September 2021 at 15:23

The topic model trainer doesn't use the bigrams feature, it would make the code much more complicated. Two ways to add bigrams would be to modify the input data file before importing it, such that

the cat sat

would become

the cat sat the_cat cat_sat

You can also create a post-hoc report that identifies pairs of words that frequently occur together and get assigned to the same topic with --xml-topic-phrase-report FILENAME.

Mallet: Tokenization by N-grams (1,2)

There are 1 best solutions below

Related Questions in TOPIC-MODELING

Related Questions in N-GRAM

Related Questions in MALLET

Trending Questions

Popular # Hahtags

Popular Questions