I was wondering whether it would be possible to tokenize words in Mallet by n-gram size between 1 and 2?
This is the code that I have used so far:
bin\mallet import-dir --input sample-data\web\en --output sample.txt --keep-sequence-bigrams --remove-stopwords
bin\mallet train-topics --input sample.txt --num-topics 20 --optimize-interval 10 --output-doc-topics sample_composition.txt --output-topic-keys sample_keys.txt
Thank you in advance.
The topic model trainer doesn't use the bigrams feature, it would make the code much more complicated. Two ways to add bigrams would be to modify the input data file before importing it, such that
would become
You can also create a post-hoc report that identifies pairs of words that frequently occur together and get assigned to the same topic with
--xml-topic-phrase-report FILENAME.