I am subscribed to a mail list where some of the messages are non-english which I cannot understand.
How do I filter the non-english messages to /dev/null using procmail and/or command line tools?
I use procmail to filter my email, so ideally any alternative tool would also require a procmail recipe.
I'd prefer not to have to train my own language models.
One way is to use the perl TextCat package from Gertjan van Noord.
The
text_catscript outputs the most likely language for the mail. This recipe assumestext_cathas been installed under/usr/local/bin.Here is a simple
procmailrecipe to call thetext_catscript:I've been running text_cat for a few years. There haven't been any non-english messages classified as english, that is, no false-positives. I've not been rigorous about checking for false-negatives.
A second way, as mentioned by tripleee in a comment, is to use the language categorisation provided by spamassassin which also uses the text_cat script. Spamassassin will unwrap any MIME transfer encodings which the vanilla text_cat version above won't.
Here is an incompletely tested
procmailrecipe for filtering on the spamassassinX-Spam-Languagesheader:Warning: spamassassin will occasionally provide multiple language categorisations like so:
which the above recipe does not account for.
Spamassassin Language Categorisation Configuration
Edit
/etc/spamassassin/v310.preand uncomment the following line:Configure the plugin in
/etc/spamassassin/local.cf:This recipe was incompletely tested with spamassassin version 3.4.2.
To adapt these answers to excluding a different language would involve substituting the other language for
englishin the first case and substituting the other 2 character language code forenin the second case.