How do I categorise non-english email using procmail and command line tools?

217 Views Asked by At

I am subscribed to a mail list where some of the messages are non-english which I cannot understand.

How do I filter the non-english messages to /dev/null using procmail and/or command line tools?

I use procmail to filter my email, so ideally any alternative tool would also require a procmail recipe.

I'd prefer not to have to train my own language models.

2

There are 2 best solutions below

10
makeyourownmaker On BEST ANSWER

One way is to use the perl TextCat package from Gertjan van Noord.

The text_cat script outputs the most likely language for the mail. This recipe assumes text_cat has been installed under /usr/local/bin.

Here is a simple procmail recipe to call the text_cat script:

:0
* ^Subject.*Jobs.*Board
{
    LANG_=`/usr/local/bin/text_cat`

    :0
    * ! LANG ?? ^english$
    /dev/null

    :0
    jobs/
}

I've been running text_cat for a few years. There haven't been any non-english messages classified as english, that is, no false-positives. I've not been rigorous about checking for false-negatives.


A second way, as mentioned by tripleee in a comment, is to use the language categorisation provided by spamassassin which also uses the text_cat script. Spamassassin will unwrap any MIME transfer encodings which the vanilla text_cat version above won't.

Here is an incompletely tested procmail recipe for filtering on the spamassassin X-Spam-Languages header:

:0
* ^Subject.*Jobs.*Board
{    
    # Delete non-english language emails using spamassassin header
    # Test for not X-Spam-Languages: en
    :0
    * !^X-Spam-Languages: en$
    foreign/

    # Save english language mails in folder
    :0
    jobs/
}

Warning: spamassassin will occasionally provide multiple language categorisations like so:

X-Spam-Languages: en da ro

which the above recipe does not account for.

Spamassassin Language Categorisation Configuration

Edit /etc/spamassassin/v310.pre and uncomment the following line:

loadplugin Mail::SpamAssassin::Plugin::TextCat

Configure the plugin in /etc/spamassassin/local.cf:

ok_languages en       # I understand english
inactive_languages '' # Enable all languages
add_header all Languages _LANGUAGES_
# score UNWANTED_LANGUAGE_BODY 5 # Increase score - not necessary and not recommended 

This recipe was incompletely tested with spamassassin version 3.4.2.


To adapt these answers to excluding a different language would involve substituting the other language for english in the first case and substituting the other 2 character language code for en in the second case.

3
tripleee On

Many modern email clients identify the character set of the email message, though not usually its language. If you want to discard Japanese, Chinese, Korean, and Russian messages, you could try something like

:0HB
* ^Content-type:[  ]*text/[/;]*;[  ]*charset="?(iso-2022|ks-c|gb|koi|cp-1251)
foreign

Because some clients forget to change the character set when they write in English, this is likely to produce some false positives, so I recommend saving to a folder and reviewing it periodically. The opposite problem is harder; many foreign languages use the same character set as English, and thus can't be identified like this with any reliability.