approximate RegEx in python with TRE: strange unicode behavior

1.4k Views Asked by At

I am trying to use the TRE-library in python to match misspelled input.
It is important, that it does handle utf-8 encoded Strings well.

an example:
The German capital's name is Berlin, but from the pronunciation it is the same, if people would write "Bärlin"

It is working so far, but if a non-ASCII character is on the first or second position of the detected String, neither the range nor the detected string itself is correct.

# -*- coding: utf-8 -*-
import tre

def apro_match(word, list):
    fz = tre.Fuzzyness(maxerr=3)
    pt = tre.compile(word)
    for i in l:
        m = pt.search(i,fz)
        if m:
            print m.groups()[0],' ', m[0]

if __name__ == '__main__':
    string1 = u'Berlín'.encode('utf-8')
    string2 = u'Bärlin'.encode('utf-8')    
    string3 = u'B\xe4rlin'.encode('utf-8')
    string4 = u'Berlän'.encode('utf-8')
    string5 = u'London, Paris, Bärlin'.encode('utf-8')
    string6 = u'äerlin'.encode('utf-8')
    string7 = u'Beälin'.encode('utf-8')

    l = ['Moskau', string1, string2, string3, string4, string5, string6, string7]

    print '\n'*2
    print "apro_match('Berlin', l)"
    print "="*20
    apro_match('Berlin', l)
    print '\n'*2

    print "apro_match('.*Berlin', l)"
    print "="*20
    apro_match('.*Berlin', l)

output

apro_match('Berlin', l)
====================
(0, 7)   Berlín
(1, 7)   ärlin
(1, 7)   ärlin
(0, 7)   Berlän
(16, 22)   ärlin
(1, 7)   ?erlin
(0, 7)   Beälin



apro_match('.*Berlin', l)
====================
(0, 7)   Berlín
(0, 7)   Bärlin
(0, 7)   Bärlin
(0, 7)   Berlän
(0, 22)   London, Paris, Bärlin
(0, 7)   äerlin
(0, 7)   Beälin

Not that for the regex '.*Berlin' it works fine, while for the regex 'Berlin'

u'Bärlin'.encode('utf-8')    
u'B\xe4rlin'.encode('utf-8')
u'äerlin'.encode('utf-8')

are not working, while

u'Berlín'.encode('utf-8')
u'Berlän'.encode('utf-8')
u'London, Paris, Bärlin'.encode('utf-8')
u'Beälin'.encode('utf-8')

work as expected.

Is there something I do wrong with the encoding? Do you know any trick?

3

There are 3 best solutions below

0
On BEST ANSWER

You could use new regex library, it supports Unicode 6.0 and fuzzy matching:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from itertools import ifilter, imap
import regex as re

def apro_match(word_re, lines, fuzzy='e<=1'):
    search = re.compile(ur'('+word_re+'){'+fuzzy+'}').search
    for m in ifilter(None, imap(search, lines)):
        print m.span(), m[0]

def main():
    lst = u'Moskau Berlín Bärlin B\xe4rlin Berlän'.split()
    lst += [u'London, Paris, Bärlin']
    lst += u'äerlin Beälin'.split()
    print
    print "apro_match('Berlin', lst)"
    print "="*25
    apro_match('Berlin', lst)
    print 
    print "apro_match('.*Berlin', lst)"
    print "="*27
    apro_match('.*Berlin', lst)

if __name__ == '__main__':
    main()

'e<=1' means that at most one error of any kind is permitted. There are three types of errors:

  • Insertion, indicated by "i"
  • Deletion, indicated by "d"
  • Substitution, indicated by "s"

Output

apro_match('Berlin', lst)
=========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(15, 21) Bärlin
(0, 6) äerlin
(0, 6) Beälin

apro_match('.*Berlin', lst)
===========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(0, 21) London, Paris, Bärlin
(0, 6) äerlin
(0, 6) Beälin
0
On

Internally TRE works at the byte level and it returns byte positions. I had your same issue a while ago - there is no trick!

I modified the Python bindings, added an utf8 function and a function which builds a map from byte position to character position, and a small wrapper. Your test case works as expected when using this wrapper. I have not released the modifications, it was more of a quick hack while testing TRE - if you want them just let me know.

AFAIK TRE hasn't been updated for quite a while and there are still unfixed bugs in the current release (0.8.0) relating to pattern matching towards the end of a string (e.g. search "2004 " using pattern "2004$" gives a cost of 2, while the expected cost is 1).

As others have pointed out, for Python the new regex module seems quite interesting!

5
On

The link that you gave is to a blog article which gives a reference to another blog article about the most recent release which has many grumbly comments including one suggesting that the package doesn't work with "non-Latin" (whatever that means) encodings. What leads you to believe that TRE works with UTF-8-encoded text (by working at the character level rather than the byte level)?

You don't tell us how many errors (insertion, deletion, replacement) are accepted as a fuzzy match. You don't tell us if it is using the char routines or the wchar routines. Do you really expect potential answerers to download the package and read the code of the Python interface?

One would expect that if there are wchar C++ routines available, a Python interface would include bindings that did Python unicode <-> Python str (encoded in UTF-16LE) <-> C++ wchar -- not so?

Given that "working" matches for 6-character test cases come back with (0, 7), and one not-working case (string 6) is splitting up a two-byte character (prints as a ? because the answer is not valid UTF-8), it seems that it is working in byte (char) encoding-agnostic mode -- not a very good idea at all.

Note that if all else fails and all your input data is in German, you could try using latin1 or cp1252 encoding with the byte mode.

Some further remarks:

Your string3 is redundant -- it is the same as string2.

Your assertion that string5 "works" seems to be inconsistent with your assertions that string2 and string3 "work".

Your test coverage is sparse; it needs several don't-match cases that are much closer to matching than "Moskau"!

You should ensure that it is "working" with ASCII-only data first; here are some test cases:

Berlxn Berlxyn
Bxrlin Bxyrlin
xerlin xyerlin
Bexlin Bexylin
xBerlin xyBerlin
Bxerlin Bxyerlin
Berlinx Berlinxy
erlin Brlin Berli

Then run it with non-ASCII characters instead of each of x andy` in the above list.

Using a pattern like ".*Berlin" is not very useful for diagnostic purposes, especially when you have no meaningful "should not match" test cases.