How can strings with non-ASCII characters be retrieved with OptParse?

2.1k Views Asked by At

I'm using the OptParse module to retrieve a string value. OptParse only supports str typed strings, not unicode ones.

So let's say I start my script with:

./someScript --some-option ééééé

French characters, such as 'é', being typed str, trigger UnicodeDecodeErrors when read in the code:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 99: ordinal not in range(128)

I played around a bit with the unicode built-in function, but either I get an error, or the character disappears:

>>> unicode('é');
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> unicode('é', errors='ignore');
u''

Is there anything I can do to use OptParse to retrieve unicode/utf-8 strings?

It seems that the string can be retrieved and printed OK, but then I try to use that string with SQLite (using the APSW module), and it tries to convert to unicode somehow with cursor.execute("..."), and then the error occurs.

Here is a sample program that causes the error:

#!/usr/bin/python
# coding: utf-8

import os, sys, optparse
parser = optparse.OptionParser()
parser.add_option("--some-option")
(opts, args) = parser.parse_args()
print unicode(opts.some_option)
4

There are 4 best solutions below

0
Mark Tolonen On BEST ANSWER

Input is returned in the console encoding, so based on your updated example, use:

print opts.some_option.decode(sys.stdin.encoding)

unicode(opts.some_option) defaults to using ascii as the encoding.

0
Woot4Moo On

I believe your error is related to the following:

For example, to write Unicode literals including the Euro currency symbol, the ISO-8859-15 encoding can be used, with the Euro symbol having the ordinal value 164. This script will print the value 8364 (the Unicode codepoint corresponding to the Euro symbol) and then exit:

# -*- coding: iso-8859-15 -*-

currency = u"€"
print ord(currency)
2
jro On

You could decode the arguments before the parser handles them. Taking your example:

#!/usr/bin/python
# coding: utf-8
import os, sys, optparse
parser = optparse.OptionParser()
parser.add_option("--some-option")

# Decode the command line arguments to unicode
for i, a in enumerate(sys.argv):
    sys.argv[i] = a.decode('ISO-8859-15')

(opts, args) = parser.parse_args()
print type(opts.some_option), opts.some_option

This gives the following output:

C:\workspace>python file.py --some-option préférer
<type 'unicode'> préférer

I've chose the ISO/IEC 8859-15 code page, as it seems most appropriate to you. Adapt if needed.

1
lionyue On
#!/usr/bin/python
# coding: utf-8

import os, sys, optparse

reload(sys)
sys.setdefaultencoding('utf-8')

parser = optparse.OptionParser()
parser.add_option(u"--some-option")
(opts, args) = parser.parse_args()
print opts.print_help()