I've written a GUI that allows Japanese input and when you go to file > parse writes in to a text file. That text file then gets run through MeCab where spaces are put in between the words. After that it is supposed to be written to the text file once again, so it can be displayed in another GUI window.
The issue I'm having is it doesn't want to write the parsed data to the text file. It has no problem writing it the first time. Also, it prints the parsed info to IDLE no problem as well. Here is the parser and the error:
#!/usr/bin/python
# -*- coding: <utf-8> -*-
import sys
import MeCab
import codecs
read_from = open("pholder.txt").read()
mecab = MeCab.Tagger("-Owakati")
output = mecab.parse(read_from)
print output
text = output
write_to = codecs.open("pholder.txt", "w", "utf-8")
write_to.write(text)
write_to.close()
Traceback (most recent call last):
File "C:\...\mecabSpaces.py", line 16, in <module>
write_to.write(text)
File "C:\...\codecs.py", line 691, in write
return self.writer.write(data)
File "C:\...\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
The parsed data isn't unicode, it's a byte string.
So when you try to write the data to the file, it tries to decode it to unicode before encoding it to
utf-8. Since your default codec isascii, but you actually haveutf-8, it chokes on the first character with byte value of 128 or above.You should
.decode('utf-8')the returned data, or else use amecabmethod that returns unicode data.