UnicodeDecodeError: 'ascii' codec can't decode, with gensim, python3.5

1.2k Views Asked by At

I am using python 3.5 on both windows and Linux but get the same error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128) The error log is the following: Reloaded modules: lazylinker_ext Traceback (most recent call last):

  File "<ipython-input-2-d60a2349532e>", line 1, in <module>
    runfile('C:/Users/YZC/Google     Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py',     wdir='C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622')

  File "C:\Users\YZC\Anaconda3\lib\site-    packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "C:\Users\YZC\Anaconda3\lib\site-    packages\spyderlib\widgets\externalshell\sitecustomize.py", line 88, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "C:/Users/YZC/Google     Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py", line 70, in     <module>
    model = gensim.models.Word2Vec.load('model_all_no_lemma')

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\models\word2vec.py",     line 1485, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 248,     in load
    obj = unpickle(fname)

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 912, in unpickle
    return _pickle.loads(f.read())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0:     ordinal not in range(128)

1.I checked and found the default decode method is utf-8 by: import sys sys.getdefaultencoding() Out[2]: 'utf-8'

  1. when read the file, I also added .decode('utf-8')
  2. I did add shepang line in the beginning and declare utf-8 so I really dont know why python couldnt read the file. Can anybody help me out?

Here are the code:

# -*- coding: utf-8 -*-
import gensim
import csv
import numpy as np
import math
import string
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob, Word



class SpeechParser(object):

    def __init__(self, filename):
        self.filename = filename
        self.lemmatize = WordNetLemmatizer().lemmatize
        self.cached_stopwords = stopwords.words('english')

    def __iter__(self):

        with open(self.filename, 'rb', encoding='utf-8') as csvfile:
            file_reader = csv.reader(csvfile, delimiter=',', quotechar='|', )
            headers = file_reader.next()
            for row in file_reader:
                parsed_row = self.parse_speech(row[-2])
                yield parsed_row

    def parse_speech(self, row):

        speech_words =  row.replace('\r\n', ' ').strip().lower().translate(None, string.punctuation).decode('utf-8', 'ignore')         

        return speech_words.split()

    # -- source: https://github.com/prateekpg2455/U.S-Presidential-    Speeches/blob/master/speech.py --
    def pos(self, tag):
        if tag.startswith('J'):
            return wordnet.ADJ
        elif tag.startswith('V'):
            return wordnet.VERB
        elif tag.startswith('N'):
            return wordnet.NOUN
        elif tag.startswith('R'):
            return wordnet.ADV
        else:
            return ''

if __name__ == '__main__':

    # instantiate object
    sentences = SpeechParser("sample.csv")

    # load an existing model
    model = gensim.models.Word2Vec.load('model_all_no_lemma')



    print('\n-----------------------------------------------------------')
    print('MODEL:\t{0}'.format(model))

    vocab = model.vocab

    # print log-probability of first 10 sentences
    row_count = 0
    print('\n------------- Scores for first 10 documents: -------------')
    for doc in sentences: 
        print(sum(model.score(doc))/len(doc))
        row_count += 1
        if row_count > 10:
            break
    print('\n-----------------------------------------------------------')
1

There are 1 best solutions below

0
On

It looks like a bug in Gensim when you try to use a Python 2 pickle file that has non-ASCII chars in it with Python 3.

The unpickle is happening when you call:

model = gensim.models.Word2Vec.load('model_all_no_lemma')

In Python 3, during the unpickle it wants to convert legacy byte strings to (Unicode) strings. The default action is to decode with 'ASCII' in strict mode.

The fix will be dependant on the encoding in your original pickle file and will require you to patch the gensim code.

I'm not familiar with gensim so you will have to try the following two options:

Force UTF-8

Chances are, your non-ASCII data is in UTF-8 format.

  1. Edit C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
  2. Goto line 912
  3. Change line to read:

    return _pickle.loads(f.read(), encoding='utf-8')
    

Byte mode

Gensim in Python3 may happily work with byte strings:

  1. Edit C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
  2. Goto line 912
  3. Change line to read:

    return _pickle.loads(f.read(), encoding='bytes')