weird characters in utf-8 encoded file

794 Views Asked by At

I used tweepy to download tweets in Spanish and then write them into a CSV file. I used the code below to do this:

while True:
try:
    for tweet in tweets:
        print tweet.created_at, tweet.text.encode('utf-8')
        csvWriter.writerow([tweet.created_at, tweet.id_str, tweet.author.name.encode('utf8'), tweet.author.screen_name.encode('utf8'),
            tweet.user.location.encode('utf-8'), tweet.coordinates, tweet.text.encode('utf-8'), tweet.retweet_count, tweet.favorite_count])
except tweepy.TweepError:

Now, the row containing the tweet text contains weird characters, for example: México, D.F. appears as Mí©xico, D.F. I tried converting exporting the file to utf-8 in Numbers but this changes the same string to:Mí©xico, D.F.

For other tweets I also get something like this: RT @taniarin: _ôÖ‰_ôÖ‰_ôÖ‰_ôÖ‰ #UberSeQueda.

I am using pandas to read the file with this:

pd.read_csv("uber_dataFULL_utf8.csv", encoding='utf-8')

but it doesn't seem to work.

I don't know exactly what the problem is or might be. I used chardet and it detects the text as to be encoded in utf-8.

Thank you!

0

There are 0 best solutions below