why cleaning text function doens't work without decoding to UTF8?

82 Views Asked by Ahmed Alashrafy At 05 January 2017 at 08:03

I wrote the following function in python 2.7 to clean the text but it doesn't work without decoding the tweet variable to utf8

# -*- coding: utf-8 -*-
import re
def clean_tweet(tweet):
    tweet = re.sub(u"[^\u0622-\u064A]", ' ', tweet, flags=re.U)
return tweet
if __name__ == "__main__":
      s="sadfas    سيبس sdfgsdfg/dfgdfg ffeee منت   منشس      يت??بمنشس//تبي منشكسميكمنشسكيمنك ٌاإلا رًاٌااًٌَُ"
      print "not working "+clean_tweet(s)
      print "working "+clean_tweet(s.decode("utf-8"))

Could any one explain why? Because I don't want to use the decoding as it makes the manipulation of the text in Sframe in graphlab is too slow.

There are 0 best solutions below