extract all possible emoticons from a python list

1.5k Views Asked by At

Objective

I am trying to extract all possible emoticons from a unicode word list. I am using Python3 with anaconda installation, therefore I can not use a package such as emoji.py.

Here is a sample bow of word list.

lst = ['✅','türkçe','Çile','ısp','İst','ğ','some','#','@','@one','#thing','','1','41','ç','ö','⏱','⏱','','₺','€',':)',':/']

Expected output is like this:

out = ['✅','⏱', '⏱','']

Attempt 1

List comprehension to check if all chars are ASCII:

[w for w in lst if len(w) != len(w.encode())]

However, this is not giving the desired output because there are non ASCII letters in text. Also, currency symbols are not emoticons.

['✅', 'türkçe', 'Çile', 'ısp', 'İst', 'ğ', 'ç', 'ö', '⏱', '⏱', '', '₺', '€']

Attempt 2

Using NTLK emoticons regular expression

from nltk.tokenize.casual import EMOTICON_RE
EMOTICON_RE.findall(' '.join(lst))

However, EMOTICON_RE can only extract expressions such as :) :/ :(

Here is the list of what I am to considering to be emoticons.

I tried to build a list of emoticons to see if my word exists in that list, but I could not build a list of emoticons from unicode character codes.

Can you please suggest?

1

There are 1 best solutions below

3
On BEST ANSWER

I think that all of those characters are in Symbol, other category. Therefore you can do

[w for w in lst if any(c for c in w if unicodedata.category(c) == 'So')]