Exclude Japanese Stopwords from File

1.1k Views Asked by At

I am trying to remove Japanese stopwords from a text corpus from twitter. Unfortunately the frequently used nltk does not contain Japanese, so I had to figure out a different way.

This is my MWE:

import urllib
from urllib.request import urlopen
import MeCab
import re

# slothlib
slothlib_path = "http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt"
sloth_file = urllib.request.urlopen(slothlib_path)

# stopwordsiso
iso_path = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ja/master/stopwords-ja.txt"
iso_file = urllib.request.urlopen(iso_path)
stopwords = [line.decode("utf-8").strip() for line in iso_file]

stopwords = [ss for ss in stopwords if not ss==u'']
stopwords = list(set(stopwords))

text = '日本語の自然言語処理は本当にしんどい、と彼は十回言った。'
tagger = MeCab.Tagger("-Owakati")
tok_text = tagger.parse(text)

ws = re.compile(" ")
words = [word for word in ws.split(tok_text)]
if words[-1] == u"\n":
    words = words[:-1]
ws = [w for w in words if w not in stopwords]

print(words)
print(ws)

Successfully Completed: It does give out the original tokenized text as well as the one without stopwords

['日本語', 'の', '自然', '言語', '処理', 'は', '本当に', 'しんどい', '、', 'と', '彼', 'は', '十', '回', '言っ', 'た', '。']
['日本語', '自然', '言語', '処理', '本当に', 'しんどい', '、', '十', '回', '言っ', '。']

There is still 2 issues I am facing though:

a) Is it possible to have 2 stopword lists regarded? namely iso_file and sloth_file ? so if the word is either a stopword from iso_file or sloth_file it will be removed? (I tried to use line 14 as stopwords = [line.decode("utf-8").strip() for line in zip('iso_file','sloth_file')] but received an error as tuple attributes may not be decoded

b) The ultimate goal would be to generate a new text file in which all stopwords are removed.

I had created this MWE

### first clean twitter csv
import pandas as pd
import re
import emoji

df = pd.read_csv("input.csv")

def cleaner(tweet):
    tweet = re.sub(r"@[^\s]+","",tweet) #Remove @username 
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+|\\n","", tweet) #Remove http links & \n
    tweet = " ".join(tweet.split())
    tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
    tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
    return tweet
df['text'] = df['text'].map(lambda x: cleaner(x))
df['text'].to_csv(r'cleaned.txt', header=None, index=None, sep='\t', mode='a')

### remove stopwords

import urllib
from urllib.request import urlopen
import MeCab
import re

# slothlib
slothlib_path = "http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt"
sloth_file = urllib.request.urlopen(slothlib_path)

#stopwordsiso
iso_path = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ja/master/stopwords-ja.txt"
iso_file = urllib.request.urlopen(iso_path)
stopwords = [line.decode("utf-8").strip() for line in iso_file]

stopwords = [ss for ss in stopwords if not ss==u'']
stopwords = list(set(stopwords))

with open("cleaned.txt",encoding='utf8') as f:
    cleanedlist = f.readlines()
    cleanedlist = list(set(cleanedlist))

tagger = MeCab.Tagger("-Owakati")
tok_text = tagger.parse(cleanedlist)

ws = re.compile(" ")
words = [word for word in ws.split(tok_text)]
if words[-1] == u"\n":
    words = words[:-1]
ws = [w for w in words if w not in stopwords]

print(words)
print(ws)

While it works for the simple input text in the first MWE, for the MWE I just stated I get the error

in method 'Tagger_parse', argument 2 of type 'char const *'
Additional information:
Wrong number or type of arguments for overloaded function 'Tagger_parse'.
  Possible C/C++ prototypes are:
    MeCab::Tagger::parse(MeCab::Lattice *) const
    MeCab::Tagger::parse(char const *)

for this line: tok_text = tagger.parse(cleanedlist) So I assume I will need to make amendments to the cleanedlist?

I have uploaded the cleaned.txt on github for reproducing the issue: [txt on github][1]

Also: How would I be able to get the tokenized list that excludes stopwords back to a text format like cleaned.txt? Would it be possible to for this purpose create a df of ws? Or might there even be a more simple way?

Sorry for the long request, I tried a lot and tried to make it as easy as possible to understand what I'm driving at :-)

Thank you very much! [1]: https://gist.github.com/yin-ori/1756f6236944e458fdbc4a4aa8f85a2c

1

There are 1 best solutions below

4
polm23 On

It sounds like you want to:

  1. combine two lists of stopwords
  2. save text that has had stopwords removed

For problem 1, if you have two lists you can make them into one list with full_list = list1 + list2. You can then make them into a set after that.

The reason you are getting the MeCab error is probably that you are passing a list to parse, which expects a string. (What MeCab wrapper are you using? I have never seen that particular error.) As a note, you should pass each individual tweet to MeCab, instead of the combined text of all tweets, something like:

tokenized = [tagger.parse(tweet) for tweet in cleanedlist]

That should resolve your problem.

Saving text with stopwords removed is just the same as any text file.


As a separate point...

Stopword lists are not very useful in Japanese because if you're using something like MeCab you already have part of speech information. So you should use that instead to throw out verb endings, function words, and so on.

Also removing stopwords is probably actively unhelpful if you're using any modern NLP methods, see the spaCy preprocessing FAQ.