I'm in the process of cleaning up data for text mining. This involves removing numbers, punctuation, and stopwords (common words that would just be noise in the data mining), and later doing word stemming.
Using the tm
package in R
, you can remove stopwords, for example using tm_map(myCorpus, removeWords, stopwords('english'))
. The tm
manual itself demonstrates using stopwords("english"))
. This word list contains contractions such as "I'd" and "I'll", as well as the very common word "I":
> library(tm)
> which(stopwords('english') == "i")
[1] 1
> which(stopwords('english') == "i'd")
[1] 69
(Text is assumed to be lowercase before removing stopwords.)
But (presumably) because "i" comes first in the list, the contractions are never removed:
> removeWords("i'd like a soda, please", stopwords('english'))
[1] "'d like soda, please"
A quick hack is to reverse the wordlist:
> removeWords("i'd like a soda, please", rev.default(stopwords('english')))
[1] " like soda, please"
Another solution is to find/make a better wordlist.
Is there a better/correct way to use stopwords('english')?
The problem here comes from the underdetermined work flow made possible by the tools you are using. Simply put, removing stop words means filtering tokens, but the text you are removing the stop words from has not yet been tokenized.
Specifically, the
i
is removed fromi'm
because the tokeniser splits on the apostrophe. In the text analysis package quanteda, you are required to tokenise the text first and only then remove features based on token matches. For instance:quanteda also has a built-in list of the most common stopwords, so this works too (and here, we have also removed punctuation):
In my opinion (biased, admittedly, since I designed quanteda) this is a better way to remove stopwords in English and most other languages.
Update Jan 2021, for a more modern version of quanteda
Created on 2021-02-01 by the reprex package (v1.0.0)