I'm working with text data right now and I'm preprocessing it (I'm working with French data text).
Here's my code so far:
df = pd.read_csv('file.csv', sep=';')
from nltk.corpus import stopwords
import re
from nltk.tokenize import RegexpTokenizer
from spacy.lang.fr import French
stop_words = set(stopwords.words('french'))
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
lemmatizer = French.Defaults.create_lemmatizer()
def clean_text(text):
text = text.lower()
text = tokenizer.tokenize(text)
text = [word for word in text if not word in stop_words]
text = [lemmatizer.lemmatize(word) for word in text]
final_text = ' '.join( [w for w in text if len(w)>2] )
return final_text
df['comms_clean'] = df['comms'].apply(lambda x : clean_text(x))
But I get this error:
TypeError: lemmatize() missing 3 required positional arguments: 'index', 'exceptions', and 'rules'
I'm used to work with English data so it's the first time I used this kind of packages so I'm quite lost. What should I do to fix this?
The error you have shown is telling you that those arguments are missing when you call it, so you need them to call method
lemmatize()
but you are only passing one:lemmatize(string=word)
.Here you have the official documentation: https://spacy.io/api/lemmatizer#_title
Here you have the implementation of the lemmatizer object where you can find the
lemmatize
method: https://github.com/explosion/spaCy/blob/master/spacy/lemmatizer.py