I had created a model for predicting type of website by looking at text.
But it is seems to be not working. I had stored the model, vectorizer, label encoder in the pickle file and loading here
code :
import pandas as pd
import sklearn.metrics as sm
import nltk
import string
from nltk.tokenize import word_tokenize
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import pickle
import os
def clean_text(text):
#### cleaning the text
###1. Convert the text to lower case
text= text.lower()
###2. tokenize the sentences to words
text_list= word_tokenize(text)
###3. Removes the special charcters
special_char_non_text= [re.sub(f'[{string.punctuation}]+','',i) for i in text_list]
###4. remove stopwords
non_stopwords_text= [i for i in special_char_non_text if i not in stopwords.words('english')]
###5. lemmatize the words
lemmatizer= WordNetLemmatizer()
lemmatized_words= [lemmatizer.lemmatize(i) for i in non_stopwords_text]
cleaned_text= ' '.join(lemmatized_words)
return cleaned_text
text_input= input('Please enter the text: ')
cleaned_text= clean_text(text_input)
temp_df= pd.DataFrame({'input_text':[cleaned_text.strip()]})
vectorizer_filepath= 'tf_idf_vectorizer.pkl'
tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
temp_df_1= tf_idf_vectorizer.transform(temp_df)
input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
### load the model
model_path='multinomial_clf.pkl'
model_clf= pickle.load(open(model_path,'rb'))
y_pred= model_clf.predict(input_df)
#print(y_pred)
### load the label encoder
label_encoder_file= 'label_encoder.pkl'
label_encoder= pickle.load(open(label_encoder_file,'rb'))
label_class= label_encoder.inverse_transform(y_pred.ravel())
print(f'{label_class} is the predicted class')
I am getting an error:
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
65 try:
---> 66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in <listcomp>(.0)
65 try:
---> 66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
KeyError: 'website booking flight bus ticket'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-21-b92cbf8dfe74> in <module>
5 vectorizer_filepath= 'tf_idf_vectorizer.pkl'
6 tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
----> 7 temp_df_1= tf_idf_vectorizer.transform(temp_df)
8 input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
9
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in transform(self, y)
275 return np.array([])
276
--> 277 _, y = _encode(y, uniques=self.classes_, encode=True)
278 return y
279
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
111 if values.dtype == object:
112 try:
--> 113 res = _encode_python(values, uniques, encode)
114 except TypeError:
115 types = sorted(t.__qualname__
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
---> 68 raise ValueError("y contains previously unseen labels: %s"
69 % str(e))
70 return uniques, encoded
ValueError: y contains previously unseen labels: 'website booking flight bus ticket'
I had used the input text value as This is the website for booking flight,bus tickets
I am not sure why it is happening like this
Could anyone help me to solve the issue?
Can't really tell exactly without having your data and trained model, but a few things I'm noticing:
In
###3
empty strings seem to be able to stay behind (if tokens consist of only punctuation) and you don't seem to remove them in any way afterwards. You strip() the entire text but that only will remove one extra first space and one extra last space, not the potential double or higher spaces within the text. You can see that in the error message too.You hand the entire DataFrame to
tf_idf_vectorizer.transform()
, but it expects an iterable of documents. Iterating over a whole DataFrame like this will iterate over the columns, not the rows. Trytf_idf_vectorizer.transform(temp_df['input_text'])
.You call
transform()
and notfit_transform()
, so all vocabulary needs to be known by the model, is that the case?To my knowledge TfidfVectorizer already has a preprocesser built-in, did you overwrite that within your clean method in the pickled object? If so, why do you manually clean it again? The fact that the error message shows a non-tokenized string seems to suggest that the built-in tokenizer did not run as it should, tries to get the vectors for non-tokenized string
'website booking flight bus ticket'
from vocabulary and fails. You should either let TfidfVectorizer do the preprocessing or properly use the attributepreprocessor
and hand (a modified version of) your cleaning method to it. Check out this thread: How can I pass a preprocessor to TfidfVectorizer? - sklearn - python.