Issue while deploying an model locally

69 Views Asked by At

I had created a model for predicting type of website by looking at text.

But it is seems to be not working. I had stored the model, vectorizer, label encoder in the pickle file and loading here

code :

import pandas as pd
import sklearn.metrics as sm
import nltk
import string
from nltk.tokenize import word_tokenize
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import pickle
import os

def clean_text(text):
    #### cleaning the text 
    ###1. Convert the text to lower case
    text= text.lower()

    ###2. tokenize the sentences to words
    text_list= word_tokenize(text)

    ###3. Removes the special charcters
    special_char_non_text= [re.sub(f'[{string.punctuation}]+','',i) for i in text_list]

    ###4.  remove stopwords
    non_stopwords_text= [i for i in special_char_non_text if i not in stopwords.words('english')]

    ###5. lemmatize the words
    lemmatizer= WordNetLemmatizer()
    lemmatized_words= [lemmatizer.lemmatize(i) for i in non_stopwords_text]

    cleaned_text= ' '.join(lemmatized_words)

    return cleaned_text

text_input= input('Please enter the text: ')
cleaned_text= clean_text(text_input)

temp_df= pd.DataFrame({'input_text':[cleaned_text.strip()]})
vectorizer_filepath= 'tf_idf_vectorizer.pkl'
tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
temp_df_1= tf_idf_vectorizer.transform(temp_df)
input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())

### load the model

model_path='multinomial_clf.pkl'
model_clf= pickle.load(open(model_path,'rb'))

y_pred= model_clf.predict(input_df)

#print(y_pred)
### load the label encoder
label_encoder_file= 'label_encoder.pkl'
label_encoder= pickle.load(open(label_encoder_file,'rb'))

label_class= label_encoder.inverse_transform(y_pred.ravel())
print(f'{label_class} is the predicted class')

I am getting an error:

KeyError                                  Traceback (most recent call last)
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
     65         try:
---> 66             encoded = np.array([table[v] for v in values])
     67         except KeyError as e:

~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in <listcomp>(.0)
     65         try:
---> 66             encoded = np.array([table[v] for v in values])
     67         except KeyError as e:

KeyError: 'website booking flight  bus ticket'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-21-b92cbf8dfe74> in <module>
      5 vectorizer_filepath= 'tf_idf_vectorizer.pkl'
      6 tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
----> 7 temp_df_1= tf_idf_vectorizer.transform(temp_df)
      8 input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
      9 

~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in transform(self, y)
    275             return np.array([])
    276 
--> 277         _, y = _encode(y, uniques=self.classes_, encode=True)
    278         return y
    279 

~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
    111     if values.dtype == object:
    112         try:
--> 113             res = _encode_python(values, uniques, encode)
    114         except TypeError:
    115             types = sorted(t.__qualname__

~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
     66             encoded = np.array([table[v] for v in values])
     67         except KeyError as e:
---> 68             raise ValueError("y contains previously unseen labels: %s"
     69                              % str(e))
     70         return uniques, encoded

ValueError: y contains previously unseen labels: 'website booking flight  bus ticket'

I had used the input text value as This is the website for booking flight,bus tickets

I am not sure why it is happening like this

Could anyone help me to solve the issue?

1

There are 1 best solutions below

0
On

Can't really tell exactly without having your data and trained model, but a few things I'm noticing:

  1. In ###3 empty strings seem to be able to stay behind (if tokens consist of only punctuation) and you don't seem to remove them in any way afterwards. You strip() the entire text but that only will remove one extra first space and one extra last space, not the potential double or higher spaces within the text. You can see that in the error message too.

  2. You hand the entire DataFrame to tf_idf_vectorizer.transform(), but it expects an iterable of documents. Iterating over a whole DataFrame like this will iterate over the columns, not the rows. Try tf_idf_vectorizer.transform(temp_df['input_text']).

  3. You call transform() and not fit_transform(), so all vocabulary needs to be known by the model, is that the case?

  4. To my knowledge TfidfVectorizer already has a preprocesser built-in, did you overwrite that within your clean method in the pickled object? If so, why do you manually clean it again? The fact that the error message shows a non-tokenized string seems to suggest that the built-in tokenizer did not run as it should, tries to get the vectors for non-tokenized string 'website booking flight bus ticket' from vocabulary and fails. You should either let TfidfVectorizer do the preprocessing or properly use the attribute preprocessor and hand (a modified version of) your cleaning method to it. Check out this thread: How can I pass a preprocessor to TfidfVectorizer? - sklearn - python.