I am trying to calculate SHAP values. I have the following code for model evaluation and training
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# split the dataset into train and test
train_text, val_text, train_labels, val_labels =
train_test_split(messages["text"].tolist(), messages["label"].tolist(),
test_size=0.3, random_state=42)
# Vectorize the text data
print('starting tdidf vectorizer')
vectorizer = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1,2))
X_train_vec = vectorizer.fit_transform(train_text).toarray()
X_val_vec = vectorizer.transform(val_text).toarray()
print(len(train_labels), len([t for t in train_labels if t]))
print(len(val_labels), len([t for t in val_labels if t]))
# Train model on the training set
rand_fore_max_feat='sqrt'
rand_fore_n_est = 1000
RandomForestClassifier_model = RandomForestClassifier(max_features =
rand_fore_max_feat, n_estimators = rand_fore_n_est)
#rename model for ease of use
model = RandomForestClassifier_model
#* fit model
print('starting model fit')
model.fit(X_train_vec, train_labels)
print('finished model fit')
# make predictions usng a testing set
val_pred = model.predict(X_val_vec)
# display a classification report
print(classification_report(val_pred, val_labels))
When I run the code below to calculate SHAP values, the kernel crashes.
feature_names = vectorizer.get_feature_names_out()
subset_size = 10
try:
explainer = shap.Explainer(model, X_train_vec, feature_names=feature_names)
shap_values = explainer(X_val_vec[:subset_size])
print(shap_values.values.shape)
except Exception as e:
print(f"An error occurred: {e}")
I tried running each line of code in the try block individually, and "shap_values = explainer(X_val_vec[:subset_size])" is the line that seems to make the kernel crash.
I tried changing my Python to 3.9.10 and updating Jupyter, ipywidgets, and ipykernel. I also tried changing the subset size from 10, to 5, to 1. I don't think the issue is with the sample size.
I tried lowering each text to 10 words and making a subset size of 3 with the following code
# X_val_text is your original text data and you want to reduce each
#text to 10 words
X_val_text_subset = [" ".join(text.split()[:10]) for text in
messages["text"]]
# Vectorize the modified text data
X_val_vec_subset = vectorizer.transform(X_val_text_subset)
# Choose a subset of instances (let's say 3)
subset_size = 3
X_val_vec_subset = X_val_vec_subset[:subset_size]
print(X_val_vec_subset)
This code outputs "(0, 56681) 0.26027819722272955 (0, 56625) 0.1740880007667988 (0, 55480) 0.18384870639744058 (0, 55457) 0.15617249149572865 (0, 47186) 0.3415585648445225 (0, 47183) 0.2530088534451547 (0, 42503) 0.12703787267515596 (0, 23941) 0.24448002523220852..."
Any ideas as to what I can do to stop this code from crashing?