Packaging keras tokenizer/word index for use in google-cloud-ml-engine

1.6k Views Asked by At

I've created a text classifier in Keras, and I can train the Keras model on Cloud ML just fine: the model is subsequently deployed on Cloud ML. However, when passing along text to classify, it returns the wrong classifications: I suspect strongly that it's not using the same tokenizer/word index that I have used when creating the keras classifier, and that was used to tokenise the new text.

I'm unsure how to pass along the tokeniser/word index to Cloud ML when training: there is a previous SO question, but will

gcloud ml-engine jobs submit training

pick up a pickle or text file containing the word index mapping? And if so, how should I configure the setup.py file?


EDIT:

So, I'm using Keras to tokenise input text like so:

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)

word_index = tokenizer.word_index

If I'm just loading a Keras model locally, I save the model like so:

model.save('model_embeddings_20epochs_v2.h5')

I also save the tokenizer, so that I can use it to tokenize new data:

with open("../saved_models/keras_tokenizer_embeddings_002.pickle", "wb") as f:
   pickle.dump(tokenizer, f)

On new data, I restore the model and tokenizer.

model = load_model('../../saved_models/model_embeddings_20epochs_v2.h5')
with open("../../saved_models/keras_tokenizer_embeddings_002.pickle", "rb") as f:
   tokenizer = pickle.load(f)

I then use the tokenizer to convert text to sequences on the new data, classify etc.

The script for the Cloud ML job does not save the tokenizer - I presumed that Keras script basically used the same word index.

....
X_train = [x.encode('UTF8') for x in X_train]
X_test = [x.encode('UTF8') for x in X_test]

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

.....

# prepare embedding matrix
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

At the moment, I'm just training it locally.

gcloud ml-engine local train \
             --job-dir $JOB_DIR \
--module-name trainer.multiclass_glove_embeddings_v1 \
--package-path ./trainer \
-- \
--train-file ./data/corpus.pkl
1

There are 1 best solutions below

0
On BEST ANSWER

From what I can tell from the source code, it appears that even TensorFlow's Keras-compatible library is doing Tokenization in Python, i.e., not as part of the TensorFlow graph.

At this time, CloudML Engine only supports serving TensorFlow models where all of the logic is encoded in a TensorFlow graph. That means you'll have to do the tokenization client side and pass the results onto the server for prediction. This would involve coding the client to deserialize the Tokenizer and call tokenizer.texts_to_sequences for the inputs for which predictions are desired.

We recognize that this is not always ideal (a non-starter for non-Python clients and inconvenient, at least, even for Python clients) and are actively investigating solutions for allowing arbitrary Python code to be run as part of prediction.