I'm trying to use google cloud platform to deploy a model to support prediction.
I train the model (locally) with the following instruction
~/$ gcloud ml-engine local train --module-name trainer.task --package-path trainer
and everything works fine (...):
INFO:tensorflow:Restoring parameters from gs://my-bucket1/test2/model.ckpt-45000
INFO:tensorflow:Saving checkpoints for 45001 into gs://my-bucket1/test2/model.ckpt.
INFO:tensorflow:loss = 17471.6, step = 45001
[...]
Loss: 144278.046875
average_loss: 1453.68
global_step: 50000
loss: 144278.0
INFO:tensorflow:Restoring parameters from gs://my-bucket1/test2/model.ckpt-50000
Mean Square Error of Test Set = 593.1018482
But, when I run the following command to create a version,
gcloud ml-engine versions create Mo1 --model mod1 --origin gs://my-bucket1/test2/ --runtime-version 1.3
Then I get the following error.
ERROR: (gcloud.ml-engine.versions.create) FAILED_PRECONDITION: Field: version.deployment_uri
Error: SavedModel directory gs://my-bucket1/test2/ is expected to contain exactly one
of: [saved_model.pb, saved_model.pbtxt].- '@type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:- description: 'SavedModel directory gs://my-bucket1/test2/ is expected
to contain exactly one of: [saved_model.pb, saved_model.pbtxt].'
field: version.deployment_uri
Here is a screenshot of my bucket. I have a saved model with 'pbtxt' format
Finally, I add the piece of code where I save the model in the bucket.
regressor = tf.estimator.DNNRegressor(feature_columns=feature_columns,
hidden_units=[40, 30, 20],
model_dir="gs://my-bucket1/test2",
optimizer='RMSProp'
)
You'll notice that the file in your screenshot is
graph.pbtxt
whereassaved_model.pb{txt}
is needed.Note that just renaming the file generally will not be sufficient. The training process outputs checkpoints periodically in case restarts happen and recovery is needed. However, those checkpoints (and corresponding graphs) are the training graph. Training graphs tend to have things like file readers, input queues, dropout layers, etc. which are not appropriate for serving.
Instead, TensorFlow requires you to explicitly export a separate graph for serving. You can do this in one of two ways:
During/After Training
For this, I'll refer you to the Census sample.
First, You'll need a "Serving Input Function", such as
The you can simply call:
Or, if you're using
learn_runner
/Experiment
, you'll need to pass anExportStrategy
like the following to the constructor ofExperiment
:After Training
Almost exactly the same steps as above, but just in a separate Python script you can run after training is over (in your case, this is beneficial because you won't have to retrain). The basic idea is to construct the
Estimator
with the samemodel_dir
used in training, then to call export as above, something like:EDIT 09/12/2017
One slight change is needed to your training code. You are using
tf.estimator.DNNRegressor
, but that was introduced in TensorFlow 1.3; CloudML Engine only officially supports TensorFlow 1.2, so you'll need to usetf.contrib.learn.DNNRegressor
instead. They are very similar, but one notable difference is that you'll need to use thefit
method instead oftrain
.