I'm using the tf.estimator
API with TensorFlow 2.1 on Google AI Platform to build a DNN Regressor. To use AI Platform Training hyperparameter tuning, I followed Google's docs.
I used the following configuration parameters:
config.yaml:
trainingInput:
scaleTier: BASIC
hyperparameters:
goal: MINIMIZE
maxTrials: 2
maxParallelTrials: 2
hyperparameterMetricTag: rmse
enableTrialEarlyStopping: True
params:
- parameterName: batch_size
type: DISCRETE
discreteValues:
- 100
- 200
- 300
- parameterName: lr
type: DOUBLE
minValue: 0.0001
maxValue: 0.1
scaleType: UNIT_LOG_SCALE
And to add the metric to my summary, I used the following code for my DNNRegressor:
def rmse(labels, predictions):
pred_values = predictions['predictions']
rmse = tf.keras.metrics.RootMeanSquaredError(name='root_mean_squared_error')
rmse.update_state(labels, pred_values)
return {'rmse': rmse}
def train_and_evaluate(hparams):
...
estimator = tf.estimator.DNNRegressor(
model_dir = output_dir,
feature_columns = get_cols(),
hidden_units = [max(2, int(FIRST_LAYER_SIZE * SCALE_FACTOR ** i))
for i in range(NUM_LAYERS)],
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
config = run_config)
estimator = tf.estimator.add_metrics(estimator, rmse)
According to Google's documentation, the add_metric
function creates a new estimator with the metric specified, which is then used as the hyperparameter metric. However, the AI Platform Training service doesn't recognise this metric:
Job details on AI Platform
On running the code locally, the rmse metric does get outputted in the logs. So, how do I make the metric available to the Training job on AI Platform using Estimators?
Additionally, there is an option of reporting the metrics through the cloudml-hypertune
Python package. But it requires the value of the metric as one of the input arguments. How do I extract the metric from tf.estimator.train_and_evaluate
function (since that's the function I use to train/evaluate my estimator) to input into the report_hyperparameter_tuning_metric
function?
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='rmse',
metric_value=??,
global_step=1000
)
ETA: Logs show no error. It says that the job completed successfully even though it fails.