I use tf.contrib.tpu.keras_to_tpu_model to make my code be able to run on TPU,but it took 170 hours to finish an epoch while CPU took the same time and GPU took only 40 hours per epoch.I tried to adjust batch size but nothing changed.And I've tested the input function may take up 20% of the run time when running on GPU, so I think it's maybe not the main reason.

Here is my code:

Run on colab:

  TPU:
  GPU:

The model:

def build_model(self):
    self.inputs = [Input(shape=(self.options.dim_feature[i], ), name='input_{}'.format(i), dtype='float') for i in range(3)]

    self.encodeds = [Dense(self.options.embedding_size[i], activation='tanh', name='encode_{}'.format(i))(self.inputs[i]) for i in range(3)]
    self.decodeds = [Dense(self.options.dim_feature[i], activation='sigmoid', name='decode_{}'.format(i),
                    activity_regularizer = regularizers.l2(0.0))(self.encodeds[i]) for i in range(3)]

    self.merged = concatenate(self.encodeds, axis=1)
    self.hidden_layer = Dense(self.options.hidden_size, activation='tanh', name='full_connected_layer')(self.merged)
    self.ouput_layer = Dense(1, activation='sigmoid', name='classify_layer')(self.hidden_layer)

    self.model = Model(inputs=self.inputs, outputs=self.decodeds+[self.ouput_layer])

                          metrics=dict([('decode_{}'.format(i), 'mse') for i in range(3)]+[('classify_layer', 'accuracy')]))
    self.model = tf.contrib.tpu.keras_to_tpu_model(
                tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])

As of 2019-02-20, the function tf.contrib.tpu.keras_to_tpu_model has been deprecated. You should therefore re-attempt converting your model using the new Distribution Strategy function. An in depth guide on distributed training can be found here.

I also noticed that you are using data type float as your input values. In CPython, the default bit value is 64bit. Currently, TPU’s function most optimally with 16-bit floats therefore you should reduce your inputs to either 8-bit or 16-bit. The lower the bit value, the faster the processing will be for your model.

Therefore it is also recommended to take advantage of Quantization, converting float weights to 8-bit integers. There are two types of quantized training: post-training quantization and quantization-aware training.

For more information concerning TPU’s on Google Cloud Platform you may refer to the Cloud TPU documentation, and for more information on TPU system architecture you may refer to this documentation by Google as it properly explains how TPU’s are designed.