If someone can help me understand the situation it would be great. Thanks in advance. My setup: OS: Ubuntu 16.04, 2 Titan X GPUs. TensorFlow (version 0.12.1) installed in a conda environment using pip as on TF docs. Python 3.5.
Code:
I ran the following code to test my 2 GPU setup. Once each with random_matrix = tf.zeros(...) and random_matrix = tf.random_uniform(...). The outputs are shown below.
Questions:
1) When I run with tf.zeros. The timings on CPU and GPU are identical. But with tf.random_uniform I see that the GPU is faster (as I had expected). Why is tf.zeros slower on GPU? What am I missing?
2) I have fixed the global seed and the local seed. Why are the outputs within the GPUs different for the tf.random_uniform case?
Thanks a lot for any insights in advance.
import sys
import numpy as np
import tensorflow as tf
from datetime import datetime
device_names = ["/cpu:0", "/gpu:0", "/gpu:1"]
shapes = [(3000, 3000), (6000, 6000), (9000, 9000), (12000, 12000)]
all_timings = []
tf.set_random_seed(1234)
for device_name in device_names:
device_timings = []
for shape in shapes:
print("device_name:::::::::{}".format(device_name))
with tf.device(device_name):
# random_matrix = tf.zeros(shape)
random_matrix = tf.random_uniform(shape=shape,
minval=0,
maxval=1,
seed=1234)
result_op = tf.reduce_sum(tf.matmul(random_matrix,tf.transpose(random_matrix)))
start_time = datetime.now()
result = -1.0
with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as session:
result = session.run(result_op)
time_diff = datetime.now() - start_time
device_timings.append((device_name,
shape,
"time_taken (secs): {}".format(time_diff.total_seconds()),
"result: {}".format(result)))
print("++++++++++++++++++++++++++++++++++++++++++++++++++++++\n\n")
all_timings.append(device_timings)
print("\n\n")
for device_timings in all_timings:
for t in device_timings:
print(t)
print("---------------------------------------------------------\n\n")
Timings with tf.random_uniform():
('/cpu:0', (3000, 3000), 'time_taken (secs): 1.146831', 'result: 6754431488.0')
('/cpu:0', (6000, 6000), 'time_taken (secs): 2.816985', 'result: 54023852032.0')
('/cpu:0', (9000, 9000), 'time_taken (secs): 9.372665', 'result: 184425938944.0')
('/cpu:0', (12000, 12000), 'time_taken (secs): 21.718614', 'result: 439655661568.0')
--------------------------------------------------------
('/gpu:0', (3000, 3000), 'time_taken (secs): 0.39667', 'result: 6754406912.0')
('/gpu:0', (6000, 6000), 'time_taken (secs): 0.085984', 'result: 54006796288.0')
('/gpu:0', (9000, 9000), 'time_taken (secs): 0.221407', 'result: 182251880448.0')
('/gpu:0', (12000, 12000), 'time_taken (secs): 0.444187', 'result: 431996174336.0')
---------------------------------------------------------
('/gpu:1', (3000, 3000), 'time_taken (secs): 0.399159', 'result: 6754401792.0')
('/gpu:1', (6000, 6000), 'time_taken (secs): 0.102889', 'result: 54006857728.0')
('/gpu:1', (9000, 9000), 'time_taken (secs): 0.262842', 'result: 182251585536.0')
('/gpu:1', (12000, 12000), 'time_taken (secs): 0.469139', 'result: 431996141568.0')
---------------------------------------------------------
Timings with tf.zeros():
('/cpu:0', (3000, 3000), 'time_taken (secs): 1.040602', 'result: 0.0')
('/cpu:0', (6000, 6000), 'time_taken (secs): 2.760587', 'result: 0.0')
('/cpu:0', (9000, 9000), 'time_taken (secs): 9.134257', 'result: 0.0')
('/cpu:0', (12000, 12000), 'time_taken (secs): 21.410583', 'result: 0.0')
---------------------------------------------------------
('/gpu:0', (3000, 3000), 'time_taken (secs): 0.394707', 'result: 0.0')
(/gpu:0', (6000, 6000), 'time_taken (secs): 2.750311', 'result: 0.0')
('/gpu:0', (9000, 9000), 'time_taken (secs): 9.141721', 'result: 0.0')
('/gpu:0', (12000, 12000), 'time_taken (secs): 21.441183', 'result: 0.0')
--------------------------------------------------------
('/gpu:1', (3000, 3000), 'time_taken (secs): 0.390197', 'result: 0.0')
('/gpu:1', (6000, 6000), 'time_taken (secs): 2.788815', 'result: 0.0')
('/gpu:1', (9000, 9000), 'time_taken (secs): 9.335516', 'result: 0.0')
('/gpu:1', (12000, 12000), 'time_taken (secs): 21.654866', 'result: 0.0')
I suspect this this related to GPU kernel optimization. If you "pre-warm" your GPU by running the same computation shape, the next execution is much faster. There's PTX compilation that adds a couple of seconds to the first usage of kernel on a GPU in a process, but it's peculiar that your runtime increases with size of the matrix, perhaps there's some profiling going on as well.
Note that without
tf.OptimizerOptions.L0it becomes implausibly fast, so there's some caching happening as well.I see this: