Naive model-partitioning across several GPUs results in the workload moving from GPU to GPU during the forward and backward pass. At any instant, one GPU is busy. Here's the naive version.
with tf.device('/gpu:0'):
    model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
    model.add(Conv2D(64, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
with tf.device('/gpu:1'):
    model.add(Conv2D(128, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(64, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
with tf.device('/gpu:2'):
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(1500, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))
We need sample code (template) that pipelines the work and keeps all GPUs busy by sending waves of batches and coordinates the work on each GPU (forward, gradient calc, parameter updates).
A hint is provided here via the use of a data_flow_ops.StagingArea but a concrete example would be helpful. 
I understand that data-partitioning (or data-parallel) is the way to go, but there exist use-cases where the model needs to be partitioned across CPU+GPUs.
Grateful for any pointer or sample (pseudo)code.
 
                        
Model parallelism is a technique used when the parameters of a model don't fit into a single GPU's memory. This happens when your model is either quite complex (many layers) or when some of the layers are huge. Usually model parallelism is something that you should use only as your last resort as usually it is quite slow.
Your model looks quite simple, so I am not sure if you really need model parallelism (was it just an example?). If you want to use only single GPU at a time and can fit all your model into a single GPU I wouldn't recommend doing model parallelism.
If you are sure you need model parallelism, then refer to this example to do it using Apache MXNet.