How to understand cache mechanism in tensorflow

1k Views Asked by At

The paper: TensorFlow: A System for Large-Scale Machine Learning, $3.3 said:

We optimized TensorFlow for executing large sub- graphs repeatedly with low latency. Once the graph for a step has been pruned, placed, and partitioned, its sub- graphs are cached in their respective devices. A client session maintains the mapping from step definitions to cached subgraphs, so that a distributed step on a large graph can be initiated with one small message to each participating task. This model favours static, reusable graphs, but it can support dynamic computations using dynamic control flow, as the next subsection describes.
  1. How to understand 'cached in their respective devices' here? And many API owns 'caching_device' parameter, however the default value is False, how to understand the CACHE feature?

  2. In general, cache mechanism will always follow with 'INVALID cache' policy, so how is the cache policy?

  3. If we use more clones model graph for multiple GPUs with between graph parallatism, that is, more model clones will refer shared variable on ps, how each clone to read remote variables? Does it cache the variable on some local devices by default for reducing network communication?

More details:

A Tour of TensorFlow
https://arxiv.org/pdf/1610.01178.pdf

Finally, an important optimization made by TensorFlow at this step is “canonicalization” of (send,receive) pairs. In the setup displayed in Figure 5b, the existence of each recv node on device B would imply allocation and management of a separate buffer to store ν’s output tensor, so that it may then be fed to nodes α and β, respectively. However, an equivalent and more efficient transformation places only one recv node on device B, streams all output from ν to this single node, and then to the two dependent nodes α and β. This last and final evolution is given in Figure 5c.

The above documentation describes that if Figure 5c automatically do optimization to reduce implict read action. If this occurs in distributed system, network traffic will automatically be reduced as wanted.

In another way, /model/slim/deployment/model_deploy.py try to create cache variable as following:

  562   def caching_device(self):
  563     """Returns the device to use for caching variables.
  564
  565     Variables are cached on the worker CPU when using replicas.
  566
  567     Returns:
  568       A device string or None if the variables do not need to be cached.
  569     """
  570     if self._num_ps_tasks > 0:
  571       return lambda op: op.device
  572     else:
  573       return None

to try to opimitize network traffic, I think.

What's the real or best way to do communication optimization in distributed system ?

We also prefer more clearification about it, and we will try to update this issue if I get more experiments tunning result.

0

There are 0 best solutions below