How to understand the padding rules on cloud TPU?

230 Views Asked by At

Cloud TPU has two padding rules on batch_size and feature_size of convolution operations, to minimize memory overhead and maximize computational efficiency (from here).

  • The total batch size should be a multiple of 64 (8 per TPU core), and feature dimensions should be a multiple of 128,

or

  • The total batch size should be a multiple of 1024 (128 per TPU core), and feature dimensions should be a multiple of 8.

If batch size and feature don't conform to the rules, padding occurs. According to the profiling results, the second one (batch_size/core -> 128, feature/core -> 8) is used.

I want to ask the rationale for these rules. As far as I know, the MXU unit is 128x128 systolic array since TPUv2. Why not pad both pre core batch size and feature to 128?

1

There are 1 best solutions below

2
Aireen Mei On

It is correct that the MXU unit is 128x128, and padding both per core batch size and feature to 128 will achieve the best memory usage. Actually in the link you referred to, the last paragraph says

Using a batch size of 1024 and feature dimensions that
are a multiple of 128 results in the best efficiency, 
although this may not be possible for all models.

This, plus the two rules you mentioned here can be interpreted as: If possible, set batch size to be multiples of 1024 (128 per core), and feature dimension to be multiples of 128. Otherwise, try to match at least one of the two conditions, i.e. batch size to be multiples of 1024, or feature dimension to be multiples of 128.