input for torch.nn.functional.gumbel_softmax

3.9k Views Asked by At

Say I have a tensor named attn_weights of size [1,a], entries of which indicate the attention weights between the given query and |a| keys. I want to select the largest one using torch.nn.functional.gumbel_softmax.

I find docs about this function describe the parameter as logits - […, num_features] unnormalized log probabilities. I wonder whether should I take log of attn_weights before passing it into gumbel_softmax? And I find Wiki defines logit=lg(p/1-p), which is different from barely logrithm. I wonder which one should I pass to the function?

Further, I wonder how to choose tau in gumbel_softmax, any guidelines?

1

There are 1 best solutions below

1
On BEST ANSWER

I wonder whether should I take log of attn_weights before passing it into gumbel_softmax?

If attn_weights are probabilities (sum to 1; e.g., output of a softmax), then yes. Otherwise, no.

I wonder how to choose tau in gumbel_softmax, any guidelines?

Usually, it requires tuning. The references provided in the docs can help you with that.

From Categorical Reparameterizaion with Gumbel-Softmax:

  • Figure 1, caption:

    ... (a) For low temperatures (τ = 0.1, τ = 0.5), the expected value of a Gumbel-Softmax random variable approaches the expected value of a categorical random variable with the same logits. As the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform distribution over the categories.

  • Section 2.2, 2nd paragraph (emphasis mine):

    While Gumbel-Softmax samples are differentiable, they are not identical to samples from the corresponding categorical distribution for non-zero temperature. For learning, there is a tradeoff between small temperatures, where samples are close to one-hot but the variance of the gradients is large, and large temperatures, where samples are smooth but the variance of the gradients is small (Figure 1). In practice, we start at a high temperature and anneal to a small but non-zero temperature.

  • Lastly, they remind the reader that tau can be learned:

    If τ is a learned parameter (rather than annealed via a fixed schedule), this scheme can be interpreted as entropy regularization (Szegedy et al., 2015; Pereyra et al., 2016), where the Gumbel-Softmax distribution can adaptively adjust the "confidence" of proposed samples during the training process.