I've been using DeepMind's ACME library for some of my Reinforcement Learning projects and I've come across an issue that I'm hoping someone can shed light on.
Libraries and Environment
- Python 3.9
- TensorFlow 2.8
- Acme by DeepMind
The Code
I am working with learning.py where I've noticed that the variables q_tm1 and q_t are tensors:
q_tm1 = self._critic_network(o_tm1, transitions.action)
q_t = self._target_critic_network(o_t, self._target_policy_network(o_t))
These are later passed to a function losses.categorical in the distributional.py file, which is implemented as follows:
def categorical(q_tm1: networks.DiscreteValuedDistribution,
r_t: tf.Tensor,
d_t: tf.Tensor,
q_t: networks.DiscreteValuedDistribution) -> tf.Tensor:
z_t = tf.reshape(r_t, (-1, 1)) + tf.reshape(d_t, (-1, 1)) * q_t.values
p_t = tf.nn.softmax(q_t.logits)
...
The Problem
The function expects both q_tm1 and q_t to be instances of DiscreteValuedDistribution. However, they are tensors as generated in learning.py. Consequently, the program crashes when trying to access the values and logits attributes of q_t and q_tm1:
z_t = tf.reshape(r_t, (-1, 1)) + tf.reshape(d_t, (-1, 1)) * q_t.values
p_t = tf.nn.softmax(q_t.logits)
Questions
Is this behavior intentional, or perhaps an oversight? If it's intentional, what would be the workaround to ensure proper functioning? If it's not intentional, how could one go about fixing it?
What I've Tried
Tensor to DiscreteValuedDistribution Conversion: One of the initial thoughts was to convert the q_tm1 and q_t tensors into instances of DiscreteValuedDistribution. However, this approach was impractical as I didn't have access to the corresponding logits or probs, which are essential for initializing a DiscreteValuedDistribution.
# The following wouldn't work without logits or probs
q_tm1_as_dvd = networks.DiscreteValuedDistribution(values=q_tm1, logits=???)
q_t_as_dvd = networks.DiscreteValuedDistribution(values=q_t, logits=???)
Modifying categorical Implementation: Another potential fix considered was to alter the implementation of the losses.categorical function in distributional.py. However, this could create ripple effects affecting other parts of the codebase, including:
Breaking other algorithms that may rely on this function. Potentially introducing new bugs, especially if the original design was intentional. Invalidating any existing tests that were designed based on the current behavior of the function. Given these constraints, I'm hesitant to make large-scale changes without better understanding the implications and design choices behind the existing code. So any help or insights are greatly appreciated!