MLP a2c policy complaining that 0 isn't greater than 0, or infinity isn't greater than 0?

19 Views Asked by At

Getting the following error as I'm training some torch models:

ValueError('Expected parameter scale (Tensor of shape (1, 4)) of distribution Normal(loc: torch.Size([1, 4]), scale: torch.Size([1, 4])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:\ntensor([[inf, inf, 0., 0.]])').

My actions are of shape (4,) and observations (3,).

Does it think that infinity isn't >0, or that 0 is not greater than 0? And why is this even appearing in the first place, I don't know. It is from simply training a model using model.learn in stable baselines 3. However, it learns for a while but fails at this step:

~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar)
    257 
    258         while self.num_timesteps < total_timesteps:
--> 259             continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
    260 
    261             if continue_training is False:

~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py in collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps)
    167                 # Convert to pytorch tensor or to TensorDict
    168                 obs_tensor = obs_as_tensor(self._last_obs, self.device)
--> 169                 actions, values, log_probs = self.policy(obs_tensor)
    170             actions = actions.cpu().numpy()
    171 

~\anaconda3\envs\\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\policies.py in forward(self, obs, deterministic)
    624         # Evaluate the values for the given observations
    625         values = self.value_net(latent_vf)
--> 626         distribution = self._get_action_dist_from_latent(latent_pi)
    627         actions = distribution.get_actions(deterministic=deterministic)
    628         log_prob = distribution.log_prob(actions)

~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\policies.py in _get_action_dist_from_latent(self, latent_pi)
    654 
    655         if isinstance(self.action_dist, DiagGaussianDistribution):
--> 656             return self.action_dist.proba_distribution(mean_actions, self.log_std)
    657         elif isinstance(self.action_dist, CategoricalDistribution):
    658             # Here mean_actions are the logits before the softmax

~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\distributions.py in proba_distribution(self, mean_actions, log_std)
    162         """
    163         action_std = th.ones_like(mean_actions) * log_std.exp()
--> 164         self.distribution = Normal(mean_actions, action_std)
    165         return self
    166 

~\anaconda3\envs\\lib\site-packages\torch\distributions\normal.py in __init__(self, loc, scale, validate_args)
     54         else:
     55             batch_shape = self.loc.size()
---> 56         super(Normal, self).__init__(batch_shape, validate_args=validate_args)
     57 
     58     def expand(self, batch_shape, _instance=None):

~\anaconda3\envs\\lib\site-packages\torch\distributions\distribution.py in __init__(self, batch_shape, event_shape, validate_args)
     55                 if not valid.all():
     56                     raise ValueError(
---> 57                         f"Expected parameter {param} "
     58                         f"({type(value).__name__} of shape {tuple(value.shape)}) "
     59                         f"of distribution {repr(self)} "

Keep in mind my actions are 0<=a<=1. Do I need to make it 0<a<=1 for it to work? It doesn't seem so, because it will train train train and be fine, but then once the trial is up and it's updating weights, it dies. What could be the fix and/or explanation for this? Much appreciated.

It's hard for me to know what the hell it's even complaining about, because this code is deep within stable baselines 3. Could it possibly be a glitch in their packages? I expect it to update the weights and continue to run, but instead it complains that 0 isn't greater than 0.. I don't know why I care about that though, it should just keep going, no?

Thanks for taking a look.

1

There are 1 best solutions below

0
Snared On

I solved the issue. My problem was that the gym environment I was using was rewarding constant behavior,and since the action doesn't have any standard deviation if the action is a constant, that was what was giving that error. Code's working now!