Getting the following error as I'm training some torch models:
ValueError('Expected parameter scale (Tensor of shape (1, 4)) of distribution Normal(loc: torch.Size([1, 4]), scale: torch.Size([1, 4])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:\ntensor([[inf, inf, 0., 0.]])').
My actions are of shape (4,) and observations (3,).
Does it think that infinity isn't >0, or that 0 is not greater than 0? And why is this even appearing in the first place, I don't know. It is from simply training a model using model.learn in stable baselines 3. However, it learns for a while but fails at this step:
~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar)
257
258 while self.num_timesteps < total_timesteps:
--> 259 continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
260
261 if continue_training is False:
~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py in collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps)
167 # Convert to pytorch tensor or to TensorDict
168 obs_tensor = obs_as_tensor(self._last_obs, self.device)
--> 169 actions, values, log_probs = self.policy(obs_tensor)
170 actions = actions.cpu().numpy()
171
~\anaconda3\envs\\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\policies.py in forward(self, obs, deterministic)
624 # Evaluate the values for the given observations
625 values = self.value_net(latent_vf)
--> 626 distribution = self._get_action_dist_from_latent(latent_pi)
627 actions = distribution.get_actions(deterministic=deterministic)
628 log_prob = distribution.log_prob(actions)
~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\policies.py in _get_action_dist_from_latent(self, latent_pi)
654
655 if isinstance(self.action_dist, DiagGaussianDistribution):
--> 656 return self.action_dist.proba_distribution(mean_actions, self.log_std)
657 elif isinstance(self.action_dist, CategoricalDistribution):
658 # Here mean_actions are the logits before the softmax
~\anaconda3\envs\\lib\site-packages\stable_baselines3\common\distributions.py in proba_distribution(self, mean_actions, log_std)
162 """
163 action_std = th.ones_like(mean_actions) * log_std.exp()
--> 164 self.distribution = Normal(mean_actions, action_std)
165 return self
166
~\anaconda3\envs\\lib\site-packages\torch\distributions\normal.py in __init__(self, loc, scale, validate_args)
54 else:
55 batch_shape = self.loc.size()
---> 56 super(Normal, self).__init__(batch_shape, validate_args=validate_args)
57
58 def expand(self, batch_shape, _instance=None):
~\anaconda3\envs\\lib\site-packages\torch\distributions\distribution.py in __init__(self, batch_shape, event_shape, validate_args)
55 if not valid.all():
56 raise ValueError(
---> 57 f"Expected parameter {param} "
58 f"({type(value).__name__} of shape {tuple(value.shape)}) "
59 f"of distribution {repr(self)} "
Keep in mind my actions are 0<=a<=1. Do I need to make it 0<a<=1 for it to work? It doesn't seem so, because it will train train train and be fine, but then once the trial is up and it's updating weights, it dies. What could be the fix and/or explanation for this? Much appreciated.
It's hard for me to know what the hell it's even complaining about, because this code is deep within stable baselines 3. Could it possibly be a glitch in their packages? I expect it to update the weights and continue to run, but instead it complains that 0 isn't greater than 0.. I don't know why I care about that though, it should just keep going, no?
Thanks for taking a look.
I solved the issue. My problem was that the gym environment I was using was rewarding constant behavior,and since the action doesn't have any standard deviation if the action is a constant, that was what was giving that error. Code's working now!