So I'm trying to train an agent on my custom gymnasium environment trough stablebaselines3 and it kept crashing seemingly random and throwing the following ValueError:
Traceback (most recent call last):
File "C:\Users\bo112\PycharmProjects\ecocharge\code\Simulation Env\prototype_visu.py", line 684, in <module>
model.learn(total_timesteps=time_steps, tb_log_name=log_name)
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\ppo\ppo.py", line 315, in learn
return super().learn(
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 277, in learn
continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 218, in collect_rollouts
terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\policies.py", line 256, in obs_to_tensor
vectorized_env = vectorized_env or is_vectorized_observation(obs_, obs_space)
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 399, in is_vectorized_observation
return is_vec_obs_func(observation, observation_space) # type: ignore[operator]
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 266, in is_vectorized_box_observation
raise ValueError(
ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape.
I don't know why the observation shape/content would change though, since it doesn't change how the state gets its values at all.
I figured out that it crashes, whenever the agent 'survives' a whole episode for the first time and truncation gets used instead of termination. Is there some kind of weird quirk for returning truncated and terminated that I don't know about? Because I can't find the error in my step function.
def step(self, action):
... # handling the action etc.
reward = 0
truncated = False
terminated = False
# Check if time is over/score too low - else reward function
if self.n_step >= self.max_steps:
truncated = True
print('truncated')
elif self.score < -1000:
terminated = True
# print('terminated')
else:
reward = self.reward_fnc_distance()
self.score += reward
self.d_score.append(self.score)
self.n_step += 1
# state: [current power, peak power, fridge 1 temp, fridge 2 temp, [...] , fridge n temp]
self.state['current_power'] = self.d_power_sum[-1]
self.state['peak_power'] = self.peak_power
for i in range(self.n_fridges):
self.state[f'fridge{i}_temp'] = self.d_fridges_temp[i][-1]
self.state[f'fridge{i}_on'] = self.fridges[i].on
if self.logging:
print(f'score: {self.score}')
if (truncated or terminated) and self.logging:
self.save_run()
return self.state, reward, terminated, truncated, {}
This is the general setup for training my models:
hidden_layer = [64, 64, 32]
time_steps = 1000_000
learning_rate = 0.003
log_name = f'PPO_{int(time_steps/1000)}k_lr{str(learning_rate).replace(".", "_")}'
vec_env = make_vec_env(env_id=ChargeEnv, n_envs=4)
model = PPO('MultiInputPolicy', vec_env, verbose=1, tensorboard_log='tensorboard_logs/',
policy_kwargs={'net_arch': hidden_layer, 'activation_fn': th.nn.ReLU}, learning_rate=learning_rate,
device=th.device("cuda" if th.cuda.is_available() else "cpu"), batch_size=128)
model.learn(total_timesteps=time_steps, tb_log_name=log_name)
model.save(f'models/{log_name}')
vec_env.close()
As mentioned above, episodes only get truncated when it also throws the ValueError and vice versa, so I'm pretty sure it has to be that.
EDIT:
From the answer below, I found the problem was to simply put all my float/Box values of self.state into numpy arrays before returning them like following:
self.state['current_power'] = np.array([self.d_power_sum[-1]], dtype='float32')
self.state['peak_power'] = np.array([self.peak_power], dtype='float32')
for i in range(self.n_fridges):
self.state[f'fridge{i}_temp'] = np.array([self.d_fridges_temp[i][-1]], dtype='float32')
self.state[f'fridge{i}_on'] = self.fridges[i].on
(Note: the dtype specification is not necessary in itself, it's just important for using the SubprocVecEnv from stable_baselines3)
The problem is most likely in your custom environment definition (
ChargeEnv). The error says that it has a wrong observation shape (it is empty). You should check yourChargeEnv.observation_space.If you want to create a custom environment, make sure to read the documentation to set it up correctly (https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#declaration-and-initialization, https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).
This is an example implementation of your
ChargeEnv, where the observation space is defined correctly: