Stable baselines 3 throws ValueError when episode is truncated

Question

Stable baselines 3 throws ValueError when episode is truncated

61 Views Asked by maxxel_ At 20 February 2024 at 12:29

So I'm trying to train an agent on my custom gymnasium environment trough stablebaselines3 and it kept crashing seemingly random and throwing the following ValueError:

Traceback (most recent call last):
  File "C:\Users\bo112\PycharmProjects\ecocharge\code\Simulation Env\prototype_visu.py", line 684, in <module>
    model.learn(total_timesteps=time_steps, tb_log_name=log_name)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\ppo\ppo.py", line 315, in learn
    return super().learn(
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 277, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 218, in collect_rollouts
    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\policies.py", line 256, in obs_to_tensor
    vectorized_env = vectorized_env or is_vectorized_observation(obs_, obs_space)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 399, in is_vectorized_observation
    return is_vec_obs_func(observation, observation_space)  # type: ignore[operator]
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 266, in is_vectorized_box_observation
    raise ValueError(
ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape.

I don't know why the observation shape/content would change though, since it doesn't change how the state gets its values at all.

I figured out that it crashes, whenever the agent 'survives' a whole episode for the first time and truncation gets used instead of termination. Is there some kind of weird quirk for returning truncated and terminated that I don't know about? Because I can't find the error in my step function.

    def step(self, action):

        ...  # handling the action etc.

        reward = 0
        truncated = False
        terminated = False
        # Check if time is over/score too low - else reward function
        if self.n_step >= self.max_steps:
            truncated = True
            print('truncated')
        elif self.score < -1000:
            terminated = True
            # print('terminated')
        else:
            reward = self.reward_fnc_distance()

        self.score += reward
        self.d_score.append(self.score)
        self.n_step += 1

        # state: [current power, peak power, fridge 1 temp, fridge 2 temp, [...] , fridge n temp]
        self.state['current_power'] = self.d_power_sum[-1]
        self.state['peak_power'] = self.peak_power
        for i in range(self.n_fridges):
            self.state[f'fridge{i}_temp'] = self.d_fridges_temp[i][-1]
            self.state[f'fridge{i}_on'] = self.fridges[i].on

        if self.logging:
            print(f'score: {self.score}')

        if (truncated or terminated) and self.logging:
            self.save_run()

        return self.state, reward, terminated, truncated, {}

This is the general setup for training my models:

hidden_layer = [64, 64, 32]
time_steps = 1000_000
learning_rate = 0.003
log_name = f'PPO_{int(time_steps/1000)}k_lr{str(learning_rate).replace(".", "_")}'
vec_env = make_vec_env(env_id=ChargeEnv, n_envs=4)
model = PPO('MultiInputPolicy', vec_env, verbose=1, tensorboard_log='tensorboard_logs/',
            policy_kwargs={'net_arch': hidden_layer, 'activation_fn': th.nn.ReLU}, learning_rate=learning_rate,
            device=th.device("cuda" if th.cuda.is_available() else "cpu"), batch_size=128)
model.learn(total_timesteps=time_steps, tb_log_name=log_name)
model.save(f'models/{log_name}')
vec_env.close()

As mentioned above, episodes only get truncated when it also throws the ValueError and vice versa, so I'm pretty sure it has to be that.

EDIT:

From the answer below, I found the problem was to simply put all my float/Box values of self.state into numpy arrays before returning them like following:

self.state['current_power'] = np.array([self.d_power_sum[-1]], dtype='float32')
self.state['peak_power'] = np.array([self.peak_power], dtype='float32')
for i in range(self.n_fridges):
    self.state[f'fridge{i}_temp'] = np.array([self.d_fridges_temp[i][-1]], dtype='float32')
    self.state[f'fridge{i}_on'] = self.fridges[i].on

(Note: the dtype specification is not necessary in itself, it's just important for using the SubprocVecEnv from stable_baselines3)

Original Q&A

There are 1 best solutions below

**c p** · Answer 1 · 2024-02-26T12:51:42.743000

The problem is most likely in your custom environment definition (ChargeEnv). The error says that it has a wrong observation shape (it is empty). You should check your ChargeEnv.observation_space.

If you want to create a custom environment, make sure to read the documentation to set it up correctly (https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#declaration-and-initialization, https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).

This is an example implementation of your ChargeEnv, where the observation space is defined correctly:

import gymnasium as gym
from gymnasium import spaces

class ChargeEnv(gym.Env):
    def __init__(self, n_fridges=2):
        super().__init__()

        # Define observation space
        observation_space_dict = {
            'current_power': spaces.Box(low=0, high=100, shape=(1,), dtype=np.float32),
            'peak_power': spaces.Box(low=0, high=100, shape=(1,), dtype=np.float32)
        }

        for i in range(n_fridges):
            observation_space_dict[f'fridge{i}_temp'] = spaces.Box(low=-10, high=50, shape=(1,), dtype=np.float32)
            observation_space_dict[f'fridge{i}_on'] = spaces.Discrete(2)  # 0 or 1 (off or on)

        self.observation_space = spaces.Dict(observation_space_dict)

        # Other environment-specific variables
        self.n_fridges = n_fridges
        # Initialize other variables as needed

    def reset(self):
        # Reset environment to initial state
        # Initialize state variables, e.g., current_power, peak_power, fridge temperatures, etc.
        # Return initial observation
        initial_observation = {
            'current_power': np.array([50.0]),
            'peak_power': np.array([100.0])
        }
        for i in range(self.n_fridges):
            initial_observation[f'fridge{i}_temp'] = np.array([25.0])  # Example initial temperature
            initial_observation[f'fridge{i}_on'] = 0  # Example: Fridge initially off

        return initial_observation

    def step(self, action):
        # Implement step logic (similar to your existing step function)
        # Update state variables, compute rewards, check termination conditions, etc.
        # Return observation, reward, done flag, and additional info

Stable baselines 3 throws ValueError when episode is truncated

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in REINFORCEMENT-LEARNING

Related Questions in OPENAI-GYM

Related Questions in STABLE-BASELINES

Trending Questions

Popular # Hahtags

Popular Questions