Deep Reinforcement Learning, how to make an agent that control many machines

111 Views Asked by At

Good morning, Im facing a "RL" problem, which have many constraints, the main idea is that my agent will control many different machines with for example ordering them to go out for doing their missions (we don't give importance for the mission), or ordering them to enter to the depot and choosing for them the right place where they should sit (depending from constraints). The problem is: the agent will take decision at periods of time that are defined, for each periode we know which of actions (go out, go in) are allowed. He will for example at 8oclock decide to order for 4 machines to go out, and at 14oclock decide to bring back 2 machines(with choosing for them the right place).

In literature i show many ideas which refers to BDQ, but is it recquired for my problem ? Im thinking about having actions like [chooseMachine1, chooseMachine2,chooseMachine3...chooseMachineN, goOut, goInPlace1, goInPlace2, goInPlace3, goInPlace4]. And in the code specifying the logic that depending of the period we are, i expose for the begening a number M<=N of the machines to choose (with giving 0 probability to those actions that aren't possible for the moment' if it is 14oclock you know that only the machines that are out are concerned with the agent decision'), if the agent choose Machine1, so he will access to only the possible actions from choosing it.

So, my question is, do you think that my ideas are right ? (am beginner), my idea is to make a DQN with giving the logic for the possible/impossible actions, Do you think that a BDQ is more accurate with my problem ? like having N branchs for N machines which have the same possible actions (brach1(Machine1) : go out, goPlace1, goPlace2 ...) If it is the case is there any implementation examples ?

If you have ressources to advise me, i will be glad of checking them.

Thank You

1

There are 1 best solutions below

4
On

What would an agent navigating a maze do in case the chosen action would run it into a wall?

I think the usual approach in RL is to allow the move and than handle the result with the environment. In such a way the environment can simply make nothing happen or even give a negative reward when an action is "disallowed".

At training convergence the agent will hopefully learn to not chose ineffective actions.