How does one handle variable (state-dependent) action set with Reinforcement Learning, specifically Actor-Critic method? I have found a similar question (Reinforcement Learning With Variable Actions), but it offers no complete answer I can use.
The problem is that neural network that models Policy function has a fixed number of outputs (corresponding to a maximum possible set of actions). It can't help but compute probabilities for all actions, including ones that are impossible in the current state. It becomes an especially huge problem, when there are states where only one or two actions out of, say, initial 50 are possible.
I see two possibilities:
1) Ignore impossible actions, and choose an actions among possible ones, re-normalizing probability of each one to the sum of their probabilities.
2) Let action selector choose impossible actions, but penaltize the network for doing so, until it learns to never choose impossible actions.
I have tried both, and they all have some problems:
1) Ignoring impossible actions may lead to action with output probability of 0.01 to have re-normalized probability of 0.99, depending on other actions. During the back-propagation step, this will lead to a big gradient, because of log(probability) factor (the network will use the original, non-normalized probability in calculation). I'm not entirely sure if this is desirable, and I seem to get not particularly good results with this approach.
2) Penaltizing choosing bad actions is even more problematic. If I only penaltize them slightly, action choose can get stuck for a long time while it adjusts over-inflated probabilities of impossible actions until previously small probabilities of possible actions become viable. But if I penaltize bad actions hugely, it leads to bad outcomes, like having a network where one action has 1.0 probability, and the rest are 0.0. What's worse, the network can even become stuck in an endless loop because of precision problems.
I haven't seen a proper discussion of this question anywhere, but maybe I just can't think of a proper search terms. I would be very thankful if anyone could direct me to a paper or blog post that deals with this question, or give me a extended answer on proper handling of this case.