Reinforcement Learning with MDP for revenues optimization

283 Views Asked by At

I want to modelize the service of selling seats on an airplane as an MDP( markov decision process) to use reinforcement learning for airline revenues optimization, for that I needed to define what would be: states, actions, policy, value and reward. I thought a little a bit about it, but i think there is still something missing.

I modelize my system this way:

  • States = (r,c) where r is the number of passengers and c the number of seats bought so r>=c.
  • Actions = (p1,p2,p3) that are the 3 prices. the objective is to decide which one of them give more revenues.
  • Reward: revenues.

Could you please tell me what do u think and help me?

After the modelization, I have to implement all of that wit Reinforcement Learning. Is there a package that do the work ?

2

There are 2 best solutions below

3
On BEST ANSWER

I think the biggest thing missing in your formulation is the sequential part. Reinforcement learning is useful when used sequentially, where the next state has to be dependent on the current state (thus the "Markovian"). In this formulation, you have not specified any Markovian behavior at all. Also, the reward is a scalar which is dependent on either the current state or the combination of current state and action. In your case, the revenue is dependent on the price (the action), but it has no correlation to the state (the seat). These are the two big problems that I see with your formulation, there are others as well. I will suggest you to go through the RL theory (online courses and such) and write a few sample problems before trying to formulate your own.

0
On

Adding this here for anyone stumbling across this topic and looking for an answer:

The sequential part should be different time steps (e.g. days/hours) for implementing a certain pricing action. The Reward is the revenue achieved in that time step (price*quantity), and the future rewards will be based on the number of seats remaining unsold and the potential prices which they can be sold for

State: current number of seats remaining unsold and passengers looking to purchase

Actions: potential seat prices, with probabilities of different numbers of seats being sold at different prices (transition probabilities)

Rewards: revenue from seats sold in current state

In terms of then optimising this, the Bellman equation is a common approach