Vowpal Wabbit Contextual Bandit correct usage

495 Views Asked by At

I am currently using the Vowpal Wabbit package in order to simulate a Contextual Bandit. I had a couple of questions regarding the usage of the library:

  1. I have multiple contexts/categories where the actions are intersecting. For example, lets say I have jerseys of Team A, Team B and Team C. These jerseys come in sizes S, M and L. Based on past demand, I want to recommend a size of jersey to produce.

Contexts - Team A, Team B, Team C Actions - S, M and L

Each context has the same set of actions to choose from. I want Vowpal Wabbit to understand that each context is different, and create separate distributions of the action space. Vowpal Wabbit is utilizing the same distribution/pmf for the actions across all contexts.

So if, Team A is the context - The distribution is [0.1, 0.8, 0.1] after several runs. Team B also has the same distribution [0.1, 0.8, 0.1] even though VW has not seen this as an input, ideally I would want it to start from [0.33,0.33,0.33]

Is there a way I can utilize VW to differentiate contexts and give them separate distributions?

I am simulating the Contextual Bandit with Vowpal Wabbit with the following settings - "--cb_explore_adf --save_resume --quiet --epsilon 0.1"

  1. I was also wondering if there was a way to access/view the underlying learnt policy? Where are the different distributions or learnt policies stored?

Thanks

1

There are 1 best solutions below

2
On

For VW to understand that each context is different, you need to add "-q CA" to do feature interactions between the context feature and action feature. Since you already trained the model with Team A, when training for Team B, the model weight has already been updated, so it won't be uniform random anymore. Maybe you can try add --ignore_linear C and --ignore_linear A? Also curious why would you want the action distribution to be uniform random for Team B?

To access/view the learnt policy you can try "--readable_model READABLE_MODEL_PATH". To save the different distributions you can do "-p PREDICTION_FILE_PATH", to save the learnt policy "-f MODEL_PATH". For more options about learnt policy: https://vowpalwabbit.org/docs/vowpal_wabbit/python/latest/command_line_args.html#output-model-options