I have trained a contextual bandit model on some production policy data with
--cb <nb_labels> --cb_type ips -b 25 -d /tmp/data/train.vw -f candidate-model.vw -c
getting the following logging results:
final_regressor = candidate-model.vw
using cache_file = /tmp/data/train.vw.cache
ignoring text input in favor of cache input
num sources = 1
Num weight bits = 25
learning rate = 0.1
initial_t = 0
power_t = 0.5
cb_type = ips
Enabled learners: gd, scorer-identity, csoaa_ldf-rank, cb_adf, shared_feature_merger, cb_to_cbadf
Input label = CB
Output pred = MULTICLASS
average since example example current current current
loss last counter weight label predict features
0.000000 0.000000 1 1.0 38:1:0.9 0:0 988
0.000000 0.000000 2 2.0 38:1:0.9 0:0 1824
0.000000 0.000000 4 4.0 28:1:0.81 0:0 380
0.000000 0.000000 8 8.0 38:1:0.9 0:0 380
0.000000 0.000000 16 16.0 0:1:0.1 21:-0.09 988
0.205863 0.411726 32 32.0 50:1:0.15 50:-0 836
0.258559 0.311255 64 64.0 24:1:0.0076 33:-0.1 380
0.129279 0.000000 128 128.0 21:1:0.59 13:-0.18 912
0.130960 0.132640 256 256.0 21:1:0.59 15:-0.21 380
0.098640 0.066320 512 512.0 21:1:0.59 36:-0.55 380
0.309922 0.521204 1024 1024.0 28:1:0.81 35:-0.2 380
0.286929 0.263936 2048 2048.0 38:1:0.9 74:-1.3 1520
0.286034 0.285139 4096 4096.0 21:1:0.59 17:-0.03 380
0.177546 0.069058 8192 8192.0 28:1:0.81 2:0 380
0.144739 0.111932 16384 16384.0 21:1:0.59 51:-0.02 380
0.080961 0.017183 32768 32768.0 21:1:0.59 51:-0.01 380
0.113071 0.145181 65536 65536.0 50:1:0.15 66:-0.13 988
0.082870 0.052669 131072 131072.0 38:1:0.9 2:0 304
0.080738 0.078605 262144 262144.0 38:1:0.9 66:-0.08 304
0.096056 0.111375 524288 524288.0 21:1:0.6 27:-0 380
0.075760 0.055464 1048576 1048576.0 38:1:0.89 44:-0.11 380
0.135795 0.195831 2097152 2097152.0 6:1:0.79 25:-0.12 988
finished run
number of examples = 2472547
weighted example sum = 2472547.000000
weighted label sum = 0.000000
average loss = 0.125687
total feature number = 1735872072
- I don't understand why I get negative scores in the column
current predict, does anyone have possible explanations about it? I can't find documentation about these scores by the way, so any help is very much appreciated. - Rows of training and test data contain explicit available actions, i.e.
19 21:1:0.8088 33 40 60 72 |User feat1=<feat1> feat2=<feat2> ...but predictions contain labels different than the available ones, is this normal expected behavior? If I want to restrict the prediction strictly to some given available actions do I need to switch tocb_adf(even in the case where I do not have rich features associated to the actions)? - On the test set I get a quite low
average losswhich would suggest that the optimization worked fine (average loss is significantly lower than sum of the cost over length of the test set which translates to better performance of the candidate policy) in spite of the negative scores, but this puzzles me since I can't evaluate properly the quality of the new optimized policy yet.
Note: training on same data with cb_explore run normally with no issues or negative scores/probabilities.