In the description of BERT's output it says:
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
I have problems in understanding what this output means. My aim is to compare human response data with BERT data in an experiment (fill-mask). I use fill-mask with the option topk
and I group the predicted fillers by linguistic properties (here, number). For BERT I can use either the raw scores (logits) for the predictions or I first normalize them by applying softmax. Comparing the two methods, I find different correlations. So here are my questions:
- Looking for correlations between BERT responses and human responses, which BERT-output should be used? Raw logits or softmax values?
- Is it ok to add up raw logits to get a group score?
- If softmax is used, should it be applied to the logit scores of the single predictions or to the summed group score?
My main problem is that I do not know how to interpret the raw logit scores. Are these probabilities transformed in the log-space? But then, why does the quote above emphasize that the scores are "before Softmax"? Any suggestions what I should read? I know the architecture of BERT and the training procedure. Thanks!