Analyzing BERT-models -- Using raw output logits or softmax values?

337 Views Asked by At

In the description of BERT's output it says:

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

I have problems in understanding what this output means. My aim is to compare human response data with BERT data in an experiment (fill-mask). I use fill-mask with the option topk and I group the predicted fillers by linguistic properties (here, number). For BERT I can use either the raw scores (logits) for the predictions or I first normalize them by applying softmax. Comparing the two methods, I find different correlations. So here are my questions:

  1. Looking for correlations between BERT responses and human responses, which BERT-output should be used? Raw logits or softmax values?
  2. Is it ok to add up raw logits to get a group score?
  3. If softmax is used, should it be applied to the logit scores of the single predictions or to the summed group score?

My main problem is that I do not know how to interpret the raw logit scores. Are these probabilities transformed in the log-space? But then, why does the quote above emphasize that the scores are "before Softmax"? Any suggestions what I should read? I know the architecture of BERT and the training procedure. Thanks!

0

There are 0 best solutions below