Analyzing BERT-models -- Using raw output logits or softmax values?

338 Views Asked by Joan C At 20 June 2025 at 04:33

In the description of BERT's output it says:

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

I have problems in understanding what this output means. My aim is to compare human response data with BERT data in an experiment (fill-mask). I use fill-mask with the option topk and I group the predicted fillers by linguistic properties (here, number). For BERT I can use either the raw scores (logits) for the predictions or I first normalize them by applying softmax. Comparing the two methods, I find different correlations. So here are my questions:

Looking for correlations between BERT responses and human responses, which BERT-output should be used? Raw logits or softmax values?
Is it ok to add up raw logits to get a group score?
If softmax is used, should it be applied to the logit scores of the single predictions or to the summed group score?

My main problem is that I do not know how to interpret the raw logit scores. Are these probabilities transformed in the log-space? But then, why does the quote above emphasize that the scores are "before Softmax"? Any suggestions what I should read? I know the architecture of BERT and the training procedure. Thanks!

Original Q&A

Analyzing BERT-models -- Using raw output logits or softmax values?

There are 0 best solutions below

Related Questions in DEEP-LEARNING

Related Questions in BERT-LANGUAGE-MODEL

Related Questions in TRANSFORMER-MODEL

Related Questions in SOFTMAX

Related Questions in LOGITS

Trending Questions

Popular # Hahtags

Popular Questions