How to relate the language model score of a whole sentence to those of the sentence's constituents

1.1k Views Asked by At

I trained a KENLM language model on around 5000 English sentences/paragraphs. I want to query this ARPA model with two or more segments and see if they can be concatenated to form a longer sentence, hopefully more "grammatical." Here as follows is the Python code that I have used to get the logarithmic scores - and the ten-based power value - of the segments and the "sentence." I have given two examples. Obviously, the sentence in the first example is more grammatical than the one in the second example. However, my question is not about this, but about how to relate the language model score of a whole sentence to those of the sentence's constituents. That is, if the sentence is grammatically better than its constituents.

import math
import kenlm as kl
model = kl.LanguageModel(r'D:\seg.arpa.bin')
print ('************')
sentence = 'Mr . Yamada was elected Chairperson of'
print(sentence)
p1=model.score(sentence)
p2=math.pow(10,p1)
print(p1)
print(p2)
sentence = 'the Drafting Committee by acclamation .'
print(sentence)
p3=model.score(sentence)
p4=math.pow(10,p3)
print(p3)
print(p4)
sentence = 'Mr . Yamada was elected Chairperson of the Drafting Committee by acclamation .'
print(sentence)
p5=model.score(sentence)
p6=math.pow(10,p5)
print(p5)
print(p6)
print ('-------------')
sentence = 'Cases cited in the present volume ix'
print(sentence)
p1=model.score(sentence)
p2=math.pow(10,p1)
print(p1)
print(p2)
sentence = 'Multilateral instruments cited in the present volume xiii'
print(sentence)
p3=model.score(sentence)
p4=math.pow(10,p3)
print(p3)
print(p4)
sentence = 'Cases cited in the present volume ix Multilateral instruments cited in the present volume xiii'
print(sentence)
p5=model.score(sentence)
p6=math.pow(10,p5)
print(p5)
print(p6)
  • ************ Mr . Yamada was elected Chairperson of -34.0706558228 8.49853715087e-35 the Drafting Committee by acclamation . -28.3745193481 4.22163470933e-29 Mr . Yamada was elected Chairperson of the Drafting Committee by acclamation . -55.5128440857 3.07012398337e-56 ------------- Cases cited in the present volume ix -27.7353248596 1.83939558773e-28 Multilateral instruments cited in the present volume xiii -34.4523620605 3.52888852435e-35 Cases cited in the present volume ix Multilateral instruments cited in the present volume xiii -60.7075233459 1.9609957573e-61
1

There are 1 best solutions below

0
On

Using the

list(model.full_scores(sent))

return the details of the constituent of the sentence i.e the words.

This returns a list and iterate this to access detail per words. Each list item contains

The above returns log-probability, ngram-length and whether the word is OOV (out-of-vocabulary) for each word in the sentence.