I have a Pandas DataFrame from an Excel file, which contains text data which need to calculate the BLEU score row-by-row.
import evaluate
import pandas as pd
sacrebleu = evaluate.load("sacrebleu")
testset = pd.read_excel(xlsx_filename)
# find out valid rows with all columns are valid
valid_rows = testset['col1'].notna() & testset['col2'].notna() & testset['col3'].notna()
for i in range(len(testset)): # or... for i in range(len(testset.loc[valid_rows, 'col2']))
score = sacrebleu.compute(predictions=[testset.loc[valid_rows, 'col1'][i], testset.loc[valid_rows, 'col2'][i]], references=[testset.loc[valid_rows, 'col3'][i]])
It raises KeyError: 139
.
The length of valid_rows
and testset
are 13700, while the length of testset.loc[valid_rows, 'col2']
is 12208.
I know loop through for-loop is an anti-pattern, but how can I fit a Series into the sacrebleu.compute()
function? It accepts only [string, string], string
as input.
How can I solve this problem?