I would use the k-fold cross-validation in FastText to validate my dataset.
I'm taking as reference the script I wrote:
I'm using two .csv
files to train and test:
train_file = 'train.csv'
test_file = 'test.csv'
The .csv
files are then processed by using the fasttext.train_supervised
as follows:
model = fasttext.train_supervised(input=train_file,
lr=1.0, epoch=100,
wordNgrams=2,
bucket=200000,
dim=300,
loss='hs')
print("Training time: {}")
start = time()
with open(test_file, 'r',encoding="utf8") as f:
test_desc = f.readlines()
listPred = []
listLabel = []
for line in test_desc:
if line.startswith("__label__Good "):
desc = line[len("__label__Good "):]
label = 1
elif line.startswith("__label__Bad "):
desc = line[len("__label__Bad "):]
label = 0
elif line.strip():
print("<EMPTY?")
print(line)
print(">")
else:
print("<ERROR reading test")
print(line)
print(">")
predLabel = model.predict(desc.rstrip("\n\r"))[0][0];
if predLabel == "__label__Good":
pred = 1
elif predLabel == "__label__Bad":
pred = 0
else:
print("ERROR in prediction")
listPred.append(pred)
listLabel.append(label)
print("Testing time: {}".format(time() - start))
final_pred.extend(listPred)
final_test.extend(listLabel)
print(final_test)
print(final_pred)
Then as output I'm creating an .eval.gz file to have the final evaluation by using another script
output = "descriptions" + ".eval.gz"
with gzip.open(output, "wb") as f:
np.savetxt(f, (final_test, final_pred), fmt='%i')
It works, however I have no clue how the StratifiedKFold validation could be integrated into the script I wrote.
I tried then by doing something like this:
X = train_file['description']
Y = train_file['label']
final_pred = list()
final_test = list()
print("training model...")
sk = StratifiedKFold(n_splits=10, random_state=0, shuffle=False)
for folds in sk.split(X,Y):
model = fasttext.train_supervised(input=train_file,
lr=1.0, epoch=100,
wordNgrams=2,
bucket=200000,
dim=300,
loss='hs')
................
output = "descriptions" + ".eval.gz"
with gzip.open(output, "wb") as f:
np.savetxt(f, (final_test, final_pred), fmt='%i')
As result I'm having this error
X = train_file['description']
TypeError: string indices must be integers
Any idea how to deal with this issue?
Thanks.