How to use StratifiedKfold in FastText?

359 Views Asked by At

I would use the k-fold cross-validation in FastText to validate my dataset.

I'm taking as reference the script I wrote:

I'm using two .csv files to train and test:

train_file = 'train.csv'

test_file = 'test.csv'

The .csv files are then processed by using the fasttext.train_supervised as follows:

model = fasttext.train_supervised(input=train_file,
                                lr=1.0, epoch=100,
                                wordNgrams=2, 
                                bucket=200000, 
                                dim=300, 
                                loss='hs')

print("Training time: {}")
start = time()
with open(test_file, 'r',encoding="utf8") as f:
    test_desc = f.readlines()

listPred = []
listLabel = []
for line in test_desc:
    if line.startswith("__label__Good "):
        desc = line[len("__label__Good "):]
        label = 1
    elif line.startswith("__label__Bad "):
        desc = line[len("__label__Bad "):]
        label = 0
    elif line.strip():
        print("<EMPTY?")
        print(line)
        print(">")
    else:
        print("<ERROR reading test")
        print(line)
        print(">")

    predLabel = model.predict(desc.rstrip("\n\r"))[0][0];
    if predLabel == "__label__Good":
        pred = 1
    elif predLabel == "__label__Bad":
        pred = 0
    else:
        print("ERROR in prediction")

    listPred.append(pred)
    listLabel.append(label)

print("Testing time: {}".format(time() - start))

final_pred.extend(listPred)
final_test.extend(listLabel)

print(final_test)
print(final_pred)

Then as output I'm creating an .eval.gz file to have the final evaluation by using another script

output = "descriptions" + ".eval.gz"
with gzip.open(output, "wb") as f:
    np.savetxt(f, (final_test, final_pred), fmt='%i')

It works, however I have no clue how the StratifiedKFold validation could be integrated into the script I wrote.

I tried then by doing something like this:


X = train_file['description']

Y = train_file['label']


final_pred = list()
final_test = list()


print("training model...")

sk = StratifiedKFold(n_splits=10, random_state=0, shuffle=False)

for folds in sk.split(X,Y):
    

    model = fasttext.train_supervised(input=train_file,
                                    lr=1.0, epoch=100,
                                    wordNgrams=2, 
                                    bucket=200000, 
                                    dim=300, 
                                    loss='hs')

            ................

output = "descriptions" + ".eval.gz"
with gzip.open(output, "wb") as f:
    np.savetxt(f, (final_test, final_pred), fmt='%i')

As result I'm having this error

X = train_file['description']
TypeError: string indices must be integers

Any idea how to deal with this issue?

Thanks.

0

There are 0 best solutions below