Retraining ML.Net model does not work properly

76 Views Asked by At

I Have created a project to classify text entries by category. For this I created a model and trained it with a small dataset (~1000 entries).

Now I want to train the model further with new entries.

This is how my model is created :

    var mlContext = new MLContext();
    var dataSet = mlContext.Data.LoadFromTextFile<Cx_Kategorie_Model.ModelInput>(@"pathToDataset.csv",separatorChar:',',hasHeader:true);
    
    IEstimator<ITransformer> dataPrepPipeline = mlContext.Transforms.Text.FeaturizeText(@"Eintrag", @"Eintrag")
        .Append(mlContext.Transforms.Concatenate(@"Features", @"Eintrag"))
        .Append(mlContext.Transforms.Conversion.MapValueToKey(@"iKategorie", @"iKategorie"))
        .Append(mlContext.Transforms.NormalizeMinMax(@"Features", @"Features"));

    var trainingPipeline =
        (mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(labelColumnName: @"iKategorie",
            featureColumnName: @"Features"));
    var postprocessPipeline = (mlContext.Transforms.Conversion.MapKeyToValue(@"PredictedLabel", @"PredictedLabel"));
    var pipeline = dataPrepPipeline.Append(trainingPipeline).Append(postprocessPipeline);
    var model = pipeline.Fit(dataSet);
    mlContext.Model.Save(model,dataSet.Schema,"modelPath.zip");

This part is to retrain the existing model:

    var mlContext = new MLContext();
    DataViewSchema inputSchema;
    IEnumerable<ITransformer> trainedModel = mlContext.Model.Load("modelPath.zip",out inputSchema) as IEnumerable<ITransformer>;
    var preprocessingPipeline = trainedModel.ElementAt(0);
    var trainPipeline = trainedModel.ElementAt(1);
    var postProcessingPipeline = trainedModel.ElementAt(2);
    
    var trainDataView = mlContext.Data.LoadFromTextFile<Cx_Kategorie_Model.ModelInput>(@"pathToNewData.csv",separatorChar:',',hasHeader:true);
    
//Extracting modelParams from trained model
    ISingleFeaturePredictionTransformer<MaximumEntropyModelParameters> predictionTransformerMulti = trainPipeline as ISingleFeaturePredictionTransformer<MaximumEntropyModelParameters>;
    MaximumEntropyModelParameters modelParameters = predictionTransformerMulti.Model;
    
    var preprocessedData = preprocessingPipeline.Transform(trainDataView);
    var retrained = mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(
        labelColumnName: @"iKategorie", featureColumnName: @"Features").Fit(preprocessedData,modelParameters);
    var model = preprocessingPipeline.Append(retrained).Append(postProcessingPipeline);
    mlContext.Model.Save(model,trainDataView.Schema,"modelPath.zip");

This changes the biases slightly, but the NumberOfFeatures - which I understand to be the number of different words(i.e. features) - remains exactly the same altough the new dataset contains new words and completely new text. Now when I apply my model, it behaves as if the training data (from "pathToNewData.csv") is the only data the model knows.

Is my training code wrong or is it not possible that way?

0

There are 0 best solutions below