SpaCy 3.0 - Fine-tuning only NER component while keeping rest intact

2.3k Views Asked by At

I have some training data for a new set of NER labels that are not currently covered in SpaCy's default NER model. I have prepared a training_data.spacy file - which exclusively contain annotated examples with new labels. I am able to train a blank model from scratch following the instructions listed here - basically using the GUI tool to create a basic_config.cfg and then filling it up to create a config.cfg.

However, I am not sure how to fine-tune NER component an existing model - while keeping all the components intact. Basically, I would like to freeze all the other components during training. I tried to do something like the following:

import spacy

spacy.require_gpu()
nlp = spacy.load('en_core_web_sm')

frozen_components = [name for name in nlp.component_names if name not in ['ner']]
max_steps = 20000
eval_frequency = 200
patience = 1600

config = nlp.config
config['training']['max_steps'] = max_steps
config['training']['patience'] = patience
config['training']['eval_frequency'] = eval_frequency
config['training']['frozen_components'] = frozen_components
config['training']['annotating_components'] = nlp.component_names

with open('./ner_config.cfg', 'w') as f:
    f.write(config.to_str())

After this, I run

python -m spacy train ner_config.cfg --output ./output/$(date +%s) --paths.train ./training_data.spacy --paths.dev ./training_data.spacy --gpu-id 0

I get the following error:

✔ Created output directory: output/1647965025
ℹ Saving to output directory: output/1647965025
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2022-03-22 21:33:47,498] [INFO] Set up nlp object from config
[2022-03-22 21:33:47,511] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2022-03-22 21:33:47,571] [INFO] Added vocab lookups: lexeme_norm
[2022-03-22 21:33:47,571] [INFO] Created vocabulary
[2022-03-22 21:33:47,572] [INFO] Finished initializing nlp object
[2022-03-22 21:34:04,376] [INFO] Initialized pipeline components: ['ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Frozen components: ['tok2vec', 'tagger', 'parser', 'senter',
'attribute_ruler', 'lemmatizer']
ℹ Set annotations on update for: ['tok2vec', 'tagger', 'parser',
'senter', 'attribute_ruler', 'lemmatizer', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS NER  TAG_ACC  DEP_UAS  DEP_LAS  SENTS_F  LEMMA_ACC  ENTS_F  ENTS_P  ENTS_R  SPEED   SCORE 
---  ------  --------  -------  -------  -------  -------  ---------  ------  ------  ------  ------  ------
⚠ Aborting and saving the final best model. Encountered exception:
KeyError("Parameter 'E' for model 'hashembed' has not been allocated yet.")
...
    vectors = cast(Floats2d, model.get_param("E"))
  File "/home/abhinav/miniconda3/envs/spacy/lib/python3.8/site-packages/thinc/model.py", line 216, in get_param
    raise KeyError(
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."

What am I missing?

Thanks!

1

There are 1 best solutions below

1
On

There is a demo project that shows how to do this:

https://github.com/explosion/projects/tree/v3/pipelines/ner_demo_update

The key point is that you need to source components from en_core_web_sm instead in your config. You also don't need any components as annotating components in this scenario.

The generic version looks like this (copied from a script in the project above):

def create_config(model_name: str, component_to_update: str, output_path: Path):
    nlp = spacy.load(model_name)

    # create a new config as a copy of the loaded pipeline's config
    config = nlp.config.copy()

    # revert most training settings to the current defaults
    default_config = spacy.blank(nlp.lang).config
    config["corpora"] = default_config["corpora"]
    config["training"]["logger"] = default_config["training"]["logger"]

    # copy tokenizer and vocab settings from the base model, which includes
    # lookups (lexeme_norm) and vectors, so they don't need to be copied or
    # initialized separately
    config["initialize"]["before_init"] = {
        "@callbacks": "spacy.copy_from_base_model.v1",
        "tokenizer": model_name,
        "vocab": model_name,
    }
    config["initialize"]["lookups"] = None
    config["initialize"]["vectors"] = None

    # source all components from the loaded pipeline and freeze all except the
    # component to update; replace the listener for the component that is
    # being updated so that it can be updated independently
    config["training"]["frozen_components"] = []
    for pipe_name in nlp.component_names:
        if pipe_name != component_to_update:
            config["components"][pipe_name] = {"source": model_name}
            config["training"]["frozen_components"].append(pipe_name)
        else:
            config["components"][pipe_name] = {
                "source": model_name,
                "replace_listeners": ["model.tok2vec"],
            }

    # save the config
    config.to_disk(output_path)