I have some training data for a new set of NER labels that are not currently covered in SpaCy's default NER model. I have prepared a training_data.spacy
file - which exclusively contain annotated examples with new labels. I am able to train a blank model from scratch following the instructions listed here - basically using the GUI tool to create a basic_config.cfg
and then filling it up to create a config.cfg
.
However, I am not sure how to fine-tune NER component an existing model - while keeping all the components intact. Basically, I would like to freeze all the other components during training. I tried to do something like the following:
import spacy
spacy.require_gpu()
nlp = spacy.load('en_core_web_sm')
frozen_components = [name for name in nlp.component_names if name not in ['ner']]
max_steps = 20000
eval_frequency = 200
patience = 1600
config = nlp.config
config['training']['max_steps'] = max_steps
config['training']['patience'] = patience
config['training']['eval_frequency'] = eval_frequency
config['training']['frozen_components'] = frozen_components
config['training']['annotating_components'] = nlp.component_names
with open('./ner_config.cfg', 'w') as f:
f.write(config.to_str())
After this, I run
python -m spacy train ner_config.cfg --output ./output/$(date +%s) --paths.train ./training_data.spacy --paths.dev ./training_data.spacy --gpu-id 0
I get the following error:
✔ Created output directory: output/1647965025
ℹ Saving to output directory: output/1647965025
ℹ Using GPU: 0
=========================== Initializing pipeline ===========================
[2022-03-22 21:33:47,498] [INFO] Set up nlp object from config
[2022-03-22 21:33:47,511] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2022-03-22 21:33:47,571] [INFO] Added vocab lookups: lexeme_norm
[2022-03-22 21:33:47,571] [INFO] Created vocabulary
[2022-03-22 21:33:47,572] [INFO] Finished initializing nlp object
[2022-03-22 21:34:04,376] [INFO] Initialized pipeline components: ['ner']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Frozen components: ['tok2vec', 'tagger', 'parser', 'senter',
'attribute_ruler', 'lemmatizer']
ℹ Set annotations on update for: ['tok2vec', 'tagger', 'parser',
'senter', 'attribute_ruler', 'lemmatizer', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS NER TAG_ACC DEP_UAS DEP_LAS SENTS_F LEMMA_ACC ENTS_F ENTS_P ENTS_R SPEED SCORE
--- ------ -------- ------- ------- ------- ------- --------- ------ ------ ------ ------ ------
⚠ Aborting and saving the final best model. Encountered exception:
KeyError("Parameter 'E' for model 'hashembed' has not been allocated yet.")
...
vectors = cast(Floats2d, model.get_param("E"))
File "/home/abhinav/miniconda3/envs/spacy/lib/python3.8/site-packages/thinc/model.py", line 216, in get_param
raise KeyError(
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."
What am I missing?
Thanks!
There is a demo project that shows how to do this:
https://github.com/explosion/projects/tree/v3/pipelines/ner_demo_update
The key point is that you need to
source
components fromen_core_web_sm
instead in your config. You also don't need any components as annotating components in this scenario.The generic version looks like this (copied from a script in the project above):