integrate `openai-whisper` tokenizer with spaCy

38 Views Asked by At

I am trying to use the tokenizer from openai-whisper to be used as the tokenizer for spaCy.

Im doing the following but it is giving errors, what is the correct way to use whisper tokenizer as a custom tokenizer in spaCy.

import spacy
import en_core_web_sm
from whisper.tokenizer import Tokenizer, get_tokenizer
nlp = en_core_web_sm.load()
nlp.tokenizer = Tokenizer

The issue is that when openai-whisper tokenizes " Toby makes fun", it considers the first " " (space) as a token. When passing through spaCy tokenizer it also recognizes this, but while finding the entity, it discards any empty spaces. Here the entity output is "Toby" but I need it to be " Toby"

nlp = en_core_web_sm.load()
ref_doc = nlp(" Toby makes fun")
print([(X.text, X.label_) for X in ref_doc.ents])
# ref_doc.text = " Toby makes fun"
# but ref_doc.ent.text = "Toby"
0

There are 0 best solutions below