Since many of the Thinc layers require a Float2D as input, I've been struggling to understand how to pass a batch of tokenized text, where [batch_size, max_seq_length, embedding_size] are the dimensions for each batch. I've noticed that in the Transformer pipe there is a mandatory mean pooling, but this doesn't work for token level classification like NER. Looking at Tok2Vec, I notice a 3D value is getting returned.
In order to look into this, I looked at how a List[Doc] gets modified when passed through a Tok2Vec pipe in en_core_web_sm. You can access a Thinc model through a spacy pipe. All Spacy v3 pipes tend to "listen" in on their shared Tok2Vec pipe, so I started investigating how a List[Docs] as input to the Tok2vec pipe generates output for the next pipe in the Spacy pipeline.
import spacy
nlp = spacy.load("en_core_web_sm")
model = nlp.get_pipe("tok2vec").model
model # verify it is a Thinc Model object
When I pass a List of 1 Doc, I get a List 3D floating point
np.array(model([nlp("This is a sentence.")], is_train=False)[0]).shape
(1, 5, 96)
However, when I pass a List of 2 Docs, I get an error:
np.array(model([nlp("This is a sentence."), nlp("And this is a second sentence.")], is_train=False)[0]).shape
Produces:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
When I pass the same sentence twice, however, I get the following expected behavior:
np.array(model([nlp("This is a sentence."), nlp("This is a sentence.")], is_train=False)[0]).shape
(2, 5, 96)
Spacy pipes are supposed to take as input List[Doc], so what gives with this Tok2Vec pipe not accepting a List[Doc] when the lengths of the input sentences / Docs are not the same? Does Tok2Vec not take care of padding (Thinc has several padding options)? It is the first pipe in the entire Spacy pipeline so no one else is doing it as a preprocessing step.
print(nlp.components)
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f39b4597280>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7f39b4597ca0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f3a172d83c0>),
('senter', <spacy.pipeline.senter.SentenceRecognizer at 0x7f39b4597ee0>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7f39b21e4740>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f39b19c9ec0>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f39b45f4dd0>)]
I did the same exercise for the Transformer pipe, using `en_core_web_trf", and I noticed the output of the Transformer pipe is a TransformerModelOutput, where TransformerModelOutput.all_outputs is a List[Ragged], which basically drops the batch_size dimension and sends any other Thinc layers 1 doc at a time in Ragged form.
Summary Of Questions:
- How does Tok2Vec handle padding when Docs are of different lengths?
- Do Thinc layers (e.g., Transformer --piping into-> Linear) requiring Float2D essentially drop the batch_size dimension, or do I understand this incorrectly?