In the Word2Vec Skip-gram setup that follows, what is the data setup for the output layer? Is it a matrix that is zero everywhere but with a single "1" in each of the C rows - that represents the words in the C context?
Add to describe Data Setup Question:
Meaning what the dataset would look like that was presented to the NN? Lets consider this to be "what does a single training example look like"?. I assume the total input is a matrix, where each row is a word in the vocabulary (and there is a column for each word as well and each cell is zero except where for the specific word - one hot encoded)? Thus, a single training example is 1xV as shown below (all zeros except for the specific word, whose value is a 1). This aligns with the picture above in that the input is V-dim. I expected that the total input matrix would have duplicated rows however - where the same one-hot encoded vector would be repeated for each time the word was found in the corpus (as the output or target variable would be different).
The Output (target) is more confusing to me. I expected it would exactly mirror the input -- a single training example has a "multi"-hot encoded vector that is zero except is a "1" in C of the cells, denoting that a particular word was in the context of the input word (C = 5 if we are looking, for example, 2 words behind and 3 words ahead of the given input word instance). The picture doesn't seem to agree with this though. I dont understand what appears like C different output layers that share the same W' weight matrix?
The skip-gram architecture has word embeddings as its output (and its input). Depending on its precise implementation, the network may therefore produce two embeddings per word (one embedding for the word as an input word, and one embedding for the word as an output word; this is the case in the basic skip-gram architecture with the traditional softmax function), or one embedding per word (this is the case in a setup with the hierarchical softmax as an approximation to the full softmax, for example).
You can find more information about these architectures in the original word2vec papers, such as Distributed Representations of Words and Phrases and their Compositionality by Mikolov et al.