Going through word2vec tutorial on udacity, and from the paper it seems to be the case that there are seperate matrices for input word vectors and output.
eg. ['the','cat','sat','on','mat']
.
Here the input vectors $w_i$, 'the','cat','on','mat'
will predict the output vector $w_o$ for 'sat'
. It does this via a sampled softmax as shown below where |context|
is the size of the context words (4 in this case).
Hence once training is done there could potentially be two vectors for sat
as a input vector, and another for the output vector. The question is why not have one matrix. This will ensure that the input and output vectors for the same words will be aligned.
If it helps, the tensorflow code is attached below (why not set softmax_weights = embedding and softmax_biases=0):
# Variables.
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Model.
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
train_labels, num_sampled, vocabulary_size))
update:
I implemented it without a seperate output matrix and the results still looks good: https://github.com/sachinruk/word2vec_alternate . I suppose the questions now should be is there a mathematical reason as to why the output matrix should be different.