logits and labels must have the same first dimension, got logits shape [100,5930] and labels shape [1900]

72 Views Asked by At

Visual Representation of Model

I am working on a Machine Translation Task and I have used attention machenism for better results. It is english to urdu conversion task. My english dataset has longest sequence on length 14 and and urdu has 19 i have padded them and make sequences of equal length. X has shape i.e english train set (19620,14) and y i.e urdu target sequence has (19620,19). I have used embedding layer for my input and my target sequences are not one hot encoded as my target vocab size is 5930 so there is no benefit in getting this much sparse vectors. One more thing output layer has 5930 neurons which is equal to number of classes as it is size of my target vocab.

Now the issue is i am using sparse_categorical_crossentropy loss and i am getting this error:

logits and labels must have the same first dimension, got logits shape [100,5930] and labels shape [1900]

i am also getting shape mismatch error in case of categorical cross entropy: But when I change the number of classes in out put layer to 19 which is the length my target sequence it runs but loss is to much high and is overshooting approx in thousands. I f i one hot my target sequences to 5930 it also runs but same issue of loss. Because documentation says categorical cross entropy takes one hot representation but i can't do this.

In case of correct input no loss is working.

Here is the whole code

English vocab Size:  5679
Urdu vocab Size:  5930
Max English sequence:  14
Max Urdu sequence:  19
X.shape=(19620,14)
Y.shape=(19620,19)
# Preprocessing of Training Data
train_eng_seq,train_eng_vocab,train_eng_tok=Tokenize_fn(train_data['English-Sentences'])
train_urdu_seq,train_urdu_vocab,train_urdu_tok=Tokenize_fn(train_data['Urdu-Sentences'])
# Padding
train_eng_seq=pad_fn(train_eng_seq,length=english_length)
train_urdu_seq=pad_fn(train_urdu_seq,length=urdu_length)

# Preprocessing of Testing Data
test_eng_seq,test_eng_vocab,test_eng_tok=Tokenize_fn(test_data['English-Sentences'])
test_urdu_seq,test_urdu_vocab,test_urdu_tok=Tokenize_fn(test_data['Urdu-Sentences'])
# Padding
test_eng_seq=pad_fn(test_eng_seq,length=english_length)
test_urdu_seq=pad_fn(test_urdu_seq,length=urdu_length)

# It is because Our each english sequence has max laength of 14 and urdu has 19
Tx=english_length
Ty=urdu_length
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)


def one_step_attention(a,s_prev):
  # We done this to change s_prev to shape of(m,Tx,n_s) for cocatination with a
  s_prev=repeator(s_prev)
  # We will here concatenate a and s_prev
  concat=concatenator([a,s_prev])
  # here i will calculate energies with 2 dense layers
  e=densor1(concat)
  energies=densor2(e)
  # we know alpha is softmax of this energy
  alpha=activator(energies)
  # to calculate context vector we take dot product of alpha and a
  context_vector=dotor([alpha,a])
  return context_vector


n_a = 32 # number of units for the pre-attention, bi-directional LSTM's hidden state 'a'
n_s = 64 # number of units for the post-attention LSTM's hidden state "s"

# this is the post attention LSTM cell.
post_activation_LSTM_cell = LSTM(n_s, return_state = True) 
output_layer = Dense(total_urdu_vocab, activation='softmax')


def modelf(Tx,Ty,n_a, n_s, total_eng_vocab, total_urdu_vocab):
  X=Input(shape=(english_length,)) # because embedding layer only demands the sequence length if i give full shape like (m,Tx) the
  # ouput of embedding layer will be 4D which can not be fed into BILSTM
  # hidden state for post LSTM
  s0 = Input(shape=(n_s,), name='s0')
  # cell state for post lstm
  # because we know From CampusX that shape of hidden and cell state of lstm are equal
  c0 = Input(shape=(n_s,), name='c0')
  s=s0
  c=c0
  outputs = []
  embedding_layer=tf.keras.layers.Embedding(total_eng_vocab,64,input_length=english_length)(X)
  a = Bidirectional(LSTM(n_a,return_sequences=True))(embedding_layer)

  for t in range(Ty):
    context=one_step_attention(a,s)
    _,s,c=post_activation_LSTM_cell(context,initial_state = [s,c] )
    out = output_layer(s)
    outputs.append(out)

    ''' Above the scene is like this:
        First we have initialized the hidden and cell state of post LSTM with zeros than our input goes through
        embedding layer than BiLSTM which return a which is a list of all the hidden states of BILSTM. Attention machenism works in a
        way that we take hidden state s which we have initialized with 0 and list of hidden ststes of BILSTM concat them and compute context
        vector as in one_step_attention function. we pass this context vector to one node of post LSTM to get hidden state s which passes
        through the output layer to give y1 same goes for 2nd word 3rd word etc.'''
  print(outputs)
  model=tf.keras.models.Model(inputs=[X,s0,c0],outputs=outputs)

  return model


model = modelf(Tx, Ty, n_a, n_s, total_eng_vocab, total_urdu_vocab)
opt = tf.keras.optimizers.Adam(learning_rate=0.005,beta_1=0.9,beta_2=0.999) # Adam(...)
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = opt, metrics = ['accuracy'])
m=train_eng_seq.shape[0]
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
model.fit([train_eng_seq, s0, c0], train_urdu_seq, epochs=50, batch_size=100)
0

There are 0 best solutions below