I use a BiLSTM-CRF architecture to assign some labels to a sequence of the sentences in a paper. We have 150 papers each of which contains 380 sentences and each sentence is represented by a double array with size 11 in range (0,1) and the number of class labels is 11.
input = Input(shape=(None,11))
mask = Masking (mask_value=0)(input)
lstm = Bidirectional(LSTM(50, return_sequences=True))(mask)
lstm = Dropout(0.3)(lstm)
lstm= TimeDistributed(Dense(50, activation="relu"))(lstm)
crf = CRF(11 , sparse_target=False , learn_mode='join') # CRF layer
out = crf(lstm)
model = Model(input, out)
model.summary()
model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])
I use keras-contrib package to implement CRF layer. CRF layer has two learning modes: join mode and marginal mode. I know that join mode is a real CRF that uses viterbi algorithm to predict the best path. While, marginal mode is not a real CRF that uses categorical-crossentropy for computing loss function. When I use marginal mode, the output is as follows:
Epoch 4/250: - 6s - loss: 1.2289 - acc: 0.5657 - val_loss: 1.3459 - val_acc: 0.5262
But, in the join mode, the value of loss function is nan:
Epoch 2/250 : - 5s - loss: nan - acc: 0.1880 - val_loss: nan - val_acc: 0.2120
I do not understand why this happens and would be grateful to anybody who could give a hint.