Keras LSTM input dimensions with one hot text embedding

5.7k Views Asked by At

I have 70k samples of text which I have embedded using Keras 'one hot' preprocessing. This gives me an array of [40, 20, 142...] which I then pad for a length of 28 (the longest sample length). All I am trying to do is predict these values to some categorical label (0 to 5 lets say). When I train the model I cannot get anything beyond -.13% accuracy (currently my error is this I have tried many ways to pass the input).

This is my data currently and am just trying to create a simple LSTM. Again my data is X -> [length of 28 integer values, embeddings] and Y -> [1 integer of length 3, (100, 143 etc.)]. Any idea what I am doing wrong?? I have asked many people and no one has been able to help. Here is the code for my model... any ideas? :(

optimizer = RMSprop(lr=0.01) #saw this online, no idea
model = Sequential()
model.add(Embedding(input_dim=28,output_dim=1,init='uniform')) #28 features, 1 dim output?
model.add(LSTM(150)) #just adding my LSTM nodes
model.add(Dense(1)) #since I want my output to be 1 integer value

model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
print(model.summary())

Edit:

using model.add(Embedding(input_dim=900,output_dim=8,init='uniform')) seems to work however still the accuracy never improves, I am at a loss of what to do.

1

There are 1 best solutions below

1
On BEST ANSWER

I have two suggestions.

  1. Use one hot representation for the target variable(y) also. If you give Y as integer, it will become a regression problem. Only if you give a one hot encoding, it becomes a classification problem.
  2. Try word2vec embedding when you have large amount of text, instead of one hot embedding.

optimizer = RMSprop(lr=0.01) 
embedding_vecor_length = 32
max_review_length = 28
nb_classes= 8
model = Sequential()
model.add(Embedding(input_dim=900, output_dim=embedding_vecor_length,
                    input_length=max_review_length)) 

model.add(LSTM(150))

#output_dim is a categorical variable with 8 classes
model.add(Dense(output_dim=nb_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, nb_epoch=3, batch_size=64)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))