How can I speed up an autoencoder to use on text data written in python's theano package?

771 Views Asked by At

I'm new to theano and I'm trying to adapt the autoencoder script here to work on text data. This code uses the MNIST dataset as training data. This data is in the form of a numpy 2d array.

My data is a csr sparse matrix of about 100,000 instances with about 50,000 features. The matrix is the result of using sklearn's tfidfvectorizer to fit and transform the text data. As I'm using sparse matrices I modify the code to use the theano.sparse package to represent my input. My training set is the symbolic variable:

train_set_x = theano.sparse.shared(train_set)

However, theano.sparse matrices cannot perform all of the operations used in the original script (there is a list of sparse operations here). The code uses dot and sum from the tensor methods on the input. I have changed the dot to sparse.dot but I can't find out what to replace the sum with so I am converting the training batches to dense matrices and using the original tensor methods as shown in this cost function:

 def get_cost(self):
     tilde_x = self.get_corrupted_input(self.x, self.corruption)
     y = self.get_hidden_values(tilde_x)
     z = self.get_reconstructed_input(y)
     #make dense, must be a better way to do this
     L = - T.sum(SP.dense_from_sparse(self.x) * T.log(z) + (1 - SP.dense_from_sparse(self.x)) * T.log(1 - z), axis=1)
     cost = T.mean(L)
     return cost

def get_hidden_values(self, input):
    # use theano.sparse.dot instead of T.dot
    return T.nnet.sigmoid(theano.sparse.dot(input, self.W) + self.b)

The get_corrupted_input and get_reconstructed_input methods remain as they are in the link above. My question is is there a faster way to do this?

Converting the matrices to dense is making running the training very slow. Currently it takes 20.67m to do one training epoch with a batch size of 20 training instances.

Any help or tips you could give would be greatly appreciated!

1

There are 1 best solutions below

0
On

In the most recent master branch of theano.sparse there is an sp_sum method listed.

(see here)

If you're not using the bleeding edge version I'd install that and see if calling it will work and if doing so speeds things up:

pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

(And if it does, noting it here would be nice, it's not always clear that the sparse functionality is much faster than using dense calculations all the way through, especially on the gpu.)