Converting tokens to word vectors effectively with TensorFlow Transform

Question

Converting tokens to word vectors effectively with TensorFlow Transform

1.3k Views Asked by Tony Yotto At 31 July 2018 at 05:40

I would like to use TensorFlow Transform to convert tokens to word vectors during my training, validation and inference phase.

I followed this StackOverflow post and implemented the initial conversion from tokens to vectors. The conversion works as expected and I obtain vectors of EMB_DIM for each token.

import numpy as np
import tensorflow as tf

tf.reset_default_graph()
EMB_DIM = 10

def load_pretrained_glove():
    tokens = ["a", "cat", "plays", "piano"]
    return tokens, np.random.rand(len(tokens), EMB_DIM)

# sample string 
string_tensor = tf.constant(["plays", "piano", "unknown_token", "another_unknown_token"])


pretrained_vocab, pretrained_embs = load_pretrained_glove()

vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
    mapping = tf.constant(pretrained_vocab),
    default_value = len(pretrained_vocab))
string_tensor = vocab_lookup.lookup(string_tensor)

# define the word embedding
pretrained_embs = tf.get_variable(
    name="embs_pretrained",
    initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
    shape=pretrained_embs.shape,
    trainable=False)

unk_embedding = tf.get_variable(
    name="unk_embedding",
    shape=[1, EMB_DIM],
    initializer=tf.random_uniform_initializer(-0.04, 0.04),
    trainable=False)

embeddings = tf.cast(tf.concat([pretrained_embs, unk_embedding], axis=0), tf.float32)
word_vectors = tf.nn.embedding_lookup(embeddings, string_tensor)

with tf.Session() as sess:
    tf.tables_initializer().run()
    tf.global_variables_initializer().run()
    print(sess.run(word_vectors))

When I refactor the code to run as a TFX Transform Graph, I am getting the error the ConversionError below.

import pprint
import tempfile
import numpy as np
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as beam_impl
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema

tf.reset_default_graph()

EMB_DIM = 10

def load_pretrained_glove():
    tokens = ["a", "cat", "plays", "piano"]
    return tokens, np.random.rand(len(tokens), EMB_DIM)


def embed_tensor(string_tensor, trainable=False):
    """
    Convert List of strings into list of indices then into EMB_DIM vectors
    """

    pretrained_vocab, pretrained_embs = load_pretrained_glove()

    vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
        mapping=tf.constant(pretrained_vocab),
        default_value=len(pretrained_vocab))
    string_tensor = vocab_lookup.lookup(string_tensor)

    pretrained_embs = tf.get_variable(
        name="embs_pretrained",
        initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
        shape=pretrained_embs.shape,
        trainable=trainable)
    unk_embedding = tf.get_variable(
        name="unk_embedding",
        shape=[1, EMB_DIM],
        initializer=tf.random_uniform_initializer(-0.04, 0.04),
        trainable=False)

    embeddings = tf.cast(tf.concat([pretrained_embs, unk_embedding], axis=0), tf.float32)
    return tf.nn.embedding_lookup(embeddings, string_tensor)

def preprocessing_fn(inputs):
    input_string = tf.string_split(inputs['sentence'], delimiter=" ") 
    return {'word_vectors': tft.apply_function(embed_tensor, input_string)}


raw_data = [{'sentence': 'This is a sample sentence'},]
raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema({
  'sentence': dataset_schema.ColumnSchema(
      tf.string, [], dataset_schema.FixedColumnRepresentation())
}))

with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
        (raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
            preprocessing_fn))

    transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable
    pprint.pprint(transformed_data)

Error Message

TypeError: Failed to convert object of type <class 
'tensorflow.python.framework.sparse_tensor.SparseTensor'> to Tensor. 
Contents: SparseTensor(indices=Tensor("StringSplit:0", shape=(?, 2), 
dtype=int64), values=Tensor("hash_table_Lookup:0", shape=(?,), 
dtype=int64), dense_shape=Tensor("StringSplit:2", shape=(2,), 
dtype=int64)). Consider casting elements to a supported type.

Questions

Why would the TF Transform step require an additional conversion/casting?
Is this approach of converting tokens to word vectors feasible? The word vectors might be multiple gigabytes in memory. How is Apache Beam handling the vectors? If Beam in a distributed setup, would it require N x vector memory with N the number of workers?

Original Q&A

There are 2 best solutions below

**Kester Tong** · Answer 1 · 2018-08-09T18:11:52.380000

The SparseTensor related error is because you are calling string_split which returns a SparseTensor. Your test code does not call string_split so that's why it only happens with your Transform code.

Regarding memory, you are correct, the embedding matrix must be loaded into each worker.

**Michael Simbirsky** · Answer 2 · 2018-08-11T15:50:40.717000

One cannot put a SparseTensor into the dictionary, returned by the TFX Transform, in your case by the function "preprocessing_fn". The reason is that SparseTensor is not a Tensor, it is actually a small subgraph.

To fix your code, you can convert your SparseTensor into a Tensor. There is a number of ways to do so, I would recommend to use tf.serialize_sparse for regular SparseTensor and tf.serialize_many_sparse for batched one.

To consume such serialized Tensor in Trainer, you could call the function tf. deserialize_many_sparse.

Converting tokens to word vectors effectively with TensorFlow Transform

There are 2 best solutions below

Related Questions in TENSORFLOW

Related Questions in WORD2VEC

Related Questions in APACHE-BEAM

Related Questions in TENSORFLOW-TRANSFORM

Related Questions in GLOVE

Trending Questions

Popular # Hahtags

Popular Questions