Error while generating dimension of universal sentence encoder embedding

230 Views Asked by At

Below is the code for generating embedding and reducing dimension:

def generate_embeddings(text):
    if embed_fn is None:
        embed_fn = hub.load(module_url)
    embedding = embed_fn(text).numpy()
    return embedding


from sklearn.decomposition import IncrementalPCA
def pca():
    pca = IncrementalPCA(n_components = 64, batch_size= 1024)
    pca.fit(generate_embeddings(df))
    features_train = pca.transform(generate_embeddings(df))
    return features_train

When I run on 100 000 records it throws error:

ResourceExhaustedError:  OOM when allocating tensor with shape[64338902,512] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
     [[{{node StatefulPartitionedCall/StatefulPartitionedCall/EncoderDNN/EmbeddingLookup/EmbeddingLookupUnique/GatherV2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_restored_function_body_15375]

Function call stack:
restored_function_body

2

There are 2 best solutions below

0
On

This shows your limitation of GPU memory. Either reduce the batch_size or the size of network layers.

0
On

As the data is greater than the memory of system and cannot load at one time so pass this on chunks or batches.It will load only batch data in memory at once.

def generate_embeddings(text):   
    embed_fn = hub.load(module_url)
    embedding = embed_fn(text).numpy()
    return embedding

def gen_pca(batch):
    gen=generate_embeddings(batch)
    pca = PCA(n_components = 64)
    pca.fit(gen)
    features_train = pca.transform(gen)
    return features_train


def run():
    ex=[]
    for batch in np.array_split(df['text'], 100):
        ex.extend(gen_pca(batch))
    return ex