Below is the code for generating embedding and reducing dimension:
def generate_embeddings(text):
if embed_fn is None:
embed_fn = hub.load(module_url)
embedding = embed_fn(text).numpy()
return embedding
from sklearn.decomposition import IncrementalPCA
def pca():
pca = IncrementalPCA(n_components = 64, batch_size= 1024)
pca.fit(generate_embeddings(df))
features_train = pca.transform(generate_embeddings(df))
return features_train
When I run on 100 000 records it throws error:
ResourceExhaustedError: OOM when allocating tensor with shape[64338902,512] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/EncoderDNN/EmbeddingLookup/EmbeddingLookupUnique/GatherV2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_restored_function_body_15375]
Function call stack:
restored_function_body
This shows your limitation of GPU memory. Either reduce the
batch_size
or the size ofnetwork layers
.