I am currently working on a project where I am using a pre-trained transformer model to generate embeddings for DNA sequences (some have a '1' label and some have a '0' label). I'm trying to map these embeddings back to their corresponding labels in my dataset, but I'm encountering an IndexError when attempting to do so. I think it has to do with the fact that I am batching since I'm running out of memory.
Here is the code I'm working with:
from datasets import Dataset
from transformers import AutoTokenizer, AutoModel
import torch
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
model = AutoModel.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
# Load the dataset
ds1 = Dataset.from_file('training.arrow') #this is already tokenized
# Convert tokenized sequences to tensor
inputs = torch.tensor(ds1['input_ids']).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
# Reduce batch size
batch_size = 4
# Pass tokenized sequences through the model with reduced batch size
with torch.no_grad():
outputs = model(input_ids=inputs[:batch_size], output_hidden_states=True)
# Extract embeddings
hidden_states = outputs.hidden_states
embeddings1 = hidden_states[-1]
Here is the information about the size of the output embeddings and the original dataset:
embeddings1.shape
torch.Size([4, 86, 1280])
ds1
Dataset({
features: ['labels', 'input_ids', 'attention_mask'],
num_rows: 22535512
})
I'm having a hard time figuring out how to map the labels back to the output embeddings, especially with the big discrepancy with the sizes. As you can see, I have 22million sequences, I would like a an embedding for each sequence.
My plan is to use these embeddings for downstream prediction using another model. I have already split my data into train, test, and val, but does it make more sense to get the embeddings for a label1 dataset and label0 dataset and then combine and then split into train/test, so I don't have to worry about the mapping of the labels?
You can use the .map function in the dataset to append the embeddings. I suggest you run this on GPU instead of CPU since nos of rows is very high.
Please try running the code below.