Mapping embeddings to labels in PyTorch/Huggingface

Question

Mapping embeddings to labels in PyTorch/Huggingface

158 Views Asked by youtube At 04 February 2024 at 17:06

I am currently working on a project where I am using a pre-trained transformer model to generate embeddings for DNA sequences (some have a '1' label and some have a '0' label). I'm trying to map these embeddings back to their corresponding labels in my dataset, but I'm encountering an IndexError when attempting to do so. I think it has to do with the fact that I am batching since I'm running out of memory.

Here is the code I'm working with:

from datasets import Dataset
from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
model = AutoModel.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")

# Load the dataset
ds1 = Dataset.from_file('training.arrow') #this is already tokenized

# Convert tokenized sequences to tensor
inputs = torch.tensor(ds1['input_ids']).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

# Reduce batch size
batch_size = 4

# Pass tokenized sequences through the model with reduced batch size
with torch.no_grad():
    outputs = model(input_ids=inputs[:batch_size], output_hidden_states=True)

# Extract embeddings
hidden_states = outputs.hidden_states
embeddings1 = hidden_states[-1]

Here is the information about the size of the output embeddings and the original dataset:

embeddings1.shape
torch.Size([4, 86, 1280])


ds1
Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 22535512
})

I'm having a hard time figuring out how to map the labels back to the output embeddings, especially with the big discrepancy with the sizes. As you can see, I have 22million sequences, I would like a an embedding for each sequence.

My plan is to use these embeddings for downstream prediction using another model. I have already split my data into train, test, and val, but does it make more sense to get the embeddings for a label1 dataset and label0 dataset and then combine and then split into train/test, so I don't have to worry about the mapping of the labels?

Original Q&A

There are 2 best solutions below

Karl On 05 February 2024 at 18:42

You can use the map function to compute embeddings and save them in the same dataset

from transformers import DataCollatorWithPadding

collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors='pt')

def embed(batch):
    inputs = collator({'input_ids' : batch['input_ids']})
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        
    hidden_states = outputs.hidden_states
    embeddings = hidden_states[-1]
    return {'embeddings' : embeddings.detach().cpu()}

ds1 = ds1.map(embed, batched=True, batch_size=4)

**Swaraj Patil** · Accepted Answer · 2024-02-16T07:15:13.567000

You can use the .map function in the dataset to append the embeddings. I suggest you run this on GPU instead of CPU since nos of rows is very high.

Please try running the code below.

import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModel

device = torch.device("cuda" if torch.cuda.is_available() else "CPU")

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
model = AutoModel.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref", device_map = device)

# Load the dataset
ds = Dataset.from_file('training.arrow') #this is already tokenized

# Convert tokenized sequences to tensor
inputs = torch.tensor(ds['input_ids']).to(device)

# Reduce batch size
batch_size = 4

def get_embeddings(data):

    # Convert tokenized sequences to tensor
    input_ids =  torch.tensor(data['input_ids']).to(device)

    # Pass tokenized sequences through the model with reduced batch size
    with torch.no_grad():
        outputs = model(input_ids, output_hidden_states=True)
    
    hidden_states = outputs.hidden_states
    embeddings = hidden_states[-1]

    return {'embeddings' : embeddings.detach().cpu()}

# Extract embeddings
ds = ds.map(get_embeddings, batched=True, batch_size=batch_size)
ds

Mapping embeddings to labels in PyTorch/Huggingface

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in TENSORFLOW

Related Questions in PYTORCH

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in WORD-EMBEDDING

Trending Questions

Popular # Hahtags

Popular Questions