I am trying to adapt a pretrained ViT to work with 3D images, using a naive approch with a maxpooling layer to aggregate extracted features before the MLP head. I want to try to use the Trainer class to train the model, so i am using HF's Dataset class with a transform to process each slice of the 3D image, but i cannot return the whole set of processed slices, the transform keeps returning only one processed slice.
Here, processor is an instance of ViTImageProcessor.
def preprocess_data(ds, num_slices=28):
reshaped = [np.array(sample).reshape(1, 28, 28, 28) for sample in ds['image']]
inputs = processor(
[np.repeat(sample, 3, axis=0)[:, :, :, i] for sample in reshaped for i in range(num_slices)],
return_tensors='pt'
)
labels = []
for y in ds['labels']:
labels.append(y)
inputs['labels'] = labels
return inputs
the 'pixel_values' tensor should have a size of (28, 3, 224, 224), and i verified this in the transform function printing the shape of this tensor, but then when i get a sample from the dataset with the transform applied i get a tensor of size (3, 224, 224). I tried to pile the information of the slices in another dimension, but then the first dimension gets ignores. Why does the transform behaves likes this?
I am tried to use my own training loop aswell, but my model keeps failing to converge, so i really want to try and use a Trainer instance.
Thanks.