Training Vision Transformer on Custom dataset

333 Views Asked by At

i am trying to use a pre-trained ViT pytorch model. It is pre-trained on imagenet with image size 384x384. Now i want to fine tune this model on my own dataset. But each time when i load the pre-trained ViT model and try to fine tune it i got an error on the positional_embedding layer. My images are 512x512 and i do not want to downscale my images. Can anyone help me with the code, how can i use the pretreind ViT model on my own dataset. Help will be appreciated.

Thank you

The following is the code to load the pytorch pre-trained model

from pytorch_pretrained_vit import ViT
model_name = 'B_16_imagenet1k'
model = ViT(model_name, pretrained=True)

and the Following is the loaded model:

ViT(
  (patch_embedding): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
  (positional_embedding): PositionalEmbedding1D()
  (transformer): Transformer(
    (blocks): ModuleList(
      (0-11): 12 x Block(
        (attn): MultiHeadedSelfAttention(
          (proj_q): Linear(in_features=768, out_features=768, bias=True)
          (proj_k): Linear(in_features=768, out_features=768, bias=True)
          (proj_v): Linear(in_features=768, out_features=768, bias=True)
          (drop): Dropout(p=0.1, inplace=False)
        )
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (pwff): PositionWiseFeedForward(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
  (fc): Linear(in_features=768, out_features=1000, bias=True)
)
0

There are 0 best solutions below