Generating captions from image embeddings

29 Views Asked by At

I want to pass image embeddings to some LLM (e.g. GPT2) and generate a caption of this image. However, I am having difficulty directly passing these embeddings to LLM.

I am using CLIP to generate embeddings of an image such as

tensor([[-5.7324e-01,  1.0089e-01,  4.0649e-01, -8.7830e-02,  1.8726e-01,
          1.6980e-01,  1.4966e-01,  1.5173e-01,  4.3896e-01, -1.3843e-01,
         -1.3867e-01,  4.8584e-01,  2.7490e-01, -6.5369e-02,  2.8735e-01,
          1.2103e-01, -7.1945e-03,  2.3279e-01, -3.5059e-01, -3.8452e-01,
          1.2390e-01, -6.4270e-02,  5.5908e-02,  3.1104e-01, -3.1525e-02,
          7.3486e-01, -7.5928e-02, -3.4644e-01, -3.3618e-01,  1.7715e-02,
          3.4302e-01,  4.8291e-01,  2.5482e-03, -9.6069e-02,  1.8463e-02,
         -3.7036e-01, -3.4180e-02, -2.1301e-01, -2.3178e-02, -1.0364e-01,
         -2.5415e-01, -1.0933e-02,  4.2651e-01, -1.4258e-01,  2.0923e-01,
         -1.6895e+00, -1.7810e-01,  3.6938e-01, -3.1885e-01,  1.1340e-01,
         -1.8542e-01,  9.6924e-02,  4.7461e-01,  1.0541e-01, -3.7964e-01,
          2.6074e-01, -3.2544e-01, -3.1616e-02,  4.2432e-01,  8.9795e-01,
         -3.0347e-01, -2.0703e-01,  1.6125e-01,  1.2549e-01,  1.9684e-02,
         -3.4790e-01,  4.7852e-01, -2.4750e-02, -4.0283e-01,  1.7281e-03,
         -9.6664e-03,  1.2073e-01, -3.6060e-01,  4.3579e-02, -3.9185e-02,
         -5.0079e-02,  3.3252e-01, -9.4482e-02,  2.3340e-01, -1.5723e-01,
         -4.2188e-01,  1.5540e-01, -3.7866e-01,  7.4951e-01, -1.8457e-01,
         -4.6265e-02,  1.2422e+00, -2.2656e-01,  1.6235e-01,  1.2537e-01,
          5.3271e-01, -1.1273e-01, -6.3906e+00,  8.2275e-01, -8.2458e-02,
          3.7354e-01, -3.6353e-01, -2.2314e-01,  2.1387e-01,  8.0518e-01,
         -4.7510e-01, -4.3604e-01,  1.5698e-01, -8.1177e-02, -4.5068e-01,
          9.4666e-02, -1.6387e+00, -8.2581e-02, -1.1176e-01,  6.7017e-02,
          3.4937e-01, -3.3630e-02,  5.1422e-02,  2.1411e-01, -3.9917e-01,
         -3.7769e-01, -2.7075e-01,  1.4636e-01, -1.0388e-01, -3.1738e-01,
         -7.2975e-03,  2.7637e-01, -4.3262e-01,  1.6443e-01, -6.4026e-02,
          4.5557e-01, -4.7046e-01, -1.0272e-01, -3.3008e-01,  8.6487e-02,
         -1.4587e-01,  6.1157e-02,  4.0137e-01,  7.9932e-01,  1.0490e-02,
         -3.9825e-02,  8.6365e-02,  1.9730e-02,  8.3557e-02, -1.9824e-01,
         -2.8885e-02, -6.5552e-02,  1.8884e-01, -6.7688e-02, -2.4242e-03,
          7.4951e-02, -4.0356e-01, -2.5122e-01,  1.6785e-01, -3.6328e-01,
         -2.9938e-02, -1.3855e-01,  1.1787e+00, -7.1875e-01,  6.9824e-01,
         -2.5952e-01, -5.0537e-01,  1.7807e-02,  2.4072e-01, -1.1224e-01,
          2.8784e-01, -3.3032e-01,  4.5679e-01, -2.3242e-01,  6.1951e-02,
         -1.4856e-01,  1.3293e-01,  2.5269e-01,  2.8564e-01, -2.7856e-01,
         -1.6724e-01, -5.9265e-02,  2.3117e-02,  2.4910e-03, -1.8762e-01,
         -3.3057e-01,  6.7676e-01, -1.2988e-01,  1.3391e-01, -1.6846e-01,
          3.4912e-01, -1.9165e-01,  6.3721e-02,  2.4756e-01,  3.1372e-01,
          2.2009e-01,  5.9875e-02,  6.1963e-01, -1.2535e-02, -1.3879e-01,
          2.7344e-01,  2.0850e-01, -1.1487e-01, -2.6343e-01, -1.6919e-01,
         -3.8013e-01, -3.5522e-01,  8.8318e-02, -4.9341e-01,  2.6685e-01,
          1.9189e-01, -2.5665e-02, -1.1490e-02, -2.1033e-01, -1.0059e-01,
          4.5337e-01,  2.3608e-01, -3.2007e-01, -3.2031e-01,  9.9243e-02,
         -6.1816e-01,  3.5059e-01, -1.6101e-01,  5.2856e-02, -4.0576e-01,
          8.9340e-03,  3.0289e-02,  5.5566e-01,  5.0781e-01, -3.0121e-02,
          2.1802e-01, -7.5073e-02, -6.8066e-01, -2.6489e-01, -1.1395e-01,
         -5.7220e-03, -9.1125e-02,  3.1128e-01, -2.0386e-01, -2.0581e-01,
         -7.3608e-02,  2.0981e-02, -2.8857e-01,  3.7305e-01,  3.9764e-02,
         -6.0889e-01,  3.1128e-01, -3.2056e-01, -4.0820e-01, -3.5938e-01,
          6.5430e-02,  1.7944e-01, -1.9928e-02, -5.5664e-01,  2.0837e-01,
          9.3701e-01, -9.2041e-02,  5.5859e-01, -9.2529e-02,  6.3672e-01,
          4.5947e-01,  1.9995e-01, -4.0234e-01, -2.1692e-01,  2.9517e-01,
          1.7090e-01, -1.2915e-01,  1.5417e-01, -8.8721e-01,  2.2266e-01,
         -3.5339e-02, -1.7383e-01, -2.3706e-01,  3.5547e-01,  5.0995e-02,
         -4.6655e-01, -2.7124e-01,  2.7344e-01,  3.5864e-01, -2.2937e-01,
         -3.7817e-01,  8.1116e-02,  4.5483e-01,  1.2683e-01,  9.8572e-02,
         -1.1975e-01, -1.6907e-01, -3.3252e-01,  3.7079e-02,  1.8188e-01,
          1.2311e-01,  3.7769e-01, -3.9648e-01,  1.4172e-01,  3.2056e-01,
          1.5845e-01,  8.4668e-01, -5.5084e-02,  2.0911e-01, -1.8384e-01,
          1.1847e-01, -3.7964e-01, -1.9165e-02,  2.7344e-01, -3.6621e-01,
          2.4561e-01, -3.7036e-01, -6.5193e-03, -1.9385e-01,  1.0144e-01,
          4.6167e-01,  3.3008e-01, -9.4910e-02,  2.5586e-01, -1.8799e-01,
         -2.6318e-01,  1.1536e-01,  2.0959e-01,  1.4539e-01,  9.1614e-02,
         -4.0497e-02,  5.4297e-01,  7.9932e-01, -2.2534e-01,  3.7109e-02,
         -1.2585e-01,  1.1592e+00,  1.6956e-01, -1.1224e-01, -3.4399e-01,
          7.0508e-01,  2.2949e+00, -2.0508e-01,  1.9568e-01,  1.1981e-01,
          2.8030e-02,  4.2529e-01,  1.1371e-01, -1.7932e-01, -1.8359e-01,
          1.2286e-01,  3.3154e-01,  5.3833e-02, -6.6504e-01, -7.6782e-02,
          3.9966e-01,  3.1543e-01, -2.9953e-02, -2.7563e-01,  2.1790e-01,
         -4.4482e-01,  3.3130e-01,  1.8848e-01,  4.6338e-01,  2.5439e-01,
         -3.7720e-01, -3.9429e-02,  1.7920e-01, -1.6089e-01,  1.4136e-01,
          5.4199e-02, -1.6907e-01, -6.2061e-01, -3.9209e-01, -3.0518e-02,
         -6.1816e-01, -1.6931e-01, -7.4524e-02,  2.0667e-01,  5.6183e-02,
         -3.1836e-01,  9.9792e-02, -4.4531e-01,  4.4769e-02, -7.2266e-02,
          1.9885e-01,  2.5806e-01,  2.3633e-01,  3.0371e-01,  5.6152e-01,
          1.4587e-02, -5.9113e-02,  2.0862e-01, -1.3329e-02,  1.6821e-01,
         -1.3354e-01,  4.0576e-01, -1.7822e-01, -4.1931e-02, -4.9634e-01,
          1.3867e-01, -5.5371e-01, -1.4771e-02, -2.5952e-01,  4.1162e-01,
         -1.6882e-01, -4.8294e-03,  5.4779e-02, -4.5135e-02,  2.3999e-01,
         -2.5513e-01,  6.6797e-01,  1.1755e-01, -2.2913e-01, -2.0422e-01,
         -2.6978e-01,  8.0948e-03, -7.3242e-02,  5.3857e-01, -2.8271e-01,
         -1.2421e-01,  3.6646e-01,  1.1982e+00, -3.8940e-01,  7.9468e-02,
          3.2104e-01,  3.5742e-01,  1.5039e-01, -1.0352e-01, -4.3311e-01,
         -5.2704e-02, -1.6223e-01,  4.5996e-01, -2.3291e-01, -1.8188e-01,
          2.3633e-01, -1.6370e-01, -2.2827e-01,  4.7455e-02, -1.7102e-01,
          1.2573e-01,  1.2451e-01,  3.2178e-01,  2.4951e-01,  3.5059e-01,
         -2.2144e-01, -4.7559e-01,  1.8713e-01, -2.1094e+00,  3.4204e-01,
         -1.8713e-01,  2.5684e-01,  7.1582e-01, -6.0400e-01, -6.5771e-01,
         -3.5303e-01,  5.9509e-02, -2.0532e-01,  6.5674e-01,  1.3870e-02,
          4.1840e-02,  4.3042e-01, -1.2042e-01,  5.4443e-01,  3.8306e-01,
         -5.6396e-01,  7.2449e-02,  2.0715e-01, -1.7383e-01,  8.1421e-02,
          4.5972e-01,  3.7744e-01,  6.2793e-01, -5.5908e-01,  3.6865e-01,
          5.1074e-01,  5.2393e-01, -1.0968e-01,  3.0688e-01,  5.9814e-02,
          8.3374e-02, -1.1902e-01,  4.8975e-01,  1.6174e-01, -1.0474e-01,
          2.8000e-02, -2.4719e-01, -2.6050e-01,  6.9702e-02, -3.2715e-01,
          5.2148e-01,  2.6587e-01,  1.3940e-01, -1.4453e-01, -1.8250e-01,
          1.7908e-01,  6.1989e-03,  2.8418e-01, -1.1188e-01, -3.5547e-01,
         -2.3669e-01, -3.0615e-01, -1.7371e-01, -4.2725e-02,  2.8976e-02,
         -3.0380e-02, -4.9927e-01, -9.2697e-03,  2.3743e-01,  1.2891e-01,
         -1.3115e-02, -2.5488e-01,  3.9380e-01, -6.0986e-01,  2.8305e-02,
         -1.8616e-01,  4.3506e-01, -1.0712e-01,  2.9443e-01, -3.0762e-01,
         -6.6064e-01, -2.1286e-02,  2.0911e-01,  2.4280e-01, -5.2551e-02,
         -1.3733e-01, -8.2336e-02,  5.1660e-01, -4.7705e-01,  4.5850e-01,
         -1.6211e-01, -4.0009e-02]], device='cuda:0', dtype=torch.float16)

Is there a way to pass these embeddings to any LLM to generate a caption?

0

There are 0 best solutions below