It might not be clear from the question what I want to say, but how can we apply masked language modelling with the text and image given using multimodal models like lxmert. For example, if there is some text given (This is a MASK) and we mask some word in it, and there is an image given (maybe of a cat), how can we apply MML to predict the word as cat? How can we implement such a thing and get MLM scores out of it using huggingface library api? A snippet of code explaining such will be great. If anyone can help, it would help in better understanding.

0

There are 0 best solutions below