caption_images
Predict image captions using pretrained DL models.
In its current form the step predicts image captions using ClipClap. ClipClap first embeds images using the Clip model, which has been pre-trained on 400M image/text pairs to pick out an image’s correct caption from a list of candidates. These images are then projected into the embedding space of the GPT-2 language model, using a custom model trained for the task. Finally, using this projection as a prefix, the pretained GPT-2 is asked to predict the next sentence, i.e. the one following the image.
Was this page helpful?