Skip to content

Caption images


Predict image captions using pretrained DL models.

In its current form the step predicts image captions using ClipClap. ClipClap first embeds images using the Clip model, which has been pre-trained on 400M image/text pairs to pick out an image's correct caption from a list of candidates. These images are then projected into the embedding space of the GPT-2 language model, using a custom model trained for the task. Finally, using this projection as a prefix, the pretained GPT-2 is asked to predict the next sentence, i.e. the one following the image.


The following are the step's expected inputs and outputs and their specific types.

Step signature
caption_images(images: url, {
    "param": value
}) -> (caption: text)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.


The step has no required parameters, so the simplest call is simply

Example call (in recipe editor)
caption_images(ds.image_url) -> (ds.caption)


images: column:url

A column of URLs to images to predict captions for.


caption: column:text


model: string = "TRF"

Which projection model to use. The projection model maps embeddings from the pretrained Clip image model, to the pretrained GPT-2 language model. Select between a multi-layer perceptron ("MLP"), or the faster transformer ("TRF").

Must be one of: "TRF", "MLP"

weights: string

Select the parameter set for the model. The ClipClap authors provide weights for models having been trained either on the COCO dataset ("coco") or the ConceptualCaptions dataset ("concept").

Must be one of: "coco", "concept"

beam_search: boolean = False

Whether to use beam-search or greedy word prediction. When enabled, uses a more expensive but "smarter" algorithm to predict the words in the captions.