caption_images

In its current form the step predicts image captions using ClipClap. ClipClap first embeds images using the Clip model, which has been pre-trained on 400M image/text pairs to pick out an image’s correct caption from a list of candidates. These images are then projected into the embedding space of the GPT-2 language model, using a custom model trained for the task. Finally, using this projection as a prefix, the pretained GPT-2 is asked to predict the next sentence, i.e. the one following the image.

Usage

The following example shows how the step can be used in a recipe.

Examples

Example 1
Signature

The step has no required parameters, so the simplest call is simply

caption_images(ds.image_url) -> (ds.caption)

General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

caption_images(images: url, {
    "param": value,
    ...
}) -> (caption: text)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

images

column[url]

required

A column of URLs to images to predict captions for.

Outputs

caption

column[text]

required

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

model

string

default:"TRF"

Which projection model to use. The projection model maps embeddings from the pretrained Clip image model, to the pretrained GPT-2 language model. Select between a multi-layer perceptron (“MLP”), or the faster transformer (“TRF”).Values must be one of the following:

TRF
MLP

weights

string

Select the parameter set for the model. The ClipClap authors provide weights for models having been trained either on the COCO dataset (“coco”) or the ConceptualCaptions dataset (“concept”).Values must be one of the following:

coco
concept

beam_search

boolean

default:"false"

Whether to use beam-search or greedy word prediction. When enabled, uses a more expensive but “smarter” algorithm to predict the words in the captions.

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration