In its current form the step predicts image captions using ClipClap. ClipClap first embeds images using the Clip model, which has been pre-trained on 400M image/text pairs to pick out an image’s correct caption from a list of candidates. These images are then projected into the embedding space of the GPT-2 language model, using a custom model trained for the task. Finally, using this projection as a prefix, the pretained GPT-2 is asked to predict the next sentence, i.e. the one following the image.

Usage

The following example shows how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).