> ## Documentation Index
> Fetch the complete documentation index at: https://docs.graphext.com/llms.txt
> Use this file to discover all available pages before exploring further.

# caption_images

> Predict image captions using pretrained DL models. 

In its current form the step predicts image captions using [ClipClap](https://github.com/rmokady/CLIP_prefix_caption).
ClipClap first embeds images using the [Clip](https://huggingface.co/docs/transformers/model_doc/clip) model,
which has been pre-trained on 400M image/text pairs to pick out an image's correct caption from a list of candidates. These
images are then projected into the embedding space of the [GPT-2](https://huggingface.co/gpt2) language model, using a
custom model trained for the task. Finally, using this projection as a prefix, the pretained GPT-2 is asked to predict the
next sentence, i.e. the one following the image.

## Usage

The following example shows how the step can be used in a recipe.

<Accordion title="Examples" icon="code" defaultOpen="true">
  <Tabs>
    <Tab title="Example 1">
      The step has no required parameters, so the simplest call is simply

      ```stan theme={null}
      caption_images(ds.image_url) -> (ds.caption)
      ```
    </Tab>

    <Tab title="Signature">
      General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

      ```stan theme={null}
      caption_images(images: url, {
          "param": value,
          ...
      }) -> (caption: text)
      ```
    </Tab>
  </Tabs>
</Accordion>

## Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally
columns (`ds.first_name`), datasets (`ds` or `ds[["first_name", "last_name"]]`) or models (referenced
by name e.g. `"churn-clf"`).

<Accordion title="Inputs" icon="right-to-bracket">
  <ParamField path="images" type="column[url]" required>
    A column of URLs to images to predict captions for.
  </ParamField>
</Accordion>

<Accordion title="Outputs" icon="right-from-bracket">
  <ParamField path="caption" type="column[text]" required />
</Accordion>

## Configuration

The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last "input" to the step, i.e. `step(..., {"param": "value", ...}) -> (output)`.

<Accordion title="Parameters" defaultOpen="true" icon="sliders">
  <ParamField path="model" type="string" default="TRF">
    Which projection model to use.
    The projection model maps embeddings from the pretrained Clip image model, to the pretrained
    GPT-2 language model. Select between a multi-layer perceptron ("MLP"), or the faster transformer
    ("TRF").

    Values must be one of the following:

    * `TRF`
    * `MLP`
  </ParamField>

  <ParamField path="weights" type="string">
    Select the parameter set for the model.
    The ClipClap authors provide weights for models having been trained either on the
    [COCO dataset](https://cocodataset.org/#home) ("coco") or the [ConceptualCaptions](https://ai.google.com/research/ConceptualCaptions/)
    dataset ("concept").

    Values must be one of the following:

    * `coco`
    * `concept`
  </ParamField>

  <ParamField path="beam_search" type="boolean" default="false">
    Whether to use beam-search or greedy word prediction.
    When enabled, uses a more expensive but "smarter" algorithm to predict the words in the captions.
  </ParamField>
</Accordion>
