> ## Documentation Index > Fetch the complete documentation index at: https://docs.graphext.com/llms.txt > Use this file to discover all available pages before exploring further. # caption_images > Predict image captions using pretrained DL models. In its current form the step predicts image captions using [ClipClap](https://github.com/rmokady/CLIP_prefix_caption). ClipClap first embeds images using the [Clip](https://huggingface.co/docs/transformers/model_doc/clip) model, which has been pre-trained on 400M image/text pairs to pick out an image's correct caption from a list of candidates. These images are then projected into the embedding space of the [GPT-2](https://huggingface.co/gpt2) language model, using a custom model trained for the task. Finally, using this projection as a prefix, the pretained GPT-2 is asked to predict the next sentence, i.e. the one following the image. ## Usage The following example shows how the step can be used in a recipe. The step has no required parameters, so the simplest call is simply ```stan theme={null} caption_images(ds.image_url) -> (ds.caption) ``` General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below. ```stan theme={null} caption_images(images: url, { "param": value, ... }) -> (caption: text) ``` ## Inputs & Outputs The following are the inputs expected by the step and the outputs it produces. These are generally columns (`ds.first_name`), datasets (`ds` or `ds[["first_name", "last_name"]]`) or models (referenced by name e.g. `"churn-clf"`). A column of URLs to images to predict captions for. ## Configuration The following parameters can be used to configure the behaviour of the step by including them in a json object as the last "input" to the step, i.e. `step(..., {"param": "value", ...}) -> (output)`. Which projection model to use. The projection model maps embeddings from the pretrained Clip image model, to the pretrained GPT-2 language model. Select between a multi-layer perceptron ("MLP"), or the faster transformer ("TRF"). Values must be one of the following: * `TRF` * `MLP` Select the parameter set for the model. The ClipClap authors provide weights for models having been trained either on the [COCO dataset](https://cocodataset.org/#home) ("coco") or the [ConceptualCaptions](https://ai.google.com/research/ConceptualCaptions/) dataset ("concept"). Values must be one of the following: * `coco` * `concept` Whether to use beam-search or greedy word prediction. When enabled, uses a more expensive but "smarter" algorithm to predict the words in the captions.