Vision LLM Component (This is a legacy component, so it is not recommended to use it)

Use the Vision LLM component to analyze and interpret images. This component allows your agent to "see" and understand visual content, answering questions or performing tasks based on what it detects in the provided images.

Why this matters

An agent that can see is fundamentally more capable. The Vision LLM unlocks powerful use cases, from reading text in a photo and describing a product to identifying objects in a scene, making your agent more aware and interactive.

Step 1: Select a Model

Choose the vision-capable model that will analyze the images.

Field	Description
Model	Select from available models like `OpenAI` (GPT-4 Vision). You can add keys for other providers like `Claude` in the Vault.

API Keys

By default, some models are available. To use full-length models or additional providers, you may need to add your own API key in the Vault.

Step 2: Define Inputs (Prompt and Images)

You need to provide both a text prompt and at least one image.

Input	Required?	Description
Prompt	Yes	The question or instruction for the AI (e.g., `What’s in this image?`). Can include dynamic variables.
Images	Yes	Accepts one or more images via URL or as a Base64 encoded string. Provide multiple images as an array.

Step 3: Configure Advanced Settings

Fine-tune the model's output to control its length.

Setting	Description
Max Output Tokens	Limits the maximum length of the generated text response. This is useful for keeping replies concise.

Step 4: Handle the Output

The component produces a single output containing the text response from the model.

Output	Description	Data Structure
Reply	The text response generated by the model based on its analysis of the image(s) and your prompt.	String

Best Practices

Ask Specific Questions: Instead of a generic "What's in this image?", ask specific questions like "How many people are in this photo?" or "What color is the car?" for more accurate results.
Provide High-Quality Images: Clear, high-resolution images will yield much better analysis than blurry or low-quality ones.
Use for OCR: This component is excellent for Optical Character Recognition (OCR). Provide an image of text and ask the model to "Extract all the text from this image."
Handle Multiple Images: When providing an array of images, make sure your prompt is clear about how to treat them (e.g., "Compare the two attached images" or "Describe each image in this series.").

Troubleshooting Tips

If your vision analysis fails...

No Reply is generated: Check that your model is selected and your API key (if required) is valid in the Vault. Also, ensure the image input is receiving a valid image URL or Base64 string.
The analysis is inaccurate: Your prompt might be too vague, or the image quality may be too low for the model to interpret correctly. Try a more specific prompt or a clearer image.
Error with multiple images: Make sure you are passing the images as a properly formatted array.

What to Try Next

Combine this with a GenAI LLM. Use the Vision LLM to get a description, then use GenAI LLM to write a story or an advertisement based on that description.
Create an Agent Skill that allows a user to upload an image and ask questions about it, using this component as the core engine.

Was this page helpful?

What You’ll Configure​

Step 1: Select a Model​

Step 2: Define Inputs (Prompt and Images)​

Step 3: Configure Advanced Settings​

Step 4: Handle the Output​

Best Practices​

Troubleshooting Tips​

What to Try Next​