Vision LLM Component (This is a legacy component, so it is not recommended to use it)
Use the Vision LLM component to analyze and interpret images. This component allows your agent to "see" and understand visual content, answering questions or performing tasks based on what it detects in the provided images.
Why this matters
What You’ll Configure
- Model Selection
- Define Inputs
- Advanced Settings
- Handle the Output
- Best Practices
- Troubleshooting Tips
- What to Try Next
Step 1: Select a Model
Choose the vision-capable model that will analyze the images.
Field | Description |
---|---|
Model | Select from available models like OpenAI (GPT-4 Vision). You can add keys for other providers like Claude in the Vault. |
API Keys
Step 2: Define Inputs (Prompt and Images)
You need to provide both a text prompt and at least one image.
Input | Required? | Description |
---|---|---|
Prompt | Yes | The question or instruction for the AI (e.g., What’s in this image? ). Can include dynamic variables. |
Images | Yes | Accepts one or more images via URL or as a Base64 encoded string. Provide multiple images as an array. |
Step 3: Configure Advanced Settings
Fine-tune the model's output to control its length.
Setting | Description |
---|---|
Max Output Tokens | Limits the maximum length of the generated text response. This is useful for keeping replies concise. |
Step 4: Handle the Output
The component produces a single output containing the text response from the model.
Output | Description | Data Structure |
---|---|---|
Reply | The text response generated by the model based on its analysis of the image(s) and your prompt. | String |
Best Practices
- Ask Specific Questions: Instead of a generic "What's in this image?", ask specific questions like "How many people are in this photo?" or "What color is the car?" for more accurate results.
- Provide High-Quality Images: Clear, high-resolution images will yield much better analysis than blurry or low-quality ones.
- Use for OCR: This component is excellent for Optical Character Recognition (OCR). Provide an image of text and ask the model to "Extract all the text from this image."
- Handle Multiple Images: When providing an array of images, make sure your prompt is clear about how to treat them (e.g., "Compare the two attached images" or "Describe each image in this series.").
Troubleshooting Tips
If your vision analysis fails...
What to Try Next
- Combine this with a GenAI LLM. Use the
Vision LLM
to get a description, then useGenAI LLM
to write a story or an advertisement based on that description. - Create an Agent Skill that allows a user to upload an image and ask questions about it, using this component as the core engine.