Skip to main content

Multimodal Support

Toolpack SDK supports multimodal inputs (text + images) across all vision-capable providers. You can send images alongside text prompts to models like GPT-4o, Claude Sonnet, Gemini Pro Vision, and LLaVA.

Image Input Formats

Images can be provided in three formats:

1. Local File Path

import { Toolpack } from 'toolpack-sdk';

const toolpack = await Toolpack.init({ provider: 'openai' });

const response = await toolpack.generate({
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{
type: 'image_file',
image_file: {
path: '/path/to/image.png',
detail: 'high' // 'auto' | 'low' | 'high'
}
}
]
}],
model: 'gpt-4o',
});

2. Base64 Data

const response = await toolpack.generate({
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this diagram' },
{
type: 'image_data',
image_data: {
data: 'iVBORw0KGgo...', // base64 string
mimeType: 'image/png',
detail: 'auto'
}
}
]
}],
model: 'gpt-4o',
});

3. HTTP URL

const response = await toolpack.generate({
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'What breed is this dog?' },
{
type: 'image_url',
image_url: {
url: 'https://example.com/dog.jpg',
detail: 'low'
}
}
]
}],
model: 'gpt-4o',
});

Multiple Images

Send multiple images in a single request:

const response = await toolpack.generate({
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Compare these two images' },
{ type: 'image_file', image_file: { path: './image1.png' } },
{ type: 'image_file', image_file: { path: './image2.png' } }
]
}],
model: 'gpt-4o',
});

Provider Behavior

Different providers handle image inputs differently. The SDK normalizes this automatically:

ProviderFile PathBase64URL
OpenAIConverted to base64✓ Native✓ Native
AnthropicConverted to base64✓ NativeDownloaded → base64
GeminiConverted to base64✓ NativeDownloaded → base64
OllamaConverted to base64✓ NativeDownloaded → base64

Notes

  • File paths are always read and converted to base64 before sending
  • URLs are passed directly to OpenAI, but downloaded and converted for other providers
  • Detail level controls image resolution/token usage (OpenAI-specific, ignored by others)

TypeScript Types

import { ImageFilePart, ImageDataPart, ImageUrlPart } from 'toolpack-sdk';

const filePart: ImageFilePart = {
type: 'image_file',
image_file: { path: '/path/to/image.png', detail: 'high' }
};

const dataPart: ImageDataPart = {
type: 'image_data',
image_data: { data: 'base64...', mimeType: 'image/png', detail: 'auto' }
};

const urlPart: ImageUrlPart = {
type: 'image_url',
image_url: { url: 'https://example.com/image.png', detail: 'low' }
};

Streaming with Images

Multimodal requests work with streaming too:

const stream = toolpack.stream({
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image in detail' },
{ type: 'image_file', image_file: { path: './photo.jpg' } }
]
}],
model: 'gpt-4o',
});

for await (const chunk of stream) {
process.stdout.write(chunk.delta);
}

Vision-Capable Models

Not all models support vision. Check the model's capabilities:

const providers = await toolpack.listProviders();
for (const provider of providers) {
for (const model of provider.models) {
if (model.capabilities.vision) {
console.log(`${model.id} supports vision`);
}
}
}

Common Vision Models

ProviderModels
OpenAIgpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4-vision-preview
Anthropicclaude-sonnet-4-*, claude-3-5-sonnet-*, claude-3-opus-*, claude-3-haiku-*
Geminigemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash
Ollamallava, llava-llama3, bakllava, and other vision models