Multimodal Support

Toolpack SDK supports multimodal inputs (text + images) across all vision-capable providers. You can send images alongside text prompts to models like GPT-4o, Claude Sonnet, Gemini Pro Vision, and LLaVA.

Image Input Formats

Images can be provided in three formats:

1. Local File Path

import { Toolpack } from 'toolpack-sdk';

const toolpack = await Toolpack.init({ provider: 'openai' });

const response = await toolpack.generate({
    messages: [{
        role: 'user',
        content: [
            { type: 'text', text: 'What is in this image?' },
            { 
                type: 'image_file', 
                image_file: { 
                    path: '/path/to/image.png',
                    detail: 'high'  // 'auto' | 'low' | 'high'
                } 
            }
        ]
    }],
    model: 'gpt-4o',
});

2. Base64 Data

const response = await toolpack.generate({
    messages: [{
        role: 'user',
        content: [
            { type: 'text', text: 'Describe this diagram' },
            { 
                type: 'image_data', 
                image_data: { 
                    data: 'iVBORw0KGgo...', // base64 string
                    mimeType: 'image/png',
                    detail: 'auto'
                } 
            }
        ]
    }],
    model: 'gpt-4o',
});

3. HTTP URL

const response = await toolpack.generate({
    messages: [{
        role: 'user',
        content: [
            { type: 'text', text: 'What breed is this dog?' },
            { 
                type: 'image_url', 
                image_url: { 
                    url: 'https://example.com/dog.jpg',
                    detail: 'low'
                } 
            }
        ]
    }],
    model: 'gpt-4o',
});

Multiple Images

Send multiple images in a single request:

const response = await toolpack.generate({
    messages: [{
        role: 'user',
        content: [
            { type: 'text', text: 'Compare these two images' },
            { type: 'image_file', image_file: { path: './image1.png' } },
            { type: 'image_file', image_file: { path: './image2.png' } }
        ]
    }],
    model: 'gpt-4o',
});

Provider Behavior

Different providers handle image inputs differently. The SDK normalizes this automatically:

Provider	File Path	Base64	URL
OpenAI	Converted to base64	✓ Native	✓ Native
Anthropic	Converted to base64	✓ Native	Downloaded → base64
Gemini	Converted to base64	✓ Native	Downloaded → base64
Ollama	Converted to base64	✓ Native	Downloaded → base64

Notes

File paths are always read and converted to base64 before sending
URLs are passed directly to OpenAI, but downloaded and converted for other providers
Detail level controls image resolution/token usage (OpenAI-specific, ignored by others)

TypeScript Types

import { ImageFilePart, ImageDataPart, ImageUrlPart } from 'toolpack-sdk';

const filePart: ImageFilePart = {
    type: 'image_file',
    image_file: { path: '/path/to/image.png', detail: 'high' }
};

const dataPart: ImageDataPart = {
    type: 'image_data',
    image_data: { data: 'base64...', mimeType: 'image/png', detail: 'auto' }
};

const urlPart: ImageUrlPart = {
    type: 'image_url',
    image_url: { url: 'https://example.com/image.png', detail: 'low' }
};

Streaming with Images

Multimodal requests work with streaming too:

const stream = toolpack.stream({
    messages: [{
        role: 'user',
        content: [
            { type: 'text', text: 'Describe this image in detail' },
            { type: 'image_file', image_file: { path: './photo.jpg' } }
        ]
    }],
    model: 'gpt-4o',
});

for await (const chunk of stream) {
    process.stdout.write(chunk.delta);
}

Vision-Capable Models

Not all models support vision. Check the model's capabilities:

const providers = await toolpack.listProviders();
for (const provider of providers) {
    for (const model of provider.models) {
        if (model.capabilities.vision) {
            console.log(`${model.id} supports vision`);
        }
    }
}

Common Vision Models

Provider	Models
OpenAI	`gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-4-vision-preview`
Anthropic	`claude-sonnet-4-`, `claude-3-5-sonnet-`, `claude-3-opus-`, `claude-3-haiku-`
Gemini	`gemini-1.5-pro`, `gemini-1.5-flash`, `gemini-2.0-flash`
Ollama	`llava`, `llava-llama3`, `bakllava`, and other vision models

Image Input Formats​

1. Local File Path​

2. Base64 Data​

3. HTTP URL​

Multiple Images​

Provider Behavior​

Notes​

TypeScript Types​

Streaming with Images​

Vision-Capable Models​

Common Vision Models​