Skip to main content

Testing Agents

@toolpack-sdk/agents ships testing utilities that let you unit-test agents in complete isolation — no API keys, no live channels, no network calls.

Import path

import { createTestAgent, MockChannel, captureEvents, createMockKnowledge } from '@toolpack-sdk/agents/testing';

The testing utilities live in the ./testing sub-path export, not in the main package root.

Contents


createTestAgent()

The primary testing factory. Creates an agent instance wired to a MockChannel and a mock Toolpack that returns scripted responses.

import { createTestAgent } from '@toolpack-sdk/agents/testing';

function createTestAgent<TAgent extends BaseAgent>(
AgentClass: new (options: BaseAgentOptions) => TAgent,
options?: CreateTestAgentOptions,
): TestAgentResult<TAgent>

Options

interface CreateTestAgentOptions {
mockResponses?: MockResponse[]; // scripted LLM responses
defaultResponse?: string; // fallback when no trigger matches (default: 'Mock AI response')
provider?: string; // mock provider name
model?: string; // mock model name
}

interface MockResponse {
trigger: string | RegExp; // matched against user message
response: string; // what the mock LLM returns
usage?: {
prompt_tokens: number;
completion_tokens: number;
total_tokens: number;
};
}

Return value

interface TestAgentResult<TAgent extends BaseAgent> {
agent: TAgent; // the agent instance
channel: MockChannel; // mock channel wired to the agent
toolpack: Toolpack; // mock toolpack instance
addMockResponse: (response: MockResponse) => void; // add responses after creation
}

Example

import { describe, it, expect } from 'vitest';
import { createTestAgent } from '@toolpack-sdk/agents/testing';
import { SupportAgent } from './support-agent.js';

describe('SupportAgent', () => {
it('handles a refund request', async () => {
const { agent, channel } = createTestAgent(SupportAgent, {
mockResponses: [
{ trigger: 'refund', response: 'Your refund has been approved.' },
],
});

const result = await agent.invokeAgent({
message: 'I need a refund for order #12345',
conversationId: 'test-conv-1',
participant: { kind: 'user', id: 'user-1', displayName: 'Alice' },
});

expect(result.output).toBe('Your refund has been approved.');
});

it('returns default response for unmatched messages', async () => {
const { agent } = createTestAgent(SupportAgent, {
defaultResponse: 'How can I help you today?',
});

const result = await agent.invokeAgent({
message: 'Hello',
conversationId: 'test-conv-2',
});

expect(result.output).toBe('How can I help you today?');
});
});

MockChannel

MockChannel implements ChannelInterface and records all inputs and outputs. Wired automatically by createTestAgent, but can also be used standalone.

import { MockChannel } from '@toolpack-sdk/agents/testing';

const channel = new MockChannel();

Properties

class MockChannel implements ChannelInterface {
name = 'mock-channel';
isTriggerChannel = false;

// Inspection
get inputs(): AgentInput[]; // normalized messages received (inbound)
get outputs(): AgentOutput[]; // messages sent (outbound)
get lastInput(): AgentInput | undefined;
get lastOutput(): AgentOutput | undefined;
get receivedCount(): number;
get sentCount(): number;
get isListening(): boolean;

// Simulation
async receive(incoming: unknown): Promise<void>; // normalize + invoke handler
async receiveMessage(
message: string,
conversationId?: string,
intent?: string,
context?: Record<string, unknown>,
): Promise<void>;
async send(output: AgentOutput): Promise<void>; // record outbound message
clear(): void; // reset captured inputs/outputs

// Assertion helpers
assertOutputContains(text: string): void;
assertLastOutput(expected: string): void;

// ChannelInterface compliance
listen(): void;
stop(): void;
normalize(incoming: unknown): AgentInput;
onMessage(handler: (input: AgentInput) => Promise<void>): void;
}

Simulating messages

receive() accepts unknown and runs it through normalize() — it does not take an AgentInput directly. Use receiveMessage() for the most common case:

// Simple text message
await channel.receiveMessage(
'What is my account balance?',
'conv-abc', // conversationId (default: 'test-conversation-1')
'balance_inquiry', // intent (optional)
{ userId: 'user-42' }, // context (optional)
);

expect(channel.lastOutput?.output).toContain('balance');
expect(channel.sentCount).toBe(1);

Or use receive() with a raw object (normalized via normalize()):

await channel.receive({
message: 'Check order #789',
conversationId: 'conv-1',
intent: 'order_lookup',
});

Full channel-driven test

it('processes message through full interceptor chain', async () => {
const { agent, channel } = createTestAgent(MyAgent, {
mockResponses: [{ trigger: /order/i, response: 'Order found.' }],
});

await agent.start(); // binds channel handlers + interceptor chain

await channel.receive({ message: 'Check order #789', conversationId: 'conv-1' });

expect(channel.outputs).toHaveLength(1);
expect(channel.lastOutput?.output).toBe('Order found.');

await agent.stop();
});

Built-in assertions

channel.assertOutputContains('approved');   // throws if no output contains this text
channel.assertLastOutput('Exact match'); // throws if last output !== expected

MockResponse matching

The mock toolpack checks responses in order. First match wins.

  • String trigger: checks if the user message contains the trigger string (case-sensitive).
  • RegExp trigger: tests the user message against the regex.
  • defaultResponse: returned when no trigger matches.
const { agent } = createTestAgent(MyAgent, {
mockResponses: [
{ trigger: /cancel.*order/i, response: 'Order cancellation initiated.' },
{ trigger: 'cancel', response: 'What would you like to cancel?' },
{ trigger: /refund/i, response: 'Refund request received.' },
],
defaultResponse: 'I can help with that.',
});

Add responses dynamically:

const { agent, addMockResponse } = createTestAgent(MyAgent);
addMockResponse({ trigger: 'shipping', response: 'Your package ships in 2 days.' });

captureEvents()

Captures agent lifecycle events emitted during a test run. Returns a rich EventCapture object with assertion helpers.

import { captureEvents } from '@toolpack-sdk/agents/testing';

const events = captureEvents(agent); // no options argument

// ... run agent ...

events.stop(); // detach listeners

EventCapture API

type AgentEventName = 'agent:start' | 'agent:complete' | 'agent:error';
// Note: 'agent:step' is NOT an event name — only the three above are captured.

interface CapturedEvent {
name: AgentEventName;
data: unknown; // event payload
timestamp: number; // Date.now() value (number, not Date)
}

interface EventCapture {
readonly events: CapturedEvent[];
readonly count: number;

clear(): void;
stop(): void; // remove listeners

hasEvent(name: AgentEventName): boolean;
getEvents(name: AgentEventName): CapturedEvent[];
getFirstEvent(name: AgentEventName): CapturedEvent | undefined;
getLastEvent(name: AgentEventName): CapturedEvent | undefined;
assertEvent(name: AgentEventName): void; // throws if event not found
assertNoEvent(name: AgentEventName): void; // throws if event was found
}

Example

it('emits start and complete events', async () => {
const { agent } = createTestAgent(MyAgent, { defaultResponse: 'Done.' });
const events = captureEvents(agent);

await agent.invokeAgent({ message: 'Hello', conversationId: 'c1' });

events.assertEvent('agent:start');
events.assertEvent('agent:complete');
events.assertNoEvent('agent:error');

events.stop();
});

Custom Vitest/Jest matchers

import { registerEventMatchers } from '@toolpack-sdk/agents/testing';
import { expect } from 'vitest';

// In your test setup file:
registerEventMatchers(expect);

// Then in tests:
expect(events).toContainEvent('agent:start');
expect(events).not.toContainEvent('agent:error');
expect(events).toContainEventTimes('agent:complete', 1);

createMockKnowledge()

Provides an in-memory Knowledge instance pre-populated with test data. Useful for testing agents that query a knowledge base without needing a real embedder or vector store.

import { createMockKnowledge, createMockKnowledgeSync } from '@toolpack-sdk/agents/testing';

createMockKnowledge (async)

Returns a real Knowledge instance from @toolpack-sdk/knowledge backed by a MemoryProvider and a deterministic mock embedder.

interface MockKnowledgeOptions {
initialChunks?: Array<{
content: string;
metadata?: Record<string, unknown>;
}>;
dimensions?: number; // embedding dimensions (default: 384)
description?: string; // tool description exposed to LLM
}

const knowledge = await createMockKnowledge({
initialChunks: [
{ content: 'Lead: Acme Corp, score: 85', metadata: { source: 'crm' } },
{ content: 'Lead: TechStart, score: 70', metadata: { source: 'crm' } },
],
});

createMockKnowledgeSync (sync)

Returns a MockKnowledge class instance — not a full Knowledge object, but suitable for testing agents that use knowledge queries. Supports query(), add(), getAllChunks(), clear(), and toTool().

const knowledge = createMockKnowledgeSync({
initialChunks: [
{ content: 'Refund policy: 30-day no-questions-asked return' },
],
});

// Use knowledge.toTool() to wire it as a tool into a mock Toolpack
const tool = knowledge.toTool(); // returns a RequestToolDefinition

Uses simple keyword matching (not semantic similarity) for queries, which is sufficient for most test assertions.


createMockToolpackSimple()

Creates a minimal mock Toolpack that returns the same fixed response for every generate() call. Useful when the agent's AI response content is not the focus of the test.

import { createMockToolpackSimple } from '@toolpack-sdk/agents/testing';

function createMockToolpackSimple(response?: string): Toolpack
  • response — the string returned by every generate() call. Defaults to 'Mock AI response'.
const toolpack = createMockToolpackSimple('Hello!');
const agent = new MyAgent({ toolpack });

const result = await agent.run('Hi');
expect(result.output).toBe('Hello!');

createMockToolpackSequence()

Creates a mock Toolpack that returns responses from a fixed array, one per generate() call. After the array is exhausted, subsequent calls return 'No more mock responses'.

import { createMockToolpackSequence } from '@toolpack-sdk/agents/testing';

function createMockToolpackSequence(responses: string[]): Toolpack

Useful for testing multi-turn conversations or stateful interactions where the AI output changes across turns.

const toolpack = createMockToolpackSequence([
'First response',
'Second response',
'Third response',
]);

// First generate() call → 'First response'
// Second generate() call → 'Second response'
// ...

MockKnowledge class

A lightweight in-memory knowledge store for synchronous test setup. Returned by createMockKnowledgeSync(). Implements query(), add(), getAllChunks(), clear(), and toTool().

import { MockKnowledge, createMockKnowledgeSync } from '@toolpack-sdk/agents/testing';

class MockKnowledge {
// Query using simple keyword matching (not semantic similarity)
async query(text: string, options?: QueryOptions): Promise<QueryResult[]>

// Add a chunk to the in-memory store; returns the generated chunk ID
async add(content: string, metadata?: Record<string, unknown>): Promise<string>

// Return a copy of all stored chunks
getAllChunks(): Chunk[]

// Remove all stored chunks
clear(): void

// Return a RequestToolDefinition-compatible object for wiring into a mock Toolpack
toTool(): { name: string; execute: (params: { query: string; ... }) => Promise<...>; ... }
}

MockKnowledge is not a full Knowledge instance — it uses simple keyword matching instead of vector similarity, which is sufficient for most test assertions. Use createMockKnowledge() (async) when you need real embedding behaviour.

const knowledge = createMockKnowledgeSync({
initialChunks: [
{ content: 'Refund policy: 30-day no-questions-asked return' },
],
});

// Wire into a test agent via toolpack knowledge option
const results = await knowledge.query('refund');
expect(results[0].chunk.content).toContain('Refund policy');

// Add more content at any point
await knowledge.add('Shipping time: 3-5 business days');

Testing patterns

Testing intent routing

it('routes billing intent correctly', async () => {
const { agent } = createTestAgent(SupportAgent, {
mockResponses: [
{ trigger: 'billing', response: 'Here is your billing summary.' },
],
});

const result = await agent.invokeAgent({
intent: 'billing',
message: 'Show me my bills',
conversationId: 'c1',
});

expect(result.output).toBe('Here is your billing summary.');
});

Testing delegation

import { AgentRegistry } from '@toolpack-sdk/agents';
import { createTestAgent } from '@toolpack-sdk/agents/testing';

it('delegates to data agent', async () => {
const { agent: mainAgent } = createTestAgent(OrchestratorAgent);
const { agent: dataAgent } = createTestAgent(DataAgent, {
defaultResponse: 'Data analysis complete.',
});

const registry = new AgentRegistry([mainAgent, dataAgent]);
await registry.start();

const result = await registry.invoke('orchestrator-agent', {
message: 'Analyse sales',
conversationId: 'c1',
});

expect(result.output).toContain('complete');
});

Testing conversation history

it('remembers previous messages', async () => {
const { agent } = createTestAgent(MyAgent, {
mockResponses: [
{ trigger: 'name is Bob', response: 'Nice to meet you, Bob.' },
{ trigger: 'remember', response: 'You told me your name is Bob.' },
],
});

await agent.invokeAgent({ message: 'My name is Bob', conversationId: 'conv-1' });

const result = await agent.invokeAgent({
message: 'Do you remember my name?',
conversationId: 'conv-1', // same conversation
});

expect(result.output).toContain('Bob');
});

Testing lifecycle hooks

it('calls onComplete after successful run', async () => {
const { agent } = createTestAgent(MyAgent, { defaultResponse: 'Done.' });

let completedWith: AgentResult | null = null;
agent.onComplete = async (result) => { completedWith = result; };

await agent.invokeAgent({ message: 'test', conversationId: 'c1' });

expect(completedWith?.output).toBe('Done.');
});

Testing error handling

it('emits agent:error on failure', async () => {
const { agent } = createTestAgent(MyAgent);
const events = captureEvents(agent);

// Override invokeAgent to force an error
const original = agent.invokeAgent.bind(agent);
agent.invokeAgent = async () => { throw new Error('boom'); };

try {
await agent.invokeAgent({ message: 'test', conversationId: 'c1' });
} catch {
// expected
}

events.assertEvent('agent:error');
events.stop();
});

---

## Evals — LLM quality evaluation

Unit tests verify agent wiring; evals verify agent **quality** — does the agent give correct, helpful answers on real inputs? The eval primitives let you build regression suites and track quality over time.

### Import path

```typescript
import {
EvalDataset,
EvalRunner,
ExactMatchScorer,
ContainsScorer,
LLMJudgeScorer,
CustomScorer,
compareEvalRuns,
formatEvalReport,
} from '@toolpack-sdk/agents';

Quick start

import { EvalDataset, EvalRunner, ContainsScorer } from '@toolpack-sdk/agents';

const dataset = new EvalDataset([
{
id: 'greet-1',
input: 'Say hello',
expectedOutput: 'hello',
},
{
id: 'summarise-1',
input: 'Summarise: The sky is blue.',
expectedOutput: 'blue',
},
]);

const runner = new EvalRunner({
agent: myAgent,
dataset,
scorers: [new ContainsScorer()],
});

const run = await runner.run();
console.log(`Score: ${run.averageScore * 100}%`);

EvalDataset

Holds a list of EvalCase objects.

interface EvalCase {
id: string; // unique identifier
input: string; // message sent to the agent
expectedOutput: string; // used by scorers
metadata?: Record<string, unknown>;
}

Datasets can be loaded from arrays, JSON files, or built programmatically:

// From array
const dataset = new EvalDataset(cases);

// Add cases
dataset.add({ id: 'c3', input: 'test', expectedOutput: 'expected' });

// Filter
const subset = dataset.filter(c => c.id.startsWith('greet'));

EvalRunner

Runs each case through the agent and collects scored results.

const runner = new EvalRunner({
agent, // BaseAgent instance
dataset, // EvalDataset
scorers, // EvalScorer[]
concurrency?: 1, // parallel cases (default: 1)
conversationId?: string, // fixed or per-run generated
});

const run: EvalRun = await runner.run();

Scorers

Four scorers are built in:

ScorerDescription
ExactMatchScorerScore 1.0 if output === expectedOutput (trimmed, case-insensitive by default)
ContainsScorerScore 1.0 if output contains expectedOutput
LLMJudgeScorerAsk an LLM to score the output on a 0–1 scale
CustomScorerYour own scoring function

LLMJudgeScorer

import { LLMJudgeScorer } from '@toolpack-sdk/agents';

const judge = new LLMJudgeScorer({
sdk: myToolpack,
prompt: 'Is this response factually correct and helpful? Score 0-1.',
// Optional: pass a specific model name
});

const runner = new EvalRunner({ agent, dataset, scorers: [judge] });

CustomScorer

import { CustomScorer } from '@toolpack-sdk/agents';

const lengthScorer = new CustomScorer({
name: 'brevity',
score: async ({ output, expectedOutput }) => {
// Score 1.0 if shorter than expected, 0 if longer
return output.length <= expectedOutput.length ? 1.0 : 0.0;
},
});

Regression reports

Compare two eval runs to catch regressions before merging:

import { compareEvalRuns, formatEvalReport } from '@toolpack-sdk/agents';

const baselineRun = await runnerV1.run();
const currentRun = await runnerV2.run();

const report = compareEvalRuns(baselineRun, currentRun);
console.log(formatEvalReport(report));
// Prints a human-readable table of improvements and regressions

compareEvalRuns returns an EvalReport with regressions and improvements arrays so you can assert in CI:

expect(report.regressions).toHaveLength(0);

EvalRun shape

interface EvalRun {
id: string;
startedAt: Date;
completedAt: Date;
results: EvalCaseResult[];
averageScore: number;
scoredResults: EvalScoredResult[];
}