Knowledge Package
The @toolpack-sdk/knowledge package provides Retrieval-Augmented Generation (RAG) capabilities for your AI agents. Build knowledge bases from documentation, code, or any text source and enable semantic search within your agent conversations.
Installation
npm install @toolpack-sdk/knowledge
Quick Start
Development Setup (Memory Provider)
Perfect for prototyping and development with zero configuration:
import { Knowledge, MemoryProvider, MarkdownSource, OllamaEmbedder } from '@toolpack-sdk/knowledge';
const kb = await Knowledge.create({
provider: new MemoryProvider(),
sources: [new MarkdownSource('./docs/**/*.md')],
embedder: new OllamaEmbedder({ model: 'nomic-embed-text' }),
description: 'Documentation for search queries',
});
// Search your knowledge base
const results = await kb.query('how to configure authentication');
console.log(results[0].chunk.content);
Production Setup (Persistent Provider)
For CLI tools and production applications with persistent storage:
import { Knowledge, PersistentKnowledgeProvider, MarkdownSource, OpenAIEmbedder } from '@toolpack-sdk/knowledge';
const kb = await Knowledge.create({
provider: new PersistentKnowledgeProvider({
namespace: 'my-app',
reSync: false, // Use existing index if available
}),
sources: [new MarkdownSource('./docs/**/*.md')],
embedder: new OpenAIEmbedder({
model: 'text-embedding-3-small',
apiKey: process.env.OPENAI_API_KEY!,
}),
description: 'My application documentation',
onEmbeddingProgress: (event) => {
console.log(`Embedding: ${event.percent}% (${event.current}/${event.total})`);
},
});
Providers
Providers handle vector storage and similarity search. Choose based on your use case:
MemoryProvider
In-memory storage ideal for development:
import { MemoryProvider } from '@toolpack-sdk/knowledge';
const provider = new MemoryProvider({
maxChunks: 10000, // Optional: limit memory usage
});
Best for: Development, prototyping, short-lived processes
Limitations: Data lost on process exit, memory constraints
PersistentKnowledgeProvider
SQLite-backed persistent storage:
import { PersistentKnowledgeProvider } from '@toolpack-sdk/knowledge';
const provider = new PersistentKnowledgeProvider({
namespace: 'my-app', // Creates ~/.toolpack/knowledge/my-app.db
storagePath: './custom/path', // Optional: override default location
reSync: false, // Skip re-indexing if DB exists
});
Best for: CLI tools, desktop apps, production workloads
Features: WAL mode, transactions, metadata filtering
Sources
Sources extract and chunk content from various formats:
MarkdownSource
Chunks Markdown files by heading hierarchy:
import { MarkdownSource } from '@toolpack-sdk/knowledge';
const source = new MarkdownSource('./docs/**/*.md', {
maxChunkSize: 2000, // Max tokens per chunk
chunkOverlap: 200, // Overlap between chunks
minChunkSize: 100, // Merge small sections
namespace: 'docs', // Prefix for chunk IDs
metadata: { type: 'documentation' }, // Added to all chunks
});
Features:
- Heading-based chunking (preserves document structure)
- YAML frontmatter extraction
- Code block detection (
hasCode: truemetadata) - Deterministic chunk IDs for deduplication
WebUrlSource
Crawl and index websites automatically:
import { WebUrlSource } from '@toolpack-sdk/knowledge';
const source = new WebUrlSource(['https://example.com/docs'], {
maxDepth: 2, // Follow links up to this many levels (default: 1)
maxPagesPerDomain: 20, // Cap pages crawled per domain (default: 10)
sameDomainOnly: true, // Only follow links on the same domain (default: true)
delayMs: 1000, // Delay between requests in ms (default: 1000)
userAgent: 'MyApp/1.0', // Custom user agent
timeoutMs: 30000, // Request timeout in ms (default: 30000)
maxChunkSize: 2000, // Max tokens per chunk (default: 2000)
chunkOverlap: 200, // Overlap between chunks (default: 200)
minChunkSize: 100, // Min chunk size (default: 100)
namespace: 'web', // Chunk ID prefix (default: 'web')
metadata: { source: 'web' }, // Added to all chunks
});
Constructor: new WebUrlSource(urls: string[], options?: WebUrlSourceOptions)
Features:
- Recursive website crawling with depth control
- Automatic HTML text extraction (removes scripts, styles, nav, header, footer)
- Link discovery and following
- Per-domain rate limiting and page caps
- Metadata includes title, URL, chunk index
ApiDataSource
Index data from REST APIs with pagination support:
import { ApiDataSource } from '@toolpack-sdk/knowledge';
const source = new ApiDataSource('https://api.github.com/repos/org/repo/issues', {
method: 'GET', // 'GET' | 'POST' (default: 'GET')
headers: {
'Authorization': `Bearer ${process.env.GITHUB_TOKEN}`,
'Accept': 'application/vnd.github.v3+json',
},
body: undefined, // Request body for POST requests
pagination: {
param: 'page', // Query parameter name for the page number
start: 1, // Starting page number
step: 1, // Page increment per request
maxPages: 5, // Max pages to fetch
},
dataPath: '', // Dot-separated JSON path to data array (e.g. 'data.items')
contentExtractor: (item: any) => // Custom content extraction (optional)
`${item.title}\n\n${item.body}`,
metadataExtractor: (item: any) => ({ // Custom metadata extraction (optional)
id: item.id,
state: item.state,
}),
timeoutMs: 30000, // Request timeout in ms (default: 30000)
maxChunkSize: 2000, // Max tokens per chunk (default: 2000)
chunkOverlap: 200, // Overlap between chunks (default: 200)
namespace: 'api', // Chunk ID prefix (default: 'api')
metadata: { source: 'github' }, // Added to all chunks
});
Constructor: new ApiDataSource(url: string, options?: ApiDataSourceOptions)
When no contentExtractor is provided, the source automatically looks for common fields (content, text, description, body, message) in each item and falls back to JSON serialization. When no metadataExtractor is provided, it extracts common fields (id, title, name, created_at, updated_at, author, tags).
Features:
- REST API data ingestion (GET/POST)
- Automatic pagination handling with configurable start, step, and max pages
- Custom content and metadata extractors
- JSON path support for nested data via
dataPath - Configurable timeout and chunking
JSONSource
Index data from local JSON files:
import { JSONSource } from '@toolpack-sdk/knowledge';
const source = new JSONSource('./data/products.json', {
toContent: (item: any) => `${item.name}\n\n${item.description}`, // Required
filter: (item: any) => item.active === true, // Optional: filter items
chunkSize: 100, // Items per chunk (default: 100)
namespace: 'products', // Chunk ID prefix (default: 'json')
metadata: { source: 'products-db' }, // Added to all chunks
});
Constructor: new JSONSource(filePath: string, options: JSONSourceOptions)
The toContent callback is required. The source handles both JSON arrays (with optional item-level filter) and single JSON objects.
Features:
- Parses JSON arrays or single objects
- Optional item-level filtering
- Configurable items per chunk
- Metadata includes source filename, chunk index range, and total item count
SQLiteSource
Index rows from a SQLite database. Requires the better-sqlite3 package:
npm install better-sqlite3
import { SQLiteSource } from '@toolpack-sdk/knowledge';
const source = new SQLiteSource('./data/app.db', {
query: 'SELECT id, title, body FROM articles WHERE published = 1', // Optional SQL query
toContent: (row) => `${row.title}\n\n${row.body}`, // Required
chunkSize: 50, // Rows per chunk (default: 100)
namespace: 'articles', // Chunk ID prefix (default: 'sqlite')
metadata: { source: 'sqlite' }, // Added to all chunks
preLoadCSV: { // Optional: load a CSV into the DB before querying
tableName: 'articles',
csvPath: './data/articles.csv',
delimiter: ',', // Column delimiter (default: ',')
headers: true, // First row is a header (default: true)
},
});
Constructor: new SQLiteSource(dbPath: string, options: SQLiteSourceOptions)
The toContent callback is required. If query is not provided it defaults to SELECT * FROM sqlite_master WHERE type = "table".
Features:
- Arbitrary SQL queries
- Required
toContentcallback to control what gets embedded - Optional CSV/TSV pre-loading into the database before querying
- Metadata includes source filename, query, and row index range
PostgresSource
Index rows from a PostgreSQL database. Requires the pg package:
npm install pg
import { PostgresSource } from '@toolpack-sdk/knowledge';
// Using a connection string
const source = new PostgresSource({
connectionString: process.env.DATABASE_URL,
query: 'SELECT id, title, content FROM docs WHERE status = $1',
toContent: (row) => `${row.title}\n\n${row.content}`, // Required
chunkSize: 50,
namespace: 'docs',
metadata: { source: 'postgres' },
ssl: true,
});
// Using individual connection parameters
const source2 = new PostgresSource({
host: 'localhost',
port: 5432,
database: 'mydb',
user: 'admin',
password: process.env.DB_PASSWORD,
query: 'SELECT title, body FROM posts',
toContent: (row) => `${row.title}\n\n${row.body}`,
});
Constructor: new PostgresSource(options: PostgresSourceOptions)
The query and toContent fields are required. Use either connectionString or the individual host/port/database/user/password fields.
Features:
- Arbitrary SQL queries
- Connection via connection string or individual parameters
- SSL support
- Required
toContentcallback to control what gets embedded - Metadata includes query and row index range
Embedders
Embedders convert text to vector embeddings for semantic search:
OllamaEmbedder
Local embeddings using Ollama (zero API cost):
import { OllamaEmbedder } from '@toolpack-sdk/knowledge';
const embedder = new OllamaEmbedder({
model: 'nomic-embed-text', // or 'mxbai-embed-large', 'all-minilm', etc.
baseUrl: 'http://localhost:11434', // default
retries: 3,
retryDelay: 1000,
});
Supported models:
nomic-embed-text(768 dimensions)mxbai-embed-large(1024 dimensions)all-minilm(384 dimensions)- Custom models with
dimensionsoverride
OpenAIEmbedder
OpenAI text-embedding models:
import { OpenAIEmbedder } from '@toolpack-sdk/knowledge';
const embedder = new OpenAIEmbedder({
model: 'text-embedding-3-small', // or 'text-embedding-3-large'
apiKey: process.env.OPENAI_API_KEY!,
retries: 3,
retryDelay: 1000,
timeout: 30000,
});
Supported models:
text-embedding-3-small(1536 dimensions)text-embedding-3-large(3072 dimensions)text-embedding-ada-002(1536 dimensions)
OpenRouterEmbedder
Embeddings via OpenRouter, giving access to OpenAI embedding models through a single API key:
import { OpenRouterEmbedder } from '@toolpack-sdk/knowledge';
const embedder = new OpenRouterEmbedder({
model: 'openai/text-embedding-3-small', // Required
apiKey: process.env.OPENROUTER_API_KEY!, // Required
dimensions: 1536, // Optional: override auto-detected dimensions
retries: 3, // Default: 3
retryDelay: 1000, // ms, default: 1000
timeout: 30000, // ms, default: 30000
});
Known models (dimensions auto-detected):
openai/text-embedding-3-small(1536 dimensions)openai/text-embedding-3-large(3072 dimensions)openai/text-embedding-ada-002(1536 dimensions)nvidia/llama-nemotron-embed-vl-1b-v2(4096 dimensions)
For any model not in the list above, pass dimensions explicitly.
VertexAIEmbedder
Google Cloud Vertex AI embedding models. Authenticates via Application Default Credentials (ADC):
import { VertexAIEmbedder } from '@toolpack-sdk/knowledge';
const embedder = new VertexAIEmbedder({
projectId: 'my-gcp-project', // Required (or set VERTEX_AI_PROJECT / GOOGLE_CLOUD_PROJECT env var)
location: 'us-central1', // GCP region (default: 'us-central1')
model: 'gemini-embedding-001', // Embedding model (default: 'gemini-embedding-001')
outputDimensionality: 3072, // Optional: override output dimensions
retries: 3, // Default: 3
retryDelay: 1000, // ms, default: 1000
});
If projectId is not set in options, the embedder falls back to the VERTEX_AI_PROJECT, TOOLPACK_VERTEXAI_PROJECT, or GOOGLE_CLOUD_PROJECT environment variables. If none are set, it throws at construction time.
Supported models:
gemini-embedding-001(3072 dimensions, default)text-embedding-005(768 dimensions)text-multilingual-embedding-002(768 dimensions)
For any other model pass outputDimensionality explicitly.
Important: Changing outputDimensionality after initial indexing requires wiping the knowledge store, as the dimension count must be consistent.
Integration with Toolpack SDK
Connect your knowledge base to Toolpack SDK agents:
import { Toolpack } from 'toolpack-sdk';
import { Knowledge, MemoryProvider, MarkdownSource, OllamaEmbedder } from '@toolpack-sdk/knowledge';
const kb = await Knowledge.create({
provider: new MemoryProvider(),
sources: [new MarkdownSource('./docs/**/*.md')],
embedder: new OllamaEmbedder({ model: 'nomic-embed-text' }),
description: 'Search this when users ask about setup, configuration, or API usage.',
});
const toolpack = await Toolpack.init({
provider: 'anthropic',
knowledge: kb, // Automatically registers knowledge_search tool
});
// The agent can now search your knowledge base
const response = await toolpack.chat('How do I configure authentication?');
Enhanced Knowledge Tools
The knowledge base provides a tool with advanced search capabilities:
const tool = kb.toTool();
// Semantic search (default)
const semanticResults = await tool.execute({
query: 'machine learning',
limit: 5,
});
// Hybrid search (semantic + keyword)
const hybridResults = await tool.execute({
query: 'machine learning algorithms',
searchType: 'hybrid',
keywordWeight: 0.4,
semanticWeight: 0.6,
filter: { category: 'tutorial' },
});
Tool parameters:
query: Search query stringsearchType:'semantic'or'hybrid'(default:'semantic')keywordWeight: Weight for keyword matching (0-1, default: 0.3)semanticWeight: Weight for semantic matching (0-1, default: 0.7)limit: Maximum results (default: 10)threshold: Minimum similarity score (default: 0.7)filter: Metadata filters
Querying
Basic Semantic Search
const results = await kb.query('authentication setup');
// Returns: Array of { chunk, score, distance }
Hybrid Search (Semantic + Keyword)
Combine semantic similarity with keyword matching for better results:
const results = await kb.hybridQuery('machine learning algorithms', {
keywordWeight: 0.3, // 30% keyword relevance
semanticWeight: 0.7, // 70% semantic relevance
keywordFields: ['content', 'title'], // Fields to search
limit: 10,
threshold: 0.7,
});
Hybrid search advantages:
- Better precision for technical terms and proper nouns
- Improved ranking for exact matches
- Balanced semantic understanding with keyword accuracy
Advanced Query Options
const results = await kb.query('authentication setup', {
limit: 5, // Max results (default: 10)
threshold: 0.8, // Minimum similarity 0-1 (default: 0.7)
filter: { // Metadata filters
hasCode: true,
category: { $in: ['api', 'guide'] },
source: 'web', // Filter by source type
},
includeMetadata: true, // Include chunk metadata (default: true)
includeVectors: false, // Include embedding vectors (default: false)
});
Metadata Filters
// Exact match
{ category: 'api' }
// In array
{ category: { $in: ['api', 'guide', 'tutorial'] } }
// Numeric comparisons
{ priority: { $gt: 5 } }
{ priority: { $lt: 10 } }
// Source-specific filters
{ source: 'web', url: { $in: ['https://docs.example.com'] } }
{ source: 'api', statusCode: 200 }
Utility Functions
The package exports two utility functions for building custom search pipelines:
keywordSearch
import { keywordSearch } from '@toolpack-sdk/knowledge';
const score = keywordSearch(text, query);
// Returns a number between 0 and 1.
// Returns 1.0 on exact substring match.
// Returns a fractional score based on the proportion of query words found in text.
// Returns 0.0 when no query words (longer than 2 characters) are found.
combineScores
import { combineScores } from '@toolpack-sdk/knowledge';
const combined = combineScores(semanticScore, keywordScore, semanticWeight);
// semanticWeight defaults to 0.7 (70% semantic, 30% keyword).
// Returns: semanticScore * semanticWeight + keywordScore * (1 - semanticWeight)
Use these functions when implementing custom hybrid search logic outside of hybridQuery.
Streaming Ingestion
Process large datasets with real-time progress tracking:
// Traditional sync (blocks until complete)
await kb.sync();
// Streaming sync (progress updates)
for await (const progress of kb.syncStream()) {
switch (progress.type) {
case 'count':
console.log(`Total chunks to process: ${progress.total}`);
break;
case 'progress':
console.log(`Processed ${progress.processed}/${progress.total} chunks (${progress.percent}%)`);
break;
case 'complete':
console.log(`Sync complete! Processed ${progress.total} chunks`);
break;
case 'error':
console.error('Sync failed:', progress.error);
break;
}
}
Streaming benefits:
- Real-time progress feedback
- Memory-efficient batch processing (100 chunks per batch)
- Non-blocking UI updates
- Early error detection
- Better user experience for large datasets
Error Handling
Handle embedding failures gracefully:
const kb = await Knowledge.create({
provider,
sources,
embedder,
description: 'Knowledge base',
onError: (error, context) => {
console.error(`Failed: ${context.file} — ${error.message}`);
if (error instanceof EmbeddingError) {
return 'skip'; // Skip this chunk, continue with others
}
return 'abort'; // Stop the entire process
},
});
Error Types:
KnowledgeError— Base error classEmbeddingError— Embedding API failuresIngestionError— Source parsing failuresDimensionMismatchError— Vector dimension mismatchKnowledgeProviderError— Provider operation failuresChunkTooLargeError— Chunk exceeds max size
API Reference
Knowledge.create()
interface KnowledgeOptions {
provider: KnowledgeProvider;
sources: KnowledgeSource[];
embedder: Embedder;
description: string; // Required: used as tool description
reSync?: boolean; // default: true
streamingBatchSize?: number; // Chunks processed per batch (default: 100)
onError?: ErrorHandler;
onSync?: SyncEventHandler;
onEmbeddingProgress?: EmbeddingProgressHandler;
}
Knowledge Methods
query(text, options?)
Standard semantic search using vector similarity.
async query(text: string, options?: QueryOptions): Promise<QueryResult[]>
hybridQuery(text, options?)
Advanced search combining semantic similarity with keyword matching.
async hybridQuery(text: string, options?: HybridQueryOptions): Promise<QueryResult[]>
interface HybridQueryOptions extends QueryOptions {
keywordWeight?: number; // Weight for keyword matching (default: 0.3)
semanticWeight?: number; // Weight for semantic matching (default: 0.7)
keywordFields?: string[]; // Fields to search (default: ['content'])
}
sync()
Synchronous ingestion of all sources.
async sync(): Promise<void>
syncStream()
Streaming ingestion with progress updates.
async *syncStream(): AsyncIterable<SyncProgress>
interface SyncProgress {
type: 'count' | 'progress' | 'complete' | 'error';
total?: number;
processed?: number;
percent?: number;
error?: Error;
}
toTool()
Convert knowledge base to a Toolpack SDK tool.
toTool(): KnowledgeTool
Sync Events
onSync: (event) => {
// event.type: 'start' | 'file' | 'chunk' | 'complete' | 'error'
// event.file?: string
// event.chunksAffected?: number
// event.error?: Error
}
Embedding Progress
onEmbeddingProgress: (event) => {
// event.source: string
// event.current: number
// event.total: number
// event.percent: number
}
Best Practices
-
Choose the right provider:
- Development:
MemoryProvider - Production CLI:
PersistentKnowledgeProvider
- Development:
-
Use appropriate chunk sizes:
- Small docs: 1000-1500 tokens
- Large docs: 2000-3000 tokens
- Code: 1500-2000 tokens (with
hasCodemetadata) - Web content: 1500-2000 tokens (after HTML cleaning)
- API data: 1000-1500 tokens (depends on content structure)
-
Handle embedding failures:
- Always provide
onErrorfor production - Use
skipfor transient failures - Use
abortfor critical errors
- Always provide
-
Leverage metadata filtering:
- Tag chunks with
category,hasCode,version - Use source-specific metadata (
source,url,statusCode) - Filter by relevance in queries
- Tag chunks with
-
Monitor progress:
- Use
onEmbeddingProgressfor large knowledge bases - Use
syncStream()for real-time progress in UIs - Show loading indicators in CLI apps
- Use
-
Web crawling best practices:
- Set appropriate
delayMs(1000ms+ for respectful crawling) - Use
sameDomainOnly: true(the default) to stay within your site - Limit
maxDepthandmaxPagesPerDomainto avoid excessive crawling - Check
robots.txtcompliance manually
- Set appropriate
-
API data ingestion:
- Use
contentExtractorfor complex data structures - Use
dataPathto target nested arrays in API responses - Configure
paginationfor large datasets
- Use
-
Hybrid search optimization:
- Use hybrid search for technical content with proper nouns
- Adjust
keywordWeighthigher for exact term matching - Use
semanticWeighthigher for conceptual searches
Troubleshooting
Common Issues
"Dimension mismatch" error:
// Ensure embedder dimensions match provider
// OllamaEmbedder with nomic-embed-text = 768 dimensions
// PersistentKnowledgeProvider persists dimensions in DB
"No files found" with MarkdownSource:
// Check glob pattern - use forward slashes even on Windows
new MarkdownSource('./docs/**/*.md') // correct
new MarkdownSource('.\\docs\\**\\*.md') // incorrect
Slow embedding:
// Use embedBatch() when possible
// OllamaEmbedder.embedBatch is optimized
// OpenAIEmbedder.embedBatch makes single API call
Web crawling blocked:
// Check robots.txt and respect site policies
// Add user agent and delay between requests
// Use sameDomainOnly to limit crawling scope
const source = new WebUrlSource(['https://example.com'], {
userAgent: 'MyApp/1.0 (contact@example.com)',
delayMs: 2000,
sameDomainOnly: true,
});
VertexAI authentication errors:
// Ensure Application Default Credentials are configured
// Run: gcloud auth application-default login
// Or set GOOGLE_APPLICATION_CREDENTIALS env var to a service account key file
const embedder = new VertexAIEmbedder({
projectId: process.env.GOOGLE_CLOUD_PROJECT,
location: 'us-central1',
});
SQLiteSource or PostgresSource peer dependency missing:
# SQLiteSource requires better-sqlite3
npm install better-sqlite3
# PostgresSource requires pg
npm install pg
Hybrid search not returning expected results:
// Adjust weights based on content type
// Technical docs: higher keywordWeight
// General content: higher semanticWeight
const results = await kb.hybridQuery(query, {
keywordWeight: 0.4, // Increase for technical terms
semanticWeight: 0.6, // Increase for concepts
});
Streaming sync memory issues:
// Streaming processes in batches automatically
// For very large datasets, consider:
// - Smaller batch sizes (not currently configurable)
// - More frequent progress updates
// - Error handling for partial failures