Parallel LLM Requests
Parallel LLM requests can significantly reduce latency by sending multiple identical requests and using the fastest response. This feature can reduce response times by 30-50% in scenarios with variable network conditions.
Quick Start
The easiest way to enable parallel requests is through your minded.json
configuration:
{
"flows": ["./src/flows"],
"tools": ["./src/tools"],
"agent": "./src/agent.ts",
"llm": {
"name": "MindedChatOpenAI",
"properties": {
"model": "gpt-4o",
"numParallelRequests": 3,
"logTimings": true
}
}
}
Your agent will automatically use parallel requests for all LLM calls:
import { Agent } from '@minded-ai/mindedjs';
import memorySchema from './agentMemorySchema';
import config from '../minded.json';
import tools from './tools';
const agent = new Agent({
memorySchema,
config, // Parallel configuration is automatically applied
tools,
});
Configuration Options
MindedChatOpenAI (Recommended)
For agents running on the Minded platform, use MindedChatOpenAI
with parallel configuration:
{
"llm": {
"name": "MindedChatOpenAI",
"properties": {
"model": "gpt-4o",
"numParallelRequests": 3,
"logTimings": true,
"temperature": 0.7
}
}
}
AzureChatOpenAI
For Azure OpenAI deployments:
{
"llm": {
"name": "AzureChatOpenAI",
"properties": {
"model": "gpt-4o",
"numParallelRequests": 3,
"logTimings": true,
"azureOpenAIApiVersion": "2024-02-01"
}
}
}
Required environment variables:
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_API_INSTANCE_NAME=your_instance_name
AZURE_OPENAI_API_DEPLOYMENT_NAME=your_deployment_name
ChatOpenAI
For standard OpenAI API:
{
"llm": {
"name": "ChatOpenAI",
"properties": {
"model": "gpt-4o",
"numParallelRequests": 3,
"logTimings": true,
"openAIApiKey": "${OPENAI_API_KEY}"
}
}
}
Configuration Parameters
numParallelRequests
number
1
Number of parallel requests (2-5 recommended)
logTimings
boolean
false
Enable detailed timing logs
Performance Notes
Optimal Range: 2-3 parallel requests usually provide the best latency/cost balance
Cost Impact: You pay for all parallel requests made
Best Use Cases: Variable network conditions, consistency requirements
Latency Reduction: Typically 30-50% faster response times
Monitoring Performance
When logTimings: true
is enabled, you'll see detailed performance logs:
[Model] Fastest request completed { requestTime: 1.234, numParallelRequests: 3 }
[Model] Time saved using parallel requests {
fastestRequestTime: 1.234,
secondFastestRequestTime: 1.567,
allFinishTime: 2.345,
timeSaved: 1.111,
timeSavedFromSecond: 0.333
}
Advanced Usage
Dynamic Configuration
You can adjust parallel requests based on environment:
{
"llm": {
"name": "MindedChatOpenAI",
"properties": {
"model": "gpt-4o",
"numParallelRequests": "${NODE_ENV === 'production' ? 3 : 1}",
"logTimings": "${NODE_ENV === 'development'}"
}
}
}
Manual Instantiation with createParallelWrapper
For advanced use cases where you need direct control over the LLM instance, you can manually apply parallel wrapping:
import { createParallelWrapper } from '@minded-ai/mindedjs';
import { ChatOpenAI, AzureChatOpenAI } from '@langchain/openai';
// Manual wrapping for ChatOpenAI
const parallelOpenAI = createParallelWrapper(
new ChatOpenAI({
openAIApiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o',
temperature: 0.7,
}),
{
numParallelRequests: 3,
logTimings: true,
},
);
// Manual wrapping for AzureChatOpenAI
const parallelAzure = createParallelWrapper(
new AzureChatOpenAI({
azureOpenAIApiKey: process.env.AZURE_OPENAI_API_KEY,
azureOpenAIApiInstanceName: process.env.AZURE_OPENAI_INSTANCE!,
azureOpenAIApiDeploymentName: process.env.AZURE_OPENAI_DEPLOYMENT!,
azureOpenAIApiVersion: '2024-02-01',
}),
{
numParallelRequests: 2,
logTimings: false,
},
);
// Use directly with Agent (bypassing configuration)
const agent = new Agent({
memorySchema,
config: {
flows: ['./flows'],
tools: [],
llm: parallelOpenAI as any, // Direct LLM instance
},
tools: [],
});
Note: The configuration-based approach is recommended for most use cases as it's simpler and more maintainable. Use manual instantiation only when you need specific control over LLM creation or are integrating with existing LangChain workflows.
How It Works
MindedChatOpenAI (Backend Processing)
Parallel requests are handled on the Minded platform backend
Multiple requests sent to Azure OpenAI from the backend
Fastest response returned to your agent
Optimal for production deployments
Other LLM Providers (Client-Side Processing)
Parallel requests handled in your application
Multiple requests sent directly to the LLM provider
Good for development and custom deployments
Best Practices
Start Small: Begin with 2-3 parallel requests
Monitor Costs: Each parallel request counts toward your usage
Enable Logging: Use
logTimings: true
during development to measure improvementsEnvironment-Specific: Use fewer parallel requests in development
Configuration Over Code: Prefer
minded.json
configuration over manual instantiation
Troubleshooting
No Performance Improvement
Check network latency variability
Ensure
numParallelRequests > 1
Verify timing logs are showing multiple requests
Increased Costs
Reduce
numParallelRequests
Consider cost vs. latency trade-offs
Monitor usage patterns
Rate Limiting
Lower
numParallelRequests
Implement backoff strategies
Contact provider about rate limits
Last updated