Parallel LLM Requests

Parallel LLM requests can significantly reduce latency by sending multiple identical requests and using the fastest response. This feature can reduce response times by 30-50% in scenarios with variable network conditions.

Quick Start

The easiest way to enable parallel requests is through your minded.json configuration:

{
  "flows": ["./src/flows"],
  "tools": ["./src/tools"],
  "agent": "./src/agent.ts",
  "llm": {
    "name": "MindedChatOpenAI",
    "properties": {
      "model": "gpt-4o",
      "numParallelRequests": 3,
      "logTimings": true
    }
  }
}

Your agent will automatically use parallel requests for all LLM calls:

import { Agent } from '@minded-ai/mindedjs';
import memorySchema from './agentMemorySchema';
import config from '../minded.json';
import tools from './tools';

const agent = new Agent({
  memorySchema,
  config, // Parallel configuration is automatically applied
  tools,
});

Configuration Options

For agents running on the Minded platform, use MindedChatOpenAI with parallel configuration:

{
  "llm": {
    "name": "MindedChatOpenAI",
    "properties": {
      "model": "gpt-4o",
      "numParallelRequests": 3,
      "logTimings": true,
      "temperature": 0.7
    }
  }
}

AzureChatOpenAI

For Azure OpenAI deployments:

{
  "llm": {
    "name": "AzureChatOpenAI",
    "properties": {
      "model": "gpt-4o",
      "numParallelRequests": 3,
      "logTimings": true,
      "azureOpenAIApiVersion": "2024-02-01"
    }
  }
}

Required environment variables:

AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_API_INSTANCE_NAME=your_instance_name
AZURE_OPENAI_API_DEPLOYMENT_NAME=your_deployment_name

ChatOpenAI

For standard OpenAI API:

{
  "llm": {
    "name": "ChatOpenAI",
    "properties": {
      "model": "gpt-4o",
      "numParallelRequests": 3,
      "logTimings": true,
      "openAIApiKey": "${OPENAI_API_KEY}"
    }
  }
}

Configuration Parameters

Parameter
Type
Default
Description

numParallelRequests

number

1

Number of parallel requests (2-5 recommended)

logTimings

boolean

false

Enable detailed timing logs

Performance Notes

  • Optimal Range: 2-3 parallel requests usually provide the best latency/cost balance

  • Cost Impact: You pay for all parallel requests made

  • Best Use Cases: Variable network conditions, consistency requirements

  • Latency Reduction: Typically 30-50% faster response times

Monitoring Performance

When logTimings: true is enabled, you'll see detailed performance logs:

[Model] Fastest request completed { requestTime: 1.234, numParallelRequests: 3 }
[Model] Time saved using parallel requests {
  fastestRequestTime: 1.234,
  secondFastestRequestTime: 1.567,
  allFinishTime: 2.345,
  timeSaved: 1.111,
  timeSavedFromSecond: 0.333
}

Advanced Usage

Dynamic Configuration

You can adjust parallel requests based on environment:

{
  "llm": {
    "name": "MindedChatOpenAI",
    "properties": {
      "model": "gpt-4o",
      "numParallelRequests": "${NODE_ENV === 'production' ? 3 : 1}",
      "logTimings": "${NODE_ENV === 'development'}"
    }
  }
}

Manual Instantiation with createParallelWrapper

For advanced use cases where you need direct control over the LLM instance, you can manually apply parallel wrapping:

import { createParallelWrapper } from '@minded-ai/mindedjs';
import { ChatOpenAI, AzureChatOpenAI } from '@langchain/openai';

// Manual wrapping for ChatOpenAI
const parallelOpenAI = createParallelWrapper(
  new ChatOpenAI({
    openAIApiKey: process.env.OPENAI_API_KEY,
    model: 'gpt-4o',
    temperature: 0.7,
  }),
  {
    numParallelRequests: 3,
    logTimings: true,
  },
);

// Manual wrapping for AzureChatOpenAI
const parallelAzure = createParallelWrapper(
  new AzureChatOpenAI({
    azureOpenAIApiKey: process.env.AZURE_OPENAI_API_KEY,
    azureOpenAIApiInstanceName: process.env.AZURE_OPENAI_INSTANCE!,
    azureOpenAIApiDeploymentName: process.env.AZURE_OPENAI_DEPLOYMENT!,
    azureOpenAIApiVersion: '2024-02-01',
  }),
  {
    numParallelRequests: 2,
    logTimings: false,
  },
);

// Use directly with Agent (bypassing configuration)
const agent = new Agent({
  memorySchema,
  config: {
    flows: ['./flows'],
    tools: [],
    llm: parallelOpenAI as any, // Direct LLM instance
  },
  tools: [],
});

Note: The configuration-based approach is recommended for most use cases as it's simpler and more maintainable. Use manual instantiation only when you need specific control over LLM creation or are integrating with existing LangChain workflows.

How It Works

MindedChatOpenAI (Backend Processing)

  • Parallel requests are handled on the Minded platform backend

  • Multiple requests sent to Azure OpenAI from the backend

  • Fastest response returned to your agent

  • Optimal for production deployments

Other LLM Providers (Client-Side Processing)

  • Parallel requests handled in your application

  • Multiple requests sent directly to the LLM provider

  • Good for development and custom deployments

Best Practices

  1. Start Small: Begin with 2-3 parallel requests

  2. Monitor Costs: Each parallel request counts toward your usage

  3. Enable Logging: Use logTimings: true during development to measure improvements

  4. Environment-Specific: Use fewer parallel requests in development

  5. Configuration Over Code: Prefer minded.json configuration over manual instantiation

Troubleshooting

No Performance Improvement

  • Check network latency variability

  • Ensure numParallelRequests > 1

  • Verify timing logs are showing multiple requests

Increased Costs

  • Reduce numParallelRequests

  • Consider cost vs. latency trade-offs

  • Monitor usage patterns

Rate Limiting

  • Lower numParallelRequests

  • Implement backoff strategies

  • Contact provider about rate limits

Last updated