Overview

By default, Pipecat Cloud auto-scales your agent to help minimize the likelihood that your agents experience a cold start. For many applications, you can simply set the --min-agents parameter to 1 in order to avoid scaling from zero and let Pipecat Cloud handle the rest.

However, for applications where traffic can fluctuate, you may need to plan for additional warm capacity to ensure your agents are always ready to respond immediately. For those cases, this guide will help you understand how warm capacity works in Pipecat Cloud, when you need to plan to use reserves, and how you can optimize your plan for both performance and cost.

Agent Types

Before discussing capacity planning, it’s important to understand the different types of agent instances in Pipecat Cloud:

  • Active Agents: Agent instances currently running and handling user sessions.
  • Idle Agents: When you start an active agent, Pipecat Cloud automatically provisions an additional idle agent to help with scaling. These take approximately 30 seconds to become available, though this time may vary based on system load and image size. You are not charged for these idle agents.
  • Reserved Agents: Agent instances maintained according to your --min-agents deployment setting, ensuring immediate availability regardless of current traffic.

Understanding Warm Capacity

Your “warm capacity” represents agent instances that are immediately available to handle new sessions without a cold start. Pipecat Cloud automatically manages this based on:

  1. Your configured reserved agents (min-agents)
  2. Your current active sessions
  3. Automatically provisioned idle agents

The system ensures you always have the following warm capacity available:

  • When Reserved > Active: Your warm capacity equals the number of Reserved Agents
  • When Active ≥ Reserved: Your warm capacity equals the number of Active Agents (through automatically provisioned idle agents)

To illustrate the point, here are a few scenarios showing how warm capacity works:

ReservedActiveWarm Capacity
10110
101010
11010

This shows how:

  • Reserved agents provide a guaranteed minimum warm capacity
  • As active sessions increase beyond your reserved count, your warm capacity grows to match through automatically provisioned idle agents

Agent Cooldown

When an active session ends, the agent instance behavior is governed by a cooldown period:

  • The agent instance remains available in your warm capacity pool for a 5-minute cooldown period
  • During this time, it can immediately serve another request without any cold start
  • After the 5-minute cooldown expires, if the agent instance hasn’t been used, it will be terminated
  • This cooldown provides a buffer in your warm capacity pool, helping to smooth transitions between traffic peaks

Planning for Traffic Patterns

To determine the optimal reserved instance count for your deployment, consider:

  1. Peak Concurrent Sessions: How many simultaneous sessions do you expect during peak periods?

  2. Growth Rate: How quickly do new sessions start during peak periods?

    • If new sessions start faster than the ~30 second warm-up time for auto-scaled agent instances, you’ll need more reserved capacity
  3. Cold Start Tolerance: How important is immediate response for your use case?

Completely avoiding cold starts may not be practical for all applications. We strongly recommend considering longer or variable start up times when building your application. For phone use cases, consider a hold message or for web apps, consider a waiting UX or message.

Cost-Efficient Scaling Strategies

  • Development/Testing: Use min-agents: 0 to minimize costs during development
  • Production Voice AI: Set min-agents to cover your baseline traffic to avoid cold starts
  • Time-Based Scaling: Consider modifying your reserved count for known high-traffic periods
  • Monitoring: Regularly review your warm capacity utilization to optimize your configuration
A cold start typically takes around 10 seconds.

Calculating Optimal Reserved Agents

For production deployments where immediate response is essential, you can calculate your optimal reserved agent count using a growth rate approach:

Optimal Reserved = MAX(Baseline Sessions, CPS × Idle Creation Delay)

Where:

  • Baseline Sessions: Minimum concurrent sessions you typically maintain
  • CPS (Calls Per Second): Your expected session growth rate during peak periods
  • Idle Creation Delay: Time for new idle agents to become available (~30 seconds)

This formula addresses the fundamental challenge: “Can our warm capacity creation keep pace with our call growth rate?” By reserving capacity based on your growth rate and the idle creation delay, you ensure sufficient capacity is available during the critical period before auto-scaling can catch up.

Growth Rate Examples

Here’s how the formula works with different growth rates:

ScenarioBaselineCPSIdle Creation DelayCalculationOptimal Reserved
High volume101.0 (1 call/sec)30sMAX(10, 1.0 × 30)30
Medium volume100.5 (1 call/2sec)30sMAX(10, 0.5 × 30)15
Low volume100.1 (1 call/10sec)30sMAX(10, 0.1 × 30)10

Call Center Example

Consider a voice AI call center that:

  • Normally handles 10 concurrent calls (baseline)
  • During promotions, receives new calls at a rate of 1 call per second
  • Has a 30-second idle creation delay

Applying our formula:

Optimal Reserved = MAX(10, 1 × 30)
                 = MAX(10, 30)
                 = 30

With 30 reserved agents:

  • You can handle the growth rate of 1 call per second for the full 30 seconds until new idle agents start becoming available
  • This prevents any cold starts during the critical catch-up period
  • Auto-scaling will create additional idle agents to handle continued growth

Understanding the Growth Rate Approach

This approach works because:

  1. When your call volume starts increasing, you immediately begin consuming your warm capacity
  2. At the same time, each new active call triggers the creation of a new idle agent
  3. However, these new idle agents take ~30 seconds to become available
  4. Your reserved capacity must be sufficient to handle all calls during this 30-second “catch-up period”

Trading Cost for Performance

The formula provides a starting point, which you can adjust based on your specific needs:

  • Cost-sensitive: Use a lower reserved count and accept some cold starts during traffic spikes
  • Performance-sensitive: Use the calculated reserved count to ensure zero cold starts
  • Hybrid approach: Monitor your actual traffic patterns and adjust based on real-world performance

For most production voice AI applications, we recommend using at least the calculated optimal reserved agent count during business hours or peak usage periods, and then scaling down during off-hours to optimize costs.

Summary and Next Steps

Proper capacity planning ensures your agents are always ready to respond immediately, providing the best possible user experience while optimizing costs:

  1. Understand your traffic patterns: Baseline sessions, peak sessions, and growth rate
  2. Calculate your optimal reserved agents: Use the formula to determine the right level of warm capacity
  3. Monitor and adjust: Refine your capacity planning based on real-world performance and costs