Capacity Planning
Optimize your agent deployments to minimize cold starts
Overview
By default, Pipecat Cloud auto-scales your agent to help minimize the likelihood that your agents experience a cold start. For many applications, you can simply set the --min-agents
parameter to 1 in order to avoid scaling from zero and let Pipecat Cloud handle the rest.
However, for applications where traffic can fluctuate, you may need to plan for additional warm capacity to ensure your agents are always ready to respond immediately. For those cases, this guide will help you understand how warm capacity works in Pipecat Cloud, when you need to plan to use reserves, and how you can optimize your plan for both performance and cost.
Agent Types
Before discussing capacity planning, it’s important to understand the different types of agent instances in Pipecat Cloud:
- Active Agents: Agent instances currently running and handling user sessions.
- Idle Agents: When you start an active agent, Pipecat Cloud automatically provisions an additional idle agent to help with scaling. These take approximately 30 seconds to become available, though this time may vary based on system load and image size. You are not charged for these idle agents.
- Reserved Agents: Agent instances maintained according to your
--min-agents
deployment setting, ensuring immediate availability regardless of current traffic.
Understanding Warm Capacity
Your “warm capacity” represents agent instances that are immediately available to handle new sessions without a cold start. Pipecat Cloud automatically manages this based on:
- Your configured reserved agents (
min-agents
) - Your current active sessions
- Automatically provisioned idle agents
The system ensures you always have the following warm capacity available:
- When Reserved > Active: Your warm capacity equals the number of Reserved Agents
- When Active ≥ Reserved: Your warm capacity equals the number of Active Agents (through automatically provisioned idle agents)
To illustrate the point, here are a few scenarios showing how warm capacity works:
Reserved | Active | Warm Capacity |
---|---|---|
10 | 1 | 10 |
10 | 10 | 10 |
1 | 10 | 10 |
This shows how:
- Reserved agents provide a guaranteed minimum warm capacity
- As active sessions increase beyond your reserved count, your warm capacity grows to match through automatically provisioned idle agents
Agent Cooldown
When an active session ends, the agent instance behavior is governed by a cooldown period:
- The agent instance remains available in your warm capacity pool for a 5-minute cooldown period
- During this time, it can immediately serve another request without any cold start
- After the 5-minute cooldown expires, if the agent instance hasn’t been used, it will be terminated
- This cooldown provides a buffer in your warm capacity pool, helping to smooth transitions between traffic peaks
Planning for Traffic Patterns
To determine the optimal reserved instance count for your deployment, consider:
-
Peak Concurrent Sessions: How many simultaneous sessions do you expect during peak periods?
-
Growth Rate: How quickly do new sessions start during peak periods?
- If new sessions start faster than the ~30 second warm-up time for auto-scaled agent instances, you’ll need more reserved capacity
-
Cold Start Tolerance: How important is immediate response for your use case?
Completely avoiding cold starts may not be practical for all applications. We strongly recommend considering longer or variable start up times when building your application. For phone use cases, consider a hold message or for web apps, consider a waiting UX or message.
Cost-Efficient Scaling Strategies
- Development/Testing: Use
min-agents: 0
to minimize costs during development - Production Voice AI: Set
min-agents
to cover your baseline traffic to avoid cold starts - Time-Based Scaling: Consider modifying your reserved count for known high-traffic periods
- Monitoring: Regularly review your warm capacity utilization to optimize your configuration
Calculating Optimal Reserved Agents
For production deployments where immediate response is essential, you can calculate your optimal reserved agent count using a growth rate approach:
Where:
- Baseline Sessions: Minimum concurrent sessions you typically maintain
- CPS (Calls Per Second): Your expected session growth rate during peak periods
- Idle Creation Delay: Time for new idle agents to become available (~30 seconds)
This formula addresses the fundamental challenge: “Can our warm capacity creation keep pace with our call growth rate?” By reserving capacity based on your growth rate and the idle creation delay, you ensure sufficient capacity is available during the critical period before auto-scaling can catch up.
Growth Rate Examples
Here’s how the formula works with different growth rates:
Scenario | Baseline | CPS | Idle Creation Delay | Calculation | Optimal Reserved |
---|---|---|---|---|---|
High volume | 10 | 1.0 (1 call/sec) | 30s | MAX(10, 1.0 × 30) | 30 |
Medium volume | 10 | 0.5 (1 call/2sec) | 30s | MAX(10, 0.5 × 30) | 15 |
Low volume | 10 | 0.1 (1 call/10sec) | 30s | MAX(10, 0.1 × 30) | 10 |
Call Center Example
Consider a voice AI call center that:
- Normally handles 10 concurrent calls (baseline)
- During promotions, receives new calls at a rate of 1 call per second
- Has a 30-second idle creation delay
Applying our formula:
With 30 reserved agents:
- You can handle the growth rate of 1 call per second for the full 30 seconds until new idle agents start becoming available
- This prevents any cold starts during the critical catch-up period
- Auto-scaling will create additional idle agents to handle continued growth
Understanding the Growth Rate Approach
This approach works because:
- When your call volume starts increasing, you immediately begin consuming your warm capacity
- At the same time, each new active call triggers the creation of a new idle agent
- However, these new idle agents take ~30 seconds to become available
- Your reserved capacity must be sufficient to handle all calls during this 30-second “catch-up period”
Trading Cost for Performance
The formula provides a starting point, which you can adjust based on your specific needs:
- Cost-sensitive: Use a lower reserved count and accept some cold starts during traffic spikes
- Performance-sensitive: Use the calculated reserved count to ensure zero cold starts
- Hybrid approach: Monitor your actual traffic patterns and adjust based on real-world performance
For most production voice AI applications, we recommend using at least the calculated optimal reserved agent count during business hours or peak usage periods, and then scaling down during off-hours to optimize costs.
Summary and Next Steps
Proper capacity planning ensures your agents are always ready to respond immediately, providing the best possible user experience while optimizing costs:
- Understand your traffic patterns: Baseline sessions, peak sessions, and growth rate
- Calculate your optimal reserved agents: Use the formula to determine the right level of warm capacity
- Monitor and adjust: Refine your capacity planning based on real-world performance and costs