Capacity Planning

Overview

By default, Pipecat Cloud auto-scales your agent to help minimize the likelihood that your agents experience a cold start. For many applications, you can simply set the --min-agents parameter to 1 in order to avoid scaling from zero and let Pipecat Cloud handle the rest.

However, for applications where traffic can fluctuate, you may need to plan for additional warm capacity to ensure your agents are always ready to respond immediately. For those cases, this guide will help you understand how warm capacity works in Pipecat Cloud, when you need to plan to use reserves, and how you can optimize your plan for both performance and cost.

Agent Types

Before discussing capacity planning, it’s important to understand the different types of agent instances in Pipecat Cloud:

Active Agents: Agent instances currently running and handling user sessions.
Idle Agents: When you start an active agent, Pipecat Cloud automatically provisions an additional idle agent to help with scaling. These take approximately 30 seconds to become available, though this time may vary based on system load and image size. You are not charged for these idle agents.
Reserved Agents: Agent instances maintained according to your --min-agents deployment setting, ensuring immediate availability regardless of current traffic.

Understanding Warm Capacity

Your “warm capacity” represents agent instances that are immediately available to handle new sessions without a cold start. Pipecat Cloud automatically manages this based on:

Your configured reserved agents (min-agents)
Your current active sessions
Automatically provisioned idle agents

The system ensures you always have the following warm capacity available:

When Reserved > Active: Your warm capacity equals the number of Reserved Agents
When Active ≥ Reserved: Your warm capacity equals the number of Active Agents (through automatically provisioned idle agents)

To illustrate the point, here are a few scenarios showing how warm capacity works:

Reserved	Active	Warm Capacity
10	1	10
10	10	10
1	10	10

This shows how:

Reserved agents provide a guaranteed minimum warm capacity
As active sessions increase beyond your reserved count, your warm capacity grows to match through automatically provisioned idle agents

Agent Cooldown

When an active session ends, the agent instance behavior is governed by a cooldown period:

The agent instance remains available in your warm capacity pool for a 5-minute cooldown period
During this time, it can immediately serve another request without any cold start
After the 5-minute cooldown expires, if the agent instance hasn’t been used, it will be terminated
This cooldown provides a buffer in your warm capacity pool, helping to smooth transitions between traffic peaks

Planning for Traffic Patterns

To determine the optimal reserved instance count for your deployment, consider:

Peak Concurrent Sessions: How many simultaneous sessions do you expect during peak periods?
Growth Rate: How quickly do new sessions start during peak periods?
- If new sessions start faster than the ~30 second warm-up time for auto-scaled agent instances, you’ll need more reserved capacity
Cold Start Tolerance: How important is immediate response for your use case?

Completely avoiding cold starts may not be practical for all applications. We strongly recommend considering longer or variable start up times when building your application. For phone use cases, consider a hold message or for web apps, consider a waiting UX or message.

Cost-Efficient Scaling Strategies

Development/Testing: Use min-agents: 0 to minimize costs during development
Production Voice AI: Set min-agents to cover your baseline traffic to avoid cold starts
Time-Based Scaling: Consider modifying your reserved count for known high-traffic periods
Monitoring: Regularly review your warm capacity utilization to optimize your configuration

A cold start typically takes around 10 seconds.

Calculating Optimal Reserved Agents

For production deployments where immediate response is essential, you can calculate your optimal reserved agent count using a growth rate approach:

Optimal Reserved = MAX(Baseline Sessions, CPS × Idle Creation Delay)

Where:

Baseline Sessions: Minimum concurrent sessions you typically maintain
CPS (Calls Per Second): Your expected session growth rate during peak periods
Idle Creation Delay: Time for new idle agents to become available (~30 seconds)

This formula addresses the fundamental challenge: “Can our warm capacity creation keep pace with our call growth rate?” By reserving capacity based on your growth rate and the idle creation delay, you ensure sufficient capacity is available during the critical period before auto-scaling can catch up.

Growth Rate Examples

Here’s how the formula works with different growth rates:

Scenario	Baseline	CPS	Idle Creation Delay	Calculation	Optimal Reserved
High volume	10	1.0 (1 call/sec)	30s	MAX(10, 1.0 × 30)	30
Medium volume	10	0.5 (1 call/2sec)	30s	MAX(10, 0.5 × 30)	15
Low volume	10	0.1 (1 call/10sec)	30s	MAX(10, 0.1 × 30)	10

Call Center Example

Consider a voice AI call center that:

Normally handles 10 concurrent calls (baseline)
During promotions, receives new calls at a rate of 1 call per second
Has a 30-second idle creation delay

Applying our formula:

Optimal Reserved = MAX(10, 1 × 30)
                 = MAX(10, 30)
                 = 30

With 30 reserved agents:

You can handle the growth rate of 1 call per second for the full 30 seconds until new idle agents start becoming available
This prevents any cold starts during the critical catch-up period
Auto-scaling will create additional idle agents to handle continued growth

Understanding the Growth Rate Approach

This approach works because:

When your call volume starts increasing, you immediately begin consuming your warm capacity
At the same time, each new active call triggers the creation of a new idle agent
However, these new idle agents take ~30 seconds to become available
Your reserved capacity must be sufficient to handle all calls during this 30-second “catch-up period”

Trading Cost for Performance

The formula provides a starting point, which you can adjust based on your specific needs:

Cost-sensitive: Use a lower reserved count and accept some cold starts during traffic spikes
Performance-sensitive: Use the calculated reserved count to ensure zero cold starts
Hybrid approach: Monitor your actual traffic patterns and adjust based on real-world performance

For most production voice AI applications, we recommend using at least the calculated optimal reserved agent count during business hours or peak usage periods, and then scaling down during off-hours to optimize costs.

Summary and Next Steps

Proper capacity planning ensures your agents are always ready to respond immediately, providing the best possible user experience while optimizing costs:

Understand your traffic patterns: Baseline sessions, peak sessions, and growth rate
Calculate your optimal reserved agents: Use the formula to determine the right level of warm capacity
Monitor and adjust: Refine your capacity planning based on real-world performance and costs

Explore Scaling Options

Learn more about scaling configuration options and deployment strategies.

Deploy Your Agent

Apply your capacity planning knowledge to your agent deployment.

Troubleshooting Daily WebRTC

On this page

Overview
Agent Types
Understanding Warm Capacity
Agent Cooldown
Planning for Traffic Patterns
Cost-Efficient Scaling Strategies
Calculating Optimal Reserved Agents
Growth Rate Examples
Call Center Example
Understanding the Growth Rate Approach
Trading Cost for Performance
Summary and Next Steps

Overview

Agent Types

Before discussing capacity planning, it’s important to understand the different types of agent instances in Pipecat Cloud:

Active Agents: Agent instances currently running and handling user sessions.
Idle Agents: When you start an active agent, Pipecat Cloud automatically provisions an additional idle agent to help with scaling. These take approximately 30 seconds to become available, though this time may vary based on system load and image size. You are not charged for these idle agents.
Reserved Agents: Agent instances maintained according to your --min-agents deployment setting, ensuring immediate availability regardless of current traffic.

Understanding Warm Capacity

Your “warm capacity” represents agent instances that are immediately available to handle new sessions without a cold start. Pipecat Cloud automatically manages this based on:

Your configured reserved agents (min-agents)
Your current active sessions
Automatically provisioned idle agents

The system ensures you always have the following warm capacity available:

When Reserved > Active: Your warm capacity equals the number of Reserved Agents
When Active ≥ Reserved: Your warm capacity equals the number of Active Agents (through automatically provisioned idle agents)

To illustrate the point, here are a few scenarios showing how warm capacity works:

Reserved	Active	Warm Capacity
10	1	10
10	10	10
1	10	10

This shows how:

Reserved agents provide a guaranteed minimum warm capacity
As active sessions increase beyond your reserved count, your warm capacity grows to match through automatically provisioned idle agents

Agent Cooldown

When an active session ends, the agent instance behavior is governed by a cooldown period:

The agent instance remains available in your warm capacity pool for a 5-minute cooldown period
During this time, it can immediately serve another request without any cold start
After the 5-minute cooldown expires, if the agent instance hasn’t been used, it will be terminated
This cooldown provides a buffer in your warm capacity pool, helping to smooth transitions between traffic peaks

Planning for Traffic Patterns

To determine the optimal reserved instance count for your deployment, consider:

Peak Concurrent Sessions: How many simultaneous sessions do you expect during peak periods?
Growth Rate: How quickly do new sessions start during peak periods?
- If new sessions start faster than the ~30 second warm-up time for auto-scaled agent instances, you’ll need more reserved capacity
Cold Start Tolerance: How important is immediate response for your use case?

Cost-Efficient Scaling Strategies

Development/Testing: Use min-agents: 0 to minimize costs during development
Production Voice AI: Set min-agents to cover your baseline traffic to avoid cold starts
Time-Based Scaling: Consider modifying your reserved count for known high-traffic periods
Monitoring: Regularly review your warm capacity utilization to optimize your configuration

A cold start typically takes around 10 seconds.

Calculating Optimal Reserved Agents

For production deployments where immediate response is essential, you can calculate your optimal reserved agent count using a growth rate approach:

Optimal Reserved = MAX(Baseline Sessions, CPS × Idle Creation Delay)

Where:

Baseline Sessions: Minimum concurrent sessions you typically maintain
CPS (Calls Per Second): Your expected session growth rate during peak periods
Idle Creation Delay: Time for new idle agents to become available (~30 seconds)

Growth Rate Examples

Here’s how the formula works with different growth rates:

Scenario	Baseline	CPS	Idle Creation Delay	Calculation	Optimal Reserved
High volume	10	1.0 (1 call/sec)	30s	MAX(10, 1.0 × 30)	30
Medium volume	10	0.5 (1 call/2sec)	30s	MAX(10, 0.5 × 30)	15
Low volume	10	0.1 (1 call/10sec)	30s	MAX(10, 0.1 × 30)	10

Call Center Example

Consider a voice AI call center that:

Normally handles 10 concurrent calls (baseline)
During promotions, receives new calls at a rate of 1 call per second
Has a 30-second idle creation delay

Applying our formula:

Optimal Reserved = MAX(10, 1 × 30)
                 = MAX(10, 30)
                 = 30

With 30 reserved agents:

You can handle the growth rate of 1 call per second for the full 30 seconds until new idle agents start becoming available
This prevents any cold starts during the critical catch-up period
Auto-scaling will create additional idle agents to handle continued growth

Understanding the Growth Rate Approach

This approach works because:

When your call volume starts increasing, you immediately begin consuming your warm capacity
At the same time, each new active call triggers the creation of a new idle agent
However, these new idle agents take ~30 seconds to become available
Your reserved capacity must be sufficient to handle all calls during this 30-second “catch-up period”

Trading Cost for Performance

The formula provides a starting point, which you can adjust based on your specific needs:

Cost-sensitive: Use a lower reserved count and accept some cold starts during traffic spikes
Performance-sensitive: Use the calculated reserved count to ensure zero cold starts
Hybrid approach: Monitor your actual traffic patterns and adjust based on real-world performance

Summary and Next Steps

Proper capacity planning ensures your agents are always ready to respond immediately, providing the best possible user experience while optimizing costs:

Understand your traffic patterns: Baseline sessions, peak sessions, and growth rate
Calculate your optimal reserved agents: Use the formula to determine the right level of warm capacity
Monitor and adjust: Refine your capacity planning based on real-world performance and costs

Explore Scaling Options

Learn more about scaling configuration options and deployment strategies.

Deploy Your Agent

Apply your capacity planning knowledge to your agent deployment.

Troubleshooting Daily WebRTC

On this page

Overview
Agent Types
Understanding Warm Capacity
Agent Cooldown
Planning for Traffic Patterns
Cost-Efficient Scaling Strategies
Calculating Optimal Reserved Agents
Growth Rate Examples
Call Center Example
Understanding the Growth Rate Approach
Trading Cost for Performance
Summary and Next Steps

Overview

Agent Types

Understanding Warm Capacity

Agent Cooldown

Planning for Traffic Patterns

Cost-Efficient Scaling Strategies

Calculating Optimal Reserved Agents

Growth Rate Examples

Call Center Example

Understanding the Growth Rate Approach

Trading Cost for Performance

Summary and Next Steps

Explore Scaling Options

Deploy Your Agent

Agents

Pipecat in Production

Additional features

Capacity Planning

Overview

Agent Types

Understanding Warm Capacity

Agent Cooldown

Planning for Traffic Patterns

Cost-Efficient Scaling Strategies

Calculating Optimal Reserved Agents

Growth Rate Examples

Call Center Example

Understanding the Growth Rate Approach

Trading Cost for Performance

Summary and Next Steps

Explore Scaling Options

Deploy Your Agent

​Overview

​Agent Types

​Understanding Warm Capacity

​Agent Cooldown

​Planning for Traffic Patterns

​Cost-Efficient Scaling Strategies

​Calculating Optimal Reserved Agents

​Growth Rate Examples

​Call Center Example

​Understanding the Growth Rate Approach

​Trading Cost for Performance

​Summary and Next Steps

Explore Scaling Options

Deploy Your Agent

Agents

Pipecat in Production

Additional features

​Overview

​Agent Types

​Understanding Warm Capacity

​Agent Cooldown

​Planning for Traffic Patterns

​Cost-Efficient Scaling Strategies

​Calculating Optimal Reserved Agents

​Growth Rate Examples

​Call Center Example

​Understanding the Growth Rate Approach

​Trading Cost for Performance

​Summary and Next Steps

Explore Scaling Options

Deploy Your Agent

Overview

Agent Types

Understanding Warm Capacity

Agent Cooldown

Planning for Traffic Patterns

Cost-Efficient Scaling Strategies

Calculating Optimal Reserved Agents

Growth Rate Examples

Call Center Example

Understanding the Growth Rate Approach

Trading Cost for Performance

Summary and Next Steps

Overview

Agent Types

Understanding Warm Capacity

Agent Cooldown

Planning for Traffic Patterns

Cost-Efficient Scaling Strategies

Calculating Optimal Reserved Agents

Growth Rate Examples

Call Center Example

Understanding the Growth Rate Approach

Trading Cost for Performance

Summary and Next Steps