Optimize your agent deployments to minimize cold starts
By default, Pipecat Cloud auto-scales your agent to help minimize the likelihood that your agents experience a cold start. For many applications, you can simply set the --min-agents
parameter to 1 in order to avoid scaling from zero and let Pipecat Cloud handle the rest.
However, for applications where traffic can fluctuate, you may need to plan for additional warm capacity to ensure your agents are always ready to respond immediately. For those cases, this guide will help you understand how warm capacity works in Pipecat Cloud, when you need to plan to use reserves, and how you can optimize your plan for both performance and cost.
Before discussing capacity planning, it’s important to understand the different types of agent instances in Pipecat Cloud:
--min-agents
deployment setting, ensuring immediate availability regardless of current traffic.Your “warm capacity” represents agent instances that are immediately available to handle new sessions without a cold start. Pipecat Cloud automatically manages this based on:
min-agents
)The system ensures you always have the following warm capacity available:
To illustrate the point, here are a few scenarios showing how warm capacity works:
Reserved | Active | Warm Capacity |
---|---|---|
10 | 1 | 10 |
10 | 10 | 10 |
1 | 10 | 10 |
This shows how:
When an active session ends, the agent instance behavior is governed by a cooldown period:
To determine the optimal reserved instance count for your deployment, consider:
Peak Concurrent Sessions: How many simultaneous sessions do you expect during peak periods?
Growth Rate: How quickly do new sessions start during peak periods?
Cold Start Tolerance: How important is immediate response for your use case?
Completely avoiding cold starts may not be practical for all applications. We strongly recommend considering longer or variable start up times when building your application. For phone use cases, consider a hold message or for web apps, consider a waiting UX or message.
min-agents: 0
to minimize costs during developmentmin-agents
to cover your baseline traffic to avoid cold startsFor production deployments where immediate response is essential, you can calculate your optimal reserved agent count using a growth rate approach:
Where:
This formula addresses the fundamental challenge: “Can our warm capacity creation keep pace with our call growth rate?” By reserving capacity based on your growth rate and the idle creation delay, you ensure sufficient capacity is available during the critical period before auto-scaling can catch up.
Here’s how the formula works with different growth rates:
Scenario | Baseline | CPS | Idle Creation Delay | Calculation | Optimal Reserved |
---|---|---|---|---|---|
High volume | 10 | 1.0 (1 call/sec) | 30s | MAX(10, 1.0 × 30) | 30 |
Medium volume | 10 | 0.5 (1 call/2sec) | 30s | MAX(10, 0.5 × 30) | 15 |
Low volume | 10 | 0.1 (1 call/10sec) | 30s | MAX(10, 0.1 × 30) | 10 |
Consider a voice AI call center that:
Applying our formula:
With 30 reserved agents:
This approach works because:
The formula provides a starting point, which you can adjust based on your specific needs:
For most production voice AI applications, we recommend using at least the calculated optimal reserved agent count during business hours or peak usage periods, and then scaling down during off-hours to optimize costs.
Proper capacity planning ensures your agents are always ready to respond immediately, providing the best possible user experience while optimizing costs:
Learn more about scaling configuration options and deployment strategies.
Apply your capacity planning knowledge to your agent deployment.
Optimize your agent deployments to minimize cold starts
By default, Pipecat Cloud auto-scales your agent to help minimize the likelihood that your agents experience a cold start. For many applications, you can simply set the --min-agents
parameter to 1 in order to avoid scaling from zero and let Pipecat Cloud handle the rest.
However, for applications where traffic can fluctuate, you may need to plan for additional warm capacity to ensure your agents are always ready to respond immediately. For those cases, this guide will help you understand how warm capacity works in Pipecat Cloud, when you need to plan to use reserves, and how you can optimize your plan for both performance and cost.
Before discussing capacity planning, it’s important to understand the different types of agent instances in Pipecat Cloud:
--min-agents
deployment setting, ensuring immediate availability regardless of current traffic.Your “warm capacity” represents agent instances that are immediately available to handle new sessions without a cold start. Pipecat Cloud automatically manages this based on:
min-agents
)The system ensures you always have the following warm capacity available:
To illustrate the point, here are a few scenarios showing how warm capacity works:
Reserved | Active | Warm Capacity |
---|---|---|
10 | 1 | 10 |
10 | 10 | 10 |
1 | 10 | 10 |
This shows how:
When an active session ends, the agent instance behavior is governed by a cooldown period:
To determine the optimal reserved instance count for your deployment, consider:
Peak Concurrent Sessions: How many simultaneous sessions do you expect during peak periods?
Growth Rate: How quickly do new sessions start during peak periods?
Cold Start Tolerance: How important is immediate response for your use case?
Completely avoiding cold starts may not be practical for all applications. We strongly recommend considering longer or variable start up times when building your application. For phone use cases, consider a hold message or for web apps, consider a waiting UX or message.
min-agents: 0
to minimize costs during developmentmin-agents
to cover your baseline traffic to avoid cold startsFor production deployments where immediate response is essential, you can calculate your optimal reserved agent count using a growth rate approach:
Where:
This formula addresses the fundamental challenge: “Can our warm capacity creation keep pace with our call growth rate?” By reserving capacity based on your growth rate and the idle creation delay, you ensure sufficient capacity is available during the critical period before auto-scaling can catch up.
Here’s how the formula works with different growth rates:
Scenario | Baseline | CPS | Idle Creation Delay | Calculation | Optimal Reserved |
---|---|---|---|---|---|
High volume | 10 | 1.0 (1 call/sec) | 30s | MAX(10, 1.0 × 30) | 30 |
Medium volume | 10 | 0.5 (1 call/2sec) | 30s | MAX(10, 0.5 × 30) | 15 |
Low volume | 10 | 0.1 (1 call/10sec) | 30s | MAX(10, 0.1 × 30) | 10 |
Consider a voice AI call center that:
Applying our formula:
With 30 reserved agents:
This approach works because:
The formula provides a starting point, which you can adjust based on your specific needs:
For most production voice AI applications, we recommend using at least the calculated optimal reserved agent count during business hours or peak usage periods, and then scaling down during off-hours to optimize costs.
Proper capacity planning ensures your agents are always ready to respond immediately, providing the best possible user experience while optimizing costs:
Learn more about scaling configuration options and deployment strategies.
Apply your capacity planning knowledge to your agent deployment.