Quotas and limits reference
Azure uses quotas and limits to prevent budget overruns due to fraud, and to honor Azure capacity constraints. Consider these limits as you scale for production workloads. The following sections provide you with a quick guide to the default quotas and limits that apply to Azure AI model’s inference service in Azure AI services:Resource limits
| Limit name | Limit value |
|---|---|
| Azure AI services resources per region per Azure subscription | 30 |
| Max deployments per resource | 32 |
Rate limits
| Limit name | Applies to | Limit value |
|---|---|---|
| Tokens per minute | Azure OpenAI models | Varies per model and SKU. See limits for Azure OpenAI. |
| Requests per minute | Azure OpenAI models | Varies per model and SKU. See limits for Azure OpenAI. |
| Tokens per minute | DeepSeek-R1 DeepSeek-V3-0324 | 5,000,000 |
| Requests per minute | DeepSeek-R1 DeepSeek-V3-0324 | 5,000 |
| Concurrent requests | DeepSeek-R1 DeepSeek-V3-0324 | 300 |
| Tokens per minute | Rest of models | 400,000 |
| Requests per minute | Rest of models | 1,000 |
| Concurrent requests | Rest of models | 300 |
Other limits
| Limit name | Limit value |
|---|---|
| Max number of custom headers in API requests1 | 10 |
Usage tiers
Global Standard deployments use Azure’s global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer’s inference requests. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see more variabilities in response latency. The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer’s usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.Request increases to the default limits
Limit increase requests can be submitted and evaluated per request. Open an online customer support request. When requesting for endpoint limit increase, provide the following information:- When opening the support request, select Service and subscription limits (quotas) as the Issue type.
- Select the subscription of your choice.
- Select Cognitive Services as Quota type.
- Select Next.
-
On the Additional details tab, you need to provide detailed reasons for the limit increase in order for your request to be processed. Be sure to add the following information into the reason for limit increase:
- Model name, model version (if applicable), and deployment type (SKU).
- Description of your scenario and workload.
- Rationale for the requested increase.
- Provide the target throughput: Tokens per minute, requests per minute, etc.
- Provide planned time plan (by when you need increased limits).
- Finally, select Save and continue to continue.
General best practices to remain within rate limits
To minimize issues related to rate limits, it’s a good idea to use the following techniques:- Implement retry logic in your application.
- Avoid sharp changes in the workload. Increase the workload gradually.
- Test different load increase patterns.
- Increase the quota assigned to your deployment. Move quota from another deployment, if necessary.
Next steps
- Learn more about the models available in Azure AI Foundry Models