04/02/2026 | Press release | Distributed by Public on 04/02/2026 13:48
Your browser does not support the audio element.
Today, we are adding two new service tiers to the Gemini API: Flex and Priority. These new options give you granular control over cost and reliability through a single, unified interface.
As AI evolves from simple chat into complex, autonomous agents, developers typically have to manage two distinct types of logic:
Until now, supporting both meant splitting your architecture between standard synchronous serving and the asynchronous Batch API. Flex and Priority help to bridge this gap. You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints. This eliminates the complexity of async job management while giving you the economic and performance benefits of specialized tiers.
Flex Inference is our new cost-optimized tier, designed for latency-tolerant workloads without the overhead of batch processing.
Get started fast by simply configuring the service_tier parameter in your request:
Flex tier will be available for all paid tiers and is available for GenerateContent and Interactions API requests.
The new Priority Inference tier offers our highest level of assurance at a premium price point. This helps to ensure your most important traffic is not preempted, even during peak platform usage.
To use Priority Inference, simply set the service_tier parameter accordingly:
Priority inference will be available to users with Tier 2 / 3 paid projects across the `GenerateContent` API and Interactions API endpoints.
Visit the Gemini API documentation to see the full pricing breakdown and start optimizing your production tiers today. To see it in action, check out the cookbook for runnable code examples.