Google has released Gemini 3.1 Flash-Lite, the most cost-efficient entry in the Gemini 3 model series. Designed for ‘intelligence at scale,’ this model is optimized for high-volume tasks where low latency and cost-per-token are the primary engineering constraints. It is currently available in Public Preview via the Gemini API (Google AI Studio) and Vertex AI.
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/?
Core Feature: Variable ‘Thinking Levels’
A significant architectural update in the 3.1 series is the introduction of Thinking Levels. This feature allows developers to programmatically adjust the model’s reasoning depth based on the specific complexity of a request.
By selecting between Minimal, Low, Medium, or High thinking levels, you can optimize the trade-off between latency and logical accuracy.
- Minimal/Low: Ideal for high-throughput, low-latency tasks such as classification, basic sentiment analysis, or simple data extraction.
- Medium/High: Utilizes Deep Think Mini logic to handle complex instruction-following, multi-step reasoning, and structured data generation.
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/?
Performance and Efficiency Benchmarks
Gemini 3.1 Flash-Lite is designed to replace Gemini 2.5 Flash for production workloads that require faster inference without sacrificing output quality. The model achieves a 2.5x faster Time to First Token (TTFT) and a 45% increase in overall output speed compared to its predecessor.
On the GPQA Diamond benchmark—a measure of expert-level reasoning—Gemini 3.1 Flash-Lite scored 86.9%, matching or exceeding the quality of larger models in the previous generation while operating at a significantly lower computational cost.
Comparison Table: Gemini 3.1 Flash-Lite vs. Gemini 2.5 Flash
MetricGemini 2.5 FlashGemini 3.1 Flash-LiteInput Cost (per 1M tokens)Higher$0.25Output Cost (per 1M tokens)Higher$1.50TTFT SpeedBaseline2.5x FasterOutput ThroughputBaseline45% FasterReasoning (GPQA Diamond)Competitive86.9%
Technical Use Cases for Production
The 3.1 Flash-Lite model is specifically tuned for workloads that involve complex structures and long-sequence logic:
- UI and Dashboard Generation: The model is optimized for generating hierarchical code (HTML/CSS, React components) and structured JSON required to render complex data visualizations.
- System Simulations: It maintains logical consistency over long contexts, making it suitable for creating environment simulations or agentic workflows that require state-tracking.
- Synthetic Data Generation: Due to the low input cost ($0.25/1M tokens), it serves as an efficient engine for distilling knowledge from larger models like Gemini 3.1 Ultra into smaller, domain-specific datasets.
Key Takeaways
- Superior Price-to-Performance Ratio: Gemini 3.1 Flash-Lite is the most cost-efficient model in the Gemini 3 series, priced at $0.25 per 1M input tokens and $1.50 per 1M output tokens. It outperforms Gemini 2.5 Flash with a 2.5x faster Time to First Token (TTFT) and 45% higher output speed.
- Introduction of ‘Thinking Levels’: A new architectural feature allows developers to programmatically toggle between Minimal, Low, Medium, and High reasoning intensities. This provides granular control to balance latency against reasoning depth depending on the task’s complexity.
- High Reasoning Benchmark: Despite its ‘Lite’ designation, the model maintains high-tier logic, scoring 86.9% on the GPQA Diamond benchmark. This makes it suitable for expert-level reasoning tasks that previously required larger, more expensive models.
- Optimized for Structured Workloads: The model is specifically tuned for ‘intelligence at scale,’ excelling at generating complex UI/dashboards, creating system simulations, and maintaining logical consistency across long-sequence code generation.
- Seamless API Integration: Currently available in Public Preview, the model uses the gemini-3.1-flash-lite-preview endpoint via the Gemini API and Vertex AI. It supports multimodal inputs (text, image, video) while maintaining a standard 128k context window.
Check out the Public Preview via the Gemini API (Google AI Studio) and Vertex AI. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Previous articleAlibaba Releases OpenSandbox to Provide Software Developers with a Unified, Secure, and Scalable API for Autonomous AI Agent Execution

