A Hands-On Test of Google's Newest AI

Just 3 months after the release of their state-of-the-art model Gemini 3 Pro, Google DeepMind is here with its latest iteration: Gemini 3.1 Pro.

A radical upgrade in terms of capabilities and safety, Gemini 3.1 Pro model strives to be accessible and operable by all. Regardless of your preference, platform, purchasing power, the model has a lot to offer for all the users.

I’d be testing the capabilities of Gemini 3.1 Pro and would elaborate on its key features. From how to access Gemini 3.1 Pro to benchmarks, all things about this new model has been touched upon in this article.

Gemini 3.1 Pro: What’s new?

Gemini 3.1 Pro is the latest member of the Gemini model family. As usual the model comes with an astounding number of features and improvements from the past. Some of the most noticeable one are:

1 Million Context Window: Maintains the industry-leading 1 million token input capacity, allowing it to process over 1,500 pages of text or entire code repositories in a single prompt.
Advanced Reasoning Performance: It delivers more than double the reasoning performance of Gemini 3 Pro, scoring 77.1% on the ARC-AGI-2 benchmark.
Enhanced Agentic Reliability: Specifically optimized for autonomous workflows, including a dedicated API endpoint (gemini-3.1-pro-preview-customtools) for high-precision tool orchestration and bash execution.
Pricing: The cost/token of the latest model is the same as that of its predecessor. For those accustomed to the Pro variant, they are getting a free upgrade.

Advanced Vibe Coding: The model handles visual coding exceptionally well. It can generate website-ready, animated SVGs purely through code, meaning crisp scaling and tiny file sizes.
Hallucinations: Gemini 3.1 Pro has tacked the hallucinations problem head on by reducing its rate of hallucinations from 88% to 50% across AA-Omniscience: Knowledge and Hallucination Benchmark

Granular Thinking: The model adds more granularity to the thinking option offered by its predecessor. Now the users can choose between high, medium and low thinking parameters.

Thinking Level
Gemini 3.1 Pro
Gemini 3 Pro
Gemini 3 Flash
Description

Minimal
Not supported
Not supported
Supported

Matches the no thinking setting for most queries. The model may think minimally for complex coding tasks.
Minimizes latency for chat or high throughput applications.

Low
Supported
Supported
Supported

Minimizes latency and cost. Best for simple instruction following or high-throughput applications.

Medium
Supported
Not supported
Supported

Balanced reasoning for most tasks.

High
Supported (Default, Dynamic)
Supported (Default, Dynamic)
Supported (Default, Dynamic)

Maximizes reasoning depth. May increase latency, but outputs are more carefully reasoned.

Hands-On: Let’s have some fun

All the talk in the world wouldn’t amount to anything if the performance falls flat in practice. To evaluate Gemini 3.1 Pro properly, I tested it across three categories:

Complex reasoning
Code generation & debugging
Long-context synthesis

Task 1: Multi-Step Logical Reasoning

What this tests: Chain-of-thought reasoning, constraint handling, and hallucination resistance.

Prompt:

“You are given the following scenario:

Five analysts — A, B, C, D, and E — are assigned to three projects: Alpha, Beta, and Gamma.

Rules:

1. Each project must have at least one analyst.
2. A cannot work with C.
3. B must be assigned to the same project as D.
4. E cannot be on Alpha.
5. No project can have more than three analysts.

Question: List all valid assignment combinations. Show your reasoning clearly and ensure no rule is violated.

“

Response:

Gemini 3.1 Pro handled constraint-heavy logic without collapsing into contradictions, which is where most models stumble. The consistency and clarity in enumerating valid combinations showed serious reasoning depth.

Task 2: Code Generation & Refactoring

What this tests: Structured thinking, edge-case awareness, code correctness, and production realism.

Prompt:

“You are given the following Python function:

def process_orders(orders):
    total = 0
    for order in orders:
        if order[“status”] == “completed”:
            total += order[“amount”]
    return total

Refactor this function to:

1. Handle missing keys safely.
2. Ignore negative amounts.
3. Log skipped invalid entries.
4. Be production-ready with type hints and docstrings.
5. Include unit tests for edge cases.

Return clean, well-structured Python code only.

“

Response:

import logging
import unittest
from typing import List, Dict, Any

logger = logging.getLogger(__name__)

def process_orders(orders: List[Dict[str, Any]]) -> float:
“””
Process a list of order dictionaries and calculate
the total amount of completed orders.
“””
total: float = 0.0

if not isinstance(orders, list):
logger.error(“‘orders’ must be a list.”)
return total

for index, order in enumerate(orders):
if not isinstance(order, dict):
logger.warning(“Skipping index %d: Not a dictionary.”, index)
continue

if order.get(“status”) != “completed”:
continue

if “amount” not in order:
logger.warning(“Skipping index %d: Missing ‘amount’.”, index)
continue

amount = order[“amount”]

if not isinstance(amount, (int, float)) or isinstance(amount, bool):
logger.warning(“Skipping index %d: ‘amount’ must be numeric.”, index)
continue

if amount < 0:
logger.warning(“Skipping index %d: Negative amount (%.2f).”, index, amount)
continue

total += float(amount)

return total

class TestProcessOrders(unittest.TestCase):

def setUp(self):
logging.disable(logging.WARNING)

def tearDown(self):
logging.disable(logging.NOTSET)

def test_valid_completed_orders(self):
orders = [
{“status”: “completed”, “amount”: 100},
{“status”: “completed”, “amount”: 50.5}
]
self.assertEqual(process_orders(orders), 150.5)

def test_ignore_pending_orders(self):
orders = [
{“status”: “pending”, “amount”: 100},
{“status”: “completed”, “amount”: 50}
]
self.assertEqual(process_orders(orders), 50.0)

def test_missing_keys_skipped(self):
orders = [
{“amount”: 100},
{“status”: “completed”},
{“status”: “completed”, “amount”: 20}
]
self.assertEqual(process_orders(orders), 20.0)

def test_negative_amounts_ignored(self):
orders = [
{“status”: “completed”, “amount”: -10},
{“status”: “completed”, “amount”: 3

The refactored code felt production-aware, not toy-level. It anticipated edge cases, enforced type safety, and included meaningful tests. This is the kind of output that actually respects real-world development standards.

Task 3: Long-Context Analytical Synthesis

What this tests: Information compression, structured summarization, and reasoning across context.

Prompt:

“Below is a synthetic business report:

Company: NovaGrid AI

2022 Revenue: $12M
2023 Revenue : $28M
2024 Revenue: $46M

Customer churn increased from 4% to 11% in 2024.
R&D spending increased by 70% in 2024.
Operating margin dropped from 18% to 9%.
Enterprise customers grew by 40%.
SMB customers declined by 22%.
Cloud infrastructure costs doubled.

Task:

1. Diagnose the most likely root causes of margin decline.
2. Identify strategic risks.
3. Recommend 3 data-backed actions.
4. Present your answer in a structured executive memo format.

“

Response:

It connected financial signals, operational shifts, and strategic risks into a coherent executive narrative. The ability to diagnose margin pressure while balancing growth signals shows strong business reasoning. It read like something a sharp strategy consultant would draft, not a generic summary.

Note: I didn’t use the standard “Create a dashboard” tasks as most latest models like Sonnet 4.6, Kimi K 2.5, are easily able to create one. So it wouldn’t offer much of a challenge to a model this capable.

How to access Gemini 3.1 Pro?

Unlike the previous Pro models, Gemini 3.1 Pro is freely accessible by all the users on the platform of their choice.

Now that you’ve made up your mind about using Gemini 3.1 Pro, let’s see how you can access the model.

Gemini Web UI: Free and Gemini Advanced users now have 3.1 Pro available under the model section option.

API: Available via Google AI Studio for developers (models/Gemini-3.1-pro).

Model
Base Input Tokens
5m Cache Writes
1h Cache Writes
Cache Hits & Refreshes
Output Tokens

Gemini 3.1 Pro (≤200 K tokens)
$2 / 1M tokens
~$0.20–$0.40 / 1M tokens
~$4.50 / 1M tokens per hour storage
Not formally documented
$12 / 1M tokens

Gemini 3.1 Pro (>200 K tokens)
$4 / 1M tokens
~$0.20–$0.40 / 1M tokens
~$4.50 / 1M tokens per hour storage
Not formally documented
$18 / 1M tokens

Cloud Platforms: Being rolled out to NotebookLM, Google Cloud’s Vertex AI, and Microsoft Foundry.

Benchmarks

To quantify how good this model is, the benchmarks would assist.

There is a lot to decipher here. But the most astounding improvement of all is certainly in Abstract reasoning puzzles.

Let me put things into perspective: Gemini 3 Pro released with a ARC-AGI-2 score of 31.1%. This was the highest for the time and considered a breakthrough for LLM standards. Fast forward just 3 months, and that score has been eclipsed by its own successor by double the margin!

This is the rapid pace at which AI models are improving.

If you’re unfamiliar with what these benchmarks test, read this article: AI Benchmarks.

Conclusion: Powerful and Accessible

Gemini 3.1 Pro proves it’s more than a flashy multimodal model. Across reasoning, code, and analytical synthesis, it demonstrates real capability with production relevance. It’s not flawless and still demands structured prompting and human oversight. But as a frontier model embedded in Google’s ecosystem, it’s powerful, competitive, and absolutely worth serious evaluation.

Frequently Asked Questions

Q1. What is Gemini 3.1 Pro designed for?

A. It is built for advanced reasoning, long-context processing, multimodal understanding, and production-grade AI applications.

Q2. How can developers access Gemini 3.1 Pro?

A. Developers can access it via Google AI Studio for prototyping or Vertex AI for scalable, enterprise deployments.

Q3. Is Gemini 3.1 Pro reliable for high-stakes tasks?

A. It performs strongly but still requires structured prompting and human oversight to ensure accuracy and reduce hallucinations.

I specialize in reviewing and refining AI-driven research, technical documentation, and content related to emerging AI technologies. My experience spans AI model training, data analysis, and information retrieval, allowing me to craft content that is both technically accurate and accessible.

Login to continue reading and enjoy expert-curated content.

Keep Reading for Free

What's Hot

Google Pixel 10A review: Just buy the 9A

Oppo may be readying a mini version of the Watch X3

This 60-second Windows health check is better than any third-party app

How to Switch from ChatGPT to Claude Without Losing Any Data

Google AI Introduces ‘Groundsource’: A New Methodology that Uses Gemini Model to Transform Unstructured Global News into Actionable, Historical Data

People Hate Datacenters, Survey Finds

We Used 5 Outlier Detection Methods on a Real Dataset: They Disagreed on 96% of Flagged Samples

The Shingles Virus May Be Aging You More Quickly

Model Context Protocol (MCP) vs. AI Agent Skills: A Deep Dive into Structured Tools and Behavioral Guidance for LLMs

Google Pixel 10A review: Just buy the 9A

Oppo may be readying a mini version of the Watch X3

This 60-second Windows health check is better than any third-party app

Google Pixel 10A review: Just buy the 9A

Oppo may be readying a mini version of the Watch X3

This 60-second Windows health check is better than any third-party app

Usefull link

categories

What's Hot

A Hands-On Test of Google’s Newest AI

Gemini 3.1 Pro: What’s new?

Hands-On: Let’s have some fun

Task 1: Multi-Step Logical Reasoning

Task 2: Code Generation & Refactoring

Task 3: Long-Context Analytical Synthesis

How to access Gemini 3.1 Pro?

Benchmarks

Conclusion: Powerful and Accessible

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Related Posts

Usefull link

categories