Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents

GLM-4.7-Flash is a new member of the GLM 4.7 family and targets developers who want strong coding and reasoning performance in a model that is practical to run locally. Zhipu AI (Z.ai) describes GLM-4.7-Flash as a 30B-A3B MoE model and presents it as the strongest model in the 30B class, designed for lightweight deployment where performance and efficiency both matter.

Model class and position inside the GLM 4.7 family

GLM-4.7-Flash is a text generation model with 31B params, BF16 and F32 tensor types, and the architecture tag glm4_moe_lite. It supports English and Chinese, and it is configured for conversational use. GLM-4.7-Flash sits in the GLM-4.7 collection next to the larger GLM-4.7 and GLM-4.7-FP8 models.

Z.ai positions GLM-4.7-Flash as a free tier and lightweight deployment option relative to the full GLM-4.7 model, while still targeting coding, reasoning, and general text generation tasks. This makes it interesting for developers who cannot deploy a 358B class model but still want a modern MoE design and strong benchmark results.

Architecture and context length

In a Mixture of Experts architecture of this type, the model stores more parameters than it activates for each token. That allows specialization across experts while keeping the effective compute per token closer to a smaller dense model.

GLM 4.7 Flash supports a context length of 128k tokens and achieves strong performance on coding benchmarks among models of similar scale. This context size is suitable for large codebases, multi-file repositories, and long technical documents, where many existing models would need aggressive chunking.

GLM-4.7-Flash uses a standard causal language modeling interface and a chat template, which allows integration into existing LLM stacks with minimal changes.

Benchmark performance in the 30B class

The Z.ai team compares GLM-4.7-Flash with Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B. GLM-4.7-Flash leads or is competitive across a mix of math, reasoning, long horizon, and coding agent benchmarks.

https://huggingface.co/zai-org/GLM-4.7-Flash

This above table showcase why GLM-4.7-Flash is one of the strongest model in the 30B class, at least among the models included in this comparison. The important point is that GLM-4.7-Flash is not only a compact deployment of GLM but also a high performing model on established coding and agent benchmarks.

Evaluation parameters and thinking mode

For most tasks, the default settings are: temperature 1.0, top p 0.95, and max new tokens 131072. This defines a relatively open sampling regime with a large generation budget.

For Terminal Bench and SWE-bench Verified, the configuration uses temperature 0.7, top p 1.0, and max new tokens 16384. For τ²-Bench, the configuration uses temperature 0 and max new tokens 16,384. These stricter settings reduce randomness for tasks that need stable tool use and multi step interaction.

Z.ai team also recommends turning on Preserved Thinking mode for multi turn agentic tasks such as τ²-Bench and Terminal Bench 2. This mode preserves internal reasoning traces across turns. That is useful when you build agents that need long chains of function calls and corrections.

How GLM-4.7-Flash fits developer workflows

GLM-4.7-Flash combines several properties that are relevant for agentic, coding focused applications:

A 30B-A3B MoE architecture with 31B params and a 128k token context length.
Strong benchmark results on AIME 25, GPQA, SWE-bench Verified, τ²-Bench, and BrowseComp compared to other models in the same table.
Documented evaluation parameters and a Preserved Thinking mode for multi turn agent tasks.
First class support for vLLM, SGLang, and Transformers based inference, with ready to use commands.
A growing set of finetunes and quantizations, including MLX conversions, in the Hugging Face ecosystem.

Check out the Model weight. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

What's Hot

The Papers: 'Trump's war goes global' and 'Labour ensnared by China spy probe'

I saw the future of Samsung display tech at MWC 2026

Big tech companies agree to not ruin your electric bill with AI data centers

How to Build an EverMem-Style Persistent AI Agent OS with Hierarchical Memory, FAISS Vector Retrieval, SQLite Storage, and Automated Memory Consolidation

Polymarket Pulls Bet on Nuclear Detonation in 2026

5 Useful Python Scripts to Automate Exploratory Data Analysis

Embed Amazon Quick Suite chat agents in enterprise applications

Trump’s War on Iran Could Screw Over US Farmers

Teaching LLMs to reason like Bayesians

The Papers: 'Trump's war goes global' and 'Labour ensnared by China spy probe'

I saw the future of Samsung display tech at MWC 2026

Big tech companies agree to not ruin your electric bill with AI data centers

The Papers: 'Trump's war goes global' and 'Labour ensnared by China spy probe'

I saw the future of Samsung display tech at MWC 2026

Big tech companies agree to not ruin your electric bill with AI data centers

Usefull link

categories

What's Hot

Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents

Model class and position inside the GLM 4.7 family

Architecture and context length

Benchmark performance in the 30B class

Evaluation parameters and thinking mode

How GLM-4.7-Flash fits developer workflows

Related Posts

Usefull link

categories