News Deep Dive Opinion Research Data Resources Events About

GPT-5.5 Arrives: OpenAI's New Agentic Model Hits 82.7% on Terminal-Bench 2.0

OpenAI releases GPT-5.5 with 82.7% on Terminal-Bench 2.0 and 93.6% on GPQA Diamond. We compare benchmarks against Kimi K2.6, GLM-5.1, and Qwen 3.6 Plus to assess the new frontier.

OpenAI has officially released GPT-5.5, the company’s latest entry into the agentic AI space. The model achieved 82.7% on the Terminal-Bench 2.0 benchmark and 93.6% on the GPQA Diamond scientific reasoning test1. The announcement comes as competition in the large language model market intensifies, with Chinese open-source alternatives including Kimi K2.6, GLM-5.1, and Qwen 3.6 Plus rapidly closing the gap.

GPT-5.5 is positioned not merely as a conversational assistant but as an agentic model designed for professional workflows. In its system card, OpenAI summarized the model’s core advantage in a single phrase: “understands the task earlier, asks for less guidance, uses tools more effectively, checks its work and keeps going until it’s done.”2

Key Metrics: Coding Prowess Leads the Pack

On software engineering benchmarks, GPT-5.5 delivered an impressive performance. According to BenchLM.ai’s comprehensive evaluation3, GPT-5.5 scored 89/100 overall, ranking 5th among all tested models and 2nd among the 16 models that underwent vendor-verified evaluation.

The standout figure is its Terminal-Bench 2.0 score. At 82.7%, it represents not only the highest among comparable models but also a margin of more than 15 percentage points over the strongest Chinese open-source competitors: Kimi K2.6 (66.7%), GLM-5.1 (63.5%), and Qwen 3.6 Plus (61.6%).

Table 1: Coding and Software Engineering Comparison 13

BenchmarkGPT-5.5Kimi K2.6GLM-5.1Qwen 3.6 Plus
SWE-Bench Pro58.6%58.6%58.4%56.6%
Terminal-Bench 2.082.7%66.7%63.5%61.6%
LiveCodeBench89.6%87.1%
SWE-Bench Verified80.2%78.8%

Notably, on SWE-Bench Pro, the widely-cited software engineering benchmark, GPT-5.5 tied with Kimi K2.6 at 58.6%. GLM-5.1 (58.4%) and Qwen 3.6 Plus (56.6%) followed closely behind, with gaps of just 2 percentage points. This suggests that on practical coding tasks, the divide between top-tier models is narrowing.

OpenAI specifically noted that Claude Opus 4.7’s SWE-Bench Pro score carries an asterisk indicating “signs of memorization,” casting doubt on the credibility of some high scores.

Reasoning and Knowledge: Chinese Models Strike Back

While GPT-5.5 dominates on terminal tasks, Chinese models demonstrated strong competitiveness in pure reasoning and knowledge tests.

GPQA Diamond evaluates graduate-level scientific question-answering capabilities. GPT-5.5 leads with 93.6%, but Kimi K2.6 (90.5%) and Qwen 3.6 Plus (90.4%) have closed the gap to within 3 percentage points. GLM-5.1’s 86.2%, though lower, remains respectable given its April 7 release date, two weeks ahead of GPT-5.5.

Table 2: Reasoning and Knowledge Comparison 13

BenchmarkGPT-5.5Kimi K2.6GLM-5.1Qwen 3.6 Plus
GPQA Diamond93.6%90.5%86.2%90.4%
HLE (with tools)52.2%54.0%
AIME 202696.4%95.3%
MMLU-Pro88.5%

On HLE (Humanity’s Last Exam), an ultra-difficult test, Kimi K2.6 achieved 54.0% with tools enabled, surpassing GPT-5.5’s 52.2%. On the AIME 2026 mathematics competition, both Kimi K2.6 (96.4%) and GLM-5.1 (95.3%) posted near-perfect scores.

Qwen 3.6 Plus reported 88.5% on the MMLU-Pro knowledge test, ranking 4th on that leaderboard.

Agentic Capabilities: The Tool-Use Divide

GPT-5.5’s core selling point lies in its agentic capabilities, the ability to autonomously plan, invoke tools, and iterate until task completion. BenchLM.ai’s evaluation shows GPT-5.5 ranking 2nd in the agentic tool-use category with a score of 99.2.

OpenAI’s system card describes this capability in detail: GPT-5.5 “understands the task earlier, asks for less guidance, uses tools more effectively, checks its work and keeps going until it’s done.”2

“It’s more than faster coding,” said Justin Boitano, Vice President of Enterprise Platforms at NVIDIA, in OpenAI’s official blog post1. “It’s a new way of working that helps people operate at a fundamentally different speed.”

OpenAI disclosed that approximately 200 early-access partners tested the model before release, focusing on use cases including coding, research, data analysis, document creation, and cross-tool workflows.

Long Context and Pricing Dynamics

Table 3: Context Window and Pricing Comparison 13

ModelContextLicenseAPI Input Price (per 1M tokens)
GPT-5.51MProprietary~$2.50 (est.)
Kimi K2.6262KOpen (Modified MIT)$0.60
GLM-5.1203KOpen$1.40
Qwen 3.6 Plus1MOpen

Context windows represent another critical battleground. Both GPT-5.5 and Qwen 3.6 Plus support 1 million tokens, while Kimi K2.6 and GLM-5.1 offer 262K and 203K tokens respectively.

On pricing, however, open-source models demonstrate overwhelming advantages. Kimi K2.6’s API input pricing stands at just $0.60 per million tokens, with GLM-5.1 at $1.40. While GPT-5.5 has not announced official pricing, market estimates place it around $2.50, more than four times the cost of Kimi.

Pro Variant: Test-Time Compute Potential

The GPT-5.5 Pro variant demonstrates the potential of test-time compute. Through parallel inference scaling, the Pro version achieved improvements on several ultra-difficult tests:

  • HLE (with tools): improved from 52.2% to 57.2%
  • FrontierMath Tier 1-3: improved from 51.7% to 52.4%
  • FrontierMath Tier 4: improved from 35.4% to 39.6%
  • GeneBench: improved from 25.0% to 33.2%

GeneBench, a biomedical gene analysis benchmark, remains exceptionally challenging. Even the enhanced Pro version achieved only 33.2%, underscoring the high barriers in this domain.

Safety and Guardrails

OpenAI emphasizes that GPT-5.5 includes “the strongest safeguards to date.”2 On internal Expert-SWE testing, the model achieved 73.1%, demonstrating reliability when handling sensitive engineering tasks.

FrontierMath and ARC-AGI Results 12

GPT-5.5 posted 51.7% on FrontierMath Tier 1-3 problems and 35.4% on Tier 4. The Pro variant pushed these to 52.4% and 39.6% respectively.

On the ARC-AGI abstraction and reasoning benchmark, GPT-5.5 achieved 95.0% on ARC-AGI-1 and 85.0% on ARC-AGI-2. These results suggest strong performance on novel reasoning tasks that require generalization beyond training data.

Additional benchmark results include 80.5% on BixBench and 41.4% on HLE without tools, improving to 52.2% with tool access enabled.

References

All data and quotes cited in this article can be verified through the following sources:

Conclusion

GPT-5.5’s release marks a transition for agentic AI from concept to practical application. OpenAI’s clear lead on terminal tasks like Terminal-Bench temporarily secures its position in the enterprise market. However, Chinese open-source models’ advantages in pricing, select reasoning tasks, and open ecosystems are changing the rules of competition.

For developers, the choice involves trade-offs: pay for OpenAI’s agentic capabilities or embrace the cost-effectiveness and customizability of open-source alternatives. The answer likely depends on specific use cases and budget constraints.

The model is available now to ChatGPT Plus subscribers, with API access rolling out over the coming weeks. Enterprise customers can request early access through OpenAI’s sales channel.

Footnotes

  1. OpenAI Official Blog — GPT-5.5 announcement, including Terminal-Bench 2.0, GPQA Diamond benchmarks, and Justin Boitano quote https://openai.com/index/introducing-gpt-5-5/ 2 3 4 5 6

  2. OpenAI GPT-5.5 System Card — Safety evaluations, early-access partner feedback, and safeguard details https://openai.com/index/gpt-5-5-system-card/ 2 3 4

  3. BenchLM.ai Leaderboard — Comprehensive benchmark platform showing GPT-5.5 (#2), Kimi 2.6 (#12), GLM-5.1 (#13), Qwen 3.6 Plus (#18) https://benchlm.ai/ 2 3 4