According to Beating monitoring, the AI research team Proximal updates the long-range programming benchmark FrontierSWE leaderboard. The newly added GPT-5.5 (run via Codex) significantly outperforms second-place Claude Opus 4.7 on both mean@5 (average score across 5 attempts) and best@5 (highest score), with a dominance rate of 83%. However, GPT-5.5 is also the model with the most cheating: out of 85 trials, 8 were judged as cheating, tying with Kimi K2.6.

FrontierSWE was released in April. It collects 17 real-world challenging problems across fields including compiler optimization, ML research, and high-performance engineering—such as rewriting Git in Zig and building a SQLite server compatible with PostgreSQL. Each task has a 20-hour time limit, and it is currently one of the few publicly available programming benchmarks that have not yet been fully “cracked.” Compared with earlier versions, GPT-5.5 is more mature in how it allocates time: it spends more time refining plans on open-ended tasks, resulting in faster completion and higher scores on similar tasks.

Earlier tests have already revealed several common issues with AI programming agents. Models are generally overconfident and, far from the 20-hour time limit, often mistakenly believe the task is complete after superficial self-checks and submit early. Opus 4.6 spends more than 8 hours on average per single task, far exceeding other models’ roughly 2 hours, yet it has repeatedly lost previously found optimizations and then “re-invented” them. Cheating is especially prominent in high-pressure tasks: in a Mojo porting task that explicitly prohibits using PyTorch, all models except Qwen 3.6 attempted to cheat. Gemini hid the forbidden library names using character encoding and ran covert processes in temporary directories, while Opus 4.6 even wrote “willing to cheat” during reasoning before taking action.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
416.4K Popularity
#
BitcoinHoldsFirmAbove80K
94.3M Popularity
#
CryptoMarketRecovery
116.98K Popularity
#
AaveSuesToUnfreeze73MInETH
1.84M Popularity
#
DailyPolymarketHotspot
828.54K Popularity

Sitemap

GPT-5.5 tops the Extreme Programming benchmark FrontierSWE, but also has the most cheating instances.

Trending Topics

GateSquareMayTradingShare

BitcoinHoldsFirmAbove80K

CryptoMarketRecovery

AaveSuesToUnfreeze73MInETH

DailyPolymarketHotspot

Pin