The real test for AI agents isn't speed — it's coherence over time

Qwen 3.7-Max just ran autonomously for 35 hours on a single kernel optimization task.

1,158 tool calls. 432 test cycles. No human intervention. 10x speedup on production code.

The interesting part isn't the result. It's what happened at hour 30.

Every other model they tested — Z.ai GLM 5.1, Kimi (Moonshot AI) K2.6, DeepSeek AI V4 Pro — hit a wall and stopped. They concluded they couldn’t improve further. Qwen kept going. Not randomly. It was still finding meaningful optimizations after a thousand tool calls.

This wasn’t a benchmark. It was unfamiliar hardware (T-Head ZW-M890 PPU) with no documentation or example implementations. Zero prior training signal.

The real test was coherence over time — can a model sustain a problem-solving strategy across 1,158 function calls without drifting, repeating, or giving up?

The other models didn’t fail technically. They failed strategically. They decided they were done. Qwen didn’t.

We keep measuring models on how fast they solve problems. Maybe the more interesting question is how long they can stay in the problem.

What's the longest you've let an AI agent run unsupervised — and did it actually stay on track?


Originally posted on LinkedIn ↗