According to the latest paper from Penn State, UCSC, and Amazon, titled "Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents," researchers found that equipment update capabilities among AI agents show a "flattening" pattern across different models. Cross-testing revealed that different models' equipment updates yield performance gains differing by only 3.1%, with even the 9B-scale Qwen3.5-9B model producing updates structurally equivalent to flagship Claude Opus 4.6.
However, agents' ability to benefit from updated equipment shows non-monotonic trends. Weak models like Qwen3-32B face two critical failure modes: "equipment activation failure" with only 25.1% skill loading rates versus 96% for stronger models, and "equipment compliance failure," where instruction adherence drops sharply from 0.52 to 0.13 during extended execution. AI researcher Elvis Sar noted similar patterns in his coding agent experiments, suggesting computational budgets should prioritize execution agents over evolution engines.