Over the past decade, we’ve been accustomed to measuring the progress of artificial intelligence by “how accurate the answers are”: posing questions to the model, comparing its responses to standard answers, and then grading accordingly. But this logic is rapidly becoming obsolete. Because the role of AI has changed — it is no longer just a passive tool that answers questions, but an active agent that “starts doing things on its own.” From automatically planning itineraries, calling external tools, to making multiple decisions in complex tasks, the new generation of AI is gradually taking over workflows that were once performed by humans.
In a world without standard answers, why do exams no longer work?
This raises a question: if AI is not just generating a single reply but completing an entire task, can we still evaluate it using traditional right-or-wrong testing standards? When a task has no single solution, and AI might achieve its goal through “unanticipated but more effective” methods, traditional evaluation methods might mistakenly judge success as failure. This is not just a technical detail but a systemic challenge — the evaluation approach is determining whether AI will learn to solve problems or just learn to game the rules.
The focus of evaluation is shifting from results to process
To address this issue, the AI research community has recently reached a consensus: evaluating AI cannot just look at the results but must examine “how it achieved them.” In recent research and practical experience, the focus of assessment is gradually shifting from a single answer to the entire process — how AI understands the task, how it breaks down steps, when it calls external tools, and whether it can adapt its strategy when the environment changes. In other words, AI is no longer just a test-taker being graded but more like an assistant executing a task, and the evaluation system must be able to judge whether it is truly progressing toward the correct goal, rather than just following instructions blindly. This shift also means that “evaluation” itself is becoming a critical threshold for AI to safely move toward real-world applications.
An AI evaluation is actually an action experiment
In this context, research teams including Anthropic have started to view “an AI evaluation” as a complete action experiment rather than a single question. Practically, researchers design a multi-step decision-making scenario that requires the AI to coordinate with tools, and then let the AI complete the task from start to finish, fully recording every judgment, action, and strategy adjustment. This process is like a recorded practical exam.
True assessment happens after the task is completed
The evaluation system reviews this complete record of actions to determine whether the AI has achieved the “real goal,” rather than just whether it followed the pre-designed process. To avoid relying on a single standard, the assessment usually combines multiple methods: parts that can be judged by rules are handled automatically, while parts requiring understanding of semantics and strategic intent are evaluated with the help of another model, and human experts are involved for calibration when necessary. This design responds to a real-world scenario — as AI’s solutions become more flexible than the originally designed processes, the evaluation system itself must understand that “success is not just one form.”
Evaluation is not a ruler but a way to shape AI behavior
However, evaluation design also contains risks. Because evaluation essentially trains AI “what it should become.” If the standards overly emphasize process compliance, AI might learn to produce lengthy but safe solutions; if only results are considered without regard to process, the system might exploit loopholes, take shortcuts, or adopt strategies that humans might find unacceptable. Evaluation is never a neutral measure but an implicit set of value guidelines. If the direction is off, it could push AI toward “high scores but uncontrollable behavior.”
Error optimization: AI doesn’t become dumber; it becomes better at doing wrong
This is also why the research community has become highly alert to the problem of “error amplification” in recent years: when models are repeatedly reinforced on incorrect scoring targets, they don’t become dumber; instead, they become more skilled at executing wrong actions to the extreme. Such biases often don’t manifest immediately but only become apparent when AI is deployed in the real world, bearing more responsibility. At that point, the issue is no longer just product quality but safety, accountability, and trustworthiness.
Why this is not just an engineer’s issue
For most people, AI evaluation might seem like a technical detail among engineers, but it actually impacts whether we will be influenced by a “seemingly smart but misled system” in the future. As AI begins to arrange schedules, filter information, execute transactions, and even intervene in public and personal decision-making, the way we assess “how well it performs” is no longer just about model ranking but about the foundation of reliability, predictability, and trust. Whether AI becomes a dependable assistant or just a black box that only follows rules is often foreshadowed the moment evaluation standards are set. Therefore, as AI starts to act on its own, how to evaluate it is no longer just an internal issue for the tech community but a public concern that everyone who will coexist with AI cannot avoid.
This article “AI Starts Doing Things on Its Own,” Anthropic explains: How should humans evaluate whether it does well or poorly? Originally published on Chain News ABMedia.