According to Sakana AI and KPMG Japan Azsa, the companies unveiled CoffeeBench, a multi-agent long-horizon economics benchmark accepted by ICML 2026's Failure Modes in Agentic AI workshop. The framework simulates a coffee supply chain with two farmers, two roasters, and two retailers, requiring each AI model to operate a roasting business over a 90-day period using pricing negotiations, order transactions, and payment settlements.
Horizontal evaluation of mainstream models revealed distinct trading behaviors: GPT-5.5 and Claude Opus 4.7 pursued active communication, frequently negotiating prices and executing trades to maximize sales, while Gemini 3.1 Pro showed passive responsiveness. Notably, Kimi K2.6 made numerous tool calls but failed to enforce pricing discipline, resulting in high transaction volume but zero profit. Claude Haiku 4.5 exhibited planning-execution misalignment, repeatedly choosing inaction despite formulating solid strategies, ultimately incurring massive losses as fixed costs accumulated.