Sakana AI and KPMG Unveil CoffeeBench, 90-Day AI Agent Trading Simulation; GPT-5.5 and Claude Show Contrasting Strategies

According to Sakana AI and KPMG Japan Azsa, the companies unveiled CoffeeBench, a multi-agent long-horizon economics benchmark accepted by ICML 2026's Failure Modes in Agentic AI workshop. The framework simulates a coffee supply chain with two farmers, two roasters, and two retailers, requiring each AI model to operate a roasting business over a 90-day period using pricing negotiations, order transactions, and payment settlements.

Horizontal evaluation of mainstream models revealed distinct trading behaviors: GPT-5.5 and Claude Opus 4.7 pursued active communication, frequently negotiating prices and executing trades to maximize sales, while Gemini 3.1 Pro showed passive responsiveness. Notably, Kimi K2.6 made numerous tool calls but failed to enforce pricing discipline, resulting in high transaction volume but zero profit. Claude Haiku 4.5 exhibited planning-execution misalignment, repeatedly choosing inaction despite formulating solid strategies, ultimately incurring massive losses as fixed costs accumulated.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments