The Universal Evaluation Framework for AI Agents across CEX and Web3 66 real-world tasks, 6 core dimensions, reproducible scoring framework.
Covering the full Crypto user journey — from basic CEX operations to complex on-chain investigations, all based on real-world scenarios.
Spot trading, futures open/close, yield farming queries, grid strategies, account transfers, and portfolio analysis.
On-chain swaps, cross-chain bridge pricing, slippage control, multi-hop route optimization, and contract risk assessment.
Multi-chain transfers, gas estimation, address format validation, wrong-chain prevention, and conditional transfers.
Real-time market data, RSI/K-line technical analysis, volume-price relationships, multi-asset comparison, and volatility analysis.
Tokenomics analysis, narrative cycle assessment, rug pull detection, competitive analysis, and research reports.
Address profiling and P&L analysis, whale tracking, smart money signals, and protocol security monitoring.
Single-step instructions with clear intent. Examples: balance queries, price checks, simple orders.
Includes pre-checks or exception handling. Examples: insufficient balance detection, parameter completion, wrong-chain risk identification.
Multi-step, multi-constraint tasks requiring reasoning and trade-offs. Examples: optimal cross-chain paths, full transfers with gas reserves.
Weighted scores across 6 dimensions. All evaluations use dual-model consensus with human arbitration for disputes.
| # | Agent | Type | Total Score | CEX | DEX | Wallet | Market Analysis | Project Research | On-Chain Tracking |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GateAI Agent | General AI | 83.1 | 89.7 | 82.4 | 61.5 | 86.8 | 92.3 | 83.5 |
| 2 | Claude Agent(Gate for AI installed) | General AI | 82.8 | 79.2 | 81.6 | 82.2 | 83.2 | 89.6 | 79.9 |
| 3 | Codex Agent(Gate for AI installed) | General AI | 81.2 | 80.6 | 72.8 | 79 | 81.5 | 86.8 | 84.4 |
| 4 | AskSurf Agent | Crypto AI | 77.5 | 75.8 | 75.8 | 57.5 | 83.7 | 95.4 | 83 |
| 5 | Manus(Gate for AI installed) | General AI | 74.3 | 74.5 | 74.5 | 77.3 | 73.7 | 78.4 | 68.1 |
| 6 | Binance Agent | Crypto AI | 70.1 | 59.7 | 72.3 | 63.9 | 69.4 | 80.3 | 72.6 |
| 7 | Claude Agent | General AI | 68.2 | 59.4 | 58.6 | 59 | 73.1 | 80.9 | 73.6 |
| 8 | Bitget Agent | Crypto AI | 62.2 | 66.1 | 44.5 | 48.9 | 72 | 80.3 | 57.2 |
| 9 | Codex Agent | General AI | 52.2 | 51.4 | 46.5 | 55 | 60.4 | 57 | 42.4 |
Gate AI Agent ranks first in this comprehensive evaluation. As a native Agent deeply integrated with the exchange, it ranks first across three core dimensions: CEX Trading, DEX Trading, and Market Analysis. This evaluation includes 9 Agents with tasks spanning 6 major scenarios: CEX Trading, DEX Trading, Wallet Operations, Market Analysis, On-Chain Investigation, and Project Research. Scoring uses dual-model consensus with human arbitration. Gate AI Agent's performance under this framework represents a complete validation of its Web3-native capabilities.
Each task is scored on 2-3 dimensions independently, using dual-model consensus review. All benchmarks and weights are fully transparent.
Does the Agent correctly understand user intent? Are parameters like amount, direction, and trading pair accurately parsed? Are there misunderstandings (e.g., confusing 10U with 10 SOL)?
Does the Agent provide correct results? Are API calls, calculations, and outputs accurate and complete? Are there fabricated data or false execution claims?
Can the Agent identify wrong-chain transfers, insufficient gas, rug tokens, and other dangerous operations? Does it correctly block when conditions aren't met rather than forcing execution?
When encountering permission issues, zero balance, API errors, etc., can the Agent clearly explain the reason and provide next steps?
Each task is scored independently by GPT-5.4 and Claude Sonnet 4.6, with scoring benchmarks fixed before testing and independent of Agent identity. Average scores are taken to avoid single-model bias.
Each scoring dimension has explicit weights (e.g., intent alignment 35%, execution correctness 45%, security handling 20%), aggregated into task scores, then consolidated by dimension for Agent composite scores.
Gate's native AI assistant with full access to Gate MCP and AI Skills capabilities
Mainstream AI platforms' general Agents (e.g., Claude, ChatGPT) with Gate MCP installed
Industry's other Crypto-specific AI Agents
Click any task to expand and view each Agent's scores and scoring dimensions.