AI-ABC

AI Agent Benchmarkfor

The Universal Evaluation Framework for AI Agents across CEX and Web3 66 real-world tasks, 6 core dimensions, reproducible scoring framework.

View Results ↓GitHub · Task Set & Scoring Rules

66+

Benchmark Tasks

Core Dimensions

Participating Agents

Monthly

Updates

Dimensions

Evaluation Dimensions

Covering the full Crypto user journey — from basic CEX operations to complex on-chain investigations, all based on real-world scenarios.

10 tasks

CEX

Spot trading, futures open/close, yield farming queries, grid strategies, account transfers, and portfolio analysis.

10 tasks

DEX

On-chain swaps, cross-chain bridge pricing, slippage control, multi-hop route optimization, and contract risk assessment.

10 tasks

Wallet

Multi-chain transfers, gas estimation, address format validation, wrong-chain prevention, and conditional transfers.

12 tasks

Market Analysis

Real-time market data, RSI/K-line technical analysis, volume-price relationships, multi-asset comparison, and volatility analysis.

12 tasks

Project Research

Tokenomics analysis, narrative cycle assessment, rug pull detection, competitive analysis, and research reports.

12 tasks

On-Chain Tracking

Address profiling and P&L analysis, whale tracking, smart money signals, and protocol security monitoring.

Basic Operations

Single-step instructions with clear intent. Examples: balance queries, price checks, simple orders.

Conditional Operations

Includes pre-checks or exception handling. Examples: insufficient balance detection, parameter completion, wrong-chain risk identification.

Complex Tasks

Multi-step, multi-constraint tasks requiring reasoning and trade-offs. Examples: optimal cross-chain paths, full transfers with gas reserves.

March 2026 Results

Leaderboard

Weighted scores across 6 dimensions. All evaluations use dual-model consensus with human arbitration for disputes.

#	Agent	Type	Total Score	CEX	DEX	Wallet	Market Analysis	Project Research	On-Chain Tracking
1	GateAI Agent	General AI	83.1	89.7	82.4	61.5	86.8	92.3	83.5
2	Claude Agent（Gate for AI installed）	General AI	82.8	79.2	81.6	82.2	83.2	89.6	79.9
3	Codex Agent（Gate for AI installed）	General AI	81.2	80.6	72.8	79	81.5	86.8	84.4
4	AskSurf Agent	Crypto AI	77.5	75.8	75.8	57.5	83.7	95.4	83
5	Manus（Gate for AI installed）	General AI	74.3	74.5	74.5	77.3	73.7	78.4	68.1
6	Binance Agent	Crypto AI	70.1	59.7	72.3	63.9	69.4	80.3	72.6
7	Claude Agent	General AI	68.2	59.4	58.6	59	73.1	80.9	73.6
8	Bitget Agent	Crypto AI	62.2	66.1	44.5	48.9	72	80.3	57.2
9	Codex Agent	General AI	52.2	51.4	46.5	55	60.4	57	42.4

GateAI Agent83.1

Claude Agent（Gate for AI installed）82.8

Codex Agent（Gate for AI installed）81.2

Gate AI Agent ranks first in this comprehensive evaluation. As a native Agent deeply integrated with the exchange, it ranks first across three core dimensions: CEX Trading, DEX Trading, and Market Analysis. This evaluation includes 9 Agents with tasks spanning 6 major scenarios: CEX Trading, DEX Trading, Wallet Operations, Market Analysis, On-Chain Investigation, and Project Research. Scoring uses dual-model consensus with human arbitration. Gate AI Agent's performance under this framework represents a complete validation of its Web3-native capabilities.

Scoring Framework

Evaluation Methodology

Each task is scored on 2-3 dimensions independently, using dual-model consensus review. All benchmarks and weights are fully transparent.

Intent & Parameter Alignment

Does the Agent correctly understand user intent? Are parameters like amount, direction, and trading pair accurately parsed? Are there misunderstandings (e.g., confusing 10U with 10 SOL)?

Execution Result Correctness

Does the Agent provide correct results? Are API calls, calculations, and outputs accurate and complete? Are there fabricated data or false execution claims?

Risk Identification & Prevention

Can the Agent identify wrong-chain transfers, insufficient gas, rug tokens, and other dangerous operations? Does it correctly block when conditions aren't met rather than forcing execution?

Exception Compatibility & Expression

When encountering permission issues, zero balance, API errors, etc., can the Agent clearly explain the reason and provide next steps?

PASS

1.0

Fully meets all scoring criteria

PARTIAL

0.6

Correct direction but incomplete execution

FAIL

0.0

Error, fabrication, or security risk

Dual-Model Consensus Review

Each task is scored independently by GPT-5.4 and Claude Sonnet 4.6, with scoring benchmarks fixed before testing and independent of Agent identity. Average scores are taken to avoid single-model bias.

Weighted Composite Scoring

Each scoring dimension has explicit weights (e.g., intent alignment 35%, execution correctness 45%, security handling 20%), aggregated into task scores, then consolidated by dimension for Agent composite scores.

Participating Agent Categories

Gate AI Agent

Gate's native AI assistant with full access to Gate MCP and AI Skills capabilities

General-Purpose AI Agent

Mainstream AI platforms' general Agents (e.g., Claude, ChatGPT) with Gate MCP installed

Third-Party Crypto AI Agent

Industry's other Crypto-specific AI Agents

View Complete Task Set & Scoring Rules →

Detailed Scores

Task Details

Click any task to expand and view each Agent's scores and scoring dimensions.

CEX

DEX

Wallet

Market Analysis

Project Research

On-Chain Tracking

cex_001L1Help me check how much USDT I have left in my spot account.100▾

Help me check how much USDT I have left in my spot account.

GateAI Agent100

Claude Agent（Gate for AI installed）95

Codex Agent（Gate for AI installed）82.5

AskSurf Agent36.5

Manus（Gate for AI installed）94

Binance Agent87.5

Claude Agent36.5

Bitget Agent77.5

Codex Agent36.5

Scoring Dimensions

Account intent understandingCorrectly identified as a spot account balance inquiry, rather than total assets, contract balance, or deposit operation.

Balance inquiry accuracyWhether to return the available balance of spot USDT, with clear values and units, distinguishing between available and frozen.

Error Handling and ExplanationWhen encountering issues such as not being logged in or authorization failure, is a clear reason and next steps provided?

cex_002L1Buy 10U of SOL at market price.89▾

Buy 10U of SOL at market price.

GateAI Agent89

Claude Agent（Gate for AI installed）72.5

Codex Agent（Gate for AI installed）87.5

AskSurf Agent77.5

Manus（Gate for AI installed）90

Binance Agent67.5

Claude Agent77.5

Bitget Agent42.5

Codex Agent36.5

Scoring Dimensions

Instruction parsing accuracyWhether it correctly understands 10U as the amount in USDT, rather than the quantity of 10 SOL.

Transaction execution integrityWhether to return transaction results, confirmation steps, or a clear order status explanation

Risk identification and blockingWhen the balance is insufficient or permissions are restricted, is it accurately blocked and does it prompt the user for the next steps?

cex_003L1What is the annualized yield of USDT financial products?95▾

What is the annualized yield of USDT financial products?

GateAI Agent95

Claude Agent（Gate for AI installed）87.5

Codex Agent（Gate for AI installed）91

AskSurf Agent77.5

Manus（Gate for AI installed）72.5

Binance Agent65

Claude Agent77.5

Bitget Agent69

Codex Agent42.5

Scoring Dimensions

Product range identificationIs it focused on USDT wealth management/earning products rather than leaning towards trading or lending?

Result ValidityWhether at least one type of valid USDT financial product and its annualized yield is returned

Description of Earnings and RestrictionsDoes it explain the dynamic changes in the rate of return or qualifications/region restrictions?

cex_004L1Help me find a seller who supports Alipay to buy 5,000 USDT.100▾

Help me find a seller who supports Alipay to buy 5,000 USDT.

GateAI Agent100

Claude Agent（Gate for AI installed）47.5

Codex Agent（Gate for AI installed）60

AskSurf Agent77.5

Manus（Gate for AI installed）55

Binance Agent40

Claude Agent36.5

Bitget Agent42.5

Codex Agent71.5

Scoring Dimensions

P2P Scene RecognitionIs it correctly identified as P2P fiat currency purchase, extracting Alipay, 5000 yuan, and USDT as three parameters?

Matching result qualityWhether to return a list of ads that meet the criteria or executable purchase plans

Blocking and Risk ExplanationWhen there are no ads or insufficient qualifications, is a clear reason and next steps provided?

cex_005L2Short ETH90▾

Short ETH

GateAI Agent90

Claude Agent（Gate for AI installed）92.5

Codex Agent（Gate for AI installed）82.5

AskSurf Agent36.5

Manus（Gate for AI installed）75

Binance Agent71.5

Claude Agent52.5

Bitget Agent52.5

Codex Agent36.5

Scoring Dimensions

Understanding Trading DirectionCorrectly identifying shorting ETH as opening a short position in a perpetual contract, rather than selling the spot.

Parameter completion and planWhether to actively inquire when parameters are missing, and whether the final plan includes direction/leverage/margin.

Executing a closed loop and blockingAfter all parameters are complete, can an executable plan be provided, and is the blocking accurate under time constraints?

cex_006L2Help me close the long position on BTC.72.5▾

Help me close the long position on BTC.

GateAI Agent72.5

Claude Agent（Gate for AI installed）96

Codex Agent（Gate for AI installed）95

AskSurf Agent52.5

Manus（Gate for AI installed）82.5

Binance Agent51.5

Claude Agent36.5

Bitget Agent89

Codex Agent61.5

Scoring Dimensions

Position closing semantic recognitionIs it correctly identified as closing long/selling, rather than opening a short position?

Position verification and resultsShould we first check the BTC long position before providing the closing result or the next confirmation?

Risk and Exception HandlingIn scenarios such as no positions or insufficient permissions, is an accurate explanation provided?

cex_007L2Transfer 10 USDT from the spot account to the perpetual contract account.90▾

Transfer 10 USDT from the spot account to the perpetual contract account.

GateAI Agent90

Claude Agent（Gate for AI installed）94

Codex Agent（Gate for AI installed）92.5

AskSurf Agent71.5

Manus（Gate for AI installed）92.5

Binance Agent71.5

Claude Agent67.5

Bitget Agent69

Codex Agent52.5

Scoring Dimensions

Transfer path correctnessWhether it is correctly identified as an internal transfer, with the direction from the spot account to the perpetual contract account.

Execution or blocking resultsProvide status explanation when transfer is successful, and whether it accurately blocks when the balance is insufficient.

Clarity of InformationIs the account direction, amount, and reason for the anomaly clearly expressed?

cex_008L2Buy $100 when ETH drops to $2500.75▾

Buy $100 when ETH drops to $2500.

GateAI Agent75

Claude Agent（Gate for AI installed）62.5

Codex Agent（Gate for AI installed）70

AskSurf Agent62.5

Manus（Gate for AI installed）59

Binance Agent37.5

Claude Agent77.5

Bitget Agent62.5

Codex Agent62.5

Scoring Dimensions

Order Type IdentificationWhether recognized as a limit buy order at the target price, rather than a market order for immediate execution.

Parameter AccuracyAre the three core parameters of ETH cryptocurrency, target price of 2500, and amount of 100U all accurate?

Execute closed loopIs confirmation/execution status provided, and is it accurately blocked under time constraints?

cex_009L3Please help me analyze whether my total account over the last 30 days has outperformed BTC, and also check the win rate and profit-loss ratio of USDT perpetual contracts.90▾

Please help me analyze whether my total account over the last 30 days has outperformed BTC, and also check the win rate and profit-loss ratio of USDT perpetual contracts.

GateAI Agent90

Claude Agent（Gate for AI installed）85

Codex Agent（Gate for AI installed）77.5

AskSurf Agent77.5

Manus（Gate for AI installed）49

Binance Agent27.5

Claude Agent62.5

Bitget Agent77.5

Codex Agent77.5

Scoring Dimensions

The analysis scope covers whether it simultaneously encompasses both the account's performance surpassing BTC and the analysis of perpetual trading behavior.

Results and Index AccuracyWhether a conclusion is provided on whether it outperforms BTC, as well as win rate and profit-loss ratio data.

Caliber and Exception HandlingIs there a clear distinction between the two types of analysis calibers, and are the limitations specified separately when there is no data?

cex_010L3Open a BTC spot grid with 100 USDT.95▾

Open a BTC spot grid with 100 USDT.

GateAI Agent95

Claude Agent（Gate for AI installed）60

Codex Agent（Gate for AI installed）67.5

AskSurf Agent77.5

Manus（Gate for AI installed）75

Binance Agent77.5

Claude Agent69

Bitget Agent79

Codex Agent36.5

Scoring Dimensions

Strategy Type IdentificationCorrectly identified as BTC spot grid, rather than contract grid or other quantitative strategies.

Plan parameter correctnessDoes it accurately reflect the three key elements of BTC, 100 USDT, and spot grid?

Blocking and Limitation ExplanationWhen the balance is insufficient or the strategy is unavailable, is a clear reason provided?

FAQ

Frequently Asked Questions

What is AI-ABC?+

AI-ABC (AI Agent Benchmark for Crypto) is the industry's first standardized evaluation framework specifically designed for AI Agents in Crypto scenarios. It covers 6 dimensions: CEX trading, DEX operations, wallet management, market analysis, project research, and on-chain tracking. Using 66+ real-world tasks based on actual user scenarios, it employs reproducible scoring mechanisms to benchmark various AI Agents across CEX and Web3.

How is this different from GAIA and AgentBench?+

Existing evaluation frameworks like GAIA and AgentBench focus on general scenarios without Crypto-specific tasks. AI-ABC's tasks are all based on real Crypto operations — from 'buy $10 of SOL at market price' to 'bridge 1000 USDC and swap to ETH with slippage control' — including many operation-based tasks requiring real API calls to exchanges, wallet interfaces, and on-chain data. This is completely beyond the scope of general benchmarks.

How is this benchmark scored?+

Scoring is based on AI Agent performance across 66+ real-world tasks, comprehensively evaluating task completion, accuracy, and execution efficiency to ensure objective, fair, and comparable evaluation results.

How often is the evaluation data updated?+

Monthly updates. As Agents iterate and improve, and new Agents join, we continuously run evaluations and update the leaderboard. The task set also expands based on industry developments and new scenarios.

Is the scoring objective? Is it fair that Gate AI participates?+

Scoring benchmarks are fixed before testing and independent of Agent identity. Evaluations use dual-model consensus (GPT-5.4 and Claude Sonnet 4.6 score independently), with average scores taken to avoid single-model bias. All scoring dimensions, weights, and task benchmarks are publicly available on GitHub for anyone to reproduce.

How are task difficulties classified?+

Three levels: L1 (Basic Operations: single-step, clear intent), L2 (Conditional Operations: includes pre-checks or exception handling), L3 (Complex Tasks: multi-step, multi-constraint, requires reasoning and trade-offs). Higher difficulty better reflects an Agent's comprehensive decision-making in real Crypto scenarios.

What are the main differences between general-purpose and Crypto-specific AI Agents?+

General-purpose AI Agents (like Claude and ChatGPT) perform comparably to specialized Agents in information-retrieval tasks (market analysis, project research) but show significant gaps in execution-based tasks (trading, transfers, gas estimation, wrong-chain prevention). This is why specialized AI Agent infrastructure is essential for Crypto.