AIResearchAIResearch
Machine Learning

Wallet V Benchmarks 688 AI Trading Agents by LLM Family

Wallet V's new public benchmark compares 688 live AI trading agents by underlying language model, covering perpetual futures across crypto, equities, commodities, and FX.

3 min read
Wallet V Benchmarks 688 AI Trading Agents by LLM Family

TL;DR

Wallet V's new public benchmark compares 688 live AI trading agents by underlying language model, covering perpetual futures across crypto, equities, commodities, and FX.

Forty-two percent of AI trading agents in live markets generated flat or positive returns over a two-month period. That headline figure comes from a new public benchmark covering 688 agents that Wallet V users deployed on decentralized derivatives platforms, and it is the first dataset of this kind to break out performance by underlying large language model.

Wallet V, a self-custody Web3 wallet, published the benchmark this week on its website. According to Crypto Briefing, each agent was configured by an individual user who also selected which LLM would generate the trading signals. Execution happened on Hyperliquid and Aster, two third-party decentralized derivatives platforms. Wallet V holds neither funds nor order routing; it only aggregates the on-platform performance data.

The dataset spans seven LLM families. Agent-level returns at the extremes tell the starkest story: the worst-performing model family produced a peak agent return of negative 30 percent while the best reached positive 307 percent at the agent level. Those figures are peaks, not medians, and they reveal more about variance within each cohort than the typical outcome.

The benchmark infrastructure

Agents operated across four asset classes: major digital assets including BTC, ETH, and SOL; equity instruments that include pre-IPO exposure; commodity benchmarks covering gold, silver, and oil; and foreign exchange pairs. All positions were opened as perpetual futures through Hyperliquid or Aster. The scope reflects how crypto derivatives platforms have expanded well beyond digital assets, introducing more diverse market dynamics for LLM agents to navigate.

One methodological caveat embedded in the design: model families represented by fewer than 10 agents are reported as directional, not statistically conclusive. That flag matters more than it might appear. With 688 agents across seven families, average cohort size is under 100, and distribution is unlikely to be uniform, meaning several LLMs probably sit close to or below the significance threshold. The benchmark at least signals this explicitly rather than burying it.

Performance is refreshed as users deploy new agents, which means the dataset will evolve. Whether that evolution trends toward larger cohorts and narrower confidence intervals, or stays lumpy as users gravitate toward a handful of preferred models, will determine how practically useful the leaderboard becomes.

What this actually measures

Leaderboards for algorithmic trading strategies exist across many platforms, but they typically aggregate all strategies together or compare human traders. Breaking out results by LLM is a different proposal: it treats the choice of foundation model as an empirically testable variable in live markets.

This framing connects to a broader push in artificial intelligence research to move from synthetic benchmarks toward evaluations in real environments. In robotics, USA Today recently covered ACE ROBOTICS topping four embodied-intelligence benchmarks with its Kairos world model, but those benchmarks are simulated. A trading benchmark with actual capital at risk is something different: the environment cannot be gamed by the benchmark designers, and the feedback signal is financially unambiguous.

Significant confounds remain, though. Users self-select their LLM and their strategy, so a user who runs an aggressive leveraged long on SOL during a favorable period inflates the numbers for whichever model they happened to choose. There is no control group. Causation between LLM choice and return cannot be established from this data alone. What Crypto Briefing reported is aggregate observed performance, not a controlled artificial intelligence review of model capabilities under matched conditions.

The LLM selection dynamic also introduces a fast-moving variable. The market for foundation models has expanded sharply, with Price Per Token tracking dozens of new model releases each month across providers. A benchmark that captures seven LLM families today may look quite different in six months as users adopt newer options. Wallet V's commitment to refresh the data as agents are deployed is necessary precisely because the model landscape does not hold still.

Forward

The 42 percent profitability figure will get cited out of context. It sounds better than random, but the right baseline is not a coin flip; it is the return on holding the underlying assets or running a systematic momentum strategy over the same period. Wallet V has not published that comparison, and until it does, the benchmark measures something real but incomplete.

Still, at 688 agents and growing, this is the largest public record of LLM-driven live trading performance that currently exists. When cohort sizes grow large enough to produce statistically conclusive results across all seven model families, practitioners will have something genuinely novel: not a synthetic test of artificial intelligence, but a live performance record from markets where the cost of being wrong is immediate and measurable.

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn