Grok and Elite AIs Lose Big on Premier League Soccer Bets

Table of Contents

AI's Betting Blunder in Soccer

A recent study has laid bare the limitations of even the most sophisticated AI models when it comes to navigating the unpredictable world of sports betting. In a simulated Premier League season, systems from major players like Google, OpenAI, Anthropic, and xAI's Grok all ended up in the red. The KellyBench report, published this week by London-based AI startup General Reasoning, puts hard numbers on this shortfall.

Researchers fed the AIs comprehensive datasets: team histories, player stats, past match outcomes, and more. The task was straightforward—develop betting strategies using the Kelly criterion to optimize returns while controlling risk. Yet, across 380 matches in the 2023-24 season recreation, not one model turned a profit. This isn't a fluke; it signals deeper issues in how AIs handle extended, real-world scenarios.

The KellyBench Breakdown

KellyBench isn't just another leaderboard. It mimics the Kelly criterion, a mathematical formula for bet sizing that balances growth and ruin risk, pioneered by John Kelly in the 1950s. General Reasoning's setup forced AIs to iterate predictions week by week, adapting to results without hindsight. Grok, touted for its reasoning prowess, performed particularly poorly, as noted in the underlying Ars Technica coverage.

The eight models tested spanned the frontier: from GPT-4 variants to Claude and Gemini, plus Grok. They had access to every scrap of relevant data up to each matchday. Still, systematic biases emerged—overconfidence in favorites, underestimating variance, failure to adjust for injuries or form slumps. Over a full season, these compounded into consistent losses.

Key Findings from the Models

All models lost money, with average returns deeply negative.
Grok showed especially weak performance in risk management.
AIs excelled at short-term predictions but faltered over seasons.
No model beat a simple baseline of betting on home teams.
Historical data overload led to overfitting rather than generalization.
Human bettors with basic stats would have outperformed the field.

Implications for AI Development

This experiment underscores a persistent gap: AIs dominate benchmarks in coding, math, and trivia, but real-world domains with uncertainty expose their brittleness. Sports like soccer, with chaotic elements—weather, referee calls, morale—mirror life's complexity. General Reasoning argues that true general intelligence demands proficiency here, not just parlor tricks.

For companies like xAI pushing Grok as a 'maximum truth-seeking' AI, the results are sobering. Betting requires fusing stats, intuition, and adaptation—hallmarks of human smarts AIs haven't cracked. As capabilities scale, such benchmarks will be crucial to measure progress beyond hype.

Looking Ahead

KellyBench sets a template for future tests: longitudinal, data-rich, high-stakes simulations. Will next-gen models like potential Grok updates fare better? History suggests incremental gains, but soccer's black swans will keep testing limits. For now, if you're betting on the Premier League, stick to your gut—AI isn't ready to take your bankroll.