Universität Paderborn

06/15/2026 | Press release | Distributed by Public on 06/15/2026 06:18

When Chat­G­PT pre­dicts the World Cup: A new live bench­mark for lan­guage mod­els

Can a language model like ChatGPT predict whether the German national team will win the World Cup? What at first sounds like a question for football fans offers a unique opportunity from a scientific perspective: The predictions of large language models (LLMs) can be evaluated in a real-world decision-making context. Unlike many abstract test tasks, football predictions make it clear later on whether a forecast was correct. In the "LLM-SoccerArena" project, a team from Paderborn University, Ludwig Maximilian University of Munich (LMU) and the University of Cologne is investigating how well large language models can process real-world information and make predictions under conditions of uncertainty.

Many established benchmarks for large language models test abstract tasks in highly simplified or static environments. Such tests are important, but are increasingly reaching their limits. Medical exam questions, legal tasks or MBA tests are now frequently solved very well by many modern models. At the same time, such tasks provide only limited insight into how reliably models perform in real-world decision-making situations under uncertainty. This is where the idea behind "LLM-SoccerArena" comes in. Much like a prediction game, models make predictions that can only be verified later against real-world results.

The project investigates how well large language models such as ChatGPT, Claude or Mistral can predict the results of real football matches. The platform includes a live leaderboard, which is updated daily, as well as an overview of all model predictions. "We're not just interested in which model gets it right in the end," says Prof. Dr Stefan Feuerriegel of LMU. "What matters is how a model arrives at its prediction: what information does it look for? Which does it take into account? And can it distinguish relevant signals from mere popularity patterns?"

The findings are also relevant to management research. Executives are increasingly using large language models to structure market information, evaluate scenarios or prepare forecasts, for example regarding strategic decisions, competitors, product launches or risks. In such cases, the quality of the response depends not only on logical reasoning. Models must also capture and contextualise information about the real world, such as the relevance and timeliness of information, as well as the reliability of sources and derived assessments.

A similar challenge arises with 'LLM-SoccerArena'. For a good football prediction, it is not enough to simply draw on general football knowledge. A model must classify information on current form, injuries, managerial decisions, past encounters, squad quality or betting odds and derive a verifiable prediction from this. The platform thus uses football matches as a realistic testing ground to assess how well large language models perform in real-world decision-making situations. "The major advantage of LLMs compared to statistical prediction models is that LLMs can react flexibly to new and even unstructured information. For example, rumours from social media," explains Prof. Dr Oliver Müller, Professor of Business Informatics at Paderborn University and Director of the "Artificial Intelligence" competence area at the Software Innovation Lab of the SICP - Software Innovation Campus Paderborn.

Football offers a particularly interesting field of research for these questions. Football is frequently used in management research because it allows real-world decisions to be made and evaluated under comparatively structured conditions. Matches take place at clearly defined times, decisions and events are publicly visible, and the result - that is, a team's success - is clearly measurable shortly afterwards. This allows for a systematic investigation of whether predictions actually hold true and which models perform better under which conditions.

The project compares two approaches: on the one hand, models make predictions based on their internal knowledge. On the other hand, it tests how well models can retrieve and process additional external information from the internet. This is no trivial matter, as the internet contains not only reliable and up-to-date information. In this so-called agentic search, the first question that arises is therefore whether a language model is actually retrieving the right information at all. Does it check for current injuries, starting line-ups, form curves, managerial changes, head-to-head records, tournament context or betting odds? It must then weight this information appropriately. High betting odds, a high-profile team or a single strong performance by a team can be misleading.

So which team will win the World Cup? In the models' current predictions, OpenAI's GPT-5.5 and Claude Opus 4.8 forecast Spain as world champions, whilst Mistral Large predicts France. Once the first actual results are in, it will be possible to assess which models are more reliable in this task.

The differences between the predictions are also of scientific interest. "One possible explanation for differing forecasts lies in the models' training data," says Prof. Dr Markus Weinmann, professor of Business Analytics at the University of Cologne. "Models can tend to reproduce widespread internet opinions or patterns from their training data." Teams with high global visibility or a particularly large number of mentions on the internet could thus be systematically favoured. The language and origin of the training data can also play a role. If a model has processed a particularly large number of texts from a specific linguistic region, this can influence its assessments. This could be one reason why Mistral, as a language model developed by a French company, is likely to tip France.

This makes "LLM-SoccerArena" more than just a light-hearted comparison of football predictions. The project offers a live benchmark for real-world decision-making and forecasting tasks: it is specifically not a betting recommendation, but rather demonstrates how well large language models can search for and evaluate information under conditions of uncertainty and translate it into verifiable predictions.

This text was translated automatically.

Universität Paderborn published this content on June 15, 2026, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on June 15, 2026 at 12:19 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]