Google LLC

01/14/2026 | Press release | Distributed by Public on 01/14/2026 08:40

Introducing Community Benchmarks on Kaggle

Your browser does not support the audio element.

Listen to article
This content is generated by Google AI. Generative AI is experimental
[[duration]] minutes
Voice Speed
Voice
Speed 0.75X 1X 1.5X 2X

Today, Kaggle is launching Community Benchmarks, which lets the global AI community design, run and share their own custom benchmarks for evaluating AI models. This is the next step after we launched Kaggle Benchmarks last year, to provide trustworthy and transparent access to evaluations from top-tier research groups like Meta's MultiLoKo and Google's FACTS suite.

Why community-driven evaluation matters

AI capabilities have evolved so rapidly that it's become difficult to evaluate model performance. Not long ago, a single accuracy score on a static dataset was enough to determine model quality. But today, as LLMs evolve into reasoning agents that collaborate, write code and use tools, those static metrics and simple evaluations are no longer sufficient.

Kaggle Community Benchmarks provide developers with a transparent way to validate their specific use cases and bridge the gap between experimental code and production-ready applications.

These real-world use cases demand a more flexible and transparent evaluation framework. Kaggle's Community Benchmarks provide a more dynamic, rigorous and continuously evolving approach to AI model evaluation - one shaped by the users building and deploying these systems everyday.

How to build your own benchmarks on Kaggle

Benchmarks start with building tasks, which can range from evaluating multi-step reasoning and code generation to testing tool use or image recognition. Once you have tasks, you can add them to a benchmark to evaluate and rank selected models by how they perform across the tasks in the benchmark.

Here's how you can get started:

  1. Create a task: Tasks test an AI model's performance on a specific problem. They allow you to run reproducible tests across different models to compare their accuracy and capabilities.
  2. Create a benchmark: Once you have created one or more tasks, you can group them into a Benchmark. A benchmark allows you to run tasks across a suite of leading AI models and generate a leaderboard to track and compare their performance.

Once you build your benchmark, here's what benefits you'll see:

  • Broad model access: Free access (within quota limits) to state-of-the-art models from labs like Google, Anthropic, DeepSeek and more.
  • Reproducibility: Benchmarks capture exact outputs and model interactions so results can be audited and verified.
  • Complex interactions: They support testing for multi-modal inputs, code execution, tool use and multi-turn conversations.
  • Rapid prototyping: They allow you to quickly design and iterate on creative new tasks.

These powerful capabilities are powered by the new kaggle-benchmarks SDK. Here are a few resources for getting started:

How we're shaping the future of AI evaluation

The future of AI progress depends on how models are evaluated. With Kaggle Community Benchmarks, Kagglers are no longer just testing models, they're helping shape the next generation of intelligence.

Ready to build? Try Community Benchmarks today.

POSTED IN:
Google LLC published this content on January 14, 2026, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on January 14, 2026 at 14:40 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]