noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

Campbell University

Campbell Law advocates headed to national AAJ Student Trial Advocacy[...]
U.S. Senate Special Committee on Aging

Chairman Rick Scott Exposes Communist China’s Dangerous Grip on[...]
City of Rancho Cordova, CA

Community Meeting on Mail Theft Scheduled for March 25

Technology

Samsung Electronics Co. Ltd.

09/25/2025 | Press release | Distributed by Public on 09/24/2025 17:26

Samsung Introduces TRUEBench: A Benchmark for Real-World AI Productivity

Korea on September 25, 2025

AudioAUDIO Play/Stop

ShareShare open/close Print

Share open/close

X
Facebook
LinkedIn
Tumblr
WhatsApp
Mail

URL copy

URL Copied.

Layer close

Proprietary benchmark supports multilingual productivity scenarios, addressing gaps in existing AI benchmarks

Samsung Electronics today unveiled TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark), a proprietary benchmark developed by Samsung Research to evaluate AI productivity.

TRUEBench provides a comprehensive set of metrics to measure how large language models (LLMs) perform in real-world workplace productivity applications. To ensure realistic evaluation, it incorporates diverse dialogue scenarios and multilingual conditions.

Drawing on Samsung's in-house use of AI for productivity, TRUEBench evaluates commonly used enterprise tasks - such as content generation, data analysis, summarization and translation - across 10 categories and 46 sub-categories. The benchmark ensures reliable scoring with AI-powered automatic evaluation based on criteria that are collaboratively designed and refined by both humans and AI.

"Samsung Research brings deep expertise and a competitive edge through its real-world AI experience," said Paul (Kyungwhoon) Cheun, CTO of the DX Division at Samsung Electronics and Head of Samsung Research. "We expect TRUEBench to establish evaluation standards for productivity and solidify Samsung's technological leadership."

Recently, as companies adopt AI for tasks there has been a growing demand for measuring the productivity of LLMs. However, existing benchmarks primarily measure overall performance, are mostly English-centric, and are limited to single-turn question-answer structures. This restricts their ability to reflect actual work environments.

To address these limitations, TRUEBench is composed of a total of 2,485 test sets across 10 categories and 12 languages¹ - while also supporting cross-linguistic scenarios. The test sets examine what AI models can actually solve, and Samsung Research applied test sets ranging from as short as 8 characters to over 20,000 characters, reflecting tasks from simple requests to lengthy document summarization.

To evaluate the performance of AI models, it is important to have clear criteria for judging whether the AI's responses are correct. In real-world situations, not all user intents may be explicitly stated in the instructions. TRUEBench is designed to enable realistic evaluation by considering not only the accuracy of the answers but also detailed conditions that meet the implicit needs of users.

Samsung Research verified evaluation items through collaboration between humans and AI. First, human annotators create the evaluation criteria, and then the AI reviews it to check for errors, contradictions or unnecessary constraints. Afterward, human annotators refine the criteria again, repeating this process to apply increasingly precise evaluation standards. Based on these cross-verified criteria, automatic evaluation of AI models is conducted, minimizing subjective bias and ensuring consistency. In addition, for each test, all conditions must be satisfied for the model to pass. This enables more detailed and precise scoring across tasks.

TRUEBench's data samples and leaderboards are available on the global open-source platform Hugging Face, which allows users to compare a maximum of five models and enables comprehensive AI model performance comparisons at a glance. Moreover, data on the average length of response results are also published, enabling simultaneous comparison of both performance and efficiency. Detailed information can be found on the TRUEBench Hugging Face page at https://huggingface.co/spaces/SamsungResearch/TRUEBench.

¹ Chinese, English, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish and Vietnamese

TAGSAILarge Language ModelTRUEBenchTrustworthy Real-world Usage Evaluation Benchmark

Press Resources > Press Release

Corporate > Technology

Download

Samsung-Corporate-Technology-Samsung-Research-TRUEBench_dl1.jpg
Samsung-Corporate-Technology-Samsung-Research-TRUEBench_dl2.jpg

※ All images attached in the press releases published on Samsung Newsroom are also available on Samsung Newsroom's Media Library.

Related Articles

Samsung Electronics Opens Samsung AI Forum 2025
[Interview] From Sleepless Nights to Global Champions: How Team Atlanta Conquered the AI Cyber Challenge
Samsung Electronics Hosts Samsung Developer Conference Korea 2024, Unveils Its Improved Gen AI Model

Samsung Electronics Co. Ltd. published this content on September 25, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 24, 2025 at 23:26 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]