07/27/2025 | News release | Archived content
During the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) this week in Vienna, Austria, researchers from Bloomberg's AI Engineering group in London are showcasing their expertise in large language models (LLMs) and tool-based agentic AI with their paper "A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents."
In the paper, which is published "Findings of the Association for Computational Linguistics: ACL 2025," Bin Wu, a Bloomberg Data Science Ph.D. Fellow and Ph.D. student at University College London, Edgar Meij, Head of AI Platforms in Bloomberg's AI Engineering group, and Emine Yilmaz, a professor and EPSRC Fellow at University College London's Department of Computer Science - where she also leads the Web Intelligence Group at the UCL Centre for Artificial Intelligence - demonstrate the crucial role of the instructions provided in agent prompts and tool descriptions - collectively referred to as context. Incomplete or suboptimal context in the instructions and tool descriptions significantly increases the required number of tool calls that LLMs need to make in order to generate an adequate response, leading to computational overhead. They propose a new methodology for automatically improving agent prompts and tool descriptions, and demonstrate that it substantially reduces the number of tool calls the LLM agent needs to make.
In addition, two members of Bloomberg's AI Strategy & Research team in the company's CTO Office - Sebastian Gehrmann, Head of Responsible AI, and Enrico Santus, Principal Technical Strategist for Human-AI Interaction and Academic Engagement - are two of the organizers of the fourth iteration of the Generation, Evaluation & Metrics Workshop (GEM2), which will be held as part of ACL on July 31, 2025. In light of the broad accessibility of LLMs, this workshop will serve as a forum for researchers and practitioners from both the natural language processing and machine learning communities to come together to explore potential approaches and research directions to address the broader types of natural language generation (NLG) challenges - in particular, the evaluation of model-generated outputs. While these advanced models can generate fluent text, ensuring the usefulness, quality, and fairness of their output is essential to help bridge the gap between research and real-world applications.
We asked the paper's lead author and one of the workshop organizers to explain why their work is notable in advancing the state-of-the-art with regards to LLMs and agentic AI:
Session 12: IP-Posters (Findings Posters - In-Person 4)
11:00-12:30 CEST
A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents
Bin Wu (Centre for Artificial Intelligence, University College London), Edgar Meij (Bloomberg), Emine Yilmaz (Centre for Artificial Intelligence, University College London)
Please summarize your research. Why are your results notable?
Bin Wu: This research proposes a joint optimization framework that aims to improve the efficiency of tool-augmented LLM agents by systematically refining both agent instructions and tool descriptions. Traditional approaches have either focused on enhancing tool use effectiveness through reasoning strategies - like chain-of-thought (CoT) or tree-of-thoughts (ToT) prompting - or optimized only a single aspect (i.e., either the instructions or the tool documentation). However, these prior methods incur high computational costs and often overlook efficiency, particularly under conditions where context is incomplete.
Our proposed framework introduces a three-stage process:
Notable results include the following:
Why is it important to optimize context to improve the efficiency of agentic tool calling?
In practice, incomplete context is very common. This occurs because agent instructions are always designed manually through much trial-and-error. Plus, tool descriptions are also designed by humans, and it is especially difficult to capture for complex tools. Revealed by our empirical analysis, an incomplete context is one of the things that lead to computational overhead. Thus, when end-to-end agentic LLMs use tools, optimizing context is essential to help improve their efficiency.
How does your research advance the state-of-the-art in the field of agentic/generative AI?
This work advances the field in the following key ways:
Were there any surprising or unexpected outcomes from your research?
Yes, several findings were unexpected and noteworthy:
Read more about Bloomberg's agentic AI infrastructure here.
GEM2 Workshop: Generation, Evaluation & Metrics
Sebastian Gehrmann (Bloomberg), Gabriel Stanovsky (Hebrew University of Jerusalem), Simon Mille (Dublin City University), Enrico Santus (Bloomberg), Miruna Clinciu (Heriot Watt University), Kaustubh Dhole (Emory University), Yotam Perlitz (IBM Research), Rotem Dror (University of Haifa), Itay Itzhak (Hebrew University of Jerusalem), Ofir Arviv (IBM Research), Eliya Habba (Hebrew University of Jerusalem), Michal Shmueli Scheuer (IBM Research), João Sedoc (New York University) and Oyvind Tafjord (Allen Institute for Artificial Intelligence)
Please explain the goal of this workshop. Why are you helping to organize it?
Enrico Santus: This is the fourth edition of the Generation, Evaluation & Metrics Workshop (GEM). My colleague, Sebastian Gehrmann, originally started it in 2020, when evaluation of generated text first started becoming incredibly important. Now that GenAI is ubiquitous, GEM has grown into one of the largest workshops held at any NLP conference, and we couldn't be more excited to help lead it together with the outstanding organizing team.
As GenAI is increasingly used for high-impact applications - from healthcare to robotics and finance - the stakes for evaluation have never been higher. Yet, many of today's benchmarks are brittle, hard to reproduce, or fail to reflect real-world complexity. That's why we believe GEM2 will help shift the field toward more meaningful, efficient, and robust evaluation practices.
This year, more than 86 scientific publications will be presented at GEM2, alongside three keynotes and a panel. Moreover, for the second year, we have also built a space where industry and academia can meet each other through a dedicated Industrial Track. That conversation will be catalyzed in a panel with leading voices from DeepMind, Contextual AI, and aiXplain, during which the speakers will share what it means to evaluate generative models in real-world production environments.
How do you expect or hope that this workshop will help advance the state-of-the-art in terms of the evaluation of LLMs?
We hope GEM2 helps change how our community thinks about evaluation. Right now, much of the focus in LLM benchmarking is on leaderboards, but they don't tell the full story. Models are sensitive to prompting, few-shot formatting, and even punctuation. Reproducibility is a challenge, and many current metrics don't reflect how models behave under pressure or in production. GEM2 encourages the field to go deeper, to explore robustness, fairness, instruction-following variance, and real-world generalization.
We're incredibly fortunate to have three invited speakers who each bring powerful perspectives:
Most importantly, GEM2 is about community. Over the past four years, the GEM community has grown into a vibrant global network, bringing together hundreds of contributors from across continents, disciplines, and institutions. Through their work, the GEM community is shaping the future of NLP evaluation, and we are excited to be among its hosts.