noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

CFA - Consumer Federation of America

Consumer Groups Tell Congress, Don’t Abandon Universal Service Programs
SelectFlorida

SelectFlorida Announces the Opening of a New International Trade Office[...]
United States Attorney's Office[...]

Founder of one of world’s largest hacker forums resentenced to three[...]

Technology

Qualcomm Inc.

09/16/2025 | News release | Distributed by Public on 09/16/2025 10:15

Shifting AI inference from the cloud to your phone can reduce AI costs

What you should know:

Study shows running AI inference on your phone can potentially be more resource efficient than relying solely on the cloud.
Hybrid AI architectures that combine on-device and cloud processing can optimize cost, performance and sustainability for everyday workloads.
Understanding the hidden financial and environmental costs of AI inference is essential for responsible and scalable technology adoption.

Imagine this: you've been tasked with writing a report on computing technologies trends in 2050 by tomorrow. You open your favorite GenAI assistant app, draft a prompt and hit send. A few milliseconds later, words pop up on your screen. All that for free, right?

In fact, it's more complicated than that. Every AI query has a cost - in dollars, power and water - and not just when you're trying to be polite. Consumers and businesses see subscription fees racked up on their credit cards, but they may not see the environmental costs of the reliance on massive data centers for all their computing. And with approximately 163 million users expected in the US by 2029,¹ the impact on our wallets and the environment will continue to grow as Gen AI apps become more ubiquitous in our lives.

But there is a way forward. Instead of relying solely on cloud servers, shifting a portion of the AI inference to devices like your smartphone can be much more resource-efficient than relying solely on the cloud.

This isn't groundbreaking: many everyday GenAI use cases can already be handled on devices - such as the latest OpenAI gpt-oss model, that can run on Snapdragon. Models are getting smaller, while also more capable and more efficient,² and application developers are looking for ways to cut cloud inferencing costs and respond to a growing demand for privacy and personalization, especially in the wake of agentic AI. In parallel, the performance of Neural Processing Units (NPUs) continues to increase, making on-device AI an increasingly viable and attractive option.

"Instead of relying solely on cloud servers, shifting a portion of the AI inference to devices like your smartphone can be much more resource-efficient than relying solely on the cloud."

On-device AI can be more resource efficient than cloud inferencing

A recent study computed the hidden costs in liters of water, and joules - the standard unit of energy - for common AI prompts.³

They ran the same queries, using the same AI models, both on a Samsung Galaxy S24⁴ and on Google Colab cloud servers.⁵ They found that on-device AI inference uses less energy and water, and reduces carbon dioxide emissions.⁶

Electricity and water are critical to data center operations, providing the racks and racks of GPUs, TPUs and other AI accelerators with the power they need to operate, and equally importantly, to cool them down.

They found that running AI inference on a Samsung Galaxy S24 can reduce inference energy consumption by up to 95% and carbon footprint by up to 88% compared to running these workloads on Google Colab cloud servers. For water consumption,⁷ the average savings soar up to 96%.⁸

While the study is restricted to a small scope that can target a limited number of experiments and uses non-optimized cloud inference - which warrant further research for more robust conclusions - it suggests a promising path forward: moving from solely cloud-based AI inference to a hybrid approach that processes some routine workloads locally. This approach could relieve pressure on power grids and help reduce the environmental impact of large data centers.

The hidden costs of today's data centers

Here is what happens under the hood: Currently, most AI inferences are cloud-based. This means your prompt is sent to a model that is hosted on a server in a data center. Once processed, the model sends the output back to your application. Here is a more visual walkthrough of your AI prompt's journey - and energy consumption - by WSJ's Joanna Stern.

LLM providers bill application developers and end users for using their infrastructure that processes your prompt, including AI acceleration hardware, storage, network bandwidth and the operational costs such as maintenance, technical support, utility expenses like water and power, and more. This makes it more expensive for developers to deploy GenAI applications, and makes those applications more expensive for end users. All of those ongoing cloud costs are part of why many GenAI products charge a monthly fee.

Study's scope: The study compares the cost and environmental footprint of running generative AI models on the cloud versus edge devices. For the cloud, the models run on servers equipped with either Nvidia A100 or L4 GPUs, hosted on Google Colab. For edge devices, the models run on Samsung Galaxy S24 devices, powered by Snapdragon 8 Gen 3 processors.

"As GenAI models become smaller while on-device processing capabilities continue to grow, we believe that AI processing, from the cloud to the edge, can bring major benefits in cost, energy, performance, privacy, security and personalization."

Hybrid AI will unlock GenAI at scale

While on-device AI inference is ideal for some workloads, this isn't a black and white situation. AI inference occurs on a spectrum, ranging from closest to the user to further away in the cloud.

Like we mentioned in this white paper, hybrid AI architectures can distribute AI inference among cloud and edge devices depending on a model and query's complexity. If a model can run on a device for a given prompt without compromising its accuracy, latency and generation length, the inference should prioritize running on the edge device. If the model is more complex, the inference can be distributed between the device and the cloud, with devices running 'light' versions of the model while the cloud processes the 'full' model concurrently and corrects the device answers if needed.

Qualcomm Technologies is well-positioned to bring efficient inferencing from the cloud to edge

As GenAI models become smaller while on-device processing capabilities continue to grow, we believe that AI processing, from the cloud to the edge, can bring major benefits in cost, energy, performance, privacy, security and personalization. We are designing efficient AI inference solutions that leverage both the edge and the cloud. This hybrid set-up distributes AI workloads from cloud to edge appropriately to offer the best solution. This allows our customers and partners' smartphones, PCs, IoT devices, vehicles and data centers to deliver more intuitive, productive and efficient user experiences around the globe.

Discover a new era of possibility with on-device AI

Qualcomm Inc. published this content on September 16, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 16, 2025 at 16:15 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]

Back

View original format