09/16/2025 | News release | Distributed by Public on 09/16/2025 10:15
Imagine this: you've been tasked with writing a report on computing technologies trends in 2050 by tomorrow. You open your favorite GenAI assistant app, draft a prompt and hit send. A few milliseconds later, words pop up on your screen. All that for free, right?
In fact, it's more complicated than that. Every AI query has a cost - in dollars, power and water - and not just when you're trying to be polite. Consumers and businesses see subscription fees racked up on their credit cards, but they may not see the environmental costs of the reliance on massive data centers for all their computing. And with approximately 163 million users expected in the US by 2029,1 the impact on our wallets and the environment will continue to grow as Gen AI apps become more ubiquitous in our lives.
But there is a way forward. Instead of relying solely on cloud servers, shifting a portion of the AI inference to devices like your smartphone can be much more resource-efficient than relying solely on the cloud.
This isn't groundbreaking: many everyday GenAI use cases can already be handled on devices - such as the latest OpenAI gpt-oss model, that can run on Snapdragon. Models are getting smaller, while also more capable and more efficient,2 and application developers are looking for ways to cut cloud inferencing costs and respond to a growing demand for privacy and personalization, especially in the wake of agentic AI. In parallel, the performance of Neural Processing Units (NPUs) continues to increase, making on-device AI an increasingly viable and attractive option.
A recent study computed the hidden costs in liters of water, and joules - the standard unit of energy - for common AI prompts.3
They ran the same queries, using the same AI models, both on a Samsung Galaxy S244 and on Google Colab cloud servers.5 They found that on-device AI inference uses less energy and water, and reduces carbon dioxide emissions.6
Electricity and water are critical to data center operations, providing the racks and racks of GPUs, TPUs and other AI accelerators with the power they need to operate, and equally importantly, to cool them down.
They found that running AI inference on a Samsung Galaxy S24 can reduce inference energy consumption by up to 95% and carbon footprint by up to 88% compared to running these workloads on Google Colab cloud servers. For water consumption,7 the average savings soar up to 96%.8
While the study is restricted to a small scope that can target a limited number of experiments and uses non-optimized cloud inference - which warrant further research for more robust conclusions - it suggests a promising path forward: moving from solely cloud-based AI inference to a hybrid approach that processes some routine workloads locally. This approach could relieve pressure on power grids and help reduce the environmental impact of large data centers.
Here is what happens under the hood: Currently, most AI inferences are cloud-based. This means your prompt is sent to a model that is hosted on a server in a data center. Once processed, the model sends the output back to your application. Here is a more visual walkthrough of your AI prompt's journey - and energy consumption - by WSJ's Joanna Stern.
LLM providers bill application developers and end users for using their infrastructure that processes your prompt, including AI acceleration hardware, storage, network bandwidth and the operational costs such as maintenance, technical support, utility expenses like water and power, and more. This makes it more expensive for developers to deploy GenAI applications, and makes those applications more expensive for end users. All of those ongoing cloud costs are part of why many GenAI products charge a monthly fee.
Study's scope: The study compares the cost and environmental footprint of running generative AI models on the cloud versus edge devices. For the cloud, the models run on servers equipped with either Nvidia A100 or L4 GPUs, hosted on Google Colab. For edge devices, the models run on Samsung Galaxy S24 devices, powered by Snapdragon 8 Gen 3 processors.
While on-device AI inference is ideal for some workloads, this isn't a black and white situation. AI inference occurs on a spectrum, ranging from closest to the user to further away in the cloud.
Like we mentioned in this white paper, hybrid AI architectures can distribute AI inference among cloud and edge devices depending on a model and query's complexity. If a model can run on a device for a given prompt without compromising its accuracy, latency and generation length, the inference should prioritize running on the edge device. If the model is more complex, the inference can be distributed between the device and the cloud, with devices running 'light' versions of the model while the cloud processes the 'full' model concurrently and corrects the device answers if needed.
As GenAI models become smaller while on-device processing capabilities continue to grow, we believe that AI processing, from the cloud to the edge, can bring major benefits in cost, energy, performance, privacy, security and personalization. We are designing efficient AI inference solutions that leverage both the edge and the cloud. This hybrid set-up distributes AI workloads from cloud to edge appropriately to offer the best solution. This allows our customers and partners' smartphones, PCs, IoT devices, vehicles and data centers to deliver more intuitive, productive and efficient user experiences around the globe.