Top News

Alibaba cutting NVIDIA GPU use for AI models by 80%
NewsBytes | October 18, 2025 11:39 PM CST



Alibaba cutting NVIDIA GPU use for AI models by 80%
18 Oct 2025


Alibaba Group Holding has unveiled a revolutionary computing pooling solution, Aegaeon, which reduces the need for NVIDIA graphics processing units (GPUs) in its artificial intelligence (AI) models by 82%.

The innovative system was beta tested in Alibaba Cloud's model marketplace for over three months. During this period, it cut down the number of required NVIDIA GPUs from 1,192 to just 213.


Aegaeon can serve dozens of models simultaneously
Innovation details


The research paper on Aegaeon was presented at the 31st Symposium on Operating Systems Principles (SOSP) in Seoul, South Korea.

The study highlights that Alibaba Cloud's system can serve dozens of models with up to 72 billion parameters each.

This is a major leap forward in managing and optimizing resources for AI workloads, especially given the high demand for NVIDIA GPUs in this field.


System improves efficiency by pooling GPU power
Cost efficiency


The researchers from Peking University and Alibaba Cloud, including Alibaba's Chief Technology Officer Zhou Jingren, have highlighted Aegaeon's role in tackling the high costs of serving concurrent large language model (LLM) workloads.

The system is a major step toward improving efficiency by pooling GPU power.

It allows one GPU to serve multiple models at once, thus reducing resource allocation inefficiencies that often plague cloud service providers like Alibaba Cloud and ByteDance's Volcano Engine.


Aegaeon addresses resource inefficiency issue
Model demand


The researchers also noted that a few models, such as Alibaba's Qwen and DeepSeek, are more popular for inference than others.

This leads to resource inefficiency, with 17.7% of GPUs allocated to serve only 1.35% of requests in Alibaba Cloud's marketplace.

The Aegaeon system could be the answer to this problem by optimizing GPU usage across different models and their respective workloads.


READ NEXT
Cancel OK