Today, artificial intelligence (AI) and deep learning (DL) methods are exploited in a wide
gamut of products. DL models are trained on infrastructures based on heterogeneous systems (GPU-powered clusters), achieving in the range of 5-40x speedup wrt CPU-only based servers.
The ability to optimize the use of the infrastructure and execute the workload efficiency under power constraints is critical. Local operators/data centers are subject to power consumption quotas while their economic results are strongly dependent on how efficiently the infrastructure is used. While there are advanced solutions allowing to manage virtual servers or containers, growing ML on GPU demand represents a business opportunity for cloud/data centers operators but optimizing revenue requires keeping power consumption under hard quotas while maximizing the effectiveness of high-value assets like GPU-based systems.
The requirement to achieve lower power consumption and better cost management is critical for innovative SMEs and startups who build their competitive advantage on the AI/ML and need to fit power constraints. The same requirements as above apply to entities who build on private clouds to satisfy their needs.
The ability to manage energy footprint in new infrastructure projects, like
distributed GPU computing located at the edge in 5G networks is critical for
telecommunication companies owning their infrastructure, companies managing others’
infrastructure (incl. taking over ownership that is a growing trend in EU), and SMEs
providing innovative solutions.
In all the cases, managing energy footprint and optimizing performance have critical
importance to limit carbon footprint and to adopt renewable energy sources which impose
stricter constraints.
Despite of the clear advantages of GPU-powered clusters in term of performance, these systems are characterized by large costs and a significant energy footprint: e.g., high-end GPU servers like NVIDIA DGX-2 cost about $400k with a power consumption not always proportional with their workload.
To increase the possibility to share expensive and specialized resources, datacenters
are shifting from traditional technologies (GPUs installed locally in individual servers)
toward a SW-based architecture based on a smart, self-pacing paradigm for resource allocation. In this way, resource utilization can be maximizing by allocating the proper amount of resources to a larger number of remote training jobs, concurrently minimizing the energy consumption.
The product
Objective: ANDREAS makes available an advanced scheduling solution optimizing DL training run-time workloads and their energy consumption in accelerated clusters.
As a reference figure, dependent on the actual DL models, speed-up of 2x and energy savings of 50% are expected.
ANDREAS addresses 3 requirements:
- reduce the energy consumption of AI/ML training jobs submitted to a GPU-accelerated infrastructure
- minimize the turnaround time for these jobs as well as of the whole workload
- optimize the overall efficiency of the GPU-accelerated infrastructure
Architecture:
1) a SLURM queue manager,
2) a pool of servers,
3) a pool of GPUs,
4) an intelligent module performing application energy consumption and performance prediction connected with the jobs scheduling.
Usage: Training jobs are submitted to SLURM and are characterized by a deadline
and a priority (i.e. a weight). Jobs are never rejected and can possibly be delayed. The
final goal is to minimize the weighted job tardiness given the power budget established by
the SysAdmin. To fully achieve resource-efficiency, an intelligent system orchestrating
the resources with a global view of the system is included in the product. This ‘intelligence’ is provided by the ML model and the advanced scheduler. The former is continuously trained as applications run and provide means to predict job execution time and energy consumption. The latter solves a joint capacity allocation (i.e., how many GPUs to assign to a job) and scheduling problem (i.e., determine the job ordering leading to optimal resource use) and decides which resources (servers/GPUs) to set in a low-power state.
Target users
The users of GPU-powered clusters running training/DL models are the natural targets of ANDREAS, because of the benefits coming from optimizing the Time-To-Solution and the power consumption according to the workload variability.
TRL
ANDREAS is released at TRL 7 (and will achieve TRL 8 after a thorough field test)