• Skip to primary navigation
  • Skip to main content
  • Skip to footer
E4 Computer Engineering

E4 Computer Engineering

HPC and Enterprise Solutions

  • It
  • En
  • Solutions
    • HPC
    • AI
    • Kubernetes
    • Virtualization
    • Cloud
  • About us
    • Team
    • E4 Analytics
    • E4 Aerospace & Defence
    • Case Studies
    • European Projects
    • Expertise
    • Certifications
    • Partners
    • Jobs
  • News
    • Blog
    • Events
  • Contacts

ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Today, artificial intelligence (AI) and deep learning (DL) methods are exploited in a wide

gamut of products. DL models are trained on infrastructures based on heterogeneous systems (GPU-powered clusters), achieving in the range of 5-40x speedup wrt CPU-only based servers.

The ability to optimize the use of the infrastructure and execute the workload efficiency under power constraints is critical. Local operators/data centers are subject to power consumption quotas while their economic results are strongly dependent on how efficiently the infrastructure is used. While there are advanced solutions allowing to manage virtual servers or containers, growing ML on GPU demand represents a business opportunity for cloud/data centers operators but optimizing revenue requires keeping power consumption under hard quotas while maximizing the effectiveness of high-value assets like GPU-based systems.

The requirement to achieve lower power consumption and better cost management is critical for innovative SMEs and startups who build their competitive advantage on the AI/ML and need to fit power constraints. The same requirements as above apply to entities who build on private clouds to satisfy their needs.

The ability to manage energy footprint in new infrastructure projects, like

distributed GPU computing located at the edge in 5G networks is critical for

telecommunication companies owning their infrastructure, companies managing others’

infrastructure (incl. taking over ownership that is a growing trend in EU), and SMEs

providing innovative solutions.

In all the cases, managing energy footprint and optimizing performance have critical

importance to limit carbon footprint and to adopt renewable energy sources which impose

stricter constraints.

Despite of the clear advantages of GPU-powered clusters in term of performance, these systems are characterized by large costs and a significant energy footprint: e.g., high-end GPU servers like NVIDIA DGX-2 cost about $400k with a power consumption not always proportional with their workload.

To increase the possibility to share expensive and specialized resources, datacenters

are shifting from traditional technologies (GPUs installed locally in individual servers)

toward a SW-based architecture based on a smart, self-pacing paradigm for resource allocation. In this way, resource utilization can be maximizing by allocating the proper amount of resources to a larger number of remote training jobs, concurrently minimizing the energy consumption.

The product

Objective: ANDREAS makes available an advanced scheduling solution optimizing DL training run-time workloads and their energy consumption in accelerated clusters.

As a reference figure, dependent on the actual DL models, speed-up of 2x and energy savings of 50% are expected.

ANDREAS addresses 3 requirements:

  1. reduce the energy consumption of AI/ML training jobs submitted to a GPU-accelerated infrastructure
  2. minimize the turnaround time for these jobs as well as of the whole workload
  3. optimize the overall efficiency of the GPU-accelerated infrastructure

Architecture:

1) a SLURM queue manager,

2) a pool of servers,

3) a pool of GPUs,

4) an intelligent module performing application energy consumption and performance prediction connected with the jobs scheduling.

Usage: Training jobs are submitted to SLURM and are characterized by a deadline

and a priority (i.e. a weight). Jobs are never rejected and can possibly be delayed. The

final goal is to minimize the weighted job tardiness given the power budget established by

the SysAdmin. To fully achieve resource-efficiency, an intelligent system orchestrating

the resources with a global view of the system is included in the product. This ‘intelligence’ is provided by the ML model and the advanced scheduler. The former is continuously trained as applications run and provide means to predict job execution time and energy consumption. The latter solves a joint capacity allocation (i.e., how many GPUs to assign to a job) and scheduling problem (i.e., determine the job ordering leading to optimal resource use) and decides which resources (servers/GPUs) to set in a low-power state.

Target users

The users of GPU-powered clusters running training/DL models are the natural targets of ANDREAS, because of the benefits coming from optimizing the Time-To-Solution and the power consumption according to the workload variability.

TRL

ANDREAS is released at TRL 7 (and will achieve TRL 8 after a thorough field test)

Footer

Via Martiri della Libertà, 66
42019 Scandiano (RE) – Italy

+39 0522 991811
info@e4company.com

  • YouTube
  • Twitter
  • LinkedIn
  • SOLUTIONS
  • HPC
  • AI
  • Kubernetes
  • Virtualization
  • Cloud
  • ABOUT US
  • Team
  • E4 Analytics
  • E4 Aerospace & Defence
  • Expertise
  • Case Studies
  • European Projects
  • Certifications
  • Partner
  • Jobs

NEWS

  • Blog
  • Events

Signup to Newsletter

Download our company profile

©️2002-2021 E4 COMPUTER ENGINEERING S.p.A. - PIVA/VAT No. IT 02005300351 - R.A.E.E. IT0802 000 000 1117 - CAP. SOC. EURO 150.000,00 I.V. - Privacy policy - Cookie Policy - Manage cookie settings

WebSite by Black Studio