Artificial Intelligence: we introduce you ANDREAS!
On May 1st 2020, the virtual kick-off meeting of the TETRAMAX-VALUECHAIN-TTX-3:ANDREAS proposal was held.
E4 is one of the protagonists of the ANDREAS project, together with Politecnico di Milano and the Polish company 7bulls. ANDREAS is part of TETRAMAX (a Digital Innovation Hub), which has been carrying out a Horizon2020 project since 2017, within the Smart Anything Everywhere (SAE) initiative, dealing with energy-efficient computing for Cyber Physical Systems and the Internet of Things.
ANDREAS (Artificial intelligence traiNing scheDuler foR disaggrEgAted resource clusterS) aims to meet two key market needs: efficiency in the usage of resources and reduction of energy consumption. Today, artificial intelligence (AI) and deep learning (DL) are used for a wide range of applications and are supported by different GPU-based hardware and software platforms. The widespread and increasingly sophisticated use of artificial intelligence models creates the opportunity for improvements in the management of the energy footprint, during training and retraining operations, and in various deployments: from on-premises systems to medium-sized infrastructures ( such as European cloud operators and large HPC centers), to large suppliers of edge/fog systems.
Deep learning and machine learning models are trained on GPU-based systems, which consistently achieve a speed up of 5-40x compared to CPU-based ones. The ability to optimize infrastructure efficiency, based on strict energy consumption limits, is essential for cloud providers, data centers and HPC centers that provide computing power for these purposes.
Although there are advanced solutions that allow you to manage servers or virtual containers, the growing complexity of machine learning models requires limiting the energy consumption at quotas imposed by the SysAdmin and optimizing, at the same time, the allocation of GPUs, which are high-value resources. So, in this case, giving up flexibility due to scalable solutions is a conscious and preferable choice.
ANDREAS has the goal to develop advanced scheduling solutions, which allow optimization of run-time performance and minimization of the energy consumption in deep learning training processes, both in aggregated and disaggregated GPU clusters.
The architecture designed for the ANDREAS project is composed of: a pool of servers managed on CPUs, a pool of GPUs which can be accessed via switch, the SLURM code manager and some intelligent modules that interact with the job scheduler, in order to make predictions for performance and consumption.
ANDREAS is a 10-month project and the team plans to build the first prototypes of the solution by autumn 2020.