2-phase cooling for next generation HPC challenges
Thermal management is a critical concern in HPC centers that house numerous computing elements. The temperature directly impacts reliability and energy efficiency, as a significant portion of energy is dedicated to the cooling system. The approach used in the TEXTAROSSA project combines advanced two-phase cooling with multi-level thermal control strategies to effectively tackle thermal challenges at both the system and node levels. The objective of the TEXTAROSSA EU project is to contribute towards the strategic objectives outlined in the EuroHPC Strategic Research and Innovation Agenda and the ETP4HPC Strategic Research Agenda.
THE TEXTAROSSA SERVER SYSTEM
The solution utilizes the Ampere® Mt.Collins 2U system featuring the Ampere® Altra® Max processor. This processor delivers exceptional performance while maintaining impressive power efficiency per core. This system offers several benefits, including:
- Utilization of the ARM architecture, which aligns with the EPI project’s compatibility goals.
- Ample physical space available for integrating the cooling system.
- Support for a significant number of PCIe slots, allowing for the addition of FPGA boards or other expansion cards as required.
- Optimal compatibility between the cooling system’s design point and the heat dissipation requirements.
In terms of connectivity, this versatile platform boasts 160 PCIe Gen4 lanes, allowing for flexible I/O connectivity through PCIe slots, along with an additional 16 PCIe Gen4 lanes for OCP 3.0 networking. The Mt.Collins system supports up to thirty-two DDR4 3200 MT/s DIMMs, providing a maximum memory capacity of 8 TB.
Ampere® Mt.Collins system from the top, on the left the space for the disks (covered by the metal plate of the case), in the center you can see the two sockets with the memories around, on the right at the top the two power supplies of the system, on the left center a black container for one of the possible accelerator boards.
The notable features of the system include:
- Support for up to 2 x U280 FPGAs for acceleration purposes.
- Dual socket configuration with 128 cores per socket operating at a frequency of 3GHz.
- A total of 32 DIMMs, utilizing an 8-channel 2DPC configuration per CPU.
- Maximum memory capacity of up to 8TB.
- Maximum socket TDP (Thermal Design Power) reaching up to 250W.
- Utilization of a 2-Phase Cooling Solution for efficient thermal management.
THE PROCESSOR TECHNOLOGY
The processor technology is 7nm FinFET with a TDP of 250W.
The Ampere Altra Max processor boasts a high bandwidth and impressive memory capacity of up to 4TB per socket. It provides exceptional flexibility with 128 lanes of PCIe Gen4 per socket, and in a 2P configuration, it supports up to 192 PCIe Gen4 lanes that can be bifurcated up to x4. This enables seamless integration with various off-chip devices, such as network cards capable of reaching speeds up to 200 GbE or higher, as well as storage/NVMe devices.
Furthermore, the Ampere Altra Max processor supports cache coherent connectivity with off-chip accelerators. Specifically, 64 out of the 128 PCIe Gen4 lanes support Cache Coherent Interconnect for Accelerators (CCIX), which can be utilized for network, storage, or accelerator connectivity.
In terms of reliability, the Ampere® Altra® Max processor offers comprehensive server-class enterprise RAS (Reliability, Availability, and Serviceability) capabilities. Advanced ECC (Error Correction Code) protects data in memory, alongside standard DDR4 RAS features. Additionally, end-to-end data poisoning ensures that any corrupted data is flagged as an error when accessed. The SLC (Single-Level Cell) cache is also ECC protected, and the processor supports background scrubbing of the SLC cache and DRAM (Dynamic Random-Access Memory) to identify and rectify single-bit errors.
THE COOLING SYSTEM
E4’s designed node incorporates InQuattro’s two-phase thermal management solution specifically targeted at the CPUs, which are the most thermally sensitive components.
Implementing evaporative cooling systems in HPC environments offers noteworthy benefits. It has the potential to enhance computing system performance by minimizing the need for frequency throttling. Additionally, it can lead to improved Power Usage Effectiveness (PUE), resulting in enhanced profitability and sustainability due to increased energy efficiency.
Fluid flow animation
The technology uses a cold plate (aluminum or copper) as a heat sink, a direct on-chip evaporative heat exchanger. The cold plate is a multi-micro-channel evaporator: a metal plate with micro-fins machined on it. The evaporator is placed in direct contact with the processor (CPU or GPU) case with the help of a thermal paste to obtain a stable and low thermal contact resistance. The coolant flows through microchannels in the evaporator to capture heat from the processor through evaporation processes and then flows to a condenser where the heat is dissipated to the environment via water or air. Coolant from the condenser flows back through the pump, and the cycle repeats. The loop is a hermetically sealed, closed system, so the processors and any electronic components are not in direct contact with the fluid. Other important features of two-phase cooling technology include the use of dielectric fluids, which eliminates the risk of electrical damage from accidental coolant spills, and virtually zero maintenance, which eliminates the need for skilled personnel. These dielectric fluids are non-flammable, non-toxic, environmentally friendly, and have extremely low Global Warming Potential (GWP) and Ozone Depletion Potential (ODP).
In Quattro two phase cooling system will manage all the electronic components of selected E4 system, already air cooled.
The rack configuration of the two-phase loop incorporates a liquid-to-liquid condensing unit.
High-performance computing (HPC) plays a critical role for countries and large enterprises, supporting diverse applications in fields such as finance, weather forecasting, oil and gas, and many others. Additionally, emerging domains such as security, surveillance, bioinformatics and medicine, are increasingly relying on HPC, specifically in the areas of High-Performance Data Analytics (HPDA) and High-Performance Computing for Artificial Intelligence (HPC-AI).
To meet the demands of large-scale HPC centers, achieving high efficiency while staying within power and energy limitations is a significant challenge. Researchers need to consider various factors across the HPC hardware/software stack, including the use of specialized and highly efficient hardware accelerators, effective software resource management, and efficient cooling systems for optimal performance.
The TEXTAROSSA project addresses these challenges by prioritizing thermal control, energy efficiency, performance, and the seamless integration of new accelerators based on reconfigurable fabrics. The development of a new 2-phase cooling system aims to provide a cost-effective and efficient solution to meet the requirements of next-generation HPC demands.