2-phase cooling for next generation HPC challenges
Thermal management is a critical concern in HPC centers that house numerous computing elements. The temperature directly impacts reliability and energy efficiency, as a significant portion of energy is dedicated to the cooling system. The approach used in the TEXTAROSSA project combines advanced two-phase cooling with multi-level thermal control strategies to effectively tackle thermal challenges at both the system and node levels. The objective of the TEXTAROSSA EU project is to contribute towards the strategic objectives outlined in the EuroHPC Strategic Research and Innovation Agenda and the ETP4HPC Strategic Research Agenda.
THE TEXTAROSSA SERVER SYSTEM
The solution utilizes the Ampere® Mt.Collins 2U system featuring the Ampere® Altra® Max processor. This processor delivers exceptional performance while maintaining impressive power efficiency per core. This system offers several benefits, including:
- Utilization of the ARM architecture, which aligns with the EPI project’s compatibility goals.
- Ample physical space available for integrating the cooling system.
- Support for a significant number of PCIe slots, allowing for the addition of FPGA boards or other expansion cards as required.
- Optimal compatibility between the cooling system’s design point and the heat dissipation requirements.
In terms of connectivity, this versatile platform boasts 160 PCIe Gen4 lanes, allowing for flexible I/O connectivity through PCIe slots, along with an additional 16 PCIe Gen4 lanes for OCP 3.0 networking. The Mt.Collins system supports up to thirty-two DDR4 3200 MT/s DIMMs, providing a maximum memory capacity of 8 TB.
Ampere® Mt.Collins system from the top, on the left the space for the disks (covered by the metal plate of the case), in the center you can see the two sockets with the memories around, on the right at the top the two power supplies of the system, on the left center a black container for one of the possible accelerator boards.
The notable features of the system include:
- Support for up to 2 x U280 FPGAs for acceleration purposes.
- Dual socket configuration with 128 cores per socket operating at a frequency of 3GHz.
- A total of 32 DIMMs, utilizing an 8-channel 2DPC configuration per CPU.
- Maximum memory capacity of up to 8TB.
- Maximum socket TDP (Thermal Design Power) reaching up to 250W.
- Utilization of a 2-Phase Cooling Solution for efficient thermal management.
THE PROCESSOR TECHNOLOGY
The processor technology is 7nm FinFET with a TDP of 250W.
The Ampere Altra Max processor boasts a high bandwidth and impressive memory capacity of up to 4TB per socket. It provides exceptional flexibility with 128 lanes of PCIe Gen4 per socket, and in a 2P configuration, it supports up to 192 PCIe Gen4 lanes that can be bifurcated up to x4. This enables seamless integration with various off-chip devices, such as network cards capable of reaching speeds up to 200 GbE or higher, as well as storage/NVMe devices.
Furthermore, the Ampere Altra Max processor supports cache coherent connectivity with off-chip accelerators. Specifically, 64 out of the 128 PCIe Gen4 lanes support Cache Coherent Interconnect for Accelerators (CCIX), which can be utilized for network, storage, or accelerator connectivity.
In terms of reliability, the Ampere® Altra® Max processor offers comprehensive server-class enterprise RAS (Reliability, Availability, and Serviceability) capabilities. Advanced ECC (Error Correction Code) protects data in memory, alongside standard DDR4 RAS features. Additionally, end-to-end data poisoning ensures that any corrupted data is flagged as an error when accessed. The SLC (Single-Level Cell) cache is also ECC protected, and the processor supports background scrubbing of the SLC cache and DRAM (Dynamic Random-Access Memory) to identify and rectify single-bit errors.
High-performance computing (HPC) plays a critical role for countries and large enterprises, supporting diverse applications in fields such as finance, weather forecasting, oil and gas, and many others. Additionally, emerging domains such as security, surveillance, bioinformatics and medicine, are increasingly relying on HPC, specifically in the areas of High-Performance Data Analytics (HPDA) and High-Performance Computing for Artificial Intelligence (HPC-AI).
To meet the demands of large-scale HPC centers, achieving high efficiency while staying within power and energy limitations is a significant challenge. Researchers need to consider various factors across the HPC hardware/software stack, including the use of specialized and highly efficient hardware accelerators, effective software resource management, and efficient cooling systems for optimal performance.
The TEXTAROSSA project addresses these challenges by prioritizing thermal control, energy efficiency, performance, and the seamless integration of new accelerators based on reconfigurable fabrics. The development of a new 2-phase cooling system aims to provide a cost-effective and efficient solution to meet the requirements of next-generation HPC demands.