Medooza: the Parallel Computing infrastructure
The ideal HPC Cluster
Medooza is an HPC cluster designed specifically to meet a client’s particular requirements. It is characterised by a carefully engineered optimisation of hardware and software components, with custom or “standardised” configurations for the X86, GPU, storage and high-speed interconnects.
Engineered, not assembled
To guarantee maximum reliability, Medooza is put through a long-lasting stress test during production.
Prior to delivery, we produce an output report with test and real-time performance data for our clients. Medooza is based on a suite of hardware and software components suitably harmonised by our expert engineers, lending itself to the continuous integration of new modules and modern features.
For both hardware and software, Medooza supports the widest possible range of high-performance CPUs (ARM, x86_64), accelerators (GPU, FPGA), networks (IB, RoCE) and disks (NVMe, SSD) as well as the most popular operating systems (Red Hat, Centos Stream)
The various development workspaces accessible to users are flexible and reconfigurable. Their management takes place in environment modules created by E4 according the client’s needs.
Medooza is natively multicluster. The same infrastructure can be used to serve multiple clusters within the same local network.
It is possible to expand the HPC system by adding new nodes (scale-out), check the state of the hardware and recondition old clusters where appropriate.
Discover the advantages
Worker nodes are configured at the back-end of the cluster, and are only accessible to the end-user via the Workload Manager (Slurm), which can be installed and configured to make batch mode run with maximum resource use efficiency (workloads subject to login nodes). The solution is released with only the basic functions of SLURM activated (partition management, worker node management, computational resource allocation, etc.), whilst more advance functions can be configured by specific service request and will be quoted as add-ons.
All nodes are interconnected via a low-latency network, configured to guarantee high performance when accessing parallel storage resources and MPI applications, whilst an Ethernet network is dedicated to their centralised servicing and management.
The high performance storage area will be implemented by the solution BeeGFS.
HEAD NODE CONFIGURATION
Head nodes are configured in a hyperconverged virtualisation infrastructure, to dependably house both the Login Node, for end-user access to the infrastructure through a Command Line Interface (using the SSH protocol), and the Service Nodes, used in the deployment of the following components necessary for the perfect functioning of the cluster:
- Cluster network services (DHCP, TFTP, IP forwarding…)
- Centralised user management service (LDAP)
- Dynamic user workspace configuration (environment modules)
- Workload manager service (SLURM)
- NFS service for the storage area dedicated to the centralised installation of software components (which need to be made accessible to all Compute Nodes)
• Allows access to applications and data saved on the shared file system of the back-end nodes
• Command line interface for the cluster user
• Gestore risorse/Scheduler,
• Enables batch use of backend processing resources through Resource Management/Scheduling services
• User authentication
• File system shared between compute nodes
• Resource Manager/Scheduler
• Infrastructure alarm management services/Service monitoring
HPC resource manager & scheduler configured to obtain maximum efficiency from the cluster and have full control over workload execution.
Centralized monitoring and alerting thanks to a single and intuitive control point that allows you to check the status of the cluster and receive alerts in case of anomalies.
The software stack includes the main opensource tools for developing and running high-performance applications: compilers, scientific and numerical libraries, MPI, OpenMP.
Medooza can be enriched with a series of additional features according to the specific needs of each project:
- Optimized compilers and libraries
- High reliability HA services
- Parallel and/or ephemeral file system
Since the architecture is extremely flexible and adaptable to any environment, here are some guidelines to keep in mind when designing a Medooza solution.
The E4 HPC cluster is built using Open Source technology, validated by both our R&D laboratories and by our years of experiences with clients from both academic and corporate spheres.
Medooza’s design is based on hardware building blocks validated by our engineers, and configured and managed by internal tools developed in opensource.
Medooza’s scaleout architecture allows it to respond transparently and natively to growing needs.
With a simple but appropriate reconfiguration of the environment, it is possible to add a variety of new nodes whilst leaving the systm architecture unchanged.
Why choose the E4 solution
E4 engineers always design a solution by first understanding the client’s specific requirements and expectations.
Here is a summary of the principles with uphold throughout the design process:
• LISTENING: understand the client’s needs
• EXPLORATION: Collaborate with domain-specific ISVs / partners to optimize every single aspect of the design
• UNRESTRICTED: research and select the best infrastructure and implementation processes for the solution
• VERSATILE: provide of on-site and cloud-based solutions for production, testing and evaluation
All the systems included in a solution must first pass firmware, homogeneity, sanity and setup checks before configuration. Next, the following general performance tests are performed: HPL (High Performance Linpack) to test the computing power of a machine, measured in FLOPs; STREAM to test the memory access bandwidth, measured in MB/s; FIO to test disk access speed, measured in MB/s and IOPS. Once the solution is ready, further tests are carried out in agreement with the client for the verification of system function and comparison with expected performance.
Individual components are tested through burn in tests (developed by E4) lasting up to 120 hours to guarantee a unique, perfectly engineered and functioning system, reducing both the DoA (Dead on Arrival) and the ” early failure” rate after release. This significantly improves the overall reliability of E4’s HPC solution.
E4 is currently amongst only a few companies that delivers high-end services not only to both the academic and industrial sectors, but also to research centres of national and international importance, with which it has collaborated on the design, configuration and execution of solutions containing thousands of nodes for high performance and complex processing.
E4 is therefore able to provide clients with whatever level of support and advice they need, regardless of the size or complexity of their system, guaranteeing them perfect efficiency indefinitely:
• Level 1 & Level 2 Support, onsite service, 24/7 support
• Onsite specialists, infrastructure evolution, on-demand performance tuning.
New features coming soon
In addition to the features currently offered by its basic configuration, Medooza can be enriched with a series of add-ons. Some are integrable and others are available on demand. Contact us for their release date.
- UBUNTU support
- Centralized Web interface for managing Nodes and applications
- The “plug & play” addition of computing nodes to the existing solution
- SLURM ADVANCED: accounting, billing, QoS, FairShare, etc.
- BACKUP/DR: Implementation of solutions based on FOG/Bacula
Interactive interface for web-access to computing resources via workspaces with notebook technology configured for the HPC field. Does not support GPU computing.
ICE4AI for HPC
Supports GPU Computing and includes the main AI and Scalable Data Analytics frameworks such as Dask, Rapids, Tensorflow and PyTorch. Includes a graphical Interface to run SLURM job.
HIGH PERFORMANCE COMPUTING