To make the hardware run at peak performance, the systems must be optimally configured, networks as efficient as possible and the whole platform highly secured. The system administrators are always busy fixing, optimising and improving things. Below are given a number of tasks the HPC team is working on, on a daily basis.
Operating systems and associated services must be maintained up-to-date and secured all the time. New hardware must be integrated in the HPC platform rapidly. Whenever possible, maintenance operations are performed live on the systems without interference with the running jobs.
Multiple levels of security are implemented: 1) traffic control from public networks to the login nodes, 2) networks separation and control between the user space, the systems and the infrastructure, 3) activity monitoring and alerting. Users and their data are well protected!
Multiple solutions have been implemented to collect all the activities at the system, network and user levels. If something goes wrong, we can quickly identify it and therefore fix it. This is mandatory for a responsive and agile management of the HPC platform.
Based on monitoring data, systems, services and networks are tuned to offer the highest efficiency for the compute jobs. There are hundreds of parameters that can be tweaked to optimise data flows, jobs execution time, workload spikes etc. The SysAdmins review and adapt parameters continuously to keep systems at peak performance.
We cannot afford to spend time in repeating tasks. Tools and workflows are deployed to transform all the repeated tasks into fully automated procedures, from systems deployment to self-healing solutions.
System administrators are fully involved with the users' support. All HPC team members have scientific backgrounds allowing to find solutions for most, if not all, researcher needs.
On Hydra, queues are managed by the Torque software and jobs are scheduled on the compute nodes with the Moab software. We defined a single - default - queue which will redirect jobs to sub-queues based on the requested resources: number of cores and/or nodes and memory.
The tricky part is mastering the vast number of parameters in Moab and occasionally make Torque behaves the way we want. Parameters, and therefore cluster usage policies, are regularly updated.
On Vega, queues are managed and jobs scheduled by the Slurm software. Since the cluster is very homogeneous, only one default queue is available.
Slurm parameters are fixed and are defining the cluster usage policies.
Make sure you check the cluster pages where user limits are provided.
An HPC system is constantly under stress: compute jobs keep CPUs at maximum load and data continuously flow in and out the compute nodes. As a consequence hardware fail, programs crash and systems can hang. With a very heterogeneous workload, from floating point intensive to IO intensive, we must keep an eye at all the levels. We are using Zabbix as a monitoring solution which can collect a very large amount of data efficiently and trigger alerts whenever something goes wrong. The ability to create advance/complex triggers (combining multiple probes) help in defining intelligent alerts.
The monitoring system tracks a large set of probes but more and more often we need to aggregate multiple sources of data to detect unusual behavioural patterns such as drifts or spikes in resources usages, peaks of specific error types or sudden drops in efficiencies. The Elastic search - Logstach - Kibana software stack is an excellent tool for this. All sort of logs is streamed from the HPC platform, parsed, indexed and aggregated on timelines to spot such behavioural patterns. If a system goes out of control or a user starts running crazy jobs, we'll spot and fix it!
With 150+ compute nodes, management nodes, storage system(s) and other services integrated in a high performance computing platform, manual work is avoided at all cost. We rely instead on automation, testing, continuous integration, agile management and other trendy words. After using a commercial solution for a few years, we are moving to OpenSource for the automation part. We are now experimenting with Quattor, a heavyweight solution poorly documented but with a large community among which our colleagues from UGhent and IIHE. Quattor allows to deploy large and complex systems starting from bare-metal. For centralised configuration management, we are using the exceptional SaltStack which has a bidirectional communication layer allowing to push configs on large systems and to implement self-healing processes.