Due to the popularity of the training sessions (all seats booked in less than 24 hours), we have decided to organise another one: a hands-on session on Linux is planned for January 14th and a hands-on session on HPC for January 15th 2019. See our Events page for more information.
Works on the ULB high voltage power grid are completed. Hydra and Vega HPC clusters are back online and jobs are running.
There will be works carried on the ULB high voltage power grid on Monday 19 and Tuesday 20 of November. The works will imply two power cuts which will impact the Hydra and Vega HPC clusters. We have therefore planned a downtime for both clusters from 19/11 15:00 until 20/11 morning. As soon as the works on the power grid are completed, we’ll put back the clusters online and send a notification. A reservation has been placed to ensure that no jobs are running during the maintenance window. This also imply that jobs with a walltime running into the maintenance window will be started only after the maintenance.
The HPC team is pleased to announce a new series of training sessions, in collaboration with the Vlaams SupercomputerCentrum (VSC) and the Consortium des Équipements de Calcul Intensif (CECI), focussing on young and/or early stage researchers (doctoral students, Master thesis students). This time we'll propose the usual Linux and HPC hands-on sessions plus a new session on Grid computing. See our Events page for more information and registration.
A planned firmware upgrade of one of the Hydra switches lead to a short access interruption from the compute nodes to the Hydra storage. The outage took place between 8:40 and 8:50 this morning. The redundancy at the network and storage levels clearly did not work. Some jobs have been impacted and were lost. Users should check the output of terminated jobs for possible errors. The issue will be investigated and hopefuly a fix will be found. We apologize for any inconvenience caused.
Works on the power grid of the ULB/VUB SISC datacenter are over. Hydra is now back at 100% capabilities. Some nodes supposed to remain powered have been stopped which has caused the loss of some jobs. Users with jobs completed this Friday morning should check the outputs for possible execution interruption. Sorry for any inconvenience caused.
Works on the power grid of the ULB/VUB SISC datacenter are over. Vega is now back online.
Works will take place on the power grid of the ULB/VUB SISC datacenter on Friday 3rd of August between 8:00 and 12:00. Some electric lines powering Hydra will be offline. We have therefore prepared the cluster to run with a reduced number of nodes. Access to Hydra and the data will remain possible and jobs will continue to run. The works should be completed within Friday morning and Hydra will be again fully operational in the afternoon.
Works will take place on the power grid of the ULB/VUB SISC datacenter on Friday 3rd of August between 8:00 and 12:00. Electric lines powering Vega will be offline. We have therefore prepared the cluster for a shutdown. No job will be running at the time of the shutdown (a reservation on all the nodes is in place) and therefore no expected loss. Queued jobs will be maintained as well. Once the works are completed, Vega will be powered back and will be online beginning of the afternoon, or sooner.
This morning at 9:20 a routine operation on the Hydra network was being performed by VUBNET which took down connectivity between Hydra storage and the compute nodes. The problem was quickly identified and connectivity restored. Some operations followed to recover pending data not written to the storage (no data loss is expected) and Hydra storage was fully recovered at 10:00. Unfortunately, almost all jobs crashed during storage outage. Please check your completed jobs of this morning for eventual error messages.
We have completed the maintenance works on Hydra. It took slightly longer then expected. Our apologies for that. The cluster is again accessible and is running jobs. The nodes operating system was updated, data reorganised on the Hydra storage (nothing changed on users' side), Torque updated to the latest version and the maximum walltime reduced to 5 days.
We have completed the maintenance works: Vega is again accessible and queued jobs are now running. The nodes operating system was updated and Slurm was upgraded to version 17.11.7.
Vega will be on maintenance from 25/06 to 27/06 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Vega core services. The maintenance will not kill jobs on the cluster or remove queued jobs. We have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 25/06 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.
Hydra will be on maintenance from 25/06 to 29/06 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Hydra core services. We will change the maximum walltime from 14 days to 5 days. Pending jobs still in the queue and with a walltime above 5 days will be updated accordingly. The maintenance will not kill jobs on the cluster or remove queued jobs. As for previous planned Hydra downtimes, we have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 25/06 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.
We have identified jobs on Hydra involved in the central storage NFS service crash. There was ~100 identical jobs running a short loop with an open-write-close action on the same file located in the Hydra home directory. This led to 1) massive IOPS on the NFS server and 2) NFS lock race condition that eventually lead to the NFS service crash. We will continue to work with the storage vendor to help them figure out why their NFS service crashed. This incident is also a good reminder for all to avoid using the Hydra home directory for IO intensive jobs. We have deployed a dedicated storage on Hydra capable to sustain massive IO loads and you can use it via your work directory. Hydra and Vega clusters are again fully operational.
After further investigations of the NFS issue on the SISC central storage it came out that when specific private networks are made accessible to the central storage, the crash follows a few hours after. Hydra and Vega are accessing the NFS via private networks involved in the NFS crash phenomenon. To preserve other critical services relying on, but not impacting, the central storage (like the email service), Hydra and Vega private networks have been cut out of the central storage. At this stage we still don't know if it is a rogue process or a network related issue. We continue to investigate in collaboration with SISC colleagues and the network teams.
The SISC central storage is encountering issues making the Hydra home directories unavailable. Login to Hydra will be impossible and running jobs relying on files stored on the home directories will fail. The job scheduler has been stopped to prevent new jobs starting. The issue is being investigated and further information will be posted once the issue has been resolved.
The VSC Users Day 2018 will take place on 22nd of May 2018 at the Koninklijke Vlaamse Academie van België voor Wetenschappen en Kunsten. See https://www.vscentrum.be/en/education-and-trainings/detail/vsc-user-day-22052018
The 10th CÉCI Scientific Meeting will take place on 4th of May 2018 at UNamur. See http://www.ceci-hpc.be/scientificmeeting.html
PRACE has issued the 17th call for Proposals.
Deadline: 2nd May 2018, 10:00 CET.
Stake: single-year and multi-year proposals starting 2nd October 2018.
Resources: Joliot-Curie, Hazel Hen, JUWELS, Marconi, MareNostrum IV, Piz Daint and SuperMUC.
We replaced the dying switch by a new one (thanks VUBNET team!). The cluster is back online. Note that jobs that were running at the time of the switch faiure have been lost. We kept Slurm stopped to prevent other job losses. Jobs that were in the queue are now running. New jobs can be submitted now.
It seems that one of the Gbps switches on Vega is dying (ports going down and ramdom restart). We are investigating the issue and will replace the switch if this is a hardware issue. The cluster is currently offline and will stay as such a bit longer if the switch must be replaced.
Broken disks have been replaced on Vega (no impact on data availability). A deep cleaning also recovered 20 TB of storage space.
We have installed Gaussian version 16 on Hydra. To use this version, simply load the right module: module load gaussian16 For Gaussian 09 users: we recommend rerunning your latest jobs with G16 and compare the results.
We have completed the maintenance works: Hydra is again accessible and queued jobs are now running. Works summary: 1) The entire Hydra Ethernet network has been rebuilt from scratch with new switches and new cabling. All network communications have been therefore improved for a globally increased performance in data management & transfers. 2) Four new GPGPU nodes have been added, each with 2x Tesla P100. 3) The storage capacity has been increased to 800 TB. 4) The usual OS/software updates & upgrades have been made including the installation of security patches.
We are planning a maintenance window on Hydra from 11/12 to 15/12 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Hydra core services including a complete network upgrade (physical and logical levels), 4 new GPGPU nodes and an increase of the storage capacity to 750 TB. The maintenance will not kill jobs on the cluster or remove queued jobs. As for previous planned Hydra downtimes, we have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 11/12 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.
We are planning a maintenance window on Hydra on 25/07 with a scheduled downtime. The downtime should last a few hours. A global reservation on Hydra has been created to make sure that no job will be running on Sept. 25. Jobs that can complete before the date will be executed and those with a walltime beyond Sept 25 will be maintained in the queue.
We are redirecting all the Vega logs to our ELK platform. The ELK analytic capabilities will permit to better spot issues with the compute nodes and with the submitted/running job. Our first objective: spot jobs wasting CPU or memory resources.
Long standing hardware issues with some Vega compute nodes is being tackled. We'll remove the problematic nodes from Vega, fix them and place them back in Vega. The total number of available compute nodes will vary in the coming days.
We are happy to announce the hire of a new CECI logisticien: Ariel Lozano. Welcome Ariel!
We are planning a maintenance window on Hydra from 03/07 to 06/07 with a scheduled downtime. A reservation will be placed on the cluster from 19/06. Jobs that can complete before the 03/07 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.
The cooling system is now back to normal. The Vega and Hydra clusters are fully operational too.
This morning at ~ 9:30 the datacentre cooling system failed. Given the temperature increase in the datacentre we had to switch off the Hydra and Vega compute nodes. Once the cooling system has been repared, we'll switch on the compute nodes. Jobs that were running will be lost but those in the queue have been kept. All data are safe on the Hydra storage. We'll send another notification once the clusters are back online.
The VSC has a vacancy for an account manager VSC and industry. You will be employed by the Fonds Wetenschappelijk Onderzoek - Vlaanderen. Closing date for applications is June 12, 2017. Given the nature of the job, the vacancy is only available in Dutch. See the detailed ammounce at https://www.vscentrum.be/en/news/detail/vacancy-account-manager-vsc-and-industry
This user day will take place on June 2, 2017 in Brussels. For more information please see https://www.vscentrum.be/events/userday-2017
UCL professors Gian-Marco Rignanese and Jean-Christophe Charlier have been granted 20 millions core-hours on Marconi KNL (CINECA, Italy). Congratulation to them!
The ninth CÉCI scientific meeting day is organised in Louvain-la-Neuve on April 21st. More information and registration: see http://www.ceci-hpc.be/scientificmeeting.html
The CÉCI common filesystem is fully available on the 6 CÉCI clusters. For example, the partition /CECI/home/ is directly accessible from all the login and compute nodes on all the clusters. Make sure to try it out! More information on https://support.ceci-hpc.be/doc/_contents/ManagingFiles/TheCommonFilesystem.html
The next four years the KU Leuven will host the VSC Tier-1 supercomputer. We like to present this supercomputer to you. See https://www.vscentrum.be/en/news/detail/breniac-the-new-tier-1-supercomputer-in-flanders