Frequently asked questions (FAQ)

Hydra

1. How can I connect to Hydra?

First check the page Creating an account, to make sure that your NetID is activated for Hydra or that you have a valid VSC account.

Next, follow the instructions in the HPC tutorial.

If you cannot connect after following the instructions, there are a few things you can check:

  • Is your password correct?

    You can reset your password in your personal account manager (PAM). Note that it can take up to 1 hour to update our password database before you will be able to use your new password.

  • Are you using an SSH key?

    Make sure that your private key file is in the right place and has the right permissions.

If you still cannot connect, please contact us at hpc@vub.ac.be.

2. Where can I find installed software?

Most end-user software on Hydra is available via modules. To obtain a full list of installed software, type:

module av

Modules can be loaded as follows (in this example Python 2.7 is loaded):

module load Python/2.7.14-intel-2017b

To get a list of available versions of a given software package (for example, Python), type:

module spider Python

You can check which modules are currently loaded with:

module list

Unloading all currenly loaded modules can be done like this:

module purge

Note: when combining modules, always make sure that they are built with the same compiler toolchain (for example, foss or intel) and version (for example, 2017b or 2018a).

If you need software that is not yet installed, please contact us at hpc@vub.ac.be.

More information on module usage can be found in the HPC tutorial and in the HPC training.

3. How can I check my disk quota?

To prevent your account from becoming unusable, you should regularly check your disk quota and cleanup your $HOME and $WORKDIR.

You can check your disk quota with the command:

myquota

4. I have exceeded my $HOME disk quota, what now?

When your $HOME is full your account will become unusable.

In that case you have to delete some files from your $HOME ($VSC_HOME), such as tempoaray files, checkpoint files, until there is enough free space.

If you need large temporary storage, you can use $WORKDIR ($VSC_SCRATCH). Files will not be deleted there, but there is also no backup of your data on $WORKDIR.

Remember that hydra is not intended for data storage. You should regularly backup your data and cleanup your $HOME and $WORKDIR.

If you need backup storage, or if you need larger temporary storage than is available on $WORKDIR, please contact us at hpc@vub.ac.be.

5. How can I check my resource usage?

Making efficient use of the HPC cluster is important for you and your colleagues:

  • Your jobs will start faster.
  • Better usage means we can buy more/faster hardware.

The 3 main resources that must be considered are:

  • memory usage
  • wall time usage
  • cores usage (CPU time): how many cores are doing actual work?

You can check resource usage of running and recently finished jobs with the command:

myresources

Remarks:

  • Cores usage is only reported for jobs that are running longer than 5 minutes.
  • If your memory requested is 1GiB or less, this is always considered good (you get 1GiB ‘for free’).

You can also check resource usage of finished jobs at the end of your job output file. The last few lines of this file show the requested and used resources, for example:

Resources Requested: walltime=00:05:00,nodes=1:ppn=1,mem=1gb,neednodes=1:ppn=1
Resources Used: cput=00:01:19,vmem=105272kb,walltime=00:01:22,mem=17988kb,energy_used=0

6. I have accidentally deleted some data, how can I restore it?

First of all, remember that you are responsible for your own backups: we do not guarantee persistence of your data on Hydra.

If your deleted data was on your $WORKDIR ($VSC_SCRATCH), the data is permanently lost, as we do not make any backups there.

If you are using your VSC account to access Hydra, the data is also permanently lost (for the time being, we are working to change this).

If you are using your netID and your data was on your $HOME ($VSC_HOME), you might be lucky: we make daily snapshots of the $HOME directory, and we keep them for 7 days.

To access the snapshots, you first need to find out the directory your $HOME is mounted on:

df $HOME

The above command will return output similar to this:

Filesystem                       1K-blocks      Used Available Use% Mounted on
172.31.244.238:/export/fs-svub1 1073741824 814717952 259023872  76% /svub1

In the example output above, $HOME is mounted on ‘/svub1’. Go to the snapshot directory (replace ‘/svub1’ with the directory your $HOME is mounted on):

cd /svub1/.zfs/snapshot

List all snapshots currently available:

ls -ahl

Each directory name contains a timestemp (i.e. date+time when the snapshot was created). For example, the snapshot ‘.auto-20181118T020500UTC’ was created November 18th at 02:05. To access one of the snapshots, do (replace the timestamp with the one you need):

cd .auto-20181118T020500UTC/<netID>

In this directory you will find all your files and directories in the state as they were in on the date of the timestemp. You can browse this directory and copy back the data you need to your $HOME.

7. How can I share files with other users?

On Hydra, you can grant a group or an individual user access to some of your files or directories with an access control list (ACL). Note however that ACLs only work in your $WORKDIR, not in your $HOME.

Example 1: to give user ‘andrius’ read and write permissions to ‘my_file’:

setfacl -m u:andrius:rw my_file

Example 2: to give group ‘flash’ read access to ‘my_dir’:

setfacl -m g:flash:r my_file

Example 3: to remove all permissions from user ‘andrius’ to ‘my_file’:

setfacl -x u:andrius:rw my_file

For more information on ACLs, see: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/storage_administration_guide/acls-setting.

8. Why is my job not starting?

First of all, do not panic if your job does not start immediately. Depending on Hydra load, the number of jobs that you have submitted recently, and the requested resources in your job script, waiting times can take hours to days. Usually, the load on Hydra is higher on weekdays, and lower in the weekends and during holidays. Remember that the more resources you request, the longer you may have to wait in the queue.

To get an idea of the current load on Hydra, you can issue the command:

qstat -q

9. Why has my job crashed?

There are many possible causes for a job crash. Here are just a few of the more common ones:

  • Requested resources (memory or walltime) exceeded

    Check the last few lines of your job output file. If the used and requested memory or walltime are very similar, you have probably exceeded your requested resource. Increase the resource in your job script and resubmit.

  • Wrong module(s) loaded

    Check the syntax (case sensitive) and version of the modules. If you are using more than one module, check that they are compiled with the same toolchain.

  • Wrong input filetype

    Convert from dos to unix with the command:

    dos2unix <input_file>
    

10. How can I install additional software/packages?

You should first check if the software or package you need is already available on Hydra, either as part of a module, or installed as a separate module.

If the software or package you need is not available, we recommend that you kindly ask us to install it for you at hpc@vub.ac.be. This has several advantages:

  1. We optimize compilation for each architecture to make sure that your software/package runs efficiently (and usually much faster than when your install it yourself).
  2. The package will be available for everybody.
  3. The package will be built reproducibly with EasyBuild: important for scientific reproducibility.
  4. Different versions of the software can be installed alongside each other.

If you still want to install additional software/packages yourself, there are several resourses available online:

11. How can I run MATLAB on Hydra?

MATLAB is available on Hydra as a module. First check which MATLAB versions are available:

module av matlab

Next load a suitable version, for example (take the most recent version):

module load MATLAB/2017a

For very short calculations (<10min), you can run MATLAB in console mode. For example, if your matlab script is called ‘testmatlab.m’, type:

matlab -nodisplay -r testmatlab

For longer running calculations, we highly recommend to first compile your script using the MATLAB compiler mcc:

mcc -m testmatlab.m

This will generate a testmatlab binary file, as well as a ‘run_testmatlab.sh’ shell script (and a few other files). You can ignore the ‘run_testmatlab.sh’ file.

Now you can submit your matlab calculation as a batch job. Your job script should look like this:

#!/bin/bash -l
#PBS -l walltime=01:00:00
#PBS -l mem=1gb

module load MCR/R2017a

cd $PBS_O_WORKDIR
./testmatlab 2>&1 >testmatlab.out

The advantage of running a compiled matlab binary is that it does not require a license. We have only a limited number of MATLAB licenses that can be used at the same time, so in this way you can run your simulation even if the all licenses are in use.

More information on using the MATLAB compiler can be found here:

https://nl.mathworks.com/help/mps/ml_code/mcc.html

12. How can I use the GPUs to run my jobs?

We currently have three types of GPGPUs on Hydra:

  • 12x Tesla K20Xm (6GB) (6 nodes, each 2 GPUs each): ivybridge, 20 cores, 128 GB RAM
  • 8x Tesla P100 (16GB) (4 nodes, each 2 GPUs each): broadwell, 24 cores, 256 GB RAM
  • 4x Geforce 1080Ti (12GB) (1 node, 4 GPUs each): broadwell, 32 cores, 512 GB RAM

If you want to run a GPU job, submit with the following PBS directives:

#PBS -q gpu
#PBS -l nodes=1:ppn=1:gpus=1:gpgpu

If you want to run your job on a specific GPU type, add one of the following features:

#PBS -l feature=kepler  (for the K20Xm)
#PBS -l feature=pascal  (for the P100)
#PBS -l feature=geforce (for the 1080Ti)

Note however that if you make more specific job requests you may have to wait longer in the queue.

Important: before you submit a GPU job, make sure that you use software that is optimized for running on GPUs with support for CUDA. You can check this by looking at the name of the module. If the module name contains one of the words CUDA, goolfc or gompic, the software is built with CUDA support.

13. How can I run a job longer than the maximum allowed wall time?

The maximum wall time for calculations on Hydra is 5 days. If your job requires a longer wall time, there are a few options to consider.

Long running calculations can often be split into restartable chunks that can be submitted one after the other. First check the documentation of the software itself for restarting options.

If the software does not support restarting, you can use checkpointing. Checkpointing means making a snapshot of the current state of your calculation and saving it to disk. The snapshot can then be used to restore the state and continue your calculation. Checkpointing on Hydra can be done conveniently with csub, a tool that automates the process of:

  1. halting the job
  2. checkpointing the job
  3. killing the job
  4. re-submitting the job to the queue
  5. reloading the checkpointed state into memory

For example, to submit the job script ‘myjob.pbs’ with checkpointing and re-submitting every 24 hours, type:

csub -s myjob.pbs --shared --job_time=24:0:0

This checkpointing and re-submitting cycle will be repeated until your calculation has completed.

Notes:

  • Checkpointing and reloading is done as part of the job, and typically takes up to 15 minutes depending on the amount of RAM memory being used. Thus, in the example above you should specify the following PBS directive in your job script:

    #PBS -l walltime=24:15:00
    
  • Job output and error files are written in your directory $VSC_SCRATCH/chkpt (along with checkpoint files and csub log files). Other output files created by your job may also be written there.

  • Internally, csub uses DMTCP (Distributed MultiThreaded CheckPointing). Users who want full control can also use DMTCP directly. Example launch/restart job scripts can be downloaded here:

  • csub/DMTCP is not yet tested with all installed software on Hydra. It has been successfully used with software written in Python, R, and with Gaussian. For more information on DMTCP-supported applications, see http://dmtcp.sourceforge.net/supportedApps.html

If you run into issues with checkpointing/restarting, please contact us at hpc@vub.ac.be.

14. How can I use GaussView?

GaussView is a graphical interface used with the computuational chemistry program Gaussian. If your ssh session is configured with X11 forwarding, you can use GaussView directly on Hydra after loading the module:

ml GaussView/6.0.16

However, using a graphical interface on Hydra is slow, thus for regular use we recommend to install GaussView locally. Binary packages of GaussView are available for Linx, Mac, and Windows users and are provided upon request.

Installation of GaussView on Mac:

  1. Untar G16 and GaussView to /Applications (Two new dirs, g16 and gv will be created)

  2. Create a file ~/Library/LaunchAgents/env.GAUSS_EXEDIR.plist and paste the following content into it:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
    <dict>
    <key>Label</key>
    <string>env.GAUSS_EXEDIR</string>
    <key>ProgramArguments</key>
    <array>
    <string>launchctl</string>
    <string>setenv</string>
    <string>GAUSS_EXEDIR</string>
    <string>/Applications/g16/bsd:/Applications/g16</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    </dict>
    </plist>
    
  3. Issue the following command (only once) (or, alternatively, restart the machine):

    launchctl load ~/Library/LaunchAgents/env.GAUSS_EXEDIR.plist
    

Vega

(TODO)

VSC Tier-1

see the VSC website: https://www.vscentrum.be/en/user-portal/some-faqs

CECI Tier-1

see the CECI website: http://www.ceci-hpc.be/faq.html