BAZIS HPC

The BAZIS is a compute cluster for research at the Vrije Universiteit Amsterdam. It provides a service between the general facilties at the SURF HPC center and the desktop. It is a heterogenious system composed of clusters and servers from research departments and a general partition.

In this topic you will find information to get you started and best practices. Clustercomputing can be very powerfull and usefull skill to add to your toolbox and get more science done.

Also take a look at the SURF wiki about Snellius, it contains a lot of information which applies to Bazis as well.

Information and how to get an account can be found on VU Service Portal under IT > Research > HPC Cluster Computing

BAZIS is maintained by IT for Research.

Connect

ssh

Use your favourite SSH client to login at <your login>@bazis.labs.vu.nl. On windows we recommend MobaXterm or MS Terminal; Mac users can use iterm. Direct access is only possible from the Campus or from SURF. From other network locations first connect to the stepstone <vu net id>@ssh.data.vu.nl, or use eduVPN Institutional Access (below).

eduVPN

Install the client. Start eduVPN and choose Institute Access to connect. You will need MFA to authenticate. Students can activate MFA at the servicedesk ( kb-item 11809 )

Data transfer

MobaXterm has an integrated FTP file browser. Once you have logged in to the HPC system, you will see the file browser to the left of the terminal window, where it shows the contents of your home folder. You can browse through these folders, and drag-and-drop files and folders between this FTP file browser and the Windows File Explorer. Alternatively, you can use the download/upload buttons at the top of the FTP file browser window. A green refresh button is also located there to refresh the contents of the current folder. You can also open files in the FTP file browser to edit them directly. Upon saving, you’ll be asked if you want change these files on the HPC system.

Other SFTP browsers There are a large number of free FTP browser out there. Some examples are

Filezilla (Windows, MacOS, Linux)
Cyberduck (Windows, MacOS)
WinSCP (Windows)

Advanced: Using SSH keys

SSH keys are an alternative method for authentication to obtain access to remote computing systems. They can also be used for authentication when transferring files or for accessing version control systems like github.

The cluster uses ssh keys to manage batch jobs.

On your workstation create ssh key pair ssh-keygen -t ed25519 -a 100

-a (default is 16): number of rounds of passphrase derivation; increase to slow down brute force attacks.
-t (default is rsa): specify the cryptographic algorithm. ed25519 is faster and shorter than RSA for comparable strength.
-f (default is /home/user/.ssh/id_algorithm): filename to store your keys. If you already have SSH keys, make sure you specify a different name: ssh-keygen will overwrite the default key if you don’t specify!

If ed25519 is not available, use the older (but strong and trusted) RSA cryptography: ssh-keygen -a 100 -t rsa -b 4096

When prompted, enter a strong password that you will remember.

Note: on windows you can use MobaKeyGen from MobaXterm, but on Windows 11 Powershell or Command Prompt works as well.

In your ~/.ssh directory you will find a public and private key. Make sure to keep the private key safe as anyone with the private key has access.

Now, when you add your public key to the ~/.ssh/authorized_keys file in a remote system, your key will be used to login.

You can either use copy-paste or the ssh-copy-id command:

$ ssh-copy-id user@remote-host
The authenticity of host 'remote-host (192.168.111.135)' can't be established.
ECDSA key fingerprint is SHA256:hXGpY0ALjXvDUDF1cDs2N8WRO9SuJZ/lfq+9q99BPV0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 2 key(s) remain to be installed -- if you are prompted now it is to install the new keys
user@remote-host's password:

Number of key(s) added: 2

Now try logging into the machine, with: “ssh ‘user@remote-host’” and check to make sure that only the key(s) you wanted were added.

Advanced: Forwarding X

Fix Warning: Remote Host Identification Has Changed

If you are sure that it is harmless and the remote host key has been changed in a legitimate way, you can skip the host key checking by sending the key to a null known_hosts file:

ssh -o "UserKnownHostsFile=/dev/null" -o "StrictHostKeyChecking=no" username@bazis.labs.vu.nl

To make the change permanent remove the offending host key from your ~/.ssh/known_hosts file with:

ssh-keygen -R "hostname"

References:

Using Slurm

Usefull commands

See jobs in the queue for a given user

   squeue -u username

Show available node features

   sinfo -o "%20N  %10c  %10m  %25f  %10G "

Submit a job

   sbatch script

Show the status of a currently running job

   sstat -j jobID

Show the final status of a finished job

   sacct -j jobID

Cancel a job

   scancel jobid

Best practices

Organise your input, output and temporary data. Make use of fast scratch directory ($TMPDIR).

Don’t run large computation on the login nodes! It negatively impacts all cluster users. Grab a compute node with srun –pty bash option.

Constraints

The SLURM constraint option allows for further control over which nodes your job can be scheduled on in a particular parition/queue. You may require a specific processor family or network interconnect. The features that can be used with the sbatch constraint option are defined by the system administrator and thus vary among HPC sites.

Constraints available on BAZIS are cpu architecture and gpu. Example (singole constraint):

#SBATCH --constraint=zen2

Example combining constraints:

#SBATCH --constraint="zen2|haswell"

Computer architecture

The parts of a modern computer we need to understand to apply to running jobs are listed here. (Note: This is way oversimplified and intended to give a basic overview for the purposes of understanding how to request resources from Slurm, there are a lot of resources out there to dig deeper into computer architecture.)

Board

A physical motherboard which contains one or more of each of Socket, Memory bus and PCI bus.

Socket

A physical socket on a motherboard which accepts a physical CPU part.

CPU

A physical part that is plugged into a socket.

Core

A physical CPU core, one of many possible cores, that are part of a CPU.

HyperThread

A virtual CPU thread, associated with a specific Core. This can be enabled or disabled on a system. On BAZIS hyperthreading is typically enabled. Compute intensive workloads will benefit to disable hyperthreading.

Memory Bus

A communication bus between system memory and a Socket/CPU.

PCI Bus

A communication bus between a Socket/CPU and I/O controllers (disks, networking, graphics,…) in the server.

Slurm complicates this, however, by using the terms core and cpu interchangeably depending on the context and Slurm command. –cpus-per-taks= for example is actually specifying the number of cores per task.

Slurm example jobs

simple job

    #!/bin/bash -l
    #SBATCH -J MyTestJob
    #SBATCH -N 1
    #SBATCH -p defq
    
    echo "== Starting run at $(date)"
    echo "== Job ID: ${SLURM_JOBID}"
    echo "== Node list: ${SLURM_NODELIST}"
    echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
    echo "== Scratch dir. : ${TMPDIR}"
    
    cd $TMPDIR
    # Your more useful application can be started below!
    hostname

Workspace

scratch tmpdir

Each slurm job will have a fast scratch dir allocated on the nodes which is deleted after finishing the job. use the $TMPDIR virable to use this space for example to store intermediate results or work on many files.

scratch-shared (experimental)

Use the workspace tools on the beegfs parallel filesystem to create a project space. The project space is a directory, with an associated expiration date, created on behalf of a user, to prevent disks from uncontrolled filling. The beegeefs parallel filesystem is faster than your usual NFS home space, but not backed up, so ideal for data which is easily recreated.

Your project space lives in the filesystem under: /scratch-shared/ws

The project space is managed with the hpc-workspace tooling You can add them to your environment with: module load hpc-workspace

Example: setup a workspace “MyData” in a batchjob for 10 days.

SCR=$(ws_allocate MyData 10)
cd $SCR

Check your workspaces

$ ws_list 
id: MyData
     workspace directory  : /scratch-shared/ws/username-MyData
     remaining time       : 9 days 23 hours
     creation time        : Wed Mar 13 23:51:57 2013
     expiration date      : Sat Mar 23 23:51:57 2013
     available extensions : 15

Release the project space with

ws_release MyData

For user guide see https://github.com/holgerBerger/hpc-workspace/blob/master/user-guide.md

Python virtual environments

Python has many powerfull packages. In scientific computing many packages may be used in a single project. To manage many python packages often a package manager as conda is used.

On a HPC system we do not prefer conda as it does not use optimised binaries and the cache can take up a lot of space, but we understand it is usefull in some cases and try to help.

Working with virtual environments further makes the python environment better to manage

A virtual environment is a named, isolated, working copy of Python that that maintains its own files,
directories, and paths so that you can work with specific versions of libraries or Python itself without affecting other Python projects.
Virtual environmets make it easy to cleanly separate different projects and avoid problems with different dependencies and version requirements across components.

In short:

use virtualenv (preferred) or conda
create an isolated environment
Install packages
Activate a virtual environment
Deactivate a virtual environment
Delete a virtual environment

Adding a requirements file

Python requirements files are a great way to keep track of the Python modules. It is a simple text file that saves a list of the modules and packages required by your project. By creating a Python requirements.txt file, you save yourself the hassle of having to track down and install all of the required modules manually.

A reuirements file is a simple text file, which looks like this.

tensorflow==2.3.1
uvicorn==0.12.2
fastapi==0.63.0

Installing modules from a requirements file is easy as.

pip install -r requirements.txt

A requirements file can also be generated with:

pip freeze > requirements.txt

See the article referenced below for more information.

Portable scripts

The first line in a script usually starts the interpreter and is called the Shebang. It is recommended to use /usr/bin/env, which can interpret your $PATH. This makes scripts more portable than hard coded paths..

#!/usr/local/bin/python
Will only run your script if python is installed in /usr/local/bin.

#!/usr/bin/env python
Will interpret your $PATH, and find python in any directory in your $PATH.

So your script is more portable, and will work without modification on systems where python is installed as /usr/bin/python, or /usr/local/bin/python, or even custom directories (that have been added to $PATH), like /opt/local/bin/python.

R environment

R has many powerfull scientific packages and a strong community. Installing and maintaining packages for R can be hard. On BAZIS the Bioconductor suite is installed and can be loaded with the appropiate module environment.

When first running R on a Cluster some changes in the workflow are required making the transition from working interactively from a terminal to scripts in batchmode.

Errors

Pretty much all the time we get errors. Errors can be simple e.g.syntax error, R/python version error or more complex e.g. a problem in our data. In either case, please pay attention to what the error says carefully, because often the solution is in that message or at least it is the starting point of the solution while debugging your code. If it is an error you have not seen before, simply google it. Often you will find a solution in websites like stackoverflow.

Tips and Caveats

Producing PNG graphics and X11 related plotting errors

The png() default device used the X11 driver, which is not avaialble in batch mode or remote operation. Adding the type=“cairo” option to your code solves this issue.

Example:

pdf(file = "testR.pdf", width = 4, height = 4)
plot(x = 1:10, y = 1:10)
abline(v = 0 )
text(x=0, y=1, labels = "random text")
dev.off()

png(file = "testR.png", type="cairo", width = 4, height = 4)
plot(x = 1:10, y = 1:10)
abline(v = 0 )
text(x=0, y=1, labels = "random text")
dev.off()

References

How do I produce PNG graphics in batch mode?

Matlab

Matlab has several features to work in batch mode on a HPC cluster. Assuming you know how to create matlab scripts we start simply by executing matlab interactively on a compute node

Interactive

Request resources (1 node, 1 cpu) in a partition

srun -N 1 -p defq --pty /bin/bash
module load matlab/R2023a
cd your/data/

Here is an example of a trivial MATLAB script (hello_world.m):

fprintf('Hello world.\n')

Run with matlab using only one computational thread.

$ matlab -nodisplay -singleCompThread -r hello_world
Hello world.
>>

Matlab waits at the end of the script if there is no exit. In an compute job this would keep the job running untill the wallclocklimit so we add an exit at the end. The convenient “-batch” option combines these options.

-batch MATLAB_command   - Start MATLAB and execute the MATLAB command(s) with no desktop
                              and certain interactive capabilities disabled. Terminates
                              upon successful completion of the command and returns exit
                              code 0. Upon failure, MATLAB terminates with a non-zero exit.
                              Cannot be combined with -r.

matlab -batch hello_world

Batch mode

Combining this in a slurm script we can queue matlab workloads.

    #!/bin/bash -l
    #SBATCH -J MyMatlab
    #SBATCH -N 1
    #SBATCH --cpus-per-task=1
    #SBATCH -p defq   
    #SBATCH --output=%x_%j.out
    #SBATCH --error=%x_%j.err
    #SBATCH --mail-type=END,FAIL
    #SBATCH --mail-user=<YOUR EMAIL>
    
    # Note: for parallel operations increase cpus-per-task above
    # Note 2: output and error logs can be given absolute paths 
    
    echo "== Starting run at $(date)"
    echo "== Job ID: ${SLURM_JOBID}"
    echo "== Node list: ${SLURM_NODELIST}"
    echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
    echo "== Scratch dir. : ${TMPDIR}"
    
    # cd $TMPDIR
    # or change to a project folder with matlab file e.g. hello_World.m
    # cd your/data

    # Load matlab module
    module load 2022 matlab/R2023a
    
    # execute
    matlab -batch hello_world

Parpool

Reading

mathworks-parpool

Connect to VU Yoda/iRODS with icommands

iRODS/Yoda is a middleware and webinterface to store enriched data. On BAZIS the icommands are available to allow direct access.

To use the icommands a configuration file in your home folder is required. https://yoda.vu.nl/site/getting-started/icommands.html#configuration

After configuration use iinit to connect, use a Data Access Password, similar to a WebDAV connection.

Find an overview of the icommands here: https://docs.irods.org/master/icommands/user/ Check the ipwd, ils, icd, iput, iget en irsync commands for navigation and data transfer.

Your project folder is located at /vu/home

icommands Yoda-VU