Regor

regor is the name of the computing cluster hosted at the Observatoire. It is designed to perform intensive computation as well as interactive tasks.

Short but important informations

Anyone having an account at the Observatoire, may connect to the cluster via its master node regor (or regor2). However, before using it for the first time, it is important to pay attention to the following points:

  • Please inform the system administrator by mail.
  • Before running a job, it is important to move either into /scratch or /SCRATCH (see regor File Systems for more details). Running jobs on your home is not a good idea. This may slow down your tasks and cause file system hanging.
  • You can run short jobs directly on the master as long as the jobs do not last more that a few minutes and do not consume a lot of memory. It is however recommended to submit your job on a queue a explained below.
  • To run a longer job, you must submit it to the queue system. Interactive jobs, even requiring the X facility may be submitted to the queue.
  • Additional software and libraries are provided by the module environment (see below).

Login

Once in the Observatoire network, simply type in a terminal:

ssh regor

Note that you still have acess to your home directory.

First steps on regor

Once you are logged in:

  1. move to your directory on /scratch or /SCRATCH (see regor File Systems for more details)

    cd /scratch/myusername
    

    if you don’t have any directory on /scratch or /SCRATCH, ask the system administrator.

  2. It is useful to load some additional tools using the module facility. First, check if you have the module dios already loaded:

    module list
    Currently Loaded Modulefiles:
      1) /dios
    

    If dios is not on the list, add it:

    module add dios
    

    In order to avoid to type it at every login, this line may be included into your profile file.

  3. To run a simple interactive job type:

    srunx
    

    This makes you transparently login into a free computer where you can work without being trampled by someone else.

  4. To submit jobs in a more specific way, please consult: regor queue system

regor File Systems

Two large storage system (directories) are provided on regor:

Serial File System

If you don’t really know what parallel means, this storage system is for you. This file system is dedicated to serial tasks, or tasks that accesses a lot of small files (< 1 Mo). Actually, the available space on this file system is 10 To.

The path to this file system is

/scratch/myusername

where of course, myusername stands for your user name. If you don’t have any directory there, ask the system administrator.

Warning

This file system is considered as scratch. Any lost due to software or hardware failure is not guarantee to be recovered.

Parallel File System

If you need to access very big files in parallel, this storage system is for you. Currently the disc space is about 100 To.

The path to this file system is

/SCRATCH/myusername

where of course, myusername stands for your user name. If you don’t have any directory there, ask the system administrator.

Warning

This file system is considered as scratch. Any lost due to software or hardware failure is not guarantee to be recovered.

Long time storage

Once the computation is done and data have been produced, if the amount of data is big (>100 GB), it is better to avoid them to be stored in your home. We have and /archive directory more appropriate for that. Please ask the system administrator to get an access to this directory.

How to copy files into /scratch or /SCRATCH from outside the observatory ?

For security reasons, /scratch and /SCRATCH are accessible only from inside the cluster. However, there are plenty good reasons to access it from a remote computer, i.e., a computer situated outside the observatory. In this case, use the ssh tunnel facility:

  1. from the remote host (first terminal):

    ssh -L 4022:regor:22 your_user_name@login01.astro.unige.ch -N
    
  2. using scp : from the remote host (second terminal):

    scp   -P 4022  my_files_*  your_user_name@localhost:/SCRATCH/your_user_name/where_you_want_to_copy_these_files/.
    
  3. using rsync : from the remote host (second terminal):

    rsync -av -e 'ssh -p 4022'  my_files_*  your_user_name@localhost:/SCRATCH/your_user_name/where_you_want_to_copy_these_files/.
    

regor queue system

A queue is a group of computing server (nodes) that share similar properties and are designed for specific utilisation.

List of queues

Queue name Cores (per node) Memory (per node) interconnect comments # of nodes # of cores
r3 12 AMD Opteron (2432, 2.4GHz) 32-6GB IB 40Gb/s serial jobs or parallel jobs 12 144
r4 16 Intel Xeon (E5-2640 v3, 2.6GHz) 128GB (2133 MHz) IB 56Gb/s parallel jobs prefered 32 512
bm 16 AMD Opteron (6134, 2.4GHz) 64 (1066 MHz) IB 40Gb/s 64 GB memory 2 32
gpu 8 Intel Xeon (E5-620, 2.40GHz) 48GB (1333 MHz) IB 40Gb/s 2x nVidia Tesla M2070 2 16
daceg 12 Intel Xeon (X5660, 2.80GHz) 48GB (1333 MHz) IB 40Gb/s Planet group + GPUs 2 24
phi 16 Intel Xeon (E5-2650 v2, 2.60GHz) 128 (1600 MHz) IB 40Gb/s Xeon Phi accelerator 2 32

Submit a job on the queue

regor uses a queue system called slurm. General documentation for users may be found here.

An extensive list of examples on how to compile and submit specific jobs on regor is given in the following directory:

/dios/shared/examples

Each subdirectories contains peculiar example (serial jobs, parallel jobs with mpi, parallel jobs with openmp, cuda jobs, phi jobs, etc.). They are summarized in the following table.

directory comments suggested queue
serial simple serial job bm, r3
mpi parallel using the MPI library r3, r4, phi
openmp shared-memory parallel programming r3, r4, bm
py_multiproc python pultiprocessing r3, r4, bm
cuda NVIDIA parallel programming model gpu
phi Xeon Phi accellerator phi

Serial interactive job

This is the most simple way to use the que system, if you need only one CPU. Simply type:

srunx

This makes you transparently login into a free computer node where you can work without being trampled by someone else. X windows are automatically forwared. Note that the CPU is booked for 24 hours. Your job will be automatically killed afterwards.

Serial job

Check the status of the queue an submit an interactive job

  • get the status of all queues:

    sinfo
    
  • get the status of a specific queue:

    sinfo -p r3
    
  • get the list of running jobs:

    squeue -p r3
    
  • get a processeur allowed in interactive:

    srun -p r3 --pty  $SHELL
    

    or with the X11 facilities:

    srun -p r3 --x11=all -n1 -N1 --pty --preserve-env --mpi=none /bin/bash
    

Submit a job using a script

  1. prepare a batch script (called for example job.s) which contains:

    #!/bin/bash
    #SBATCH -p r3 -o my.stdout -e my.stderr --mail-user=prenom.nom@epfl.ch --mail-type=ALL
    # run the simulation
    python my.py
    

    -o defines the standard output, -e defines the standard error output

  2. send it on the queue::

    sbatch job.s
    
  3. cancel a job:

    scancel job_id
    

Parallel jobs

Check the queue

  • get the status of the queue:

    sinfo -p r3
    
  • get the list of running jobs:

    squeue -p r3
    

Submit a batch job, running your mpi code

  1. prepare a batch script (called for example job.s) which contains:

    #!/bin/bash
    #SBATCH -p r3 -o my.stdout -e my.stderr --mail-user=prenom.nom@epfl.ch --mail-type=ALL
    # run the simulation
    mpirun your_mpi_executable
    
-o defines the standard output, -e defines the standard error output
  1. send it on the queue::

    sbatch job.s
    
  2. cancel a job:

    scancel job_id
    

Login without password

Auto-login configuration is described here : Setup ssh for auto login

Additional software

Additional software are provided by the module environment: click here

Frequent Asked Questions (FAQ)

  • Question : How can I copy files from my home to the regor2 /scratch directory ?

    Answer : Using rsync. Assuming you are on your home:

    rsync -av my_file_on_my_home.dat regor2:/sratch/my_username/.
    
  • Question : How can I copy files from the regor2 /scratch directory to my home ?

    Answer : Using rsync. Assuming you are on your home:

    rsync -av regor2:/sratch/my_username/my_file_on_regor2.dat .
    
  • Question : How can I set the node I want to run on, when using srunx ?

    Answer : srunx is an alias to the command:

    srun -p r3 --x11=all -n1 -N1 --pty --preserve-env --mpi=none /bin/bash
    

    If you want to specify the note, you need to add the option -w, for example:

    srun -p r3 --x11=all -n1 -N1 --pty --preserve-env --mpi=none -w node081 /bin/bash
    

    will send you to the node node081.

  • Question : Sometimes I get the following message when using srunx, and the display doesn’t work:

    user@regor2:~$ srunx
    srun: error: x11: unable to connect node node078
    user@node078:~$ nedit
    NEdit: Can't open display
    

    Answer : type:

    ssh-keygen -R  node078
    

    and try srunx again.