Regor¶
regor
is the name of the computing cluster hosted at the Observatoire.
It is designed to perform intensive computation as well as interactive tasks.
Short but important informations¶
Anyone having an account at the Observatoire, may connect to the cluster via its master node regor
(or regor2
).
However, before using it for the first time, it is important to pay attention to the following points:
- Please inform the system administrator by mail.
- Before running a job, it is important to move either into
/scratch
or/SCRATCH
(see regor File Systems for more details). Running jobs on your home is not a good idea. This may slow down your tasks and cause file system hanging.- You can run short jobs directly on the master as long as the jobs do not last more that a few minutes and do not consume a lot of memory. It is however recommended to submit your job on a queue a explained below.
- To run a longer job, you must submit it to the queue system. Interactive jobs, even requiring the X facility may be submitted to the queue.
- Additional software and libraries are provided by the module environment (see below).
Login¶
Once in the Observatoire network, simply type in a terminal:
ssh regor
Note that you still have acess to your home directory.
First steps on regor
¶
Once you are logged in:
move to your directory on
/scratch
or/SCRATCH
(see regor File Systems for more details)cd /scratch/myusernameif you don’t have any directory on
/scratch
or/SCRATCH
, ask the system administrator.It is useful to load some additional tools using the module facility. First, check if you have the module
dios
already loaded:module list Currently Loaded Modulefiles: 1) /diosIf
dios
is not on the list, add it:module add diosIn order to avoid to type it at every login, this line may be included into your profile file.
To run a simple interactive job type:
srunx
This makes you transparently login into a free computer where you can work without being trampled by someone else.
To submit jobs in a more specific way, please consult: regor queue system
regor
File Systems¶
Two large storage system (directories) are provided on regor
:
Serial File System¶
If you don’t really know what parallel means, this storage system is for you. This file system is dedicated to serial tasks, or tasks that accesses a lot of small files (< 1 Mo). Actually, the available space on this file system is 10 To.
The path to this file system is
/scratch/myusername
where of course, myusername
stands for your user name.
If you don’t have any directory there, ask the system administrator.
Warning
This file system is considered as scratch. Any lost due to software or hardware failure is not guarantee to be recovered.
Parallel File System¶
If you need to access very big files in parallel, this storage system is for you. Currently the disc space is about 100 To.
The path to this file system is
/SCRATCH/myusername
where of course, myusername
stands for your user name.
If you don’t have any directory there, ask the system administrator.
Warning
This file system is considered as scratch. Any lost due to software or hardware failure is not guarantee to be recovered.
Long time storage¶
Once the computation is done and data have been produced, if the amount of data is big (>100 GB), it is better
to avoid them to be stored in your home. We have and /archive
directory more appropriate for that.
Please ask the system administrator to get an access to this directory.
How to copy files into /scratch
or /SCRATCH
from outside the observatory ?¶
For security reasons, /scratch
and /SCRATCH
are accessible only from inside the cluster.
However, there are plenty good reasons to access it from a remote computer, i.e., a computer situated
outside the observatory. In this case, use the ssh tunnel facility
:
from the remote host (first terminal):
ssh -L 4022:regor:22 your_user_name@login01.astro.unige.ch -Nusing
scp
: from the remote host (second terminal):scp -P 4022 my_files_* your_user_name@localhost:/SCRATCH/your_user_name/where_you_want_to_copy_these_files/.using
rsync
: from the remote host (second terminal):rsync -av -e 'ssh -p 4022' my_files_* your_user_name@localhost:/SCRATCH/your_user_name/where_you_want_to_copy_these_files/.
regor
queue system¶
A queue is a group of computing server (nodes) that share similar properties and are designed for specific utilisation.
List of queues¶
Queue name | Cores (per node) | Memory (per node) | interconnect | comments | # of nodes | # of cores |
---|---|---|---|---|---|---|
r3 | 12 AMD Opteron (2432, 2.4GHz) | 32-6GB | IB 40Gb/s | serial jobs or parallel jobs | 12 | 144 |
r4 | 16 Intel Xeon (E5-2640 v3, 2.6GHz) | 128GB (2133 MHz) | IB 56Gb/s | parallel jobs prefered | 32 | 512 |
bm | 16 AMD Opteron (6134, 2.4GHz) | 64 (1066 MHz) | IB 40Gb/s | 64 GB memory | 2 | 32 |
gpu | 8 Intel Xeon (E5-620, 2.40GHz) | 48GB (1333 MHz) | IB 40Gb/s | 2x nVidia Tesla M2070 | 2 | 16 |
daceg | 12 Intel Xeon (X5660, 2.80GHz) | 48GB (1333 MHz) | IB 40Gb/s | Planet group + GPUs | 2 | 24 |
phi | 16 Intel Xeon (E5-2650 v2, 2.60GHz) | 128 (1600 MHz) | IB 40Gb/s | Xeon Phi accelerator | 2 | 32 |
Submit a job on the queue¶
regor
uses a queue system called slurm.
General documentation for users may be found here.
An extensive list of examples on how to compile and submit specific jobs on regor
is given in the following directory:
/dios/shared/examples
Each subdirectories contains peculiar example (serial jobs, parallel jobs with mpi, parallel jobs with openmp, cuda jobs, phi jobs, etc.). They are summarized in the following table.
directory | comments | suggested queue |
---|---|---|
serial | simple serial job | bm, r3 |
mpi | parallel using the MPI library | r3, r4, phi |
openmp | shared-memory parallel programming | r3, r4, bm |
py_multiproc | python pultiprocessing | r3, r4, bm |
cuda | NVIDIA parallel programming model | gpu |
phi | Xeon Phi accellerator | phi |
Serial interactive job¶
This is the most simple way to use the que system, if you need only one CPU. Simply type:
srunx
This makes you transparently login into a free computer node where you can work without being trampled by someone else. X windows are automatically forwared. Note that the CPU is booked for 24 hours. Your job will be automatically killed afterwards.
Serial job¶
Check the status of the queue an submit an interactive job¶
get the status of all queues:
sinfo
get the status of a specific queue:
sinfo -p r3
get the list of running jobs:
squeue -p r3
get a processeur allowed in interactive:
srun -p r3 --pty $SHELL
or with the X11 facilities:
srun -p r3 --x11=all -n1 -N1 --pty --preserve-env --mpi=none /bin/bash
Submit a job using a script¶
prepare a batch script (called for example job.s) which contains:
#!/bin/bash #SBATCH -p r3 -o my.stdout -e my.stderr --mail-user=prenom.nom@epfl.ch --mail-type=ALL # run the simulation python my.py
-o defines the standard output, -e defines the standard error output
send it on the queue::
sbatch job.s
cancel a job:
scancel job_id
Parallel jobs¶
Check the queue¶
get the status of the queue:
sinfo -p r3
get the list of running jobs:
squeue -p r3
Submit a batch job, running your mpi code¶
prepare a batch script (called for example job.s) which contains:
#!/bin/bash #SBATCH -p r3 -o my.stdout -e my.stderr --mail-user=prenom.nom@epfl.ch --mail-type=ALL # run the simulation mpirun your_mpi_executable
-o defines the standard output, -e defines the standard error output
send it on the queue::
sbatch job.s
cancel a job:
scancel job_id
Login without password¶
Auto-login configuration is described here : Setup ssh for auto login
Additional software¶
Additional software are provided by the module environment: click here
Frequent Asked Questions (FAQ)¶
Question : How can I copy files from my home to the regor2 /scratch directory ?
Answer : Using
rsync
. Assuming you are on your home:rsync -av my_file_on_my_home.dat regor2:/sratch/my_username/.Question : How can I copy files from the regor2 /scratch directory to my home ?
Answer : Using
rsync
. Assuming you are on your home:rsync -av regor2:/sratch/my_username/my_file_on_regor2.dat .Question : How can I set the node I want to run on, when using
srunx
?Answer :
srunx
is an alias to the command:srun -p r3 --x11=all -n1 -N1 --pty --preserve-env --mpi=none /bin/bashIf you want to specify the note, you need to add the option
-w
, for example:srun -p r3 --x11=all -n1 -N1 --pty --preserve-env --mpi=none -w node081 /bin/bashwill send you to the node
node081
.Question : Sometimes I get the following message when using
srunx
, and the display doesn’t work:user@regor2:~$ srunx srun: error: x11: unable to connect node node078 user@node078:~$ nedit NEdit: Can't open displayAnswer : type:
ssh-keygen -R node078and try
srunx
again.