Usage rules
Garnatxa is a multiuser system that shares resources between them. While much effort is made to ensure that you can do your work in relative isolation, some rules must be followed to avoid interfering with other users work.
Garnatxa is a cluster HPC this means that you must not use Garantxa like your personal computer or workstation. The resources in Garnatxa are expensive and optimized to running resource-intensive tasks. If you only have to run a small set of jobs which are already running well in your workstation then probably you don’t need Garnatxa.
There is no backup of the data stored on the cluster. Any accidentally removed file is lost for ever. It is the user’s responsibility to keep a copy of the contents of their personal storage device in a safe place. We provide a tape library to archive data. Contact us if you need to move data from Garnatxa to a storage tape.
Avoid performing computations on the login nodes. Once you’ve logged in, you must either submit a batch processing script or start an interactive session (see description below). Any significant processing (high memory requirements, long running time, etc.) that is attempted on the login nodes could be killed. Avoid to use screen, tmux or similar programs.. You should open interactive sessions or submit sbatch jobs to do the same task.
Garnatxa uses a distributed file system called CEPH which provides a replicated global storage system. This means that the system is enabled to do the same type of operations with files like in your laptop or computer. However you must avoid to execute some commands with a large set of files (about >10.000 files per directory) because produces a heavy load in the system. So avoid to use commands like:
find
,ls
ordu
to extensive directories of files. You must usedu_
orcheckdiskspace
instead ofdu
.Remove unused or obsolete data from the storage system. The storage space available in Garnatxa is limited and should be used with tight control. We will warn you if we detect any improper use. Remember you can use a tape library to move data from your account.
You can’t use browsers like Nautilus or similar file browsers to mount remote disk volumes from garnatxa (it produces heavy load in the storage system).
Use the
scp
orrsync
(recommended) commands to transfer data from your personal computer to garnatxa and reverse.The ssh sessions in Garnatxa will be closed after 8 hours of inactivity (you can avoid it submitting interactive sessions adjusting memory, cpu and time as requirements , we describe how do it below).
Attention
Your code must be able to be programmed in parallel mode if you want to achieve high performance. IF YOUR JOB IS NOT PARALLELIZED AS MUCH AS YOU REQUEST LARGE AMOUNT OF CPU OR MEMORY TO THE RESOURCE MANAGER YOUR JOB WILL CONTINUE TO WORKING WITH THE SAME PERFORMANCE
Interactive jobs
If you need to execute tasks like compression of data, transfer of files, compilation or testing code or any other similar task you have to use an interactive session with Garnatxa. The idea with interactive sessions is to execute tasks into a secure environment limited by maximum number of cpus, memory or execution time limit.
To initialize an interactive session in garnatxa execute the command interactive
:
[USERNAME@master ~]$ interactive
srun: job 5745 queued and waiting for resources
srun: job 5745 has been allocated resources
This will open an interactive session in Garnatxa and will reserve an environment of 2 core, 4GB of RAM and 12 hours of execution by default.
To change the quantity of CPU, RAM or TIME that your session requires add: -c (cpu number) or --m (total memory) or -t (execution time)
. For example to request 6 cpus, 30GB of memory and 24 hours of execution:
[USERNAME@master ~]$ interactive -c 6 -m 30G -t 24:00:00
Advanced usage:
You can use the srun
command to open an interactive session with an internal node. For example the next execution returns an interactive session with an internal node in the Cluster (node00):
[USERNAME@master ~]$ srun --partition=interactive --qos=interactive --nodes=1 --ntasks=1 --cpus-per-task=2 --pty --export=ALL --mem=30G --time=12:00:00 /bin/bash
[USERNAME@osd00 test]$
srun is a SLURM command that runs synchronously (i.e. it does not return until the job is finished). The example starts a job on the “interactive” partition and QoS, with pseudo-terminal mode on (–pty), a memory allocation of 30 GB RAM (–mem 30), and for 12 hours (-t in D-HH:MM format) of execution. It also assumes 2 core on one node. The final argument is the command that you want to run. In this case you’ll just get a shell prompt on Garnatxa.
Review the maximum amount of resources that you can request in the interactive queue.
Attention
All process running out of an interactive partition could be killed if they are consuming lots of resources: +30 minutes of execution, +8GB RAM
Submitting jobs to Garnatxa
The Garnatxa cluster is managed by a batch job control system called SLURM. Tools that you want to run are embedded in a command script and the script is submitted to the job control system using an appropriate SLURM (Simple Linux Utility for Resources Management) command.
The next script is a simple example that just prints the hostname of a compute host and after it waits 120 seconds. The both standard out and standard err are saved in separated files. Write a file called hostname.sbatch with the following content:
#!/bin/bash
#SBATCH -n 1 # Number of cores requested
#SBATCH -N 1 # Number of nodes requested
#SBATCH -t 15:00 # Runtime in minutes.
#SBATCH --qos short # The QoS to submit the job.
#SBATCH --mem=10G # Memory per cpu in G (see also --mem-per-cpu)
#SBATCH -o hostname_%j.out # Standard output goes to this file
#SBATCH -e hostname_%j.err # Standard error goes to this file
srun hostname
srun sleep 120
Then submit this job script to SLURM:
[USERNAME@master ~]$ sbatch hostname.sbatch
When command scripts are submitted (using the sbatch
command), SLURM looks at the resources you’ve requested and waits until an acceptable compute node is available on which to run it. Once the resources are available, it runs the script as a background process (i.e., you don’t need to keep your terminal open while it is running), returning the output and error streams to the locations designated by the script.
You can monitor the progress of your job using the squeue -j JOBID command
, where JOBID is the ID returned by SLURM when you submit the script. The output of this command will indicate if your job is PENDING
, RUNNING
, COMPLETED
, FAILED
etc. If the job is completed, you can get the output from the file specified by the -o option. If there are errors, they should appear in the file specified by the -e option.
[USERNAME@master ~]$ squeue -j 5828
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5829 short hostname USERNAME R 0:01 1 cn0
If you need to terminate a job, the scancel
command can be used (JOBID is the number returned when the job is submitted):
[USERNAME@master ~]$ scancel JOBID
You will find a complete documentation about how to submit jobs in Garnatxa in next sections
System limits
SLURM-managed resources are divided into partitions (known as queues in other batch processing systems). Garnatxa uses 2 partitions: interactive and global to submit your jobs. The interactive partition is used to start interactive sessions in the Garnatxa node. Otherwise the global partition is used to submit batch jobs with the sbatch
command. Below the limits assigned to the interactive and global partition.
PARTITIONS LIMITS:
Partition |
TIME LIMIT |
DEFAULT TIME |
MEMORY LIMIT |
DEFAULT MEMORY |
NODELIST |
---|---|---|---|---|---|
interactive |
1-00:00:00 (1 day) |
0-12:00:00 (12 hours) |
30GB |
4GB |
garnatxa |
global |
15-00:00:0 (15 days) |
0-6:00:00 ( 6 hours) |
(see QoS limits) |
2GB |
cn[00-08] |
Take account that each partition is assigned a default memory and execution time. This means that you always should specify the estimated execution time (–time parameter) and memory size (–mem or –mem-per-cpu) to be used by your job. Otherwise your job will be terminated when the job consumes defaults values.
The global partition is assigned a set of qos (Quality of Services) that describe the limits in which a job can run under certain requirements. Each QoS is assigned a set of limits to be applied to the job, dictating the limit in the resources and partitions that a job is entitled to request. The table below shows the available QoS in Garnatxa and their allowed resources limits.
LIMITS PER QOS AND USER:
It is the total resources that an individual user could use combining any type of qos.
QOS |
TIME LIMIT |
DEFAULT TIME |
MEMORY LIMIT |
DEFAULT MEMORY |
MAX CPU |
PRIORITY |
---|---|---|---|---|---|---|
interactive |
1-00:00:0 (1 day) |
0-12:00:00(12 hours) |
30GB |
4GB |
20 |
1000 |
short |
1-00:00:0 (1 day) |
0-6:00:00 ( 6 hours) |
1300GB |
2GB |
200 |
1000 |
medium |
7-00:00:0 (7 days) |
0-6:00:00 ( 6 hours) |
700GB |
2GB |
150 |
750 |
long |
15-00:00:0 (15 days) |
0-6:00:00 ( 6 hours) |
360GB |
2GB |
100 |
500 |
long-mem |
15-00:00:0 (15 days) |
0-6:00:00 ( 6 hours) |
1300GB |
2GB |
80 |
250 |
extra (ON DEMAND) |
15-00:00:0 (15 days) |
0-6:00:00 ( 6 hours) |
ON DEMAND |
ON DEMAND |
ON DEMAND |
4000 |
Attention
The extra QoS should be used exceptionally for urgent and time-bound work loads. Access must be requested with sufficient advance notice and with a reasoned justification regarding the resources to be used. To request the use of the extra QoS you must open a ticket through: https://garnatxadoc.uv.es/support
TOTAL LIMITS PER USER:
It is the total resources that an individual user could use in a specific QoS.
MAX CPU |
MAX MEMORY |
MAX JOBS RUNNING |
ALLOWED QOS |
---|---|---|---|
200 |
1300GB |
1000 |
interactive, short, medium, long, long-mem |
MAX ARRAY SIZE: 5000
Example 1: An user submit 3 distinct types of jobs: 1 QoS medium job: 100 cpus + 1 QoS long job: 90 cpus + 1 QoS short job: 20 cpus Checking the status of all the jobs, we can see that one job was queued because the sum of requested resources exceeds the total limits per user in the system (in relation to cpus): 100 + 90 + 20 = 210 cpus > 200 cpus
$ squeue -u USERNAME
JOBID NAME USER ACCOUNT PARTITION QOS START_TIME TIME TIME_LEFT NODES CPU MIN_M NODELIST ST REASON
11125 seqJobTest USERNAME admin global short N/A 0:00 5:00 1 20 2G PD AssocGrpCpuLimit
11123 seqJobTest USERNAME admin global medium 2023-01-17 0:09 4:51 5 100 2G cn[00-04] R None
11124 seqJobTest USERNAME admin global long 2023-01-17 0:36 4:24 4 90 2G cn[04-07] R None
Example 2: An user submit 5 distinct types of jobs: 1 QoS long job: 50 cpus + 1 QoS long job: 50 cpus + 1 QoS short job: 26 cpus + 1 QoS short job: 26 cpus + 1 QoS medium job: 40 cpus Checking the status of all the jobs, we can see that the total requested resources (in relation to cpus) is less that the maximum number of cpus per user: 50 + 55 + 26 + 26 + 40 = 197 cpus < 200 cpus. However one of the long jobs was queued because the QoS long is restricted to maximum 100 cpus: 50 + 55 = 110 cpus > 100 cpus
$ squeue -u USERNAME
JOBID NAME USER ACCOUNT PARTITION QOS START_TIME TIME TIME_LEFT NODES CPU MIN_M NODELIST ST REASON
11136 seqJobTest USERNAME admin global long N/A 0:00 5:00 4 50 2G PD QOSMaxCpuPerUserLimit
11135 seqJobTest USERNAME admin global long 2023-01-17 0:19 4:41 1 55 2G cn08 R None
11137 seqJobTest USERNAME admin global short 2023-01-17 0:07 4:53 1 26 2G cn04 R None
11138 seqJobTest USERNAME admin global short 2023-01-17 0:04 4:56 1 26 2G cn05 R None
11139 seqJobTest USERNAME admin global medium 2023-01-17 0:01 4:59 3 40 2G cn[00-02] R None
Priority policy
Jobs are scheduled by Slurm according a multi factor policy. Initially all the users in the cluster have the same opportunities to access the resources, but when their jobs have been running for some time, the priorities with respect to other users to access new resources may vary. Each job is assigned a number that marks its priority in the queue. This integer is the sum of these parameters:
Job’s priority = AGE + FAIRSHARE + JOB SIZE + QOS PRIORITY
AGE = The time the job has been queued in the system. The longer the job has been waiting, the more it contributes to the priority total.
FAIRSHARE = Each user is assigned a fair share of cluster usage. When the user exceeds the allocated portion because he has a lot of consumed resources then his work will be decreased in priority. In this way the use of the cluster is balanced between all users.
JOB SIZE = It is dependent of resources used by the job in terms of CPU and memory.
QOS PRIORITY = It is a static number that is assigned depending on the type of QoS assigned to the job.
You can review the priority of all jobs in the system executing the command: sprio
As a summary, the jobs are ordered by the calculated total priority and are executed in strict order of priority. Realize that the priority of a job changes dynamically over time.
Backfilled: The waiting time for a job is not always related to the length of the queue; Even if the queue has 100 jobs pending, your job could be starting right away if your job is so small it can be backfilled and scheduled in the shadow of a larger. But of course if you do not submit your job, you have zero chance for your job to start. For example, if a higher priority job needs 30 cores and it will have to wait 20 hours for those resources to be available, if a lower priority job only needs a couple cores for an hour, Slurm will run that shorter job in the meantime. This GREATLY enhances utilization of the cluster.
Jobs in Garnatxa
The next section explains all the details about how can you submit jobs to the queue system.