This online training covers various aspects of accessing and using the Garnatxa cluster
Although we will touch on the documentation listed below, the goal of this training is to provide a unified synthesis of this material to help you understand how you can use the compute resources effectively & efficiently.
Knowledge of basic Unix skills will be of great use as you learn how to use the cluster. While we do not offer beginning Unix classes, there are many resources available on the internet. We link to some of them here.
On the other hand, if you want to practice on a Linux system without having to install anything, you can use this online simulator:
Basic concepts about Garnatxa
Before to describe how we can access to the file system of Garnatxa is important to know some basic concepts about HPC clusters.
Research problems that use computing can sometimes outgrow the desktop or laptop computer where they started.
There might not be enough memory, not enough disk space or it may take too long to run computations.
These days, a typical desktop or laptop computer has up to 16GB of memory, a small number of CPU cores and several TB of disk space. When you run your computations on a computer, the program and data get loaded from the disk to the memory and the program’s commands are executed on the CPU cores. Results of the computations are usually written back to the disk. If too much data needs to be loaded to memory, the computer will become unresponsive and may even crash.
Typical cases of when it may be beneficial to request access to an HPC cluster:
Computations need much more memory than what is available on your computer
The same program needs to be run multiple times (usually on different input data)
The program that you use takes too long to run, but it can be run faster in parallel (usually using MPI or OpenMP)
One should not expect that any program will automatically run faster on a cluster. In fact a serial program that does not need much memory may run slower on an HPC cluster than on a newer laptop or desktop computer. However, on a cluster one can take advantage of multiple cores by running several instances of a program at once or using a parallel version of the program.
An HPC cluster as Garnatxa is a collection of many separate servers (computers), called nodes, which are connected via a dedicated fast network.
The cluster has different types of nodes depending the function that they provide. From the point of view of users only a node
Each of the HPC clusters listed on this site has
A headnode or login node, where users log in
A set of compute nodes (where majority of computations is run).
A ser of storage nodes (each node export a set of disks to the global storage. A distributed file system like CEPH ‘joins’ all these disks in order to the users can access their files as a single disk)
A set of switches to connect all nodes (we use ).
All cluster nodes have the same components as a laptop or desktop: CPU cores, memory and disk space. The difference between a personal computer and a cluster node is in quantity, quality and power of the components.
The next diagram describes the main components of the cluster Garnatxa. Bellow is showed some of the steps that an user could carry out when connecting to Garnatxa.
A user needs to connect to Garnatxa in order to process some files. As the user is connecting from your home, it’s needed to establish a VPN connection before to connect to Garnatxa. A VPN connection encrypts all the data that a user sends/receives to/from the cluster. After this the user should execute: ssh garnatxa.uv.es and the system will ask for his username and password.
All users can access to three pools of data in order to store or retrieve data. The /home directory is a personal folder to store code, sources and general files to work in the cluster, /storage contains files, sources, applications or databases that should be shared by all members of the same project (there is a folder by each of the research group), /scr is a directory of temporal data used to copy and retrieve data from a job in the cluster. The data produced here should be deleted after a job is completed.
The cluster provides a total of 784 cpus, 14TB of RAM and 2.6 PB of space in disk. These resources are shared by all users in the cluster. A resource manager (SLURM) is in charge to manage the resources between all the users (cpus, memory). Each internal node exports a set of disks and a distributed file system (CEPH) is able to abstract the access to the data. The cluster is showed to the user as a single hard disk and a set of cpus and memory to submit jobs.
Garnatxa implements 4 types of queues where to submit jobs. Each queue (partition) is limited to a set of resources (number of cpus, memory or time execution). A user must to write a template where some parameters of the job are specified. SLURM is the system charged to read this template and manage a set of suitable resources where to execute the job. Jobs in Garnatxa are queued and served according to order to receipt.
Jobs are assigned to internal nodes (depending the amount of cpus and memory that was requested by the user). When a job is launched to the queue system the user can check the status of the job. When the job is finished the user can access the the output data.
Figure 1. General diagram showing the components of Garnatxa.
Start the course
The objective of this course is to learn some usual activities that you will do to work in Garnatxa. The course deals with the resolution of a typical bioinformatic problem consisting of mapping a simulated sequence of a human mitochondrial genome on the reference genome. Please, follow all the steps bellow and try to understand the concepts about how to use Garnatxa to resolve this type of problems.
Connect to Garnaxa and create a new directory in your home directory. Change to this directory. Below in the example code change USERNAME by your login name.
[USERNAME@localhost ~]$ ssh USERNAME@garnatxa.uv.es
#############################################################
## ##
## Welcome to I2SysBio ##
## _____ _ ##
## / ____| | | ##
## | | __ __ _ _ __ _ __ __ _| |___ ____ _ ##
## | | |_ |/ _` | '__| '_ \ / _` | __\ \/ / _` | ##
## | |__| | (_| | | | | | | (_| | |_ > < (_| | ##
## \_____|\__,_|_| |_| |_|\__,_|\__/_/\_\__,_| ##
## ##
## Supercomputing facility ##
## for bioinformatics & computational biology ##
## ##
#############################################################
Usage statistics: /doc/statistics
Last login: Fri Nov 4 15:23:33 2022 from 10.1.0.2
[USERNAME@master ~]$ mkdir course
[USERNAME@master ~]$ cd course
[USERNAME@master course]$
Initialize an interactive session.
In order to test all the tasks in the course in a safe way we need to initialize an interactive session. An interactive session reserves a set of resources (CPUs, memory and time limit of execution) in Garnatxa so you can submit commands and tasks without losing sight of the results obtained.
Type the command interactive . This will reserve by default 2 CPUs, 4GB of RAM and 12 hours of time to you.
[USERNAME@master course]$ interactive
You can check your running session with: squeue or the extended command squeue_
[USERNAME@master course]$ squeue -u USERNAME
JOBID PARTITION NAME USER ACCOUNT TIME TIME_LEFT START_TIME NODES CPU MIN_M NODELIST ST REASON
6197 interacti bash USERNAME grp1 11:01 11:48:59 2022-11-07 1 2 4G garnatxa R None
Note
You can change the amount of resources to reserve in an interactive session. Use interactive -h to get more information.
Example: interactive -t 20:00:00 -m 30G will open a interactive session with a reservation of 30GB of RAM and 20 hours of time execution.
Copying files
The aim of this first stage is to simulate random reads of a human mitochondrial genome. Since we don’t have a sequencing machine to produce reads we will use a sample file mt_2020_pro.fa to simulate multiple reads as a sequencing machine would do in reality . This file is a sample of human mitochondrial genome in a fasta file. The first thing to do is to copy the file from /doc/training_course/: mt_2020_pro.fa to the recently created directory: /home/USERNAME/course.
[USERNAME@master course]$ cp /doc/training_course/mt_2020_pro.fa .
Download the human mitochondrial genome sequence from the NCBI (this file will be used later as reference genome). You can use the command wget to download any file from Internet. The file contains a file in the fasta format and is compressed.
[USERNAME@master course]$ wget http://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz
--2022-11-04 16:29:47-- http://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5400 (5,3K) [application/x-gzip]
Saving to: ‘Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz’
Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz 100%[========================================================================================================================================>] 5,27K --.-KB/s in 0s
2022-11-04 16:29:47 (637 MB/s) - ‘Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz’ saved [5400/5400]
Unzip the downloaded file. Change the name of the unzipped file to mt.fa (use the command mv to do this). Now you should see two fasta files in the course directoy. Use the ls command to list the files.
[USERNAME@master course]$ gunzip Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz
[USERNAME@master course]$ mv Homo_sapiens.GRCh38.dna.chromosome.MT.fa mt.fa
[USERNAME@master course]$ ls
mt_2020_pro.fa mt.fa
Generating reads
Now is time to install the program that we will use to simulate reads of the copied sequence. wgsim is an application that we can adapt to get the number of reads and coverage what we want. The application can be download from the official site. We use the command wget to do it. Since the downloaded package contains the code of wgsim we will have to compile this code with the gcc command to generate a binary file to execute.
[USERNAME@master course]$ wget https://github.com/lh3/wgsim/archive/master.zip
--2022-11-04 17:05:21-- https://github.com/lh3/wgsim/archive/master.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/lh3/wgsim/zip/refs/heads/master [following]
--2022-11-04 17:05:29-- https://codeload.github.com/lh3/wgsim/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.121.10
Connecting to codeload.github.com (codeload.github.com)|140.82.121.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’
master.zip [ <=> ] 12,85K --.-KB/s in 0,03s
2022-11-04 17:05:29 (373 KB/s) - ‘master.zip’ saved [13160]
[USERNAME@master course]$ unzip master.zip
Archive: master.zip
a12da3375ff3b51a5594d4b6fa35591173ecc229
creating: wgsim-master/
extracting: wgsim-master/.gitignore
inflating: wgsim-master/README
inflating: wgsim-master/kseq.h
inflating: wgsim-master/wgsim.c
inflating: wgsim-master/wgsim_eval.pl
[USERNAME@master course]$ cd wgsim-master/
[USERNAME@master wgsim-master]$ gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm
[USERNAME@master wgsim-master]$ ll
total 102
-rw-r--r-- 1 USERNAME cpd 8084 oct 17 2011 kseq.h
-rw-r--r-- 1 USERNAME cpd 4070 oct 17 2011 README
-rwxr-xr-x 1 USERNAME cpd 67224 nov 4 17:10 wgsim
-rw-r--r-- 1 USERNAME cpd 15480 oct 17 2011 wgsim.c
-rwxr-xr-x 1 USERNAME cpd 8677 oct 17 2011 wgsim_eval.pl
Now you can use get the help to use wgsim so:
[USERNAME@master wgsim-master]$ ./wgsim --help
./wgsim: invalid option -- '-'
Program: wgsim (short read simulator)
Version: 0.3.1-r13
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>
Options: -e FLOAT base error rate [0.000]
-d INT outer distance between the two ends [500]
-s INT standard deviation [50]
-N INT number of read pairs [1000000]
-1 INT length of the first read [70]
-2 INT length of the second read [70]
-r FLOAT rate of mutations [0.0010]
-R FLOAT fraction of indels [0.15]
-X FLOAT probability an indel is extended [0.30]
-S INT seed for random generator [-1]
-A FLOAT disgard if the fraction of ambiguous bases higher than FLOAT [0.05]
-h haplotype mode
For this example we will simulate reads of length: 150+100 bases, separated by 300 bases and a coverage of abouy 25: N = (16500/150)*25 = 2800. The result of the simulation are two files (containing simulated paired-read): mt01.fq and mt02.fq and mt_sal.txt (containing the mutations produced by wgsim).
First we change to /home/$USERNAME/course directory and execute the command ``wgsim``. To change the parent directory we do ``cd ..``. Finally execute ``wgsim` , remember that the application was unzipped in the directory wgsim-master so we must to put a relative path './/wgsim' regarding to the current directory (/home/$USERNAME/course).
[USERNAME@master course]$ cd ..
[USERNAME@master course]$ ./wgsim-master/wgsim -N 2800 -d 300 -1 150 -2 100 mt_2020_pro.fa mt01.fq mt02.fq > mt_sal.txt
[wgsim] seed = 1667579313
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 17949
[USERNAME@master course]$
[USERNAME@master course]$ ls
master.zip mt01.fq mt02.fq mt_2020_pro.fa mt.fa mt_sal.txt wgsim-master
You can see the file with simulated mutations by wgsim reading the file mt.sal.txt. Use the command cat to see the content of a file.
[USERNAME@master course]$ cat mt_sal.txt
MT 476 C M +
MT 3641 C M +
MT 3799 C S +
MT 5623 C Y +
MT 6804 C T -
MT 7587 - C -
MT 8622 A - -
MT 11485 C A -
MT 12319 C S +
MT 12795 A R +
MT 13550 - A +
MT 13607 T G -
MT 13853 G K +
MT 13903 A G -
MT 13984 A M +
MT 15528 - C -
MT 17484 C Y +
MT 17798 T W +
Now we have all the files and tools installed in your /home we can start the alignment. Below we summarize all the steps needed to obtain the three result files: mt01.bam , statistics.txt and converage.txt. Slurm (the resource manager in Garnatxa) takes account all these stages together as a single job. Then each stage into the job is named job step (tasks). You can see an scheme of all the process in figure 1. As you can see the job is a pipeline of processes where the input files in each job steep produces output files (input files for the next job steep).
Index the reference genome
mt.fa. This is needed to achieve quick searches in the alignment process.Align ( or map) the simulated reads to the indexed file
mt.faConvert the format of the produced files.
Sort the mapped reads.
Index the sorted reads.
Extract statistics.
Extract coverage.
Figure 1. Alignment of a human mitochondrial genome.
Testing the pipeline
In order to align the generated reads we need the algorithm BWA (Burrows-Wheeler Aligner). Garnatxa provides multiple modules of software that are ready to be used. You can review the installed modules in Garnatxa with the command: module.
To list the available software do module avail:
[USERNAME@master course]$ module avail
---------------------------------------------------------- /storage/apps/modulefiles --------------------------------------------------------------
R/4.2.1 (D) bcftools/1.16 blast/2.13 fastqc/0.11.9 intel/compiler/2021.5.0 (D) iqtree2/2.2.0 matlab/R2019b roary/3.7.0 sra-tools/3.0.0
anaconda/anaconda3_2021.11 bedtools/2.30.0 bowtie2/2.4.5 freebayes/1.3.6 intel/debugger/2021.5.0 kmc/3.2.1 metagenomics/1 samstats/1.5.1
anaconda/anaconda3 (D) biotools/1 bwa/0.7.17 intel/2019.4.243 intel/mpi/2021.5.0 mafft/7.505 openmpi4/4.1.4 (L,D) samtools/1.16
bbmap/39.01 biotools/2 (D) fastp/0.23 intel/compiler/2021.5.0_RT intel/tbb/2021.5.0 mathematica/12.1 python/3.8 spades/3.15
----------------------------------------------------- /opt/ohpc/pub/moduledeps/gnu9 -------------------------------------------------------------------
R/4.1.2 gsl/2.7 mpich/3.4.2-ucx openblas/0.3.7 openmpi4/4.1.1
------------------------------------------------ /opt/ohpc/pub/modulefiles ---------------------------------------------------------------------------
autotools (L) charliecloud/0.15 cmake/3.21.3 gnu9/9.4.0 (L) hwloc/2.5.0 libfabric/1.13.0 ohpc (L) os prun/2.2 (L) singularity/3.7.1 ucx/1.11.2 valgrind/3.18.1
To list the loaded modules in your session you can type: module list
[USERNAME@master course]$ module list
Currently Loaded Modules:
1) autotools 2) prun/2.2 3) gnu9/9.4.0 4) openmpi4/4.1.4 5) ohpc
Garnatxa provides a specific module that integrates the most used bioinformatic tools ready to be loaded. You can review the list of software that contains doing: module spider biotools
[USERNAME@master course]$ module help biotools
You could also load separated modules but it is more comfortable to have all the tools available at once. We will load the biotools and will check the bwa application
[USERNAME@master course]$ module load biotools
[USERNAME@master course]$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.
Now we use bwa to index the reference genome: file: mt.fa. Execute: bwa index mt.fa and list the generated files.
[USERNAME@master course]$ bwa index mt.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index mt.fa
[main] Real time: 0.561 sec; CPU: 0.007 sec
[USERNAME@master course]$ ls -l
total 1648
-rw-r--r-- 1 USERNAME cpd 13160 nov 4 17:05 master.zip
-rw-r--r-- 1 USERNAME cpd 942744 nov 4 17:28 mt01.fq
-rw-r--r-- 1 USERNAME cpd 662744 nov 4 17:28 mt02.fq
-rw-r--r-- 1 USERNAME cpd 18303 nov 4 16:45 mt_2020_pro.fa
-rw-r--r-- 1 USERNAME cpd 16900 oct 19 2020 mt.fa
-rw-r--r-- 1 USERNAME cpd 19 nov 7 15:36 mt.fa.amb
-rw-r--r-- 1 USERNAME cpd 76 nov 7 15:36 mt.fa.ann
-rw-r--r-- 1 USERNAME cpd 16648 nov 7 15:36 mt.fa.bwt
-rw-r--r-- 1 USERNAME cpd 4144 nov 7 15:36 mt.fa.pac
-rw-r--r-- 1 USERNAME cpd 8336 nov 7 15:36 mt.fa.sa
-rw-r--r-- 1 USERNAME cpd 262 nov 4 17:28 mt_sal.txt
drwxr-xr-x 2 USERNAME cpd 9 nov 4 17:27 wgsim-master
It’s time to align the reads to the reference genome. We have paired-read files. We can use only one of the paired-reads to align the reads. We will use the mt01.fq file submit: bwa mem mt.fa mt01.fq > mt01.sam
[USERNAME@master course]$ bwa mem mt.fa mt01.fq > mt01.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2800 sequences (420000 bp)...
[M::mem_process_seqs] Processed 2800 reads in 0.212 CPU sec, 0.212 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem mt.fa mt01.fq
[main] Real time: 0.283 sec; CPU: 0.218 sec
We can check that the alignment is produced successfully. Use the head command to read only the first lines of the result file:
[USERNAME@master course]$ head mt01.sam
@SQ SN:MT LN:16569
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem mt.fa mt01.fq
MT_13359_13571_2:0:0_1:0:1_0 16 MT 12223 60 126M1I23M * 0 0 ACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGCTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAACCATGCACAACTACTATAACCACCCTAACCCT 222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:2 MD:Z:71G77 AS:i:137 XS:i:0
MT_10701_11069_4:0:0_1:0:0_1 16 MT 8280 60 121M29S * 0 0 ACCCCCTCTAGAGCCCACTGTAAAGCTAACTTAGCATTAACCTTTTAAGTTAAAGATTAAGAGAACCAACACATCTTTACAGTGAAATGCCCCAACTAAATACTACCGTATGGCCCACCATCAACTTTCCTCACTATCTGCTTCATCCGC 222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:1 MD:Z:72C48 AS:i:116 XS:i:0 SA:Z:MT,9840,-,120S30M,60,0;
MT_10701_11069_4:0:0_1:0:0_1 2064 MT 9840 60 120H30M * 0 0 TCAACTTTCCTCACTATCTGCTTCATCCGC 222222222222222222222222222222 NM:i:0 MD:Z:30 AS:i:30 XS:i:0 SA:Z:MT,8280,-,121M29S,60,1;
MT_6882_7201_2:0:0_4:0:0_2 0 MT 5862 60 150M * 0 0 CAGTCCAATGCTTCACTCAGCCATTTTACCTCACCCCCACTGATTTTCGCCGACCGTTTACTATTCTCTACAAACCACAAAGACATTGGAACACTATACCTATTATTCGGCGCATGAGCTGGAGTCCTAGGCACAGCTCTAAGCCTCCTT 222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:2 MD:Z:44G13G91 AS:i:140 XS:i:0
MT_3425_3722_0:0:0_1:1:0_3 0 MT 3425 60 150M * 0 0 TAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCC 222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:0 MD:Z:150 AS:i:150 XS:i:0
MT_6006_6368_3:0:0_4:0:0_4 16 MT 5199 60 150M * 0 0 ATTCCATCCTCCCTCCTATCCCTAGCAGGCCTCCCCCCGCTAACCGGCTTTTTGCCCAAATGGGCCATTATCGAAGAATTCACAAAAAACAATAGCCTCATCATCCCCACCATCATAGCCACCATCACCCTCCTTAACCTCTACTTCTAC 222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:4 MD:Z:9A7C7G6G117 AS:i:130 XS:i:0
MT_13074_13356_2:0:0_2:0:0_5 0 MT 11874 60 150M * 0 0 CTATTAACCTACTGGGAGAACTCTCTGTGCTAGTAACCACGTTCTCCTGCTCAAATATCACTCTCCTACTTACAGGACTCAACATACTAGTCACAGCCCTAAACTCCCTCTACATATTTACCACAACACAATGGGGCTCACTCACCCACC 222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:2 MD:Z:49A51T48 AS:i:140 XS:i:0
MT_10266_10569_2:0:0_3:0:0_6 0 MT 9246 60 150M * 0 0 AGCCCATGACCCCTAACAGGGGCCCTCTCAGCCCTCCTAATGACCTCCGGCCTAGCCATGTGATTTCACTTCCACTGCATAACGCTCCTCATACTAGGCGTACTAACCAACACACTAACCATATACCAATGATGGCGCGATGTAACACGA 222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:2 MD:Z:76C22C50 AS:i:140 XS:i:0
To extract statistics and the coverage we need to convert the sam files to bam format. We can use the samtools software to convert files. samtools is loaded wuth the biotools module so you only have to execute: samtools view -bS mt01.sam > mt01.bam
[USERNAME@master course]$ samtools view -bS mt01.sam > mt01.bam
Now we need to sort the reads by mapping position: samtools sort mt01.bam -o mt01.sorted.bam
[USERNAME@master course]$ samtools sort mt01.bam -o mt01.sorted.bam
Finally we need to index the generated bam file: samtools index mt01.sorted.bam
[USERNAME@master course]$ samtools index mt01.sorted.bam
The statistics results can be dumped to a file:
[USERNAME@master course]$ samtools idxstats mt01.sorted.bam > statistics.txt
[USERNAME@master course]$ cat statistics.txt
MT 16569 2678 0
* 0 0 161
The length of genome is 16569 bases, in this case en mi caso, 2678 reads were successfully mapped and 161 reads were not possible to be aligned.
The last step is to extract the coverage achieved with respect to the reference sample. Type: samtools depth -a mt01.sorted.bam > mt01.only_cover.txt
[USERNAME@master course]$ samtools depth -a mt01.sorted.bam > mt01.only_cover.txt
[USERNAME@master course]$ cat mt01.only_cover.txt
MT 1 0
MT 2 0
MT 3 0
MT 4 1
MT 5 1
MT 6 1
MT 7 1
MT 8 1
MT 9 1
MT 10 1
MT 11 1
MT 12 1
MT 13 2
MT 14 2
MT 15 2
MT 16 2
MT 17 2
MT 18 2
MT 19 3
MT 20 3
.........
The result file shows the coverage obtained for each base in the reference genome.
Submitting jobs to the batch system
Access to compute resources on Garnatxa in managed through a job scheduler. The job scheduler manages HPC resources by having users send jobs using scripts, asking for resources, what commands to run, and how to run them. The scheduler will launch the script on compute resources when they are available. The script consists of two parts, instructions for the scheduler and the commands that the user wants to run. The script describes all the pipeline of instructions that were executed sequentially in the interactive session. You will find more information about how submit jobs to the batch system in next sections of this guide.
Use a text editor vi to write the below file. To write in the file using vi you have to press: Esc + i
[USERNAME@master course]$ vim align_reads.sbatch
#!/bin/bash
#SBATCH --job-name=align_reads # Job name (showed with squeue)
#SBATCH --output=aligns_reads_%j.out # Standard output and error log
#SBATCH --nodes=1 # Required only 1 node
#SBATCH --ntasks=1 # Required only 1 task
#SBATCH --cpus-per-task=1 # Required only 1 cpu
#SBATCH --mem=10G # Required 10GB of memory
#SBATCH --qos=short # Required short qos (maximum 24 hours and 180GB of memory)
#SBATCH --time=00:05:00 # Required 5 minutes of execution.
module load biotools
echo "Indexing the reference genome."
srun bwa index mt.fa
echo
echo "Mapping reads to reference genome"
srun bwa mem mt.fa mt01.fq > mt01.sam
echo
echo "Converting files from SAM to BAM"
srun samtools view -bS mt01.sam > mt01.bam
echo
echo "Sorting mapped reads"
srun samtools sort mt01.bam -o mt01.sorted.bam
echo
echo "Indexing sorted mapped reads"
srun samtools index mt01.sorted.bam
echo
echo "Extracting statistics"
srun samtools idxstats mt01.sorted.bam > statistics.txt
echo
echo "Extracting coverage"
srun samtools depth -a mt01.sorted.bam > mt01.only_cover.txt
echo
echo "The reference genome has: `head -n 1 statistics.txt | awk '{print $2}'` bases."
echo "Number of reads aligned: `head -n 1 statistics.txt | awk '{print $3}'` bases."
echo "Number of reads not aligned: `tail -n 1 statistics.txt | awk '{print $4}'` bases."
echo
sleep 60
echo "Finished job"
exit 0
Save changes pressing: Esc + :wq
The lines that begin with #SBATCH are instructions to the scheduler. In order:
#SBATCH –job-name=align_reads -> The name of your job, it’s important to put a descriptive name for each job.
#SBATCH –output=aligns_reads_%j.out -> The output file name. Errors during the execution will be saved in the same output file (you can specify a separated file with –error). Every time you submit a new job the scheduler assign a sequential number to the job (is named job_id). You can use %j to add the job_id to the name of the output file.
#SBATCH –nodes=1 -> We are requiring only one internal node for the execution.
#SBATCH –ntasks=1 -> Each job step is sequential so we only need one task per job step. If your job step is working in parallel consider to request more tasks.
#SBATCH –cpus-per-task=1 -> Job steps are sequential so only a cpu per task will be used.
#SBATCH –mem=10G -> It’s the total memory the job will consume. It’s is only an first estimation and should be tuned up in next executions. It’s very important you be able to adjust the total memory your job needs because this affects to the global availability of the cluster. We will give some instructions of how to do it in next sections.
#SBATCH –qos=short -> Garnatxa has four queues or partitions. Each partition defines a set of limits (execution time and memory) that your jobs must comply. For example, choosing the short partition the jobs should request less of 180GB of RAM and an execution time of until 24 hours. If your job exceed this limits (execution time and memory) the scheduler will kill the job.
#SBATCH –time=00:05:00 -> It’s the maximum time the job will be in execution, in this example five minutes. This time should be less that the maximum time established in the partition (24 hours in case of short partition). The format of this parameter is dd-hh:mm:ss (dd: days, hh: hours, mm:minutes, ss:seconds)
module load biotools -> remember that you need to load this module in order to execute commands like bwa o samtools. You must add all the modules you need before to execute any other process.
After the #SBATCH directives we will see lines starting with srun command, each line means a job step that will be executed sequentially in the job. The last command sleep 60 is not necessary in your scripts. It is used here to extend the time of the execution.
The script is launched using the sbatch command.
[USERNAME@master course]$ sbatch align_reads.sbatch
Submitted batch job 6266
The job was assigned with job_id: 6266
While the job is running you can see the status of the job in the queue system with: squeue or squeue_
[USERNAME@master course]$ squeue_ -u USERNAME
JOBID PARTITION NAME USER ACCOUNT TIME TIME_LEFT START_TIME NODES CPU MIN_M NODELIST ST REASON
6266 short align_re USERNAME admin 0:13 4:47 2022-11-08 1 2 10G cn06 R None
TIME -> time elapsed from the start of execution.
TIME_LEFT -> time left until the job will be killed. In this case we requested 5 minutes of time and the elapsed time is 13 seconds so the time left until the job be killed is 4:47 seconds.
MIN_M -> memory requested by node.
NODELIST -> the internal node where your job is running.
ST -> status of the job. R meaning the job is Running. If the status shows PD meaning that the job is queued waiting to obtain resources.
After waiting about 1 minute you could repeat the same command and check that the job is not showed in the output.
[USERNAME@master course]$ squeue -u USERNAME
You can see more information about a job finished executing the command sacct or sacct_ , use the job_id to specify the job to monitor.
[USERNAME@master course]$ sacct -j 6266
Start End User Account JobID JobName Partition Elapsed ReqCPUS AllocCPUS ReqMem MaxRSS TotalCPU State ExitCode
------------------- ------------------- --------- ---------- ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ---------- --------
2022-11-08T13:17:26 2022-11-08T13:18:29 USERNAME admin 6266 align_rea+ short 00:01:03 1 2 10Gn 00:01.032 COMPLETED 0:0
2022-11-08T13:17:26 2022-11-08T13:18:29 admin 6266.batch batch 00:01:03 2 2 10Gn 4584K 00:00.655 COMPLETED 0:0
2022-11-08T13:17:27 2022-11-08T13:17:28 admin 6266.0 bwa 00:00:01 2 2 10Gn 768K 00:00.011 COMPLETED 0:0
2022-11-08T13:17:28 2022-11-08T13:17:28 admin 6266.1 bwa 00:00:00 2 2 10Gn 776K 00:00.241 COMPLETED 0:0
2022-11-08T13:17:28 2022-11-08T13:17:29 admin 6266.2 samtools 00:00:01 2 2 10Gn 772K 00:00.051 COMPLETED 0:0
2022-11-08T13:17:29 2022-11-08T13:17:29 admin 6266.3 samtools 00:00:00 2 2 10Gn 772K 00:00.036 COMPLETED 0:0
2022-11-08T13:17:29 2022-11-08T13:17:29 admin 6266.4 samtools 00:00:00 2 2 10Gn 776K 00:00.012 COMPLETED 0:0
2022-11-08T13:17:29 2022-11-08T13:17:29 admin 6266.5 samtools 00:00:00 2 2 10Gn 772K 00:00.008 COMPLETED 0:0
2022-11-08T13:17:29 2022-11-08T13:17:29 admin 6266.6 samtools 00:00:00 2 2 10Gn 776K 00:00.014 COMPLETED 0:0
The output shows all the job steps executed in the job and the resources consumed by each of them.
AllocCPUs -> the number of cores requested. Each CPU is associated with 2 cores (better performance) so even if you requested a single cpu the system will use 2.
ReqMem -> the required mem per node.
MaxRSS -> maximum ammount of consumed memory per job step.
TotalCPU -> execution time per job step
State -> the job step state: COMPLETED, FINISHED or FAILED
ExitCode -> the error code if your job is failed. Exit code 0 means your job finished fine.
Now the job is finished successfully we can show the output file with the results:
[USERNAME@master course]$ cat aligns_reads_6266.out
Indexing the reference genome.
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: /storage/apps/BWA/0.7.17/bin/bwa index mt.fa
[main] Real time: 0.785 sec; CPU: 0.012 sec
Mapping reads to reference genome
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2800 sequences (420000 bp)...
[M::mem_process_seqs] Processed 2800 reads in 0.231 CPU sec, 0.234 real sec
[main] Version: 0.7.17-r1188
[main] CMD: /storage/apps/BWA/0.7.17/bin/bwa mem mt.fa mt01.fq
[main] Real time: 0.309 sec; CPU: 0.241 sec
Converting files from SAM to BAM
Sorting mapped reads
Indexing sorted mapped reads
Extracting statistics
Extracting coverage
The reference genome has: 16569 bases.
Number of reads aligned: 2678 bases.
Number of reads not aligned: 161 bases.
Finished job
In the next sections you can find more extended information about how to work in Garnatxa. Please read all this documentation before to try submit jobs to the system.