Getting Started
The document will work through setting up a pipeline to process Plasmodium vivax sequence data using PvP01_v1 reference sequence and perform some processing of P vivax sequence data.
Note
Run this tutorial under a terminal multiplexer such as tmux or screen
when using a remote system to avoid termination of a running program if the
network is disrupted.
Quick Installation
If vivaxGEN NGS-Pipeline has not been installed in your system, run the following command to install it:
"${SHELL}" <(curl -L https://raw.githubusercontent.com/vivaxgen/ngs-pipeline/main/install.sh)
Enter the target directory at the prompt (or just press Enter to use the default). The installation will take about 5-15 minutes depending on the internet speed and will use about 5-6 GB of storage. Take a note of the directory where the pipeline is installed and the full path of its activation script.
Warning
If installing under WSL/WSL2, please make sure that the target directory
is under Linux (WSL) filesystems.
The installation will fail if the target directory is under Windows
filesystems, such as /mnt/c or /mnt/d.
Try to activate the NGS-Pipeline environment by executing its activation script:
YOUR_INSTALATION_DIRECTORY/bin/activate
Note
When an environment is activated, the shell prompt will change to reflect the environment name.
Checking and Setting the Profile
The installation process checks and sets the correct cluster profile for Snakemake workflow if a workload manager/job scheduler is installed in a HPC/cluster system.
If SLURM or PBSPro is installed in the system, the installation script will
try to set up a default setting.
Some HPC settings require additional flags or arguments for job submission, in
which case the flags/arguments can be supplied with environment variable
SNAKEMAKE_CLUSTER_EXTRA_FLAGS, by following instructions in
Adding Additional Flags for Job Submission.
If the installer can not detect existing workload manager, please follow
Manual Setup for Snakemake Cluster Profile to manually set the cluster profile.
If the installer could not detect any workload managers/job schedulers, it will set the profiler based on the available cores and memory of the system.
To see which profile set by the installer, activate the NGS-Pipeline environment and check the link to the Snakefile profile by running the following command:
ls -l $VVG_BASEDIR/etc/bashrc.d/99-snakemake-profile
The link should point to the correct profile. If the link is not correct, adjust the link accordingly following the instruction in Snakemake Profile Setting.
Preparing Base Environment Directory
For normal workflows, vivaxGEN NGS-Pipeline requires that all necessary files, with the exception of the pipeline and all supporting software themselves, to reside in a single hierarchical directory (unless the pipeline has been set up with specific settings). The base environment directory can be prepared with automatic method (with preset settings) or manual method (with custom settings).
Regardless of whether using automatic or manual method, the following are the mandatory steps to prepare the base environment directory:
Activate the vivaxGEN NGS-Pipeline environment by running its activation script, as noted after the automatic installation finished, eg:
NGS-PIPELINE_INSTALL_DIR/activateSetup the base working directory, eg:
/data/Pv-wgs/PvP01_v1:ngs-pl setup-base-directory /data/Pv-wgs/PvP01_v1 cd /data/Pv-wgs/PvP01_v1
Tip
To easily identify and differentiate between several base environment directory, it is recommended to use the reference genome name as part of the directory name.
Warning
If running under WSL/WSL2, please make sure that the base environment directory is under Linux (WSL) filesystems. If the base environment directory is under Windows filesystems (such as
/mnt/cor/mnt/d), the pipeline might fail to work.Exit the current environment and activate the new environment using the new
activatefile:exit /data/Pv-wgs/PvP01_v1/activate
Once activated, the environment directory can be accessed using environment variable
NGSENV_BASEDIR.
To continue preparing the base enviroment directory with automatic method using preset settings for P vivax with PvP01_v1 reference sequence, change to base environment directory:
cd $NGSENV_BASEDIR
If running in an HPC/cluster system or workstation/server with 16-core or more, use the following command to setup the base environment directory with full version of PvP01_v1 setting:
bash <(curl -L https://raw.githubusercontent.com/vivaxgen/vgnpc-plasmodium-spp/main/Pvivax/PvP01_v1/setup.sh)
If running in a laptop or desktop with less than 16-core, use the following command to setup the base enviroment directory with lite version of PvP01_v1 setting:
bash <(curl -L https://raw.githubusercontent.com/vivaxgen/vgnpc-plasmodium-spp/main/Pvivax/PvP01_v1/setup-lite.sh)
The full version setup will take some time as it needs to download both the
PvP01_v1 genome sequence (~ 23MB), human GRCh38.p14 genome (~ 928MB),
uncompress the human genome, and generate index file for both PvP01 and the
human genome sequences using bwa-mem2.
The lite version setup will take less time as it only needs to download the PvP01_v1 genome sequence (~ 23MB) and generate index file for only PvP01 sequences.
Note
The vivaxGEN github repository provides the list of available preset settings. However, if none of the preset settings are suitable, then the setup can be continued using manual method following steps described in Setting Up Base Enviroment Directory.
Running the Multi-Step Mode
This section of the tutorial shows the use of run-multistep-variant-caller
single command, which provides the simple and quick way to perform multi-step
mode of the variant calling.
In this section, 2 samples of P vivax WGS data will be processed to get the
final result as a concatenated VCF file (a single VCF file containing all
chromosomes).
Tip
For larger number of samples, it is advisable to have the final result as multiple VCF files, each contains a specific chromosome, since the downstream analysis then can be performed individually on each chromosome in parallel to speed up the analysis.
Activate the environment by executing the
activatescript if the environment has not been activated:/data/Pv-wgs/PvP01_v1/activateEnter the directory for containing data sets, and create a new directory, and enter to the new directory:
cd $NGSENV_BASEDIR/sets mkdir my-tutorial cd my-tutorial
Create a directory to hold the FASTQ read files:
mkdir reads-1Download read files related to 2 P. vivax sequence data from ENA (note that for working with public SRA read files, consider using SRA-Repo to manage and automatically download the read files):
cd reads-1 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR111/ERR111714/ERR111714_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR111/ERR111714/ERR111714_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR113/004/ERR1138854/ERR1138854_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR113/004/ERR1138854/ERR1138854_2.fastq.gz cd ..
Run the multi-step mode variant calling process by executing this single command:
ngs-pl run-multistep-variant-caller -o batch-1 --paired reads-1/*.fastq.gzWait until the process finishes.
Inspect the
batch-1directory by performing directorylisting:ls batch-1The following is the layout of the output directory:
analysis/This directory contains sample directory, eg. each sample and their associated files (input/output/log) are in their own directory.
completed_samples/This directory contains symbolic links to samples in
analysisdirectory that have been successfully called. The joint variant calling is performed only on samples in this directory.failed_samples/This directory contains symbolic links to samples in
analysisthat are failed during individual sample calling process.joint/This directory contains all files pertinent to joint variant calling process.
joint/concatenated.vcf.gzThis is the concatenated VCF file from chromosome-based VCF files inside
joint/vcfsdirectory. This file is only available with--target concatenated_vcfoption.joint/vcfs/The final output of the joint variant calling is the per-chromosome VCF files in this directory.
metafile/This directory contains metafiles necessary for performing the whole variant calling process. Currently it holds the manifest file describing the sample name and its associated read files.
reports/This directory contains consolidated report files from completed samples in the
completed_samplesdirectory. Currently, it holdsmaps/directory (which links to BAM files of each samples) anddepth-base/directory (which links to depth files generated by sambamba).stats.tsvThis file contains the statistics of each step of the process.
Congratulation!
You have just performed your first variant calling analysis with the pipeline.
The main output file(s) of this whole variant calling process are VCF files
inside joint/vcfs and concatenated.vcf.gz.
Performing Joint Calling with New Sample Batches
Now let assume that another batch of samples are available. The following steps provide instructions to perform sample variant calling and then do joint variant calling with the previous batch:
Download read files related to another 2 of P vivax sequence data from SRA database:
mkdir reads-2 cd reads-2 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527357/ERR527357_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527357/ERR527357_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR152/ERR152414/ERR152414_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR152/ERR152414/ERR152414_2.fastq.gz cd ..
Run the multi-step variant calling with the new data, but only to the step of sample variant calling:
ngs-pl run-multistep-variant-caller -o batch-2 --target sample_variant_calling reads-2/*.fastq.gzWait until the process finishes.
Run the joint-variant calling by combining the completed samples of
batch-1andbatch-2together:ngs-pl run-joint-variant-caller -o joint-batches --target concatenated_vcf batch-1/completed_samples batch-2/completed_samplesWait until the joint variant calling finishes.
Inspect the directory
joint-batches. The per-chromosome VCF files would be in thejoint-batches/vcfsdirectory, while the concatenated VCF file containing all chromosomes in a single file would bejoint-batches/concatenated.vcf.gz.
Congratulation! You now have sucessfully perform joint variant calling between 2 sample batches.
Working with SRA Data
For working with many published FASTQ read files from SRA databases (NCBI SRA or EMBL ENA), SRA-Repo can be used to help downloading and managing SRA read files.
This part of tutorial requires SRA-Repo to be installed.
Follow the installation step in SRA-Repo github repository to install it
properly.
Open a new terminal/shell and change to the the tutorial directory.
Generate a tab-delimited sample file named my-samples.tsv with the content
as follow:
SAMPLE COUNTRY SRA
PH0098-C C1 ERR216478,ERR490276
PY0074-C C2 ERR1138883
Note
The sample PH0098-C has 2 SRA ids (meaning it will be called using two
FASTQ paired reads).
When a sample has been sequenced in more than one lane/flowcell, or been
re-sequenced, all FASTQ pairs can be used within single analysis directly.
Activate SRA-Repo by activating its activation script, and fetch the SRA read
files in my-samples.tsv above:
<YOUR_SRA_REPO_INSTALLATION>/bin/activate
sra-repo.py fetch --ntasks 6 --samplefile my-samples.tsv:SAMPLE,SRA
The above command will download the SRA read files and store it inside the
SRA-Repo installation directory.
After the download finishes, link the SRA read files to a new directory and
generate a manifest file:
sra-repo.py link -o manifest-3.tsv --outdir reads-3 --samplefile my-samples.tsv:SAMPLE,SRA
Note
Inspect the content of my-samples.tsv, and familiarize with the format.
Each FASTQ read file is separated by comma, and different pairs for the same
sample are sepeareted by semi-colon.
In the terminal/shell with active NGS-Pipeline environment, perform sample variant calling:
ngs-pl run-multistep-variant-caller -o batch-3 --target sample_variant_calling -i manifest-3.tsv
Once the sample variant calling finishes, perform joint variant calling with the previous batches:
ngs-pl run-joint-variant-caller -o new-joint --target concatenated_vcf batch-1/completed_samples batch-2/completed_samples batch-3/completed_samples
Once the joint variant calling process finishes, inspect the result in the ``new-joint``directory.
Expoloring Further
To read more about NGS-Pipeline features, please consult the rest of the
documentation.