Getting Started

The document will work through setting up a pipeline to process Plasmodium vivax sequence data using PvP01_v1 reference sequence and perform some processing of P vivax sequence data.

Note

Run this tutorial under a terminal multiplexer such as tmux or screen when using a remote system to avoid termination of a running program if the network is disrupted.

Quick Installation

If vivaxGEN NGS-Pipeline has not been installed in your system, run the following command to install it:

"${SHELL}" <(curl -L https://raw.githubusercontent.com/vivaxgen/ngs-pipeline/main/install.sh)

Enter the target directory at the prompt (or just press Enter to use the default). The installation will take about 5-15 minutes depending on the internet speed and will use about 5-6 GB of storage. Take a note of the directory where the pipeline is installed and the full path of its activation script.

Warning

If installing under WSL/WSL2, please make sure that the target directory is under Linux (WSL) filesystems. The installation will fail if the target directory is under Windows filesystems, such as /mnt/c or /mnt/d.

Try to activate the NGS-Pipeline environment by executing its activation script:

YOUR_INSTALATION_DIRECTORY/bin/activate

Note

When an environment is activated, the shell prompt will change to reflect the environment name.

Checking and Setting the Profile

The installation process checks and sets the correct cluster profile for Snakemake workflow if a workload manager/job scheduler is installed in a HPC/cluster system.

If SLURM or PBSPro is installed in the system, the installation script will try to set up a default setting. Some HPC settings require additional flags or arguments for job submission, in which case the flags/arguments can be supplied with environment variable SNAKEMAKE_CLUSTER_EXTRA_FLAGS, by following instructions in Adding Additional Flags for Job Submission. If the installer can not detect existing workload manager, please follow Manual Setup for Snakemake Cluster Profile to manually set the cluster profile.

If the installer could not detect any workload managers/job schedulers, it will set the profiler based on the available cores and memory of the system.

To see which profile set by the installer, activate the NGS-Pipeline environment and check the link to the Snakefile profile by running the following command:

ls -l $VVG_BASEDIR/etc/bashrc.d/99-snakemake-profile

The link should point to the correct profile. If the link is not correct, adjust the link accordingly following the instruction in Snakemake Profile Setting.

Preparing Base Environment Directory

For normal workflows, vivaxGEN NGS-Pipeline requires that all necessary files, with the exception of the pipeline and all supporting software themselves, to reside in a single hierarchical directory (unless the pipeline has been set up with specific settings). The base environment directory can be prepared with automatic method (with preset settings) or manual method (with custom settings).

Regardless of whether using automatic or manual method, the following are the mandatory steps to prepare the base environment directory:

  1. Activate the vivaxGEN NGS-Pipeline environment by running its activation script, as noted after the automatic installation finished, eg:

    NGS-PIPELINE_INSTALL_DIR/activate
    
  2. Setup the base working directory, eg: /data/Pv-wgs/PvP01_v1:

    ngs-pl setup-base-directory /data/Pv-wgs/PvP01_v1
    cd /data/Pv-wgs/PvP01_v1
    

    Tip

    To easily identify and differentiate between several base environment directory, it is recommended to use the reference genome name as part of the directory name.

    Warning

    If running under WSL/WSL2, please make sure that the base environment directory is under Linux (WSL) filesystems. If the base environment directory is under Windows filesystems (such as /mnt/c or /mnt/d), the pipeline might fail to work.

  3. Exit the current environment and activate the new environment using the new activate file:

    exit
    /data/Pv-wgs/PvP01_v1/activate
    

    Once activated, the environment directory can be accessed using environment variable NGSENV_BASEDIR.

To continue preparing the base enviroment directory with automatic method using preset settings for P vivax with PvP01_v1 reference sequence, change to base environment directory:

cd $NGSENV_BASEDIR

If running in an HPC/cluster system or workstation/server with 16-core or more, use the following command to setup the base environment directory with full version of PvP01_v1 setting:

bash <(curl -L https://raw.githubusercontent.com/vivaxgen/vgnpc-plasmodium-spp/main/Pvivax/PvP01_v1/setup.sh)

If running in a laptop or desktop with less than 16-core, use the following command to setup the base enviroment directory with lite version of PvP01_v1 setting:

bash <(curl -L https://raw.githubusercontent.com/vivaxgen/vgnpc-plasmodium-spp/main/Pvivax/PvP01_v1/setup-lite.sh)

The full version setup will take some time as it needs to download both the PvP01_v1 genome sequence (~ 23MB), human GRCh38.p14 genome (~ 928MB), uncompress the human genome, and generate index file for both PvP01 and the human genome sequences using bwa-mem2.

The lite version setup will take less time as it only needs to download the PvP01_v1 genome sequence (~ 23MB) and generate index file for only PvP01 sequences.

Note

The vivaxGEN github repository provides the list of available preset settings. However, if none of the preset settings are suitable, then the setup can be continued using manual method following steps described in Setting Up Base Enviroment Directory.

Running the Multi-Step Mode

This section of the tutorial shows the use of run-multistep-variant-caller single command, which provides the simple and quick way to perform multi-step mode of the variant calling. In this section, 2 samples of P vivax WGS data will be processed to get the final result as a concatenated VCF file (a single VCF file containing all chromosomes).

Tip

For larger number of samples, it is advisable to have the final result as multiple VCF files, each contains a specific chromosome, since the downstream analysis then can be performed individually on each chromosome in parallel to speed up the analysis.

  1. Activate the environment by executing the activate script if the environment has not been activated:

    /data/Pv-wgs/PvP01_v1/activate
    
  2. Enter the directory for containing data sets, and create a new directory, and enter to the new directory:

    cd $NGSENV_BASEDIR/sets
    mkdir my-tutorial
    cd my-tutorial
    
  3. Create a directory to hold the FASTQ read files:

    mkdir reads-1
    
  4. Download read files related to 2 P. vivax sequence data from ENA (note that for working with public SRA read files, consider using SRA-Repo to manage and automatically download the read files):

    cd reads-1
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR111/ERR111714/ERR111714_1.fastq.gz
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR111/ERR111714/ERR111714_2.fastq.gz
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR113/004/ERR1138854/ERR1138854_1.fastq.gz
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR113/004/ERR1138854/ERR1138854_2.fastq.gz
    cd ..
    
  5. Run the multi-step mode variant calling process by executing this single command:

    ngs-pl run-multistep-variant-caller -o batch-1 --paired reads-1/*.fastq.gz
    

    Wait until the process finishes.

  6. Inspect the batch-1 directory by performing directorylisting:

    ls batch-1
    

    The following is the layout of the output directory:

    analysis/

    This directory contains sample directory, eg. each sample and their associated files (input/output/log) are in their own directory.

    completed_samples/

    This directory contains symbolic links to samples in analysis directory that have been successfully called. The joint variant calling is performed only on samples in this directory.

    failed_samples/

    This directory contains symbolic links to samples in analysis that are failed during individual sample calling process.

    joint/

    This directory contains all files pertinent to joint variant calling process.

    joint/concatenated.vcf.gz

    This is the concatenated VCF file from chromosome-based VCF files inside joint/vcfs directory. This file is only available with --target concatenated_vcf option.

    joint/vcfs/

    The final output of the joint variant calling is the per-chromosome VCF files in this directory.

    metafile/

    This directory contains metafiles necessary for performing the whole variant calling process. Currently it holds the manifest file describing the sample name and its associated read files.

    reports/

    This directory contains consolidated report files from completed samples in the completed_samples directory. Currently, it holds maps/ directory (which links to BAM files of each samples) and depth-base/ directory (which links to depth files generated by sambamba).

    stats.tsv

    This file contains the statistics of each step of the process.

Congratulation! You have just performed your first variant calling analysis with the pipeline. The main output file(s) of this whole variant calling process are VCF files inside joint/vcfs and concatenated.vcf.gz.

Performing Joint Calling with New Sample Batches

Now let assume that another batch of samples are available. The following steps provide instructions to perform sample variant calling and then do joint variant calling with the previous batch:

  1. Download read files related to another 2 of P vivax sequence data from SRA database:

    mkdir reads-2
    cd reads-2
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527357/ERR527357_1.fastq.gz
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527357/ERR527357_2.fastq.gz
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR152/ERR152414/ERR152414_1.fastq.gz
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR152/ERR152414/ERR152414_2.fastq.gz
    cd ..
    
  2. Run the multi-step variant calling with the new data, but only to the step of sample variant calling:

    ngs-pl run-multistep-variant-caller -o batch-2 --target sample_variant_calling reads-2/*.fastq.gz
    

    Wait until the process finishes.

  3. Run the joint-variant calling by combining the completed samples of batch-1 and batch-2 together:

    ngs-pl run-joint-variant-caller -o joint-batches --target concatenated_vcf batch-1/completed_samples batch-2/completed_samples
    

    Wait until the joint variant calling finishes.

  4. Inspect the directory joint-batches. The per-chromosome VCF files would be in the joint-batches/vcfs directory, while the concatenated VCF file containing all chromosomes in a single file would be joint-batches/concatenated.vcf.gz.

Congratulation! You now have sucessfully perform joint variant calling between 2 sample batches.

Working with SRA Data

For working with many published FASTQ read files from SRA databases (NCBI SRA or EMBL ENA), SRA-Repo can be used to help downloading and managing SRA read files.

This part of tutorial requires SRA-Repo to be installed. Follow the installation step in SRA-Repo github repository to install it properly.

Open a new terminal/shell and change to the the tutorial directory. Generate a tab-delimited sample file named my-samples.tsv with the content as follow:

SAMPLE      COUNTRY   SRA
PH0098-C    C1        ERR216478,ERR490276
PY0074-C    C2        ERR1138883

Note

The sample PH0098-C has 2 SRA ids (meaning it will be called using two FASTQ paired reads). When a sample has been sequenced in more than one lane/flowcell, or been re-sequenced, all FASTQ pairs can be used within single analysis directly.

Activate SRA-Repo by activating its activation script, and fetch the SRA read files in my-samples.tsv above:

<YOUR_SRA_REPO_INSTALLATION>/bin/activate
sra-repo.py fetch --ntasks 6 --samplefile my-samples.tsv:SAMPLE,SRA

The above command will download the SRA read files and store it inside the SRA-Repo installation directory. After the download finishes, link the SRA read files to a new directory and generate a manifest file:

sra-repo.py link -o manifest-3.tsv --outdir reads-3 --samplefile my-samples.tsv:SAMPLE,SRA

Note

Inspect the content of my-samples.tsv, and familiarize with the format. Each FASTQ read file is separated by comma, and different pairs for the same sample are sepeareted by semi-colon.

In the terminal/shell with active NGS-Pipeline environment, perform sample variant calling:

ngs-pl run-multistep-variant-caller -o batch-3 --target sample_variant_calling -i manifest-3.tsv

Once the sample variant calling finishes, perform joint variant calling with the previous batches:

ngs-pl run-joint-variant-caller -o new-joint --target concatenated_vcf batch-1/completed_samples batch-2/completed_samples batch-3/completed_samples

Once the joint variant calling process finishes, inspect the result in the ``new-joint``directory.

Expoloring Further

To read more about NGS-Pipeline features, please consult the rest of the documentation.