Getting Started
===============

The document will work through setting up a pipeline to process *Plasmodium
vivax* sequence data using PvP01_v1 reference sequence and perform some
processing of *P vivax* sequence data.

.. note::

  Run this tutorial under a terminal multiplexer such as ``tmux`` or ``screen``
  when using a remote system to avoid termination of a running program if the
  network is disrupted.


Quick Installation
------------------

If vivaxGEN NGS-Pipeline has not been installed in your system, run the
following command to install it:

.. prompt:: bash

    "${SHELL}" <(curl -L https://raw.githubusercontent.com/vivaxgen/ngs-pipeline/main/install.sh)

Enter the target directory at the prompt (or just press Enter to use the
default).
The installation will take about 5-15 minutes depending on the internet speed
and will use about 5-6 GB of storage.
Take a note of the directory where the pipeline is installed and the full path
of its activation script.

.. warning::

  If installing under WSL/WSL2, please make sure that the target directory
  is under Linux (WSL) filesystems.
  The installation will fail if the target directory is under Windows
  filesystems, such as ``/mnt/c`` or ``/mnt/d``.

Try to activate the NGS-Pipeline environment by executing its activation script:

.. prompt:: bash

  YOUR_INSTALATION_DIRECTORY/bin/activate

.. note::

  When an environment is activated, the shell prompt will change to reflect
  the environment name.


Checking and Setting the Profile
--------------------------------

The installation process checks and sets the correct cluster profile for
Snakemake workflow if a workload manager/job scheduler is installed in
a HPC/cluster system.

If SLURM or PBSPro is installed in the system, the installation script will
try to set up a default setting.
Some HPC settings require additional flags or arguments for job submission, in
which case the flags/arguments can be supplied with environment variable
``SNAKEMAKE_CLUSTER_EXTRA_FLAGS``, by following instructions in
:ref:`cluster_extra_flags`.
If the installer can not detect existing workload manager, please follow
:ref:`profile_manual_setup` to manually set the cluster profile.

If the installer could not detect any workload managers/job schedulers, it will
set the profiler based on the available cores and memory of the system.

To see which profile set by the installer, activate the NGS-Pipeline environment
and check the link to the Snakefile profile by running the following command:

.. prompt:: bash

  ls -l $VVG_BASEDIR/etc/bashrc.d/99-snakemake-profile

The link should point to the correct profile.
If the link is not correct, adjust the link accordingly following the
instruction in :ref:`snakemake_profile_setting`.


Preparing Base Environment Directory
------------------------------------

For normal workflows, vivaxGEN NGS-Pipeline requires that all necessary files,
with the exception of the pipeline and all supporting software themselves, to
reside in a single hierarchical directory (unless the pipeline has been set up
with specific settings).
The base environment directory can be prepared with automatic method (with
preset settings) or manual method (with custom settings).

Regardless of whether using automatic or manual method, the following are
the mandatory steps to prepare the base environment directory:

#.  Activate the vivaxGEN NGS-Pipeline environment by running its activation
    script, as noted after the automatic installation finished, eg:

    .. prompt:: bash

          NGS-PIPELINE_INSTALL_DIR/activate

#.  Setup the base working directory, eg: ``/data/Pv-wgs/PvP01_v1``:

    .. prompt:: bash

          ngs-pl setup-base-directory /data/Pv-wgs/PvP01_v1
          cd /data/Pv-wgs/PvP01_v1

    .. tip::

      To easily identify and differentiate between several base environment
      directory, it is recommended to use the reference genome name as part
      of the directory name.

    .. warning::

      If running under WSL/WSL2, please make sure that the base environment
      directory is under Linux (WSL) filesystems.
      If the base environment directory is under Windows filesystems (such as
      ``/mnt/c`` or ``/mnt/d``), the pipeline might fail to work.

#.  Exit the current environment and activate the new environment using
    the new ``activate`` file:

    .. prompt:: bash

          exit
          /data/Pv-wgs/PvP01_v1/activate

    Once activated, the environment directory can be accessed using environment
    variable ``NGSENV_BASEDIR``.

To continue preparing the base enviroment directory with automatic method
using preset settings for *P vivax* with PvP01_v1 reference sequence, change to
base environment directory:

.. prompt:: bash
  
      cd $NGSENV_BASEDIR

If running in an HPC/cluster system or workstation/server with 16-core or more,
use the following command to setup the base environment directory with full
version of PvP01_v1 setting:

.. prompt:: bash

      bash <(curl -L https://raw.githubusercontent.com/vivaxgen/vgnpc-plasmodium-spp/main/Pvivax/PvP01_v1/setup.sh)

If running in a laptop or desktop with less than 16-core, use the following 
command to setup the base enviroment directory with lite version of PvP01_v1
setting:

.. prompt:: bash

      bash <(curl -L https://raw.githubusercontent.com/vivaxgen/vgnpc-plasmodium-spp/main/Pvivax/PvP01_v1/setup-lite.sh)

The full version setup will take some time as it needs to download both the
PvP01_v1 genome sequence (~ 23MB), human GRCh38.p14 genome (~ 928MB),
uncompress the human genome, and generate index file for both PvP01 and the
human genome sequences using ``bwa-mem2``.

The lite version setup will take less time as it only needs to download the
PvP01_v1 genome sequence (~ 23MB) and generate index file for only PvP01
sequences.

.. note::

  The vivaxGEN github repository provides the list of available preset
  settings.
  However, if none of the preset settings are suitable, then the setup can be
  continued using manual method following steps described in
  :doc:`setup-base-env-dir`.


Running the Multi-Step Mode
---------------------------

This section of the tutorial shows the use of ``run-multistep-variant-caller``
single command, which provides the simple and quick way to perform multi-step
mode of the variant calling.
In this section, 2 samples of *P vivax* WGS data will be processed to get the
final result as a concatenated VCF file (a single VCF file containing all
chromosomes).

.. tip::

  For larger number of samples, it is advisable to have the final result as
  multiple VCF files, each contains a specific chromosome, since the downstream
  analysis then can be performed individually on each chromosome in parallel to
  speed up the analysis.

#.  Activate the environment by executing the ``activate`` script if the
    environment has not been activated:

    .. prompt:: bash
      
      /data/Pv-wgs/PvP01_v1/activate

#.  Enter the directory for containing data sets, and create a new directory,
    and enter to the new directory:

    .. prompt:: bash

      cd $NGSENV_BASEDIR/sets
      mkdir my-tutorial
      cd my-tutorial

#.  Create a directory to hold the FASTQ read files:

    .. prompt:: bash

	    mkdir reads-1

#.  Download read files related to 2 *P. vivax* sequence data from ENA (note
    that for working with public SRA read files, consider using
    `SRA-Repo <https://github.com/vivaxgen/sra-repo>`_ to manage and
    automatically download the read files):

    .. prompt:: bash

      cd reads-1
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR111/ERR111714/ERR111714_1.fastq.gz
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR111/ERR111714/ERR111714_2.fastq.gz
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR113/004/ERR1138854/ERR1138854_1.fastq.gz
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR113/004/ERR1138854/ERR1138854_2.fastq.gz
      cd ..

#.  Run the multi-step mode variant calling process by executing this single
    command:

    .. prompt:: bash

      ngs-pl run-multistep-variant-caller -o batch-1 --paired reads-1/*.fastq.gz

    Wait until the process finishes.

#.  Inspect the ``batch-1`` directory by performing directorylisting:

    .. prompt:: bash

      ls batch-1

    The following is the layout of the output directory:

    ``analysis/``
      This directory contains sample directory, eg. each sample and their
      associated files (input/output/log) are in their own directory.

    ``completed_samples/``
      This directory contains symbolic links to samples in ``analysis``
      directory that have been successfully called.
      The joint variant calling is performed only on samples in this
      directory.

    ``failed_samples/``
      This directory contains symbolic links to samples in ``analysis``
      that are failed during individual sample calling process.

    ``joint/``
      This directory contains all files pertinent to joint variant calling
      process.

    ``joint/concatenated.vcf.gz``
      This is the concatenated VCF file from chromosome-based VCF files
      inside ``joint/vcfs`` directory.
      This file is only available with ``--target concatenated_vcf`` option.

    ``joint/vcfs/``
      The final output of the joint variant calling is the per-chromosome
      VCF files in this directory.

    ``metafile/``
      This directory contains metafiles necessary for performing the whole
      variant calling process.
      Currently it holds the manifest file describing the sample name and its
      associated read files.

    ``reports/``
      This directory contains consolidated report files from completed samples
      in the ``completed_samples`` directory.
      Currently, it holds ``maps/`` directory (which links to BAM files of each
      samples) and ``depth-base/`` directory (which links to depth files
      generated by sambamba).

    ``stats.tsv``
      This file contains the statistics of each step of the process.

Congratulation!
You have just performed your first variant calling analysis with the pipeline.
The main output file(s) of this whole variant calling process are VCF files
inside ``joint/vcfs`` and ``concatenated.vcf.gz``.


Performing Joint Calling with New Sample Batches
-----------------------------------------------

Now let assume that another batch of samples are available.
The following steps provide instructions to perform sample variant calling
and then do joint variant calling with the previous batch:

#.  Download read files related to another 2 of *P vivax* sequence data from
    SRA database:

    .. prompt:: bash

      mkdir reads-2
      cd reads-2
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527357/ERR527357_1.fastq.gz
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527357/ERR527357_2.fastq.gz
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR152/ERR152414/ERR152414_1.fastq.gz
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR152/ERR152414/ERR152414_2.fastq.gz
      cd ..

#.  Run the multi-step variant calling with the new data, but only to the step
    of sample variant calling:

    .. prompt:: bash

      ngs-pl run-multistep-variant-caller -o batch-2 --target sample_variant_calling reads-2/*.fastq.gz

    Wait until the process finishes.

#.  Run the joint-variant calling by combining the completed samples of
    ``batch-1`` and ``batch-2`` together:

    .. prompt:: bash

      ngs-pl run-joint-variant-caller -o joint-batches --target concatenated_vcf batch-1/completed_samples batch-2/completed_samples

    Wait until the joint variant calling finishes.

#.  Inspect the directory ``joint-batches``.
    The per-chromosome VCF files would be in the ``joint-batches/vcfs``
    directory, while the concatenated VCF file containing all chromosomes in
    a single file would be ``joint-batches/concatenated.vcf.gz``.

Congratulation!
You now have sucessfully perform joint variant calling between 2 sample batches.


Working with SRA Data
---------------------

For working with many published FASTQ read files from SRA databases (NCBI SRA
or EMBL ENA), `SRA-Repo <https://github.com/vivaxgen/sra-repo>`_ can be used to
help downloading and managing SRA read files.

This part of tutorial requires ``SRA-Repo`` to be installed.
Follow the installation step in ``SRA-Repo`` github repository to install it
properly.

Open a new terminal/shell and change to the the tutorial directory.
Generate a tab-delimited sample file named ``my-samples.tsv`` with the content
as follow:

.. code-block:: console

    SAMPLE      COUNTRY   SRA
    PH0098-C    C1        ERR216478,ERR490276
    PY0074-C    C2        ERR1138883

.. note::

  The sample ``PH0098-C`` has 2 SRA ids (meaning it will be called using two
  FASTQ paired reads).
  When a sample has been sequenced in more than one lane/flowcell, or been
  re-sequenced, all FASTQ pairs can be used within single analysis directly.

Activate SRA-Repo by activating its activation script, and fetch the SRA read
files in ``my-samples.tsv`` above:

.. prompt:: bash

    <YOUR_SRA_REPO_INSTALLATION>/bin/activate
    sra-repo.py fetch --ntasks 6 --samplefile my-samples.tsv:SAMPLE,SRA

The above command will download the SRA read files and store it inside the
``SRA-Repo`` installation directory.
After the download finishes, link the SRA read files to a new directory and
generate a manifest file:

.. prompt:: bash

    sra-repo.py link -o manifest-3.tsv --outdir reads-3 --samplefile my-samples.tsv:SAMPLE,SRA

.. note::

  Inspect the content of ``my-samples.tsv``, and familiarize with the format.
  Each FASTQ read file is separated by comma, and different pairs for the same
  sample are sepeareted by semi-colon.

In the terminal/shell with active NGS-Pipeline environment, perform sample
variant calling:

.. prompt:: bash

    ngs-pl run-multistep-variant-caller -o batch-3 --target sample_variant_calling -i manifest-3.tsv

Once the sample variant calling finishes, perform joint variant calling with the
previous batches:

.. prompt:: bash

    ngs-pl run-joint-variant-caller -o new-joint --target concatenated_vcf batch-1/completed_samples batch-2/completed_samples batch-3/completed_samples

Once the joint variant calling process finishes, inspect the result in the 
``new-joint``directory.


Expoloring Further
------------------

To read more about ``NGS-Pipeline`` features, please consult the rest of the
documentation.