Settings and Configuration
==========================


Environment Variables
---------------------

The activation script of NGS-Pipeline sets several environement variables
in the activated shell as following:

- VVG_BASEDIR
- VVG_REPODIR
- VVGBIN
- NGS_PIPELINE_BASE
- NGSENV_BASEDIR

Users can set some environment variables either in a profile file under
``$VVG_BASEDIR/etc/bashrc.d/`` directory, in ``$NGSENV_BASEDIR/profile``, or
directly from the shell using notation ``ENVVAR_NAME=value``.

The following are the list of the environment variables:

.. list-table:: Environment Variables
    :header-rows: 1

    * - Variable name
      - Remarks
    * - NGS_PIPELINE_LOGLEVEL
      -
    * - NGS_PIPELINE_LOGFILE
      -
    * - NGS_PIPELINE_JOBS
      -
    * - NGS_PIPELINE_FORCE
      -
    * - NGS_PIPELINE_NO_CONFIG_CASCADE
      -
    * - NGS_PIPELINE_CMD_MODS
      -
    * - SNAKEMAKE_CLUSTER_EXTRA_FLAGS
      -


Configuration Files
-------------------

User can set custom configuration in YAML-format config files, usually named
as ``config.yaml``.


Cascading Configuration
~~~~~~~~~~~~~~~~~~~~~~~

In default mode, NGS-Pipeline will perform cascading configuration setting.
Essentially, when a snakemake workflow process is being executed, the process
will read for a ``config.yaml`` file in current working directory if the file
exists.
The workflow process will try to read another ``config.yaml`` file in the
parent directory, and the parent of the parent directory, and so on until it
reaches the NGSENV_BASEDIR as the root of the hieararchy.
Settings from ``config.yaml`` in the directory farther to the current working
directory will be overridden by the settings from ``config.yaml`` in the
directory closer to the working directory (up to the working directory itself).
With this scheme, it is easy to arrange configurations applied to whole
project, but then customized for certain sample sets down to individual sample.

To illustrate how the cascading configuration works, assume that we have the
following directory layout:

.. code-block:: console

    NGSENV_BASEDIR
    ├── config.yaml (1)
    └── sets
        ├── clinical-samples
        │   ├── analysis
        │   │   └── samples
        │   │       ├── patient-001
        │   │       │   └── config.yaml (2)
        │   │       ├── patient-002
        │   │       └── patient-003
        │   └── config.yaml (3)
        ├── joint-varcall
        │   └── config.yaml (4)
        ├── public-samples
        │   └── analysis
        │       └── samples
        │           ├── P0001
        │           └── P0002
        └── study-A
            ├── analysis
            │   └── samples
            │       ├── A001
            │       └── A002
            └── config.yaml (5)

When ``run-sample-variant-caller`` command is executed to perform per sample
processing (mapping, genotyping) to samples in the ``clinical-samples``,
``public-samples`` and ``study-A`` directories, a snakemake workflow is being
run for each sample with the respective sample directory as working directory.
For ``patient-001`` sample, the workflow will encounter ``config.yaml (2)``,
then ``config.yaml (3)``, and the base ``config.yaml (1)``.
The settings in the ``config.yaml (1)`` will be overridden with any settings in
``config.yaml (3)``, which then will be overridden by any settings in
``config.yaml (2)``.

Likewise, for sample ``A001``, the applied settings will be those from
``config.yaml (1)`` which will then be overriden by ``config.yaml (5)``.
For sample ``P0001``, the settings will only use the ones from
``config.yaml (1)``.

With this scheme, it is easy to setup general settings for all samples in
``config.yaml (1)``, set some custom settings for in-house-sequenced
``study-A`` samples (such as keeping the proper-paired bam files for SRA
submission) in ``config.yaml (5)``, set some custom settings for all clinical
samples (such as keeping the final bam files for manual inspection) in
``config.yaml (3)`` and set specific settings for just sample ``patient-001``
(such as lowering some thresholds as the sample is of lower quality) in
``config.yaml (2)``.

The cascading configuration can be opted out by using ``--no-config-cascade``
argument in most of NGS-Pipeline commands.


Configurations
~~~~~~~~~~~~~~

.. list-table:: Configurations to select workflows
    :header-rows: 1

    * - Config Name
      - Remarks
      - Default Value
      - Available Values
    * - read_trimmer_wf
      -
      - ssf_trimmer_null.smk
      - ssf_trimmer_fastp.smk ssf_trimmer_cutadapt.smk
    * - reads_mapper_wf
      -
      - ssf_mapper_bwa.smk
      - ssf_mapper_minimap2.smk ssf_mapper_bowtie2.smk
    * - variant_caller_wf
      -
      - ssf_varcall_gatk.smk
      -
    * - joint_variant_caller_wf
      -
      - jointvarcall_gatk.smk
      - jointvarcall_freebayes.smk jointvarcall_bcftools.smk jointvarcall_clair3.smk

.. list-table:: Configuration for map processing
    :header-rows: 1

    * - Config Name
      - Remarks
      - Default Value
      - Available Values
    * - refseq_file
      -
      -
      -
    * - refmap_file
      -
      -
      -
    * - deduplicate
      -
      - True
      - False
    * - keep_paired_bam
      -
      - False
      - True
    * - keep_final_bam
      -
      - False
      - True