Settings and Configuration

Environment Variables

The activation script of NGS-Pipeline sets several environement variables in the activated shell as following:

  • VVG_BASEDIR

  • VVG_REPODIR

  • VVGBIN

  • NGS_PIPELINE_BASE

  • NGSENV_BASEDIR

Users can set some environment variables either in a profile file under $VVG_BASEDIR/etc/bashrc.d/ directory, in $NGSENV_BASEDIR/profile, or directly from the shell using notation ENVVAR_NAME=value.

The following are the list of the environment variables:

Environment Variables

Variable name

Remarks

NGS_PIPELINE_LOGLEVEL

NGS_PIPELINE_LOGFILE

NGS_PIPELINE_JOBS

NGS_PIPELINE_FORCE

NGS_PIPELINE_NO_CONFIG_CASCADE

NGS_PIPELINE_CMD_MODS

SNAKEMAKE_CLUSTER_EXTRA_FLAGS

Configuration Files

User can set custom configuration in YAML-format config files, usually named as config.yaml.

Cascading Configuration

In default mode, NGS-Pipeline will perform cascading configuration setting. Essentially, when a snakemake workflow process is being executed, the process will read for a config.yaml file in current working directory if the file exists. The workflow process will try to read another config.yaml file in the parent directory, and the parent of the parent directory, and so on until it reaches the NGSENV_BASEDIR as the root of the hieararchy. Settings from config.yaml in the directory farther to the current working directory will be overridden by the settings from config.yaml in the directory closer to the working directory (up to the working directory itself). With this scheme, it is easy to arrange configurations applied to whole project, but then customized for certain sample sets down to individual sample.

To illustrate how the cascading configuration works, assume that we have the following directory layout:

NGSENV_BASEDIR
├── config.yaml (1)
└── sets
    ├── clinical-samples
    │   ├── analysis
    │   │   └── samples
    │   │       ├── patient-001
    │   │       │   └── config.yaml (2)
    │   │       ├── patient-002
    │   │       └── patient-003
    │   └── config.yaml (3)
    ├── joint-varcall
    │   └── config.yaml (4)
    ├── public-samples
    │   └── analysis
    │       └── samples
    │           ├── P0001
    │           └── P0002
    └── study-A
        ├── analysis
        │   └── samples
        │       ├── A001
        │       └── A002
        └── config.yaml (5)

When run-sample-variant-caller command is executed to perform per sample processing (mapping, genotyping) to samples in the clinical-samples, public-samples and study-A directories, a snakemake workflow is being run for each sample with the respective sample directory as working directory. For patient-001 sample, the workflow will encounter config.yaml (2), then config.yaml (3), and the base config.yaml (1). The settings in the config.yaml (1) will be overridden with any settings in config.yaml (3), which then will be overridden by any settings in config.yaml (2).

Likewise, for sample A001, the applied settings will be those from config.yaml (1) which will then be overriden by config.yaml (5). For sample P0001, the settings will only use the ones from config.yaml (1).

With this scheme, it is easy to setup general settings for all samples in config.yaml (1), set some custom settings for in-house-sequenced study-A samples (such as keeping the proper-paired bam files for SRA submission) in config.yaml (5), set some custom settings for all clinical samples (such as keeping the final bam files for manual inspection) in config.yaml (3) and set specific settings for just sample patient-001 (such as lowering some thresholds as the sample is of lower quality) in config.yaml (2).

The cascading configuration can be opted out by using --no-config-cascade argument in most of NGS-Pipeline commands.

Configurations

Configurations to select workflows

Config Name

Remarks

Default Value

Available Values

read_trimmer_wf

ssf_trimmer_null.smk

ssf_trimmer_fastp.smk ssf_trimmer_cutadapt.smk

reads_mapper_wf

ssf_mapper_bwa.smk

ssf_mapper_minimap2.smk ssf_mapper_bowtie2.smk

variant_caller_wf

ssf_varcall_gatk.smk

joint_variant_caller_wf

jointvarcall_gatk.smk

jointvarcall_freebayes.smk jointvarcall_bcftools.smk jointvarcall_clair3.smk

Configuration for map processing

Config Name

Remarks

Default Value

Available Values

refseq_file

refmap_file

deduplicate

True

False

keep_paired_bam

False

True

keep_final_bam

False

True