Configuration

The general config file (yaml)

This is a yaml file. It could also be json formatted, however its name must not be changed. Yaml format has been preferred since users requested a more human readable configuration file to write.

This configuration file contains the following sections (in any orders):

ref

This is a very simple section: two values with evident significations. We need a path to a fasta formatted genome sequence, and a list of known SNP sites used by GATK to recalibrate base scores. This can be one or multiple files.

All of them have to be indexed, with indexes alongside of each others. Fasta file has to be aside of BOTH of his dictionary AND index.

Example:

ref:
  fasta: path/to/fasta.fa
  known:
    - /path/to/known.vcf
    - /path/to/other.vcf

These paths can be either relative or absolute.

params

This section contains additional command line arguments. One for all, do not try to modify threading options here. Do not try to modify neither logging, nor temporary directories. These options are already handled.

We have the following values:

copy_extra: Extra parameters for bash cp
bwa_index_extra: Extra parameters for bwa index.
bwa_map_extra: Extra parameters for bwa mem
picard_sort_sam_extra: Extra parameters for picard sort sam/bam
picard_group_extra: Extra parameters for picard add or replace groups
picard_dedup_extra: Extra parameters for picard mark duplicates
picard_isize_extra: Extra parameters for picard insert size summary
picard_summary_extra: Extra parameters for picard alignment summary
gatk_bqsr_extra: Extra parameters for picard base score recalibration.
samtools_view: Extra parameters for amtools view
samtools_fixmate_extra: Extra parameters for samtools fixmate
picard_sequence_dict_extra: Extra parameters for picard create sequence dictionnary
samtools_faidx_extra: Extra parameters for samtools fasta index

Example:

params:
  copy_extra: "--parents --verbose"
  bwa_index_extra: ""
  bwa_map_extra: "-T 20 -M"
  picard_sort_sam_extra: ""
  picard_group_extra: "RGLB=standard RGPL=illumina RGPU={sample} RGSM={sample}"
  picard_dedup_extra: "REMOVE_DUPLICATES=true"
  picard_isize_extra: "METRIC_ACCUMULATION_LEVEL=SAMPLE"
  gatk_bqsr_extra: ""
  picard_summary_extra: ""
  samtools_view: "-b -h -F 12"
  samtools_fixmate_extra: "-c -m"
  picard_sequence_dict_extra: "GENOME_ASSEMBLY=GRCH38 SPECIES=HSA URI=https://www.gencodegenes.org/human/"
  samtools_faidx_extra: ""

workflow

This part is used to activate or deactivate sections of the pipeline. In fact, if you just want to quantify and no quality control (you may have done them aside of the wes-mapping-bwa-gatk pipeline), then turn these flag to false instead of true.

Warning, it is case sensitive!

Example:

workflow:
  fastqc: true
  multiqc: true
  mapping_quality: true

General

The following parameters do not belong to any section, let them be at the top level of the yaml file:

design: The path to the design file workdir: The path to the working directory threads: The maximum number of threads used singularity_docker_image: The image used within singularity cold_storage: A list of cold storage mount points

Example:

design: design.tsv
workdir: .
threads: 1
singularity_docker_image: docker://continuumio/miniconda3:4.4.10
cold_storage:
  - /media

Conclusion

A complete config.yaml file would look like this:

design: design.tsv
workdir: .
threads: 1
singularity_docker_image: docker://continuumio/miniconda3:4.4.10
cold_storage:
  - /media
ref:
  fasta: /path/to/genome/sequence.fa
  known:
    - /path/to/known.vcf
    - /path/to/other.known.vcf
workflow:
  fastqc: true
  multiqc: true
  mapping_quality: true
params:
  copy_extra: "--parents --verbose"
  bwa_index_extra: ""
  bwa_map_extra: "-T 20 -M"
  picard_sort_sam_extra: ""
  picard_group_extra: "RGLB=standard RGPL=illumina RGPU={sample} RGSM={sample}"
  picard_dedup_extra: "REMOVE_DUPLICATES=true"
  picard_isize_extra: "METRIC_ACCUMULATION_LEVEL=SAMPLE"
  gatk_bqsr_extra: ""
  picard_summary_extra: ""
  samtools_view: "-b -h -F 12"
  samtools_fixmate_extra: "-c -m"
  picard_sequence_dict_extra: "GENOME_ASSEMBLY=GRCH38 SPECIES=HSA URI=https://www.gencodegenes.org/human/"
  samtools_faidx_extra: ""

Remember, there is a python script here to build this file from command line!

The design file (tsv)

At this stage, we need a TSV file describing our analysis.

It must contain the following columns:

Sample_id: the name of each samples
Upstream_file: path to the upstream fastq file

The optional columns are:

Downstream_file: path to downstream fastq files (usually your R2 in paired-end libraries)
Any other information

Remember, there is a python script here to build this file from command line!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration

The general config file (yaml)

ref

params

workflow

General

Conclusion

The design file (tsv)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally