Skip to content
This repository was archived by the owner on Nov 29, 2021. It is now read-only.

Configuration

tdayris-perso edited this page Nov 21, 2019 · 1 revision

The general config file (yaml)

This is a yaml file. It could also be json formatted, however its name must not be changed. Yaml format has been preferred since users requested a more human readable configuration file to write.

This configuration file contains the following sections (in any orders):

ref

This is a very simple section: two values with evident significations. We need a path to a fasta formatted genome sequence, and a list of known SNP sites used by GATK to recalibrate base scores. This can be one or multiple files.

All of them have to be indexed, with indexes alongside of each others. Fasta file has to be aside of BOTH of his dictionary AND index.

Example:

ref:
  fasta: path/to/fasta.fa
  known:
    - /path/to/known.vcf
    - /path/to/other.vcf

These paths can be either relative or absolute.

params

This section contains additional command line arguments. One for all, do not try to modify threading options here. Do not try to modify neither logging, nor temporary directories. These options are already handled.

We have the following values:

  • copy_extra: Extra parameters for bash cp
  • bwa_index_extra: Extra parameters for bwa index.
  • bwa_map_extra: Extra parameters for bwa mem
  • picard_sort_sam_extra: Extra parameters for picard sort sam/bam
  • picard_group_extra: Extra parameters for picard add or replace groups
  • picard_dedup_extra: Extra parameters for picard mark duplicates
  • picard_isize_extra: Extra parameters for picard insert size summary
  • picard_summary_extra: Extra parameters for picard alignment summary
  • gatk_bqsr_extra: Extra parameters for picard base score recalibration.
  • samtools_view: Extra parameters for amtools view
  • samtools_fixmate_extra: Extra parameters for samtools fixmate
  • picard_sequence_dict_extra: Extra parameters for picard create sequence dictionnary
  • samtools_faidx_extra: Extra parameters for samtools fasta index

Example:

params:
  copy_extra: "--parents --verbose"
  bwa_index_extra: ""
  bwa_map_extra: "-T 20 -M"
  picard_sort_sam_extra: ""
  picard_group_extra: "RGLB=standard RGPL=illumina RGPU={sample} RGSM={sample}"
  picard_dedup_extra: "REMOVE_DUPLICATES=true"
  picard_isize_extra: "METRIC_ACCUMULATION_LEVEL=SAMPLE"
  gatk_bqsr_extra: ""
  picard_summary_extra: ""
  samtools_view: "-b -h -F 12"
  samtools_fixmate_extra: "-c -m"
  picard_sequence_dict_extra: "GENOME_ASSEMBLY=GRCH38 SPECIES=HSA URI=https://www.gencodegenes.org/human/"
  samtools_faidx_extra: ""

workflow

This part is used to activate or deactivate sections of the pipeline. In fact, if you just want to quantify and no quality control (you may have done them aside of the wes-mapping-bwa-gatk pipeline), then turn these flag to false instead of true.

Warning, it is case sensitive!

Example:

workflow:
  fastqc: true
  multiqc: true
  mapping_quality: true

General

The following parameters do not belong to any section, let them be at the top level of the yaml file:

design: The path to the design file workdir: The path to the working directory threads: The maximum number of threads used singularity_docker_image: The image used within singularity cold_storage: A list of cold storage mount points

Example:

design: design.tsv
workdir: .
threads: 1
singularity_docker_image: docker://continuumio/miniconda3:4.4.10
cold_storage:
  - /media

Conclusion

A complete config.yaml file would look like this:

design: design.tsv
workdir: .
threads: 1
singularity_docker_image: docker://continuumio/miniconda3:4.4.10
cold_storage:
  - /media
ref:
  fasta: /path/to/genome/sequence.fa
  known:
    - /path/to/known.vcf
    - /path/to/other.known.vcf
workflow:
  fastqc: true
  multiqc: true
  mapping_quality: true
params:
  copy_extra: "--parents --verbose"
  bwa_index_extra: ""
  bwa_map_extra: "-T 20 -M"
  picard_sort_sam_extra: ""
  picard_group_extra: "RGLB=standard RGPL=illumina RGPU={sample} RGSM={sample}"
  picard_dedup_extra: "REMOVE_DUPLICATES=true"
  picard_isize_extra: "METRIC_ACCUMULATION_LEVEL=SAMPLE"
  gatk_bqsr_extra: ""
  picard_summary_extra: ""
  samtools_view: "-b -h -F 12"
  samtools_fixmate_extra: "-c -m"
  picard_sequence_dict_extra: "GENOME_ASSEMBLY=GRCH38 SPECIES=HSA URI=https://www.gencodegenes.org/human/"
  samtools_faidx_extra: ""

Remember, there is a python script here to build this file from command line!

The design file (tsv)

At this stage, we need a TSV file describing our analysis.

It must contain the following columns:

  • Sample_id: the name of each samples
  • Upstream_file: path to the upstream fastq file

The optional columns are:

  • Downstream_file: path to downstream fastq files (usually your R2 in paired-end libraries)
  • Any other information

Remember, there is a python script here to build this file from command line!

Clone this wiki locally