-
Notifications
You must be signed in to change notification settings - Fork 1
Configuration
This is a yaml file. It could also be json formatted, however its name must not be changed. Yaml format has been preferred since users requested a more human readable configuration file to write.
This configuration file contains the following sections (in any orders):
This is a very simple section: two values with evident significations. We need a path to a fasta formatted genome sequence, and a list of known SNP sites used by GATK to recalibrate base scores. This can be one or multiple files.
All of them have to be indexed, with indexes alongside of each others. Fasta file has to be aside of BOTH of his dictionary AND index.
Example:
ref:
fasta: path/to/fasta.fa
known:
- /path/to/known.vcf
- /path/to/other.vcf
These paths can be either relative or absolute.
This section contains additional command line arguments. One for all, do not try to modify threading options here. Do not try to modify neither logging, nor temporary directories. These options are already handled.
We have the following values:
- copy_extra: Extra parameters for bash cp
- bwa_index_extra: Extra parameters for bwa index.
- bwa_map_extra: Extra parameters for bwa mem
- picard_sort_sam_extra: Extra parameters for picard sort sam/bam
- picard_group_extra: Extra parameters for picard add or replace groups
- picard_dedup_extra: Extra parameters for picard mark duplicates
- picard_isize_extra: Extra parameters for picard insert size summary
- picard_summary_extra: Extra parameters for picard alignment summary
- gatk_bqsr_extra: Extra parameters for picard base score recalibration.
- samtools_view: Extra parameters for amtools view
- samtools_fixmate_extra: Extra parameters for samtools fixmate
- picard_sequence_dict_extra: Extra parameters for picard create sequence dictionnary
- samtools_faidx_extra: Extra parameters for samtools fasta index
Example:
params:
copy_extra: "--parents --verbose"
bwa_index_extra: ""
bwa_map_extra: "-T 20 -M"
picard_sort_sam_extra: ""
picard_group_extra: "RGLB=standard RGPL=illumina RGPU={sample} RGSM={sample}"
picard_dedup_extra: "REMOVE_DUPLICATES=true"
picard_isize_extra: "METRIC_ACCUMULATION_LEVEL=SAMPLE"
gatk_bqsr_extra: ""
picard_summary_extra: ""
samtools_view: "-b -h -F 12"
samtools_fixmate_extra: "-c -m"
picard_sequence_dict_extra: "GENOME_ASSEMBLY=GRCH38 SPECIES=HSA URI=https://www.gencodegenes.org/human/"
samtools_faidx_extra: ""
This part is used to activate or deactivate sections of the pipeline. In fact, if you just want to quantify and no quality control (you may have done them aside of the wes-mapping-bwa-gatk pipeline), then turn these flag to false instead of true.
Warning, it is case sensitive!
Example:
workflow:
fastqc: true
multiqc: true
mapping_quality: true
The following parameters do not belong to any section, let them be at the top level of the yaml file:
design: The path to the design file workdir: The path to the working directory threads: The maximum number of threads used singularity_docker_image: The image used within singularity cold_storage: A list of cold storage mount points
Example:
design: design.tsv
workdir: .
threads: 1
singularity_docker_image: docker://continuumio/miniconda3:4.4.10
cold_storage:
- /media
A complete config.yaml file would look like this:
design: design.tsv
workdir: .
threads: 1
singularity_docker_image: docker://continuumio/miniconda3:4.4.10
cold_storage:
- /media
ref:
fasta: /path/to/genome/sequence.fa
known:
- /path/to/known.vcf
- /path/to/other.known.vcf
workflow:
fastqc: true
multiqc: true
mapping_quality: true
params:
copy_extra: "--parents --verbose"
bwa_index_extra: ""
bwa_map_extra: "-T 20 -M"
picard_sort_sam_extra: ""
picard_group_extra: "RGLB=standard RGPL=illumina RGPU={sample} RGSM={sample}"
picard_dedup_extra: "REMOVE_DUPLICATES=true"
picard_isize_extra: "METRIC_ACCUMULATION_LEVEL=SAMPLE"
gatk_bqsr_extra: ""
picard_summary_extra: ""
samtools_view: "-b -h -F 12"
samtools_fixmate_extra: "-c -m"
picard_sequence_dict_extra: "GENOME_ASSEMBLY=GRCH38 SPECIES=HSA URI=https://www.gencodegenes.org/human/"
samtools_faidx_extra: ""
Remember, there is a python script here to build this file from command line!
At this stage, we need a TSV file describing our analysis.
It must contain the following columns:
- Sample_id: the name of each samples
- Upstream_file: path to the upstream fastq file
The optional columns are:
- Downstream_file: path to downstream fastq files (usually your R2 in paired-end libraries)
- Any other information
Remember, there is a python script here to build this file from command line!