Skip to content

bhavithakandru/validation-deseq2-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

validated-deseq2-pipeline

A DESeq2 RNA-Seq differential expression pipeline packaged with a full Computer System Validation (CSV) lifecycle — URS → FRS → DS → IQ → OQ → PQ — and compliance mappings to 21 CFR Part 11 and ALCOA+ data integrity principles.

Built as a reference implementation of how a research bioinformatics tool gets validated to a level appropriate for GAMP 5 Category 5 (custom application) software in a regulated biopharmaceutical environment.


Why this project exists

In life-sciences environments, an analysis script doesn't ship just because it runs. It ships when:

  • Requirements are documented (URS)
  • Design is specified (FRS, DS)
  • Installation is qualified (IQ)
  • Operation is qualified (OQ)
  • Performance is qualified (PQ)
  • Every requirement is traceable to a test
  • Risk is assessed
  • Change control is defined
  • An audit trail exists

This repo demonstrates that full lifecycle around a small but real DESeq2 analysis — exactly the kind of artifact a Bioinformatics + Validation hybrid role is expected to produce.


Architecture

          ┌──────────────────────────────┐
          │  config/samples.csv          │
          │  data/test_counts.tsv        │
          └──────────────┬───────────────┘
                         │
                         ▼
      ┌──────────────────────────────────────┐
      │       scripts/run_deseq2.R           │
      │                                      │
      │   ┌──────────────┐  ┌─────────────┐  │
      │   │log_versions()│  │validate_    │  │
      │   │              │  │  inputs()   │  │
      │   └──────┬───────┘  └──────┬──────┘  │
      │          │                 │         │
      │          ▼                 ▼         │
      │      ┌─────────────────────────┐     │
      │      │     run_analysis()      │     │
      │      │   (DESeq2 Wald test)    │     │
      │      └─────────────────────────┘     │
      └──────────────────┬───────────────────┘
                         │
                         ▼
          ┌──────────────────────────────┐
          │  deseq2_results.tsv          │
          │  dds.rds                     │
          │  software_versions.log       │
          └──────────────────────────────┘

Validation package

Artifact File What it proves
User Requirements Spec validation/01_URS.md What the user needs
Functional Requirements Spec validation/02_FRS.md How the user needs translate to functions
Design Specification validation/03_DS.md How the system is built
Installation Qualification validation/04_IQ_protocol.md The software is correctly installed
Operational Qualification validation/05_OQ_protocol.md The software operates as designed
Performance Qualification validation/06_PQ_protocol.md The software produces scientifically correct results
Traceability Matrix validation/07_traceability_matrix.csv Every requirement maps to a test
Risk Assessment validation/08_risk_assessment.md Known failure modes and mitigations
Revalidation Triggers validation/09_revalidation_triggers.md When the validated state becomes invalid

Compliance mappings

Standard File What it shows
21 CFR Part 11 docs/21cfr11_mapping.md Per-clause mapping of FDA controls to implementation
ALCOA+ docs/alcoa_plus_assessment.md Data integrity assessment across all 9 principles

Operating procedures

SOP File Purpose
Script Execution sops/SOP-001-script-execution.md How an analyst runs the validated tool
Change Control sops/SOP-002-change-control.md How modifications are managed post-validation

Audit trail

File What it tracks
audit-trail/change-log.md Chronological record of every controlled change to the system

Repository layout

├── README.md this file ├── .gitignore │ ├── scripts/ │ └── run_deseq2.R the validated analysis script │ ├── tests/ │ ├── test_data_integrity.R ALCOA+ input checks (4 cases) │ ├── test_reproducibility.R bit-identical re-runs │ └── test_known_results.R true-positive + true-negative DE detection │ ├── data/ │ ├── test_counts.tsv 40-gene synthetic count matrix │ └── test_samples.csv sample sheet (3 control, 3 treated) │ ├── validation/ URS → FRS → DS → IQ → OQ → PQ + traceability + risk ├── docs/ 21 CFR Part 11 + ALCOA+ control mappings ├── sops/ Standard Operating Procedures └── audit-trail/ Change log


Running the pipeline

Prerequisites

  • R ≥ 4.0
  • R packages: DESeq2, readr (install via Bioconductor / CRAN)

Direct execution

Rscript scripts/run_deseq2.R \
  data/test_counts.tsv \
  data/test_samples.csv \
  results \
  control
Outputs land in results/:

deseq2_results.tsv — gene-level DE table sorted by adjusted p-value
dds.rds — serialized DESeqDataSet for downstream re-analysis
software_versions.log — runtime audit log (R version, package versions, timestamp, user)
SOP-compliant execution
For runs that follow SOP-001 (staging, hashing, audit trail):  
RUN_DIR=runs/$(date +%Y-%m-%d)-001

mkdir -p $RUN_DIR/inputs $RUN_DIR/outputs
cp data/test_counts.tsv  $RUN_DIR/inputs/counts.tsv
cp data/test_samples.csv $RUN_DIR/inputs/samples.csv

shasum -a 256 $RUN_DIR/inputs/*.tsv $RUN_DIR/inputs/*.csv > $RUN_DIR/input_hashes.txt

Rscript scripts/run_deseq2.R \
  $RUN_DIR/inputs/counts.tsv \
  $RUN_DIR/inputs/samples.csv \
  $RUN_DIR/outputs \
  control
The runs/ directory is gitignored — analysis outputs are artifacts, not source.

Running the validation tests
The three test scripts correspond to test cases in OQ-DESEQ2-001 and PQ-DESEQ2-001:

Rscript tests/test_data_integrity.R     # TC-DI-001..004 (input validation)
Rscript tests/test_reproducibility.R    # TC-RP-001     (deterministic re-runs)
Rscript tests/test_known_results.R      # TC-PQ-001     (correct DE detection)
All three should print PASS lines and exit with status 0.

What this project is — and isn't
Is:

A complete, working CSV lifecycle example around a real DESeq2 analysis
A reference for how to map bioinformatics work onto 21 CFR Part 11 + ALCOA+
Reviewable end-to-end in under an hour
Isn't:

An FDA-cleared system
A production GxP deployment
A substitute for a regulated organization's own QMS and validation framework
For deployment in an actual GxP environment, the gaps to close are documented in docs/21cfr11_mapping.md (§ Gap Summary).

Companion project
This pipeline pairs naturally with cloud-rnaseq-aws — an AWS reference architecture (S3 versioning, IAM least-privilege, lifecycle archiving, EC2 Spot with self-termination) for running bioinformatics workflows in a cost-aware, compliance-aware cloud environment.

Together they demonstrate regulated bioinformatics + cloud end-to-end.

Author
Bhavitha Kandru — kandru.b@northeastern.edu
M.S. Bioinformatics, Northeastern University · Pharm-D

License
This project is provided as-is for educational and reference purposes.

About

Validated DESeq2 RNA-Seq pipeline with full CSV lifecycle (URS, IQ/OQ/PQ), 21 CFR Part 11 and ALCOA+ compliance mappings, and GAMP 5 documentation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages