A DESeq2 RNA-Seq differential expression pipeline packaged with a full Computer System Validation (CSV) lifecycle — URS → FRS → DS → IQ → OQ → PQ — and compliance mappings to 21 CFR Part 11 and ALCOA+ data integrity principles.
Built as a reference implementation of how a research bioinformatics tool gets validated to a level appropriate for GAMP 5 Category 5 (custom application) software in a regulated biopharmaceutical environment.
In life-sciences environments, an analysis script doesn't ship just because it runs. It ships when:
- Requirements are documented (URS)
- Design is specified (FRS, DS)
- Installation is qualified (IQ)
- Operation is qualified (OQ)
- Performance is qualified (PQ)
- Every requirement is traceable to a test
- Risk is assessed
- Change control is defined
- An audit trail exists
This repo demonstrates that full lifecycle around a small but real DESeq2 analysis — exactly the kind of artifact a Bioinformatics + Validation hybrid role is expected to produce.
┌──────────────────────────────┐
│ config/samples.csv │
│ data/test_counts.tsv │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────────────┐
│ scripts/run_deseq2.R │
│ │
│ ┌──────────────┐ ┌─────────────┐ │
│ │log_versions()│ │validate_ │ │
│ │ │ │ inputs() │ │
│ └──────┬───────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ │
│ │ run_analysis() │ │
│ │ (DESeq2 Wald test) │ │
│ └─────────────────────────┘ │
└──────────────────┬───────────────────┘
│
▼
┌──────────────────────────────┐
│ deseq2_results.tsv │
│ dds.rds │
│ software_versions.log │
└──────────────────────────────┘
| Artifact | File | What it proves |
|---|---|---|
| User Requirements Spec | validation/01_URS.md |
What the user needs |
| Functional Requirements Spec | validation/02_FRS.md |
How the user needs translate to functions |
| Design Specification | validation/03_DS.md |
How the system is built |
| Installation Qualification | validation/04_IQ_protocol.md |
The software is correctly installed |
| Operational Qualification | validation/05_OQ_protocol.md |
The software operates as designed |
| Performance Qualification | validation/06_PQ_protocol.md |
The software produces scientifically correct results |
| Traceability Matrix | validation/07_traceability_matrix.csv |
Every requirement maps to a test |
| Risk Assessment | validation/08_risk_assessment.md |
Known failure modes and mitigations |
| Revalidation Triggers | validation/09_revalidation_triggers.md |
When the validated state becomes invalid |
| Standard | File | What it shows |
|---|---|---|
| 21 CFR Part 11 | docs/21cfr11_mapping.md |
Per-clause mapping of FDA controls to implementation |
| ALCOA+ | docs/alcoa_plus_assessment.md |
Data integrity assessment across all 9 principles |
| SOP | File | Purpose |
|---|---|---|
| Script Execution | sops/SOP-001-script-execution.md |
How an analyst runs the validated tool |
| Change Control | sops/SOP-002-change-control.md |
How modifications are managed post-validation |
| File | What it tracks |
|---|---|
audit-trail/change-log.md |
Chronological record of every controlled change to the system |
├── README.md this file ├── .gitignore │ ├── scripts/ │ └── run_deseq2.R the validated analysis script │ ├── tests/ │ ├── test_data_integrity.R ALCOA+ input checks (4 cases) │ ├── test_reproducibility.R bit-identical re-runs │ └── test_known_results.R true-positive + true-negative DE detection │ ├── data/ │ ├── test_counts.tsv 40-gene synthetic count matrix │ └── test_samples.csv sample sheet (3 control, 3 treated) │ ├── validation/ URS → FRS → DS → IQ → OQ → PQ + traceability + risk ├── docs/ 21 CFR Part 11 + ALCOA+ control mappings ├── sops/ Standard Operating Procedures └── audit-trail/ Change log
- R ≥ 4.0
- R packages:
DESeq2,readr(install via Bioconductor / CRAN)
Rscript scripts/run_deseq2.R \
data/test_counts.tsv \
data/test_samples.csv \
results \
control
Outputs land in results/:
deseq2_results.tsv — gene-level DE table sorted by adjusted p-value
dds.rds — serialized DESeqDataSet for downstream re-analysis
software_versions.log — runtime audit log (R version, package versions, timestamp, user)
SOP-compliant execution
For runs that follow SOP-001 (staging, hashing, audit trail):
RUN_DIR=runs/$(date +%Y-%m-%d)-001
mkdir -p $RUN_DIR/inputs $RUN_DIR/outputs
cp data/test_counts.tsv $RUN_DIR/inputs/counts.tsv
cp data/test_samples.csv $RUN_DIR/inputs/samples.csv
shasum -a 256 $RUN_DIR/inputs/*.tsv $RUN_DIR/inputs/*.csv > $RUN_DIR/input_hashes.txt
Rscript scripts/run_deseq2.R \
$RUN_DIR/inputs/counts.tsv \
$RUN_DIR/inputs/samples.csv \
$RUN_DIR/outputs \
control
The runs/ directory is gitignored — analysis outputs are artifacts, not source.
Running the validation tests
The three test scripts correspond to test cases in OQ-DESEQ2-001 and PQ-DESEQ2-001:
Rscript tests/test_data_integrity.R # TC-DI-001..004 (input validation)
Rscript tests/test_reproducibility.R # TC-RP-001 (deterministic re-runs)
Rscript tests/test_known_results.R # TC-PQ-001 (correct DE detection)
All three should print PASS lines and exit with status 0.
What this project is — and isn't
Is:
A complete, working CSV lifecycle example around a real DESeq2 analysis
A reference for how to map bioinformatics work onto 21 CFR Part 11 + ALCOA+
Reviewable end-to-end in under an hour
Isn't:
An FDA-cleared system
A production GxP deployment
A substitute for a regulated organization's own QMS and validation framework
For deployment in an actual GxP environment, the gaps to close are documented in docs/21cfr11_mapping.md (§ Gap Summary).
Companion project
This pipeline pairs naturally with cloud-rnaseq-aws — an AWS reference architecture (S3 versioning, IAM least-privilege, lifecycle archiving, EC2 Spot with self-termination) for running bioinformatics workflows in a cost-aware, compliance-aware cloud environment.
Together they demonstrate regulated bioinformatics + cloud end-to-end.
Author
Bhavitha Kandru — kandru.b@northeastern.edu
M.S. Bioinformatics, Northeastern University · Pharm-D
License
This project is provided as-is for educational and reference purposes.