Skip to content

Commit 5472840

Browse files
GiaoLeeclaude
andcommitted
chore: standardize SKILL.md metadata and refresh eval reports
Update SKILL.md files across Data Analysis and Other categories with consistent frontmatter (skill-author, quoted descriptions, category), add Prerequisites, Agent Response Contract, and Input Validation sections, and regenerate matching eval_report JSON files. Also adds project-structure reference for external-model-validation and readability advisories to sample-group-sankey-plot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 0206bcf commit 5472840

41 files changed

Lines changed: 3828 additions & 6106 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

awesome-med-research-skills/Data Analysis/batch-effect-correction/SKILL.md

Lines changed: 49 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,26 @@
11
---
22
name: batch-effect-correction
3-
description: Use when correcting batch effects in merged bulk expression matrices with sample-level batch metadata while preserving biological group structure and generating before-and-after QC plots. NOT for: single-cell integration, raw FASTQ processing, differential expression without batch labels, or datasets without biological groups.
3+
description: "Use when correcting batch effects in merged bulk expression matrices with sample-level batch metadata while preserving biological group structure and generating before-and-after QC plots. NOT for: single-cell integration, raw FASTQ processing, differential expression without batch labels, or datasets without biological groups."
44
license: MIT
5-
author: AIPOCH
5+
skill-author: AIPOCH
66
---
7-
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)
87

98
# Batch Effect Correction
109

10+
## Prerequisites
11+
12+
Run the following before the first analysis to install all required R packages:
13+
14+
```bash
15+
Rscript -e "if (!require('BiocManager', quietly=TRUE)) install.packages('BiocManager'); BiocManager::install(c('sva', 'limma')); install.packages('ggplot2', repos='https://cloud.r-project.org')"
16+
```
17+
18+
> Note: `sva` and `limma` are Bioconductor packages and require `BiocManager` for installation. `ggplot2` is a standard CRAN package.
19+
20+
**The skill cannot run until these packages are installed.** In new or bare R environments, always run the prerequisite step first.
21+
22+
---
23+
1124
## When to Read External Files
1225

1326
| Situation | File to Read | Purpose |
@@ -99,8 +112,8 @@ Requirements:
99112
| `matched_sample_info.csv` | Standardized metadata used in the analysis |
100113
| `batch_before_boxplot.pdf` | Sample distribution boxplot before correction |
101114
| `batch_after_boxplot.pdf` | Sample distribution boxplot after correction |
102-
| `batch_before_pca.pdf` | PCA scatter plot before correction with batch-colored points; batch ellipses are added when batch size/covariance supports stable fitting |
103-
| `batch_after_pca.pdf` | PCA scatter plot after correction with batch-colored points; batch ellipses are added when batch size/covariance supports stable fitting |
115+
| `batch_before_pca.pdf` | PCA scatter plot before correction with batch-colored points |
116+
| `batch_after_pca.pdf` | PCA scatter plot after correction with batch-colored points |
104117
| `batch_before_clustering.pdf` | Hierarchical clustering before correction |
105118
| `batch_after_clustering.pdf` | Hierarchical clustering after correction |
106119
| `session_info.txt` | R session and package version info |
@@ -118,7 +131,7 @@ Requirements:
118131
### Step 2: Align and Prepare Matrix
119132
- Reorder expression columns to match metadata sample order
120133
- Keep only metadata-matched samples; warn if the expression matrix contains extra samples absent from metadata
121-
- Decide whether log transformation is needed
134+
- Decide whether log transformation is needed (`auto`, `yes`, or `no`)
122135
- Apply `log2(x + 1)` only when required
123136

124137
### Step 3: Run Batch Correction
@@ -149,6 +162,18 @@ Generates paired boxplots, PCA scatter plots with conditional batch ellipses, an
149162

150163
---
151164

165+
## Agent Response Contract
166+
167+
After a successful run, report:
168+
169+
1. **Sample count** retained after metadata matching and any subset filtering
170+
2. **Batch count** and **group count** used in the ComBat design matrix
171+
3. **Log transformation** applied (auto-detected, forced yes, or skipped)
172+
4. **QC assessment**: describe whether before/after PCA plots show reduced batch clustering
173+
5. **Artifact paths**: `corrected_expression_matrix.csv`, `batch_after_pca.pdf`, `batch_after_clustering.pdf`
174+
175+
---
176+
152177
## Examples
153178

154179
### Basic Usage
@@ -193,22 +218,36 @@ Rscript scripts/main.R \
193218
| `SKILL_EMPTY_FILE` | Input file exists but contains no data | Recreate or re-export the file |
194219
| `SKILL_MISSING_COLUMNS` | Metadata file is missing sample, group, or batch columns | Check header names or pass custom column names |
195220
| `SKILL_SAMPLE_MISMATCH` | Metadata sample IDs do not match expression matrix columns | Verify sample names between files |
196-
| `SKILL_INVALID_DATA` | Dataset fails minimum design checks | Review group counts, batch counts, and ID validity |
221+
| `SKILL_INVALID_DATA` | Dataset fails minimum design checks (< 2 batches, < 2 groups, < 2 samples per batch/group) | Review group counts, batch counts, and ID validity |
197222
| `SKILL_INVALID_TYPE` | Expression values are non-numeric or non-finite | Clean matrix values before running |
198223
| `SKILL_TIMEOUT` | Run exceeded the configured time limit | Increase `--timeout_seconds` or set it to `0` |
199-
| `SKILL_DEPENDENCY_MISSING` | Required R package is not installed | Install missing dependencies |
224+
| `SKILL_DEPENDENCY_MISSING` | Required R package is not installed | Install with: `Rscript -e "BiocManager::install(c('sva','limma')); install.packages('ggplot2')"` |
200225
| `SKILL_RUNTIME_ERROR` | Runtime I/O or filesystem error occurred | Check read/write permissions and environment |
201226

202227
**IF error persists**, READ: `references/troubleshooting.md`
203228

229+
**Troubleshooting note:** In environments where packages are not yet installed, `SKILL_DEPENDENCY_MISSING` will fire before file-validation or `--help`. Install dependencies first, then re-run to expose file-related errors or access `--help`.
230+
231+
---
232+
233+
## Input Validation
234+
235+
This skill accepts:
236+
1. A bulk RNA-seq or microarray expression matrix (CSV, genes as rows, samples as columns)
237+
2. A sample metadata file (CSV) with sample ID, biological group, and batch columns; at least 2 batches and 2 biological groups are required
238+
239+
If the user's request does not involve batch effect correction on merged bulk expression matrices — for example, asking to integrate single-cell RNA-seq data, process raw FASTQ files, run differential expression without batch labels, or analyze datasets with only one batch — do not proceed with the workflow. Instead respond:
240+
241+
> "Batch Effect Correction is designed to remove batch-driven variation from merged bulk expression matrices using ComBat, while preserving biological group structure. Your request appears to be outside this scope. Please provide a multi-batch expression matrix with sample-level batch metadata, or use a more appropriate tool for single-cell integration, differential expression, or raw sequencing processing."
242+
204243
---
205244

206245
## Testing
207246

208247
### Test with Sample Data
209248

210249
```bash
211-
# Check help
250+
# Check help (requires packages installed)
212251
Rscript scripts/main.R --help
213252

214253
# Run with bundled test data
@@ -249,4 +288,4 @@ ls -la tests/output/batch_after_pca.pdf
249288

250289
---
251290

252-
*Last updated: 2026-04-20 | Version: 1.0.0*
291+
*Last updated: 2026-04-27 | Version: 1.1.0*

0 commit comments

Comments
 (0)