You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chore: standardize SKILL.md metadata and refresh eval reports
Update SKILL.md files across Data Analysis and Other categories with
consistent frontmatter (skill-author, quoted descriptions, category),
add Prerequisites, Agent Response Contract, and Input Validation
sections, and regenerate matching eval_report JSON files. Also adds
project-structure reference for external-model-validation and
readability advisories to sample-group-sankey-plot.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: awesome-med-research-skills/Data Analysis/batch-effect-correction/SKILL.md
+49-10Lines changed: 49 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,26 @@
1
1
---
2
2
name: batch-effect-correction
3
-
description: Use when correcting batch effects in merged bulk expression matrices with sample-level batch metadata while preserving biological group structure and generating before-and-after QC plots. NOT for: single-cell integration, raw FASTQ processing, differential expression without batch labels, or datasets without biological groups.
3
+
description: "Use when correcting batch effects in merged bulk expression matrices with sample-level batch metadata while preserving biological group structure and generating before-and-after QC plots. NOT for: single-cell integration, raw FASTQ processing, differential expression without batch labels, or datasets without biological groups."
> Note: `sva` and `limma` are Bioconductor packages and require `BiocManager` for installation. `ggplot2` is a standard CRAN package.
19
+
20
+
**The skill cannot run until these packages are installed.** In new or bare R environments, always run the prerequisite step first.
21
+
22
+
---
23
+
11
24
## When to Read External Files
12
25
13
26
| Situation | File to Read | Purpose |
@@ -99,8 +112,8 @@ Requirements:
99
112
|`matched_sample_info.csv`| Standardized metadata used in the analysis |
100
113
|`batch_before_boxplot.pdf`| Sample distribution boxplot before correction |
101
114
|`batch_after_boxplot.pdf`| Sample distribution boxplot after correction |
102
-
|`batch_before_pca.pdf`| PCA scatter plot before correction with batch-colored points; batch ellipses are added when batch size/covariance supports stable fitting|
103
-
|`batch_after_pca.pdf`| PCA scatter plot after correction with batch-colored points; batch ellipses are added when batch size/covariance supports stable fitting|
115
+
|`batch_before_pca.pdf`| PCA scatter plot before correction with batch-colored points |
116
+
|`batch_after_pca.pdf`| PCA scatter plot after correction with batch-colored points |
104
117
|`batch_before_clustering.pdf`| Hierarchical clustering before correction |
105
118
|`batch_after_clustering.pdf`| Hierarchical clustering after correction |
106
119
|`session_info.txt`| R session and package version info |
@@ -118,7 +131,7 @@ Requirements:
118
131
### Step 2: Align and Prepare Matrix
119
132
- Reorder expression columns to match metadata sample order
120
133
- Keep only metadata-matched samples; warn if the expression matrix contains extra samples absent from metadata
121
-
- Decide whether log transformation is needed
134
+
- Decide whether log transformation is needed (`auto`, `yes`, or `no`)
122
135
- Apply `log2(x + 1)` only when required
123
136
124
137
### Step 3: Run Batch Correction
@@ -149,6 +162,18 @@ Generates paired boxplots, PCA scatter plots with conditional batch ellipses, an
149
162
150
163
---
151
164
165
+
## Agent Response Contract
166
+
167
+
After a successful run, report:
168
+
169
+
1.**Sample count** retained after metadata matching and any subset filtering
170
+
2.**Batch count** and **group count** used in the ComBat design matrix
171
+
3.**Log transformation** applied (auto-detected, forced yes, or skipped)
|`SKILL_EMPTY_FILE`| Input file exists but contains no data | Recreate or re-export the file |
194
219
|`SKILL_MISSING_COLUMNS`| Metadata file is missing sample, group, or batch columns | Check header names or pass custom column names |
195
220
|`SKILL_SAMPLE_MISMATCH`| Metadata sample IDs do not match expression matrix columns | Verify sample names between files |
196
-
|`SKILL_INVALID_DATA`| Dataset fails minimum design checks | Review group counts, batch counts, and ID validity |
221
+
|`SKILL_INVALID_DATA`| Dataset fails minimum design checks (< 2 batches, < 2 groups, < 2 samples per batch/group) | Review group counts, batch counts, and ID validity |
197
222
|`SKILL_INVALID_TYPE`| Expression values are non-numeric or non-finite | Clean matrix values before running |
198
223
|`SKILL_TIMEOUT`| Run exceeded the configured time limit | Increase `--timeout_seconds` or set it to `0`|
199
-
|`SKILL_DEPENDENCY_MISSING`| Required R package is not installed | Install missing dependencies|
224
+
|`SKILL_DEPENDENCY_MISSING`| Required R package is not installed | Install with: `Rscript -e "BiocManager::install(c('sva','limma')); install.packages('ggplot2')"`|
200
225
|`SKILL_RUNTIME_ERROR`| Runtime I/O or filesystem error occurred | Check read/write permissions and environment |
**Troubleshooting note:** In environments where packages are not yet installed, `SKILL_DEPENDENCY_MISSING` will fire before file-validation or `--help`. Install dependencies first, then re-run to expose file-related errors or access `--help`.
230
+
231
+
---
232
+
233
+
## Input Validation
234
+
235
+
This skill accepts:
236
+
1. A bulk RNA-seq or microarray expression matrix (CSV, genes as rows, samples as columns)
237
+
2. A sample metadata file (CSV) with sample ID, biological group, and batch columns; at least 2 batches and 2 biological groups are required
238
+
239
+
If the user's request does not involve batch effect correction on merged bulk expression matrices — for example, asking to integrate single-cell RNA-seq data, process raw FASTQ files, run differential expression without batch labels, or analyze datasets with only one batch — do not proceed with the workflow. Instead respond:
240
+
241
+
> "Batch Effect Correction is designed to remove batch-driven variation from merged bulk expression matrices using ComBat, while preserving biological group structure. Your request appears to be outside this scope. Please provide a multi-batch expression matrix with sample-level batch metadata, or use a more appropriate tool for single-cell integration, differential expression, or raw sequencing processing."
242
+
204
243
---
205
244
206
245
## Testing
207
246
208
247
### Test with Sample Data
209
248
210
249
```bash
211
-
# Check help
250
+
# Check help (requires packages installed)
212
251
Rscript scripts/main.R --help
213
252
214
253
# Run with bundled test data
@@ -249,4 +288,4 @@ ls -la tests/output/batch_after_pca.pdf
0 commit comments