|
1 | 1 | --- |
2 | 2 | name: gsea |
3 | | -description: 对按统计量排序的基因列表执行 GSEA 分析,输出富集结果表、运行分数表和绘图结果。 |
| 3 | +description: Run GSEA on a ranked gene list and produce the enrichment table, running-score table, and enrichment plots. |
4 | 4 | license: MIT |
5 | 5 | author: AIPOCH |
6 | 6 | --- |
7 | 7 | > **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills) |
8 | 8 |
|
9 | | -## 何时读取外部文件 |
| 9 | +## When to read external files |
10 | 10 |
|
11 | | -| 情况 | 读取文件 | 目的 | |
| 11 | +| Situation | Read | Purpose | |
12 | 12 | |---|---|---| |
13 | | -| 需要了解算法细节 | `references/algorithm.md` | 统计方法与公式 | |
14 | | -| 需要执行分析 | `scripts/main.R` | 获取完整命令 | |
15 | | -| 遇到报错 | `references/troubleshooting.md` | 查找解决方案 | |
16 | | -| 需要 CLI 示例 | `references/cli-guide.md` | 参数用法示例 | |
| 13 | +| Need algorithm details | `references/algorithm.md` | Statistical method and formulas | |
| 14 | +| Need to run an analysis | `scripts/main.R` | Full command reference | |
| 15 | +| Hit an error | `references/troubleshooting.md` | Look up error codes and fixes | |
| 16 | +| Need CLI examples | `references/cli-guide.md` | Worked argument examples | |
17 | 17 |
|
18 | | -## 适用场景 |
| 18 | +## Scope |
19 | 19 |
|
20 | | -适用于: |
21 | | -- 对按统计量排序的基因列表执行 GSEA 分析 |
22 | | -- 基于已有 `enrichGSEA.csv` 和 `gsea_running_scores.csv` 生成富集曲线图 |
23 | | -- 使用 `tests/data/sample_deg_results.csv` 做最小可运行验证 |
| 20 | +Use this skill for: |
| 21 | +- Running GSEA on a gene list ranked by a statistic |
| 22 | +- Generating enrichment curve plots from existing `enrichGSEA.csv` and `gsea_running_scores.csv` |
| 23 | +- Smoke-testing the pipeline with `tests/data/sample_deg_results.csv` |
24 | 24 |
|
25 | | -不适用于: |
26 | | -- 原始表达矩阵的差异分析 |
27 | | -- 单样本 ssGSEA |
28 | | -- 网络分析或多组学整合分析 |
| 25 | +Do not use it for: |
| 26 | +- Differential expression on raw expression matrices |
| 27 | +- Single-sample ssGSEA |
| 28 | +- Network analysis or multi-omics integration |
29 | 29 |
|
30 | | -## 使用方法 |
| 30 | +## Usage |
31 | 31 |
|
32 | | -分析模式: |
| 32 | +Analysis mode: |
33 | 33 | `Rscript scripts/main.R --input tests/data/sample_deg_results.csv --outdir ./GSEA_analysis --type KEGG --species human --seed 42 --timeout 300` |
34 | 34 |
|
35 | | -绘图模式: |
| 35 | +Plot mode: |
36 | 36 | `Rscript scripts/main.R --running_file ./GSEA_analysis/Table/gsea_running_scores.csv --enrich_file ./GSEA_analysis/Table/enrichGSEA.csv --plot_output ./GSEA_analysis/plot/gsea_plot.pdf --top_n 5 --plot_format pdf --seed 42 --timeout 300` |
37 | 37 |
|
38 | | -说明:详见 `references/cli-guide.md`。 |
| 38 | +See `references/cli-guide.md` for more. |
39 | 39 |
|
40 | | -模式选择说明: |
41 | | -- 仅提供 `--input` 时进入分析模式 |
42 | | -- 同时提供 `--running_file` 和 `--enrich_file` 时进入绘图模式 |
43 | | -- 若同时提供分析参数与绘图参数,则绘图模式优先,分析模式会被跳过,并输出警告信息 |
| 40 | +Mode selection: |
| 41 | +- Passing only `--input` runs analysis mode |
| 42 | +- Passing both `--running_file` and `--enrich_file` runs plot mode |
| 43 | +- If both sets of arguments are provided, plot mode takes precedence; analysis mode is skipped and a warning is logged |
44 | 44 |
|
45 | | -## 参数说明 |
| 45 | +## Arguments |
46 | 46 |
|
47 | | -### 分析模式参数 |
| 47 | +### Analysis-mode arguments |
48 | 48 |
|
49 | | -| 短参数 | 长参数 | 类型 | 默认值 | 是否必填 | 说明 | |
| 49 | +| Short | Long | Type | Default | Required | Description | |
50 | 50 | |---|---|---|---|---|---| |
51 | | -| `-i` | `--input` | character | `NULL` | 是 | 输入 CSV 文件 | |
52 | | -| `-o` | `--outdir` | character | `GSEA_analysis` | 否 | 输出目录 | |
53 | | -| `-g` | `--gene_col` | character | `name` | 否 | 基因列名 | |
54 | | -| `-f` | `--fc_col` | character | `logFC` | 否 | 排序统计量列名 | |
55 | | -| `-t` | `--type` | character | `KEGG` | 否 | 基因集类型:`KEGG`、`HALLMARKS`、`GO_BP`、`GO_MF`、`GO_CC`;预载 RDS 中会自动将 `HALLMARKS` 映射到资产键 `Hallmarks` | |
56 | | -| `-s` | `--species` | character | `human` | 否 | 物种:`human`、`mouse`、`rat` | |
57 | | -| `-p` | `--pvalue_cutoff` | numeric | `0.05` | 否 | 显著性阈值 | |
58 | | -| `-m` | `--method` | character | `fgsea` | 否 | GSEA 方法:`fgsea` 或 `DOSE` | |
59 | | -| `-c` | `--chunk_size` | numeric | `1000` | 否 | 大基因集转换时的分块大小 | |
60 | | -| `-r` | `--rds_path` | character | `NULL` | 否 | 预存基因集 RDS 路径 | |
61 | | -| `-v` | `--verbose` | logical | `FALSE` | 否 | 输出详细日志 | |
62 | | -| | `--seed` | integer | `42` | 否 | 随机种子 | |
63 | | -| | `--timeout` | integer | `300` | 否 | 超时秒数,`<=0` 表示不限制 | |
64 | | -| `-h` | `--help` | logical | `FALSE` | 否 | 显示帮助 | |
65 | | - |
66 | | -### 绘图模式参数 |
67 | | - |
68 | | -| 短参数 | 长参数 | 类型 | 默认值 | 是否必填 | 说明 | |
| 51 | +| `-i` | `--input` | character | `NULL` | yes | Input CSV file | |
| 52 | +| `-o` | `--outdir` | character | `GSEA_analysis` | no | Output directory | |
| 53 | +| `-g` | `--gene_col` | character | `name` | no | Gene column name | |
| 54 | +| `-f` | `--fc_col` | character | `logFC` | no | Ranking-statistic column name | |
| 55 | +| `-t` | `--type` | character | `KEGG` | no | Gene-set type: `KEGG`, `HALLMARKS`, `GO_BP`, `GO_MF`, `GO_CC`. With a preloaded RDS, `HALLMARKS` is automatically mapped to the asset key `Hallmarks` | |
| 56 | +| `-s` | `--species` | character | `human` | no | Species: `human`, `mouse`, `rat` | |
| 57 | +| `-p` | `--pvalue_cutoff` | numeric | `0.05` | no | Significance threshold | |
| 58 | +| `-m` | `--method` | character | `fgsea` | no | GSEA backend: `fgsea` or `DOSE` | |
| 59 | +| `-c` | `--chunk_size` | numeric | `1000` | no | Chunk size for large gene-set conversion | |
| 60 | +| `-r` | `--rds_path` | character | `NULL` | no | Path to a pre-stored gene-set RDS | |
| 61 | +| `-v` | `--verbose` | logical | `FALSE` | no | Verbose logging | |
| 62 | +| | `--seed` | integer | `42` | no | Random seed | |
| 63 | +| | `--timeout` | integer | `300` | no | Timeout in seconds; `<=0` disables it | |
| 64 | +| `-h` | `--help` | logical | `FALSE` | no | Show help | |
| 65 | + |
| 66 | +### Plot-mode arguments |
| 67 | + |
| 68 | +| Short | Long | Type | Default | Required | Description | |
69 | 69 | |---|---|---|---|---|---| |
70 | | -| | `--running_file` | character | `NULL` | 是 | `gsea_running_scores.csv` 路径 | |
71 | | -| | `--enrich_file` | character | `NULL` | 是 | `enrichGSEA.csv` 路径 | |
72 | | -| | `--plot_output` | character | `gsea_plot.pdf` | 否 | 输出图文件路径 | |
73 | | -| | `--plot_width` | numeric | `8` | 否 | 图宽 | |
74 | | -| | `--plot_height` | numeric | `6` | 否 | 图高 | |
75 | | -| | `--plot_format` | character | `pdf` | 否 | 输出格式:`pdf` 或 `png` | |
76 | | -| | `--top_n` | numeric | `1` | 否 | 未指定 `geneSetID` 时绘制前 N 条通路 | |
77 | | -| | `--rank_by` | character | `p.adjust` | 否 | 通路排序列 | |
78 | | -| | `--geneSetID` | character | `""` | 否 | 逗号分隔的通路 ID | |
79 | | -| | `--plot_title` | character | `""` | 否 | 图标题 | |
80 | | -| | `--colors` | character | `#4DBBD5,#E64B35,#00A087,#F39B7F,#3C5488,#8491B4` | 否 | 颜色列表 | |
81 | | -| | `--base_size` | numeric | `11` | 否 | 基础字号 | |
82 | | -| | `--subplots` | character | `1,2,3` | 否 | 显示子图编号 | |
83 | | -| | `--rel_heights` | character | `1.5,0.8,1` | 否 | 子图高度比例 | |
84 | | -| | `--NES_table` | logical | `TRUE` | 否 | 显示 NES 注释 | |
85 | | -| | `--no_NES_table` | logical | `FALSE` | 否 | 关闭 NES 注释 | |
86 | | -| | `--NES_label_size` | numeric | `4` | 否 | NES 注释字号 | |
87 | | -| | `--NES_label_x` | numeric | `0.75` | 否 | NES 注释横向位置 | |
88 | | -| | `--NES_label_y` | numeric | `0.75` | 否 | NES 注释纵向位置 | |
89 | | -| | `--NES_label_color` | character | `black` | 否 | NES 注释颜色 | |
90 | | -| | `--NES_label_hjust` | numeric | `0` | 否 | NES 注释水平对齐 | |
91 | | -| | `--NES_label_vjust` | numeric | `1` | 否 | NES 注释垂直对齐 | |
92 | | -| | `--line_width` | numeric | `1` | 否 | ES 线宽 | |
93 | | -| | `--dot_size` | numeric | `1.2` | 否 | ES 点大小 | |
94 | | -| | `--legend_position` | character | `auto` | 否 | 图例位置 | |
95 | | -| | `--legend_x` | numeric | `0.02` | 否 | 内嵌图例横坐标 | |
96 | | -| | `--legend_y` | numeric | `0.02` | 否 | 内嵌图例纵坐标 | |
97 | | -| | `--legend_just_x` | numeric | `0` | 否 | 图例横向对齐 | |
98 | | -| | `--legend_just_y` | numeric | `0` | 否 | 图例纵向对齐 | |
99 | | -| | `--legend_text_size` | numeric | `9` | 否 | 图例文字大小 | |
100 | | -| | `--legend_key_size` | numeric | `0.6` | 否 | 图例键大小 | |
101 | | -| | `--legend_bg_alpha` | numeric | `0` | 否 | 图例背景透明度 | |
102 | | -| | `--grid_major_color` | character | `grey92` | 否 | 主网格颜色 | |
103 | | -| | `--grid_minor_color` | character | `grey92` | 否 | 次网格颜色 | |
104 | | -| | `--ylab_es` | character | `Enrichment Score` | 否 | ES 面板纵轴标题 | |
105 | | -| | `--ylab_rank` | character | `Ranked List Metric` | 否 | 排名面板纵轴标题 | |
106 | | -| | `--xlab_rank` | character | `Rank in Ordered Dataset` | 否 | 排名面板横轴标题 | |
107 | | -| | `--hit_height` | numeric | `1` | 否 | 命中条高度 | |
108 | | -| | `--hit_gap` | numeric | `0` | 否 | 命中条间距 | |
109 | | -| | `--hit_linewidth` | numeric | `0.5` | 否 | 命中条线宽 | |
110 | | -| | `--rank_bar_alpha` | numeric | `0.9` | 否 | 排名条透明度 | |
111 | | -| | `--rank_bar_height_ratio` | numeric | `0.3` | 否 | 排名条高度比例 | |
112 | | -| | `--rank_metric_segment_color` | character | `grey` | 否 | 排名线颜色 | |
113 | | -| | `--rank_metric_segment_width` | numeric | `0.3` | 否 | 排名线宽 | |
114 | | -| | `--rank_metric_segment_alpha` | numeric | `1` | 否 | 排名线透明度 | |
115 | | -| | `--pvalue_table` | logical | `FALSE` | 否 | 显示 P 值表 | |
116 | | -| | `--ES_geom` | character | `line` | 否 | ES 绘制方式:`line` 或 `dot` | |
117 | | -| | `--verbose` | logical | `FALSE` | 否 | 输出详细日志 | |
118 | | -| | `--seed` | integer | `42` | 否 | 随机种子 | |
119 | | -| | `--timeout` | integer | `300` | 否 | 超时秒数,`<=0` 表示不限制 | |
120 | | -| `-h` | `--help` | logical | `FALSE` | 否 | 显示帮助 | |
121 | | - |
122 | | -## 输入格式 |
123 | | - |
124 | | -分析模式输入为 CSV 文件,至少包含两列: |
125 | | -- 基因列,默认列名为 `name` |
126 | | -- 排序统计量列,默认列名为 `logFC` |
127 | | - |
128 | | -示例: |
| 70 | +| | `--running_file` | character | `NULL` | yes | Path to `gsea_running_scores.csv` | |
| 71 | +| | `--enrich_file` | character | `NULL` | yes | Path to `enrichGSEA.csv` | |
| 72 | +| | `--plot_output` | character | `gsea_plot.pdf` | no | Output plot path | |
| 73 | +| | `--plot_width` | numeric | `8` | no | Plot width | |
| 74 | +| | `--plot_height` | numeric | `6` | no | Plot height | |
| 75 | +| | `--plot_format` | character | `pdf` | no | Output format: `pdf` or `png` | |
| 76 | +| | `--top_n` | numeric | `1` | no | Number of top pathways to plot when `geneSetID` is not given | |
| 77 | +| | `--rank_by` | character | `p.adjust` | no | Column used to rank pathways | |
| 78 | +| | `--geneSetID` | character | `""` | no | Comma-separated pathway IDs | |
| 79 | +| | `--plot_title` | character | `""` | no | Plot title | |
| 80 | +| | `--colors` | character | `#4DBBD5,#E64B35,#00A087,#F39B7F,#3C5488,#8491B4` | no | Color list | |
| 81 | +| | `--base_size` | numeric | `11` | no | Base font size | |
| 82 | +| | `--subplots` | character | `1,2,3` | no | Sub-panel indices to display | |
| 83 | +| | `--rel_heights` | character | `1.5,0.8,1` | no | Relative panel heights | |
| 84 | +| | `--NES_table` | logical | `TRUE` | no | Show NES annotation | |
| 85 | +| | `--no_NES_table` | logical | `FALSE` | no | Disable NES annotation | |
| 86 | +| | `--NES_label_size` | numeric | `4` | no | NES label font size | |
| 87 | +| | `--NES_label_x` | numeric | `0.75` | no | NES label x position | |
| 88 | +| | `--NES_label_y` | numeric | `0.75` | no | NES label y position | |
| 89 | +| | `--NES_label_color` | character | `black` | no | NES label color | |
| 90 | +| | `--NES_label_hjust` | numeric | `0` | no | NES label horizontal justification | |
| 91 | +| | `--NES_label_vjust` | numeric | `1` | no | NES label vertical justification | |
| 92 | +| | `--line_width` | numeric | `1` | no | ES line width | |
| 93 | +| | `--dot_size` | numeric | `1.2` | no | ES dot size | |
| 94 | +| | `--legend_position` | character | `auto` | no | Legend position | |
| 95 | +| | `--legend_x` | numeric | `0.02` | no | Inset legend x coordinate | |
| 96 | +| | `--legend_y` | numeric | `0.02` | no | Inset legend y coordinate | |
| 97 | +| | `--legend_just_x` | numeric | `0` | no | Legend horizontal justification | |
| 98 | +| | `--legend_just_y` | numeric | `0` | no | Legend vertical justification | |
| 99 | +| | `--legend_text_size` | numeric | `9` | no | Legend text size | |
| 100 | +| | `--legend_key_size` | numeric | `0.6` | no | Legend key size | |
| 101 | +| | `--legend_bg_alpha` | numeric | `0` | no | Legend background alpha | |
| 102 | +| | `--grid_major_color` | character | `grey92` | no | Major grid color | |
| 103 | +| | `--grid_minor_color` | character | `grey92` | no | Minor grid color | |
| 104 | +| | `--ylab_es` | character | `Enrichment Score` | no | ES panel y-axis title | |
| 105 | +| | `--ylab_rank` | character | `Ranked List Metric` | no | Rank panel y-axis title | |
| 106 | +| | `--xlab_rank` | character | `Rank in Ordered Dataset` | no | Rank panel x-axis title | |
| 107 | +| | `--hit_height` | numeric | `1` | no | Hit-bar height | |
| 108 | +| | `--hit_gap` | numeric | `0` | no | Hit-bar gap | |
| 109 | +| | `--hit_linewidth` | numeric | `0.5` | no | Hit-bar line width | |
| 110 | +| | `--rank_bar_alpha` | numeric | `0.9` | no | Rank-bar alpha | |
| 111 | +| | `--rank_bar_height_ratio` | numeric | `0.3` | no | Rank-bar height ratio | |
| 112 | +| | `--rank_metric_segment_color` | character | `grey` | no | Rank-line color | |
| 113 | +| | `--rank_metric_segment_width` | numeric | `0.3` | no | Rank-line width | |
| 114 | +| | `--rank_metric_segment_alpha` | numeric | `1` | no | Rank-line alpha | |
| 115 | +| | `--pvalue_table` | logical | `FALSE` | no | Show p-value table | |
| 116 | +| | `--ES_geom` | character | `line` | no | ES geometry: `line` or `dot` | |
| 117 | +| | `--verbose` | logical | `FALSE` | no | Verbose logging | |
| 118 | +| | `--seed` | integer | `42` | no | Random seed | |
| 119 | +| | `--timeout` | integer | `300` | no | Timeout in seconds; `<=0` disables it | |
| 120 | +| `-h` | `--help` | logical | `FALSE` | no | Show help | |
| 121 | + |
| 122 | +## Input format |
| 123 | + |
| 124 | +Analysis-mode input is a CSV with at least: |
| 125 | +- a gene column (default name `name`) |
| 126 | +- a ranking-statistic column (default name `logFC`) |
| 127 | + |
| 128 | +Example: |
129 | 129 | ```csv |
130 | 130 | name,logFC,pvalue,padj |
131 | 131 | TP53,2.5,0.001,0.01 |
132 | 132 | BRCA1,1.8,0.005,0.02 |
133 | 133 | EGFR,-1.2,0.01,0.05 |
134 | 134 | ``` |
135 | 135 |
|
136 | | -取值约束: |
137 | | -- `type` 支持 `KEGG`、`HALLMARKS`、`GO_BP`、`GO_MF`、`GO_CC` |
138 | | -- 当使用预载 RDS 时,`HALLMARKS` 会自动匹配资产中的 `Hallmarks` 键名 |
139 | | -- `species` 支持 `human`、`mouse`、`rat` |
| 136 | +Value constraints: |
| 137 | +- `type` accepts `KEGG`, `HALLMARKS`, `GO_BP`, `GO_MF`, `GO_CC` |
| 138 | +- When using a preloaded RDS, `HALLMARKS` is automatically matched to the asset key `Hallmarks` |
| 139 | +- `species` accepts `human`, `mouse`, `rat` |
140 | 140 |
|
141 | | -## 输出文件 |
| 141 | +## Output files |
142 | 142 |
|
143 | | -| 文件名 | 格式 | 内容说明 | |
| 143 | +| File | Format | Description | |
144 | 144 | |---|---|---| |
145 | | -| `data/GSEA_list.rda` | RDA | 完整 GSEA 结果对象 | |
146 | | -| `Table/enrichGSEA.csv` | CSV | 富集结果表 | |
147 | | -| `Table/gsea_running_scores.csv` | CSV | 运行分数表;若无富集结果则输出空表头文件 | |
148 | | -| `plot/` | directory | 绘图输出目录 | |
149 | | -| `session_info.txt` | TXT | R 版本与包版本信息 | |
| 145 | +| `data/GSEA_list.rda` | RDA | Full GSEA result object | |
| 146 | +| `Table/enrichGSEA.csv` | CSV | Enrichment result table | |
| 147 | +| `Table/gsea_running_scores.csv` | CSV | Running-score table; if no enrichment passes, a header-only file is still written | |
| 148 | +| `plot/` | directory | Plot output directory | |
| 149 | +| `session_info.txt` | TXT | R version and package versions | |
150 | 150 |
|
151 | | -`enrichGSEA.csv` 主要包含:`ID`、`Description`、`NES`、`pvalue`、`p.adjust`、`core_enrichment`。 |
| 151 | +`enrichGSEA.csv` mainly contains: `ID`, `Description`, `NES`, `pvalue`, `p.adjust`, `core_enrichment`. |
152 | 152 |
|
153 | | -## 错误处理 |
| 153 | +## Error handling |
154 | 154 |
|
155 | | -常见错误码: |
156 | | -- `SKILL_FILE_NOT_FOUND`:输入文件不存在 |
157 | | -- `SKILL_MISSING_COLUMNS`:缺少必要列 |
158 | | -- `SKILL_EMPTY_DATA`:输入数据为空或过滤后为空 |
159 | | -- `SKILL_INVALID_PARAMETER`:参数值不合法 |
160 | | -- `SKILL_PACKAGE_NOT_FOUND`:依赖包未安装 |
161 | | -- `SKILL_ANALYSIS_FAILED`:分析重试后仍失败 |
| 155 | +Common error codes: |
| 156 | +- `SKILL_FILE_NOT_FOUND`: input file does not exist |
| 157 | +- `SKILL_MISSING_COLUMNS`: required columns are missing |
| 158 | +- `SKILL_EMPTY_DATA`: input is empty, or empty after filtering |
| 159 | +- `SKILL_INVALID_PARAMETER`: an argument has an invalid value |
| 160 | +- `SKILL_PACKAGE_NOT_FOUND`: a required package is not installed |
| 161 | +- `SKILL_ANALYSIS_FAILED`: GSEA still failed after retries |
162 | 162 |
|
163 | | -排查文档:`references/troubleshooting.md` |
| 163 | +Triage doc: `references/troubleshooting.md` |
164 | 164 |
|
165 | | -退出状态码: |
166 | | -- `0`:运行成功 |
167 | | -- `1`:运行失败 |
| 165 | +Exit codes: |
| 166 | +- `0`: success |
| 167 | +- `1`: failure |
168 | 168 |
|
169 | | -## 测试方法 |
| 169 | +## Testing |
170 | 170 |
|
171 | | -最小测试数据集:`tests/data/sample_deg_results.csv` |
| 171 | +Minimal test dataset: `tests/data/sample_deg_results.csv` |
172 | 172 |
|
173 | | -最小运行命令: |
| 173 | +Minimal command: |
174 | 174 | `Rscript scripts/main.R --input tests/data/sample_deg_results.csv --outdir ./test_output --type KEGG --species human --seed 42 --timeout 300 --verbose` |
175 | 175 |
|
176 | | -预期输出: |
| 176 | +Expected output: |
177 | 177 | - `./test_output/data/GSEA_list.rda` |
178 | 178 | - `./test_output/Table/enrichGSEA.csv` |
179 | 179 | - `./test_output/Table/gsea_running_scores.csv` |
180 | 180 | - `./test_output/session_info.txt` |
181 | | -- 若无显著富集结果,`gsea_running_scores.csv` 仍会生成,但只包含表头 |
182 | | -- 退出状态码为 `0` |
| 181 | +- If no significant enrichment is found, `gsea_running_scores.csv` is still written but contains only the header |
| 182 | +- Exit code `0` |
0 commit comments