This repository validates whether Hermes Multi-Agent Kanban Board can run a long-running research workflow in a traceable, self-recovering, and human-gated way.
The concrete case study is Korean K-12 essay auto-scoring. The primary goal of this repository is not to productize an essay scoring model, but to use a realistic machine learning research task to evaluate the reliability, traceability, and quality-evolution potential of a Hermes-based autonomous research workflow.
This is a workflow validation project, not just a model performance experiment.
| Validation target | Description |
|---|---|
| Long-running execution | Verifies whether Hermes workers can keep a multi-step research chain moving over extended runs |
| Kanban-native dependency | Connects AUDIT, SPLIT, FEATURE, MODEL, HPO, EVAL, REVIEW, SYNTH, and DECIDE through board dependencies |
| Traceability | Links task bodies, artifact paths, MLflow runs, Optuna studies, and commit evidence |
| Self-recovery | Preserves evidence and recovers from split failures, environment limits, and interrupted long-running jobs |
| Human gate | Uses DECIDE tasks with [Continue], [Phase-up], and [Stop] for explicit cycle control |
| Quality evolution | Tests whether model quality improves from baselines to Transformer, HPO, and ensemble stages |
Korean K-12 essay auto-scoring is used as a realistic ML benchmark for validating the Hermes workflow.
| Item | Description |
|---|---|
| Domain | Korean K-12 essay auto-scoring |
| Data | 5,003 stratified samples from AI Hub Training data |
| Task | Multi-task regression for rubric-level scores and overall score |
| Models | M1 dummy, M2 length, M3 TF-IDF+Ridge, M4 LightGBM, M5 KLUE-RoBERTa, M6 ensemble |
| Optimization | Optuna Hyperparameter Optimization |
| Tracking | MLflow + SQLite |
| Evaluation | QWK, RMSE, MAE, rubric-level metrics, score-band fairness |
Each stage is registered as a Hermes Kanban task. Parent dependencies promote the next stage to ready state automatically. Long-running jobs are tracked through external execution and progress polling rather than keeping workers blocked in the foreground.
| Item | Current value |
|---|---|
| Active board | essay-auto-scoring-research-phase3 |
| Phase | Phase 3 Mid Multi-task |
| Primary data | dataset/sample_5k/ |
| Active cycle | M2R recovery chain |
| Models | M1-M4 CPU baseline, M5 multi-task KLUE-RoBERTa, M6 multi-output ensemble |
| Tracking | sqlite:///mlflow.db, sqlite:///optuna.db |
| Human gate | [Continue], [Phase-up], [Stop] in DECIDE-* tasks |
Source of truth:
- Project rules:
AGENTS.md - Phase goal:
MILESTONE_v3.md - Gates:
ACCEPTANCE_CRITERIA.yaml - Docs index:
docs/README.md
# Check the Python environment
python3 -c "import pandas, sklearn, lightgbm, mlflow, transformers, datasets, accelerate, optuna"
# Inspect the active board
hermes kanban boards list
hermes kanban stats
hermes kanban list --sort created
# Inspect a task
hermes kanban show <task_id>
hermes kanban runs <task_id>Dashboard:
http://localhost:9119/kanban
| Path | Purpose |
|---|---|
dataset/1.Training/라벨링데이터/ |
Original AI Hub Training data, read-only |
dataset/2.Validation/라벨링데이터/ |
Candidate final holdout, not used in current training folds |
dataset/sample_5k/ |
Phase 3 primary sample, stratified from Training with seed 42 |
dataset/sample/ |
Phase 1 toy evidence, read-only |
Regenerate the 5K sample:
python3 -m pipelines.extract_5k dataset/1.Training \
--out dataset/sample_5k \
--target-n 5000 \
--seed 42# Data audit
python3 pipelines/audit_data.py --input dataset/sample_5k/
# Split generation: default k=5
python3 pipelines/make_splits.py \
--input dataset/sample_5k/ \
--k 5 \
--output workspace/cycle_M<N>/splits \
--cycle-id M<N> \
--kanban-task-id <task_id> \
--min-valid-n 300 \
--group-key student.location
# M2 approved fallback:
# preserve failed k=5 evidence, then use region merge + k=3
python3 pipelines/make_splits.py \
--input dataset/sample_5k/ \
--k 3 \
--output workspace/cycle_M<N>/splits \
--cycle-id M<N> \
--kanban-task-id <task_id> \
--min-valid-n 300 \
--group-key region \
--audit-table workspace/cycle_M<N>/audit/data_audit/audit_table_no_raw_text.csv
# CPU baselines
python3 -m pipelines.train \
--models M1,M2,M3,M4 \
--cycle-id M<N> \
--mlflow-uri sqlite:///mlflow.db
# M5/M6 for Phase 3 acceptance must not use the legacy scalar command.
# Follow docs/multi_task_채점모델_구현_스펙_v_1_1.md for the multi-task launcher/spec.
# HPO
python3 -m pipelines.run_hpo \
--model M4 \
--cycle-id M<N> \
--n-trials 30 \
--study-name cycle_M<N>_M4 \
--storage sqlite:///optuna.db \
--mlflow-uri sqlite:///mlflow.db \
--experiment-name essay-auto-scoring-phase3 \
--kanban-task-id <task_id> \
--split-dir workspace/cycle_M<N>/splits \
--feature-dir workspace/cycle_M<N>/features \
--label-dir dataset/sample_5k/ \
--output-dir workspace/cycle_M<N>/hpo
# Evaluation
python3 pipelines/evaluate.py --cycle-id M<N>Do not use vastai show user for Vast.ai authentication checks. CLI 0.5.0 may fail against the current API because it appends owner=me.
vastai --api-key "$VAST_API_KEY" show instances --raw
vastai --api-key "$VAST_API_KEY" search offers 'gpu_ram>=8 reliability>0.95' --raw.
├── AGENTS.md
├── MILESTONE.md
├── MILESTONE_v2.md
├── MILESTONE_v3.md
├── ACCEPTANCE_CRITERIA.yaml
├── VAST_GPU_GUIDE.md
├── configs/
├── pipelines/
├── tests/
├── docs/
│ ├── README.md
│ ├── archive/
│ └── research/
├── reports/ # optional generated reports, ignored when absent
├── skills/
├── workspace/ # ignored runtime artifacts
├── mlflow.db # ignored runtime DB
└── optuna.db # ignored runtime DB
| Profile | Responsibility |
|---|---|
aristotle |
SYNTH, cycle report, next-cycle registration |
tukey |
AUDIT |
gauss |
SPLIT, FEATURE, MODEL, HPO |
spearman |
EVAL |
turing |
REVIEW |
ada-lovelace |
Implementation support |
The root documents and the active document index in docs/README.md define the operating standard.
| Document | Role |
|---|---|
AGENTS.md |
Hermes worker behavior rules and hard rules |
ACCEPTANCE_CRITERIA.yaml |
Phase-level acceptance gates |
MILESTONE_v3.md |
Phase 3 goals and success criteria |
docs/phase_3_operations_guide_v_1_0.md |
Operations guide |
docs/multi_task_채점모델_구현_스펙_v_1_1.md |
Multi-task model implementation spec |
Completed Phase 1/2 documents, seminar materials, review checklists, and older specs are archived under docs/archive/.
Code in this repository is licensed under the MIT License.
Dataset files, source PDFs, and external model/data assets remain subject to their original provider terms. Verify redistribution rights before reusing data files outside this repository.

