Hermes Autonomous Research Workflow

This repository validates whether Hermes Multi-Agent Kanban Board can run a long-running research workflow in a traceable, self-recovering, and human-gated way.

The concrete case study is Korean K-12 essay auto-scoring. The primary goal of this repository is not to productize an essay scoring model, but to use a realistic machine learning research task to evaluate the reliability, traceability, and quality-evolution potential of a Hermes-based autonomous research workflow.

What This Project Validates

This is a workflow validation project, not just a model performance experiment.

Validation target	Description
Long-running execution	Verifies whether Hermes workers can keep a multi-step research chain moving over extended runs
Kanban-native dependency	Connects AUDIT, SPLIT, FEATURE, MODEL, HPO, EVAL, REVIEW, SYNTH, and DECIDE through board dependencies
Traceability	Links task bodies, artifact paths, MLflow runs, Optuna studies, and commit evidence
Self-recovery	Preserves evidence and recovers from split failures, environment limits, and interrupted long-running jobs
Human gate	Uses DECIDE tasks with `[Continue]`, `[Phase-up]`, and `[Stop]` for explicit cycle control
Quality evolution	Tests whether model quality improves from baselines to Transformer, HPO, and ensemble stages

Case Study

Korean K-12 essay auto-scoring is used as a realistic ML benchmark for validating the Hermes workflow.

Item	Description
Domain	Korean K-12 essay auto-scoring
Data	5,003 stratified samples from AI Hub Training data
Task	Multi-task regression for rubric-level scores and overall score
Models	M1 dummy, M2 length, M3 TF-IDF+Ridge, M4 LightGBM, M5 KLUE-RoBERTa, M6 ensemble
Optimization	Optuna Hyperparameter Optimization
Tracking	MLflow + SQLite
Evaluation	QWK, RMSE, MAE, rubric-level metrics, score-band fairness

Workflow Overview

Each stage is registered as a Hermes Kanban task. Parent dependencies promote the next stage to ready state automatically. Long-running jobs are tracked through external execution and progress polling rather than keeping workers blocked in the foreground.

Current State

Item	Current value
Active board	`essay-auto-scoring-research-phase3`
Phase	Phase 3 Mid Multi-task
Primary data	`dataset/sample_5k/`
Active cycle	`M2R` recovery chain
Models	M1-M4 CPU baseline, M5 multi-task KLUE-RoBERTa, M6 multi-output ensemble
Tracking	`sqlite:///mlflow.db`, `sqlite:///optuna.db`
Human gate	`[Continue]`, `[Phase-up]`, `[Stop]` in `DECIDE-*` tasks

Source of truth:

Project rules: AGENTS.md
Phase goal: MILESTONE_v3.md
Gates: ACCEPTANCE_CRITERIA.yaml
Docs index: docs/README.md

Quick Start

# Check the Python environment
python3 -c "import pandas, sklearn, lightgbm, mlflow, transformers, datasets, accelerate, optuna"

# Inspect the active board
hermes kanban boards list
hermes kanban stats
hermes kanban list --sort created

# Inspect a task
hermes kanban show <task_id>
hermes kanban runs <task_id>

Dashboard:

http://localhost:9119/kanban

Data

Path	Purpose
`dataset/1.Training/라벨링데이터/`	Original AI Hub Training data, read-only
`dataset/2.Validation/라벨링데이터/`	Candidate final holdout, not used in current training folds
`dataset/sample_5k/`	Phase 3 primary sample, stratified from Training with seed 42
`dataset/sample/`	Phase 1 toy evidence, read-only

Regenerate the 5K sample:

python3 -m pipelines.extract_5k dataset/1.Training \
  --out dataset/sample_5k \
  --target-n 5000 \
  --seed 42

Pipeline

# Data audit
python3 pipelines/audit_data.py --input dataset/sample_5k/

# Split generation: default k=5
python3 pipelines/make_splits.py \
  --input dataset/sample_5k/ \
  --k 5 \
  --output workspace/cycle_M<N>/splits \
  --cycle-id M<N> \
  --kanban-task-id <task_id> \
  --min-valid-n 300 \
  --group-key student.location

# M2 approved fallback:
# preserve failed k=5 evidence, then use region merge + k=3
python3 pipelines/make_splits.py \
  --input dataset/sample_5k/ \
  --k 3 \
  --output workspace/cycle_M<N>/splits \
  --cycle-id M<N> \
  --kanban-task-id <task_id> \
  --min-valid-n 300 \
  --group-key region \
  --audit-table workspace/cycle_M<N>/audit/data_audit/audit_table_no_raw_text.csv

# CPU baselines
python3 -m pipelines.train \
  --models M1,M2,M3,M4 \
  --cycle-id M<N> \
  --mlflow-uri sqlite:///mlflow.db

# M5/M6 for Phase 3 acceptance must not use the legacy scalar command.
# Follow docs/multi_task_채점모델_구현_스펙_v_1_1.md for the multi-task launcher/spec.

# HPO
python3 -m pipelines.run_hpo \
  --model M4 \
  --cycle-id M<N> \
  --n-trials 30 \
  --study-name cycle_M<N>_M4 \
  --storage sqlite:///optuna.db \
  --mlflow-uri sqlite:///mlflow.db \
  --experiment-name essay-auto-scoring-phase3 \
  --kanban-task-id <task_id> \
  --split-dir workspace/cycle_M<N>/splits \
  --feature-dir workspace/cycle_M<N>/features \
  --label-dir dataset/sample_5k/ \
  --output-dir workspace/cycle_M<N>/hpo

# Evaluation
python3 pipelines/evaluate.py --cycle-id M<N>

Do not use vastai show user for Vast.ai authentication checks. CLI 0.5.0 may fail against the current API because it appends owner=me.

vastai --api-key "$VAST_API_KEY" show instances --raw
vastai --api-key "$VAST_API_KEY" search offers 'gpu_ram>=8 reliability>0.95' --raw

Repository Layout

.
├── AGENTS.md
├── MILESTONE.md
├── MILESTONE_v2.md
├── MILESTONE_v3.md
├── ACCEPTANCE_CRITERIA.yaml
├── VAST_GPU_GUIDE.md
├── configs/
├── pipelines/
├── tests/
├── docs/
│   ├── README.md
│   ├── archive/
│   └── research/
├── reports/          # optional generated reports, ignored when absent
├── skills/
├── workspace/        # ignored runtime artifacts
├── mlflow.db         # ignored runtime DB
└── optuna.db         # ignored runtime DB

Hermes Agent Profiles

Profile	Responsibility
`aristotle`	SYNTH, cycle report, next-cycle registration
`tukey`	AUDIT
`gauss`	SPLIT, FEATURE, MODEL, HPO
`spearman`	EVAL
`turing`	REVIEW
`ada-lovelace`	Implementation support

Governance

The root documents and the active document index in docs/README.md define the operating standard.

Document	Role
`AGENTS.md`	Hermes worker behavior rules and hard rules
`ACCEPTANCE_CRITERIA.yaml`	Phase-level acceptance gates
`MILESTONE_v3.md`	Phase 3 goals and success criteria
`docs/phase_3_operations_guide_v_1_0.md`	Operations guide
`docs/multi_task_채점모델_구현_스펙_v_1_1.md`	Multi-task model implementation spec

Completed Phase 1/2 documents, seminar materials, review checklists, and older specs are archived under docs/archive/.

License

Code in this repository is licensed under the MIT License.

Dataset files, source PDFs, and external model/data assets remain subject to their original provider terms. Verify redistribution rights before reusing data files outside this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hermes Autonomous Research Workflow

What This Project Validates

Case Study

Workflow Overview

Current State

Quick Start

Data

Pipeline

Repository Layout

Hermes Agent Profiles

Governance

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
assets		assets
configs		configs
dataset		dataset
docs		docs
pipelines		pipelines
scripts		scripts
skills		skills
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ACCEPTANCE_CRITERIA.yaml		ACCEPTANCE_CRITERIA.yaml
AGENTS.md		AGENTS.md
LICENSE		LICENSE
MILESTONE.md		MILESTONE.md
MILESTONE_v2.md		MILESTONE_v2.md
MILESTONE_v3.md		MILESTONE_v3.md
README.ko.md		README.ko.md
README.md		README.md
SECURITY.md		SECURITY.md
VAST_GPU_GUIDE.md		VAST_GPU_GUIDE.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Hermes Autonomous Research Workflow

What This Project Validates

Case Study

Workflow Overview

Current State

Quick Start

Data

Pipeline

Repository Layout

Hermes Agent Profiles

Governance

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages