Skip to content

ldmrepo/hermes-autonomous-research-workflow

Repository files navigation

Hermes Autonomous Research Workflow

Python 3.10+ pandas scikit-learn LightGBM Hugging Face Transformers KLUE-RoBERTa Optuna MLflow SQLite Hermes Kanban Vast.ai

English | 한국어

Hermes autonomous research workflow cover

This repository validates whether Hermes Multi-Agent Kanban Board can run a long-running research workflow in a traceable, self-recovering, and human-gated way.

The concrete case study is Korean K-12 essay auto-scoring. The primary goal of this repository is not to productize an essay scoring model, but to use a realistic machine learning research task to evaluate the reliability, traceability, and quality-evolution potential of a Hermes-based autonomous research workflow.

What This Project Validates

This is a workflow validation project, not just a model performance experiment.

Validation target Description
Long-running execution Verifies whether Hermes workers can keep a multi-step research chain moving over extended runs
Kanban-native dependency Connects AUDIT, SPLIT, FEATURE, MODEL, HPO, EVAL, REVIEW, SYNTH, and DECIDE through board dependencies
Traceability Links task bodies, artifact paths, MLflow runs, Optuna studies, and commit evidence
Self-recovery Preserves evidence and recovers from split failures, environment limits, and interrupted long-running jobs
Human gate Uses DECIDE tasks with [Continue], [Phase-up], and [Stop] for explicit cycle control
Quality evolution Tests whether model quality improves from baselines to Transformer, HPO, and ensemble stages

Case Study

Korean K-12 essay auto-scoring is used as a realistic ML benchmark for validating the Hermes workflow.

Item Description
Domain Korean K-12 essay auto-scoring
Data 5,003 stratified samples from AI Hub Training data
Task Multi-task regression for rubric-level scores and overall score
Models M1 dummy, M2 length, M3 TF-IDF+Ridge, M4 LightGBM, M5 KLUE-RoBERTa, M6 ensemble
Optimization Optuna Hyperparameter Optimization
Tracking MLflow + SQLite
Evaluation QWK, RMSE, MAE, rubric-level metrics, score-band fairness

Workflow Overview

Hermes autonomous research workflow

Each stage is registered as a Hermes Kanban task. Parent dependencies promote the next stage to ready state automatically. Long-running jobs are tracked through external execution and progress polling rather than keeping workers blocked in the foreground.

Current State

Item Current value
Active board essay-auto-scoring-research-phase3
Phase Phase 3 Mid Multi-task
Primary data dataset/sample_5k/
Active cycle M2R recovery chain
Models M1-M4 CPU baseline, M5 multi-task KLUE-RoBERTa, M6 multi-output ensemble
Tracking sqlite:///mlflow.db, sqlite:///optuna.db
Human gate [Continue], [Phase-up], [Stop] in DECIDE-* tasks

Source of truth:

  • Project rules: AGENTS.md
  • Phase goal: MILESTONE_v3.md
  • Gates: ACCEPTANCE_CRITERIA.yaml
  • Docs index: docs/README.md

Quick Start

# Check the Python environment
python3 -c "import pandas, sklearn, lightgbm, mlflow, transformers, datasets, accelerate, optuna"

# Inspect the active board
hermes kanban boards list
hermes kanban stats
hermes kanban list --sort created

# Inspect a task
hermes kanban show <task_id>
hermes kanban runs <task_id>

Dashboard:

http://localhost:9119/kanban

Data

Path Purpose
dataset/1.Training/라벨링데이터/ Original AI Hub Training data, read-only
dataset/2.Validation/라벨링데이터/ Candidate final holdout, not used in current training folds
dataset/sample_5k/ Phase 3 primary sample, stratified from Training with seed 42
dataset/sample/ Phase 1 toy evidence, read-only

Regenerate the 5K sample:

python3 -m pipelines.extract_5k dataset/1.Training \
  --out dataset/sample_5k \
  --target-n 5000 \
  --seed 42

Pipeline

# Data audit
python3 pipelines/audit_data.py --input dataset/sample_5k/

# Split generation: default k=5
python3 pipelines/make_splits.py \
  --input dataset/sample_5k/ \
  --k 5 \
  --output workspace/cycle_M<N>/splits \
  --cycle-id M<N> \
  --kanban-task-id <task_id> \
  --min-valid-n 300 \
  --group-key student.location

# M2 approved fallback:
# preserve failed k=5 evidence, then use region merge + k=3
python3 pipelines/make_splits.py \
  --input dataset/sample_5k/ \
  --k 3 \
  --output workspace/cycle_M<N>/splits \
  --cycle-id M<N> \
  --kanban-task-id <task_id> \
  --min-valid-n 300 \
  --group-key region \
  --audit-table workspace/cycle_M<N>/audit/data_audit/audit_table_no_raw_text.csv

# CPU baselines
python3 -m pipelines.train \
  --models M1,M2,M3,M4 \
  --cycle-id M<N> \
  --mlflow-uri sqlite:///mlflow.db

# M5/M6 for Phase 3 acceptance must not use the legacy scalar command.
# Follow docs/multi_task_채점모델_구현_스펙_v_1_1.md for the multi-task launcher/spec.

# HPO
python3 -m pipelines.run_hpo \
  --model M4 \
  --cycle-id M<N> \
  --n-trials 30 \
  --study-name cycle_M<N>_M4 \
  --storage sqlite:///optuna.db \
  --mlflow-uri sqlite:///mlflow.db \
  --experiment-name essay-auto-scoring-phase3 \
  --kanban-task-id <task_id> \
  --split-dir workspace/cycle_M<N>/splits \
  --feature-dir workspace/cycle_M<N>/features \
  --label-dir dataset/sample_5k/ \
  --output-dir workspace/cycle_M<N>/hpo

# Evaluation
python3 pipelines/evaluate.py --cycle-id M<N>

Do not use vastai show user for Vast.ai authentication checks. CLI 0.5.0 may fail against the current API because it appends owner=me.

vastai --api-key "$VAST_API_KEY" show instances --raw
vastai --api-key "$VAST_API_KEY" search offers 'gpu_ram>=8 reliability>0.95' --raw

Repository Layout

.
├── AGENTS.md
├── MILESTONE.md
├── MILESTONE_v2.md
├── MILESTONE_v3.md
├── ACCEPTANCE_CRITERIA.yaml
├── VAST_GPU_GUIDE.md
├── configs/
├── pipelines/
├── tests/
├── docs/
│   ├── README.md
│   ├── archive/
│   └── research/
├── reports/          # optional generated reports, ignored when absent
├── skills/
├── workspace/        # ignored runtime artifacts
├── mlflow.db         # ignored runtime DB
└── optuna.db         # ignored runtime DB

Hermes Agent Profiles

Profile Responsibility
aristotle SYNTH, cycle report, next-cycle registration
tukey AUDIT
gauss SPLIT, FEATURE, MODEL, HPO
spearman EVAL
turing REVIEW
ada-lovelace Implementation support

Governance

The root documents and the active document index in docs/README.md define the operating standard.

Document Role
AGENTS.md Hermes worker behavior rules and hard rules
ACCEPTANCE_CRITERIA.yaml Phase-level acceptance gates
MILESTONE_v3.md Phase 3 goals and success criteria
docs/phase_3_operations_guide_v_1_0.md Operations guide
docs/multi_task_채점모델_구현_스펙_v_1_1.md Multi-task model implementation spec

Completed Phase 1/2 documents, seminar materials, review checklists, and older specs are archived under docs/archive/.

License

Code in this repository is licensed under the MIT License.

Dataset files, source PDFs, and external model/data assets remain subject to their original provider terms. Verify redistribution rights before reusing data files outside this repository.

About

Validating long-running autonomous research workflows with Hermes Multi-Agent Kanban Board, using Korean K-12 essay auto-scoring as the case study.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors