feat(phase-19): track D auto research lessons 54-57#203
Conversation
📝 WalkthroughWalkthroughThis PR adds four complete capstone lessons (54–57) to Phase 19, building an end-to-end research automation pipeline. It includes a LaTeX paper generator, an iterative refinement loop with heuristic scoring, a parallel hypothesis scheduler with UCB branch selection, and an orchestrator composing all three. The catalog is updated to reflect new lesson and file counts, and prior capstone projects are refactored with expanded module structures. ChangesCatalog and Prior Capstone Refactoring
Lesson 54: Paper Writer
Lesson 55: Critic Loop
Lesson 56: Iteration Scheduler
Lesson 57: End-to-End Research Demo
Sequence Diagram(s)sequenceDiagram
participant User
participant E2E as End-to-End<br/>Demo
participant Sched as Iteration<br/>Scheduler
participant Critic as Critic<br/>Loop
participant Writer as Paper<br/>Writer
User->>E2E: run_demo(out_dir, seed=11)
E2E->>E2E: make_seed_hypotheses()
E2E->>Sched: IterationScheduler.run(seeds)
Sched->>Sched: UCB branch selection<br/>parallel experiments
Sched-->>E2E: SchedulerReport(best_branch, triggers)
E2E->>E2E: pick_best_branch(report)
E2E->>E2E: build_mini_paper(branch, reward)
E2E->>Critic: CriticLoop.run(mini_paper)
Critic->>Critic: deterministic_score()<br/>critic/reviser rounds
Critic-->>E2E: LoopResult(converged_paper)
E2E->>E2E: mini_to_full_paper(converged, branch)
E2E->>Writer: PaperWriter.fill_prose(paper)
Writer->>Writer: MockProseGenerator.generate()
E2E->>Writer: writer.write(paper, out_dir)
Writer->>Writer: render_latex(), render_bibtex()
Writer-->>E2E: manifest.json
E2E-->>User: DemoReport(scheduler+critic+paper)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
phases/19-capstone-projects/54-paper-writer/code/tests/test_paper_writer.py (1)
70-99: ⚡ Quick winAdd a regression test for duplicate bibliography keys.
Given the validation gates, a dedicated test for repeated
BibEntry.keyvalues will prevent regressions in BibTeX integrity.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/54-paper-writer/code/tests/test_paper_writer.py` around lines 70 - 99, Add a new unit test in TestValidation that constructs a Paper with two BibEntry objects sharing the same key (e.g., create a Paper via _paper_min(), append two BibEntry(key="dup", ...)), then call render_latex and assert it raises PaperValidationError (use assertRaisesRegex to match a message like "duplicate bibliography key"); reference Paper, BibEntry, render_latex, and PaperValidationError so the test mirrors the existing duplicate-figure and unknown-citation tests to prevent regressions.phases/19-capstone-projects/57-end-to-end-research-demo/code/main.py (1)
183-186: ⚡ Quick winAvoid mutating
MiniPaperduring conversion.On Line 185,
sec.cites.append(...)mutates the inputminiobject in place. This converter should stay side-effect free and only shape the returnedPaper.♻️ Proposed fix
def mini_to_full_paper(mini: MiniPaper, branch: str) -> Paper: @@ - if not bib: - bib = [BibEntry( - key=f"{branch}-baseline", entry_type="article", - fields={"title": "Baseline", "author": "Synthesised", "year": "2026"}, - )] - for sec in mini.sections: - if sec.id == "intro": - sec.cites.append(f"{branch}-baseline") - break + baseline_key: str | None = None + if not bib: + baseline_key = f"{branch}-baseline" + bib = [BibEntry( + key=baseline_key, entry_type="article", + fields={"title": "Baseline", "author": "Synthesised", "year": "2026"}, + )] @@ sections = [] for s in mini.sections: figure_refs = [fig.id] if s.id == "results" else [] + cites = list(s.cites) + if baseline_key is not None and s.id == "intro" and baseline_key not in cites: + cites.append(baseline_key) sections.append(Section( id=s.id, title=s.title, body=s.body, - cites=list(s.cites), figure_refs=figure_refs, + cites=cites, figure_refs=figure_refs, ))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/57-end-to-end-research-demo/code/main.py` around lines 183 - 186, The loop mutates the input MiniPaper by calling sec.cites.append(...) on mini.sections; instead leave mini untouched and build new Section objects (or shallow copies) for the output Paper, creating each section's cites list as a copy of sec.cites plus the extra f"{branch}-baseline" where needed; specifically, when iterating over mini.sections, do not call sec.cites.append — construct a new section value (copying fields and using sec.cites + [f"{branch}-baseline"] for the intro case) and add that to the Paper.sections result so MiniPaper and mini remain side-effect free.phases/19-capstone-projects/56-iteration-scheduler/code/tests/test_scheduler.py (1)
169-202: ⚡ Quick winAdd a regression test for trigger evaluation during drain.
Please add a deadline/max-budget scenario where a branch crosses
paper_thresholdonly in drained results, then assert the trigger is still emitted.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/56-iteration-scheduler/code/tests/test_scheduler.py` around lines 169 - 202, Add a new test (e.g., test_trigger_emitted_during_drain) that constructs a seed with multiple Hypothesis objects and a slow runner that delays long enough so some results finish only during the scheduler's drain phase; configure IterationScheduler with a very short max_seconds (or a max_experiments that causes early stop) and a paper_threshold such that the target branch crosses the threshold only by results that complete during drain; run sched.run(seed) and then assert the scheduler report contains the trigger for that branch (for example by checking report.triggers or by asserting any(t.branch == "b0" for t in report.triggers) — adapt the exact assertion to the report field used in the codebase).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@phases/19-capstone-projects/54-paper-writer/code/main.py`:
- Around line 88-94: The code currently only checks that cited keys exist but
does not detect duplicate BibEntry.key values in paper.bibliography; add a
validation before the cite-check loop that scans paper.bibliography (e.g., using
bib_keys and a seen set or a Counter) to detect any duplicate BibEntry.key
values and raise PaperValidationError listing the duplicate key(s) (include the
offending key(s) in the error message) so duplicate BibTeX entries cannot be
emitted; keep this check alongside the existing cite validation that uses
bib_keys, and reference the same symbols (paper.bibliography, BibEntry.key,
bib_keys, PaperValidationError, paper.sections, sec.cites).
In `@phases/19-capstone-projects/55-critic-loop/code/main.py`:
- Around line 339-345: The branch handling sug.edit ==
"add-related-work-section" can no-op when a "Related Work" section exists but
has an empty body; change it to check for an existing section (search
paper.sections for s.title.lower() == "related work") and if found and s.body is
empty or whitespace, set s.body = "We survey adjacent work. " + ("x" * 200);
otherwise (no existing section) append a new MiniSection(id="related-work",
title="Related Work", body=...). This ensures the suggestion actually fills an
empty section instead of repeating without improving score.
In `@phases/19-capstone-projects/55-critic-loop/docs/en.md`:
- Around line 76-84: The docs incorrectly state that clarity is driven by
section count; update the description to match deterministic_score() by stating
that clarity is computed from average section body length (clarity grows when
the average/mean section body length crosses the configured threshold), and keep
the other signal descriptions (novelty, evidence, methodology, related-work)
as-is; reference the deterministic_score() function and the clarity signal when
making this edit so the prose matches the implementation.
- Around line 57-59: The doc's convergence flow incorrectly states the stop rule
as "mean score >= target" while the implementation in CriticLoop._target_met
requires every one of the five dimensions to meet or exceed target_score; update
the diagram/text (the A decision node and the related paragraphs) to state that
convergence requires all five dimensions to be >= target_score (or explicitly
list the five dimensions and the "all" condition), and make the same correction
for the other references around the same section (the lines referenced after
66–69) so the documentation matches CriticLoop._target_met.
In `@phases/19-capstone-projects/55-critic-loop/quiz.json`:
- Around line 43-52: The quiz option and explanation are incorrect:
Reviser.__call__ accepts (paper, suggestions), so update the option currently
reading "Only the list of Suggestion records" to something like "The paper and
the list of Suggestion records" and change the explanation to state that the
reviser receives both the paper and the Suggestion records (each Suggestion has
dimension, target section, and an edit instruction) and does not receive scores;
ensure the correct answer index points to the updated option and that the
wording references Reviser.__call__.
In `@phases/19-capstone-projects/56-iteration-scheduler/code/main.py`:
- Around line 209-211: The code indexes stats with result.branch without
guarding against unknown branches, causing crashes if a runner returns an unseen
branch; update both locations where you do bs = stats[result.branch] (the block
updating bs.runs and bs.reward_sum and the similar block at the later
occurrence) to first check if result.branch exists in stats and if not insert a
new BranchStats instance (or use stats.setdefault(result.branch,
BranchStats(...))) and then update bs.runs and bs.reward_sum; reference the
symbols stats, result.branch, bs.runs, bs.reward_sum, and BranchStats when
applying the fix.
- Around line 261-275: Loop draining in_flight tasks updates stats but never
runs paper.trigger, so valid triggers are lost; after successfully awaiting a
task (in the for task in list(in_flight.keys()) block) and after updating
bs.runs and bs.reward_sum, call the trigger evaluation on the associated paper
(e.g., paper.trigger(...) or the existing trigger-evaluation helper used
elsewhere) with the branch and updated stats/reward so trigger conditions are
evaluated and any resulting TraceEvent(s) appended to trace before removing the
task from in_flight; ensure to preserve the current TraceEvent("result.drain")
logic and handle exceptions exactly as before.
In `@phases/19-capstone-projects/56-iteration-scheduler/docs/en.md`:
- Around line 70-71: Update the docs to match the implementation: change wording
that currently states the scheduler posts Results to an asyncio.Queue and uses
random runner delay; instead explain that the scheduler creates tasks with
asyncio.create_task to run the async experiment runner (which yields a Result),
waits on the set of in-flight tasks using asyncio.wait(...,
return_when=asyncio.FIRST_COMPLETED), and then applies a fixed sleep delay
between loop iterations; reference the experiment runner, Result,
asyncio.create_task, the in-flight tasks set, and asyncio.FIRST_COMPLETED so
readers understand the actual control flow and timing.
In `@phases/19-capstone-projects/57-end-to-end-research-demo/docs/en.md`:
- Around line 80-81: Update the stale documentation to match the current tests
and API: remove or rewrite the claim that a test in test_e2e.py asserts an
empty-seed short-circuit to a typed exception (fix the sentence around the
original lines 80-81 to reflect that no such test exists), change the “one
figure per high-yield branch” wording to state that mini_to_full_paper() creates
one figure for the selected branch (replace the text around line 94), and
replace any reference to build_paper_from_result with the actual exposed
functions build_mini_paper and mini_to_full_paper (update references in the
block around lines 98 and the 94-100 range) so the docs accurately reflect
current function names and behavior.
---
Nitpick comments:
In `@phases/19-capstone-projects/54-paper-writer/code/tests/test_paper_writer.py`:
- Around line 70-99: Add a new unit test in TestValidation that constructs a
Paper with two BibEntry objects sharing the same key (e.g., create a Paper via
_paper_min(), append two BibEntry(key="dup", ...)), then call render_latex and
assert it raises PaperValidationError (use assertRaisesRegex to match a message
like "duplicate bibliography key"); reference Paper, BibEntry, render_latex, and
PaperValidationError so the test mirrors the existing duplicate-figure and
unknown-citation tests to prevent regressions.
In
`@phases/19-capstone-projects/56-iteration-scheduler/code/tests/test_scheduler.py`:
- Around line 169-202: Add a new test (e.g., test_trigger_emitted_during_drain)
that constructs a seed with multiple Hypothesis objects and a slow runner that
delays long enough so some results finish only during the scheduler's drain
phase; configure IterationScheduler with a very short max_seconds (or a
max_experiments that causes early stop) and a paper_threshold such that the
target branch crosses the threshold only by results that complete during drain;
run sched.run(seed) and then assert the scheduler report contains the trigger
for that branch (for example by checking report.triggers or by asserting
any(t.branch == "b0" for t in report.triggers) — adapt the exact assertion to
the report field used in the codebase).
In `@phases/19-capstone-projects/57-end-to-end-research-demo/code/main.py`:
- Around line 183-186: The loop mutates the input MiniPaper by calling
sec.cites.append(...) on mini.sections; instead leave mini untouched and build
new Section objects (or shallow copies) for the output Paper, creating each
section's cites list as a copy of sec.cites plus the extra f"{branch}-baseline"
where needed; specifically, when iterating over mini.sections, do not call
sec.cites.append — construct a new section value (copying fields and using
sec.cites + [f"{branch}-baseline"] for the intro case) and add that to the
Paper.sections result so MiniPaper and mini remain side-effect free.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1df3182b-0e2c-486c-aab4-e69bbca980b1
📒 Files selected for processing (17)
catalog.jsonphases/19-capstone-projects/54-paper-writer/code/main.pyphases/19-capstone-projects/54-paper-writer/code/tests/test_paper_writer.pyphases/19-capstone-projects/54-paper-writer/docs/en.mdphases/19-capstone-projects/54-paper-writer/quiz.jsonphases/19-capstone-projects/55-critic-loop/code/main.pyphases/19-capstone-projects/55-critic-loop/code/tests/test_critic_loop.pyphases/19-capstone-projects/55-critic-loop/docs/en.mdphases/19-capstone-projects/55-critic-loop/quiz.jsonphases/19-capstone-projects/56-iteration-scheduler/code/main.pyphases/19-capstone-projects/56-iteration-scheduler/code/tests/test_scheduler.pyphases/19-capstone-projects/56-iteration-scheduler/docs/en.mdphases/19-capstone-projects/56-iteration-scheduler/quiz.jsonphases/19-capstone-projects/57-end-to-end-research-demo/code/main.pyphases/19-capstone-projects/57-end-to-end-research-demo/code/tests/test_e2e.pyphases/19-capstone-projects/57-end-to-end-research-demo/docs/en.mdphases/19-capstone-projects/57-end-to-end-research-demo/quiz.json
Reject duplicate bibliography keys during validation so render_bibtex cannot emit duplicate BibTeX entries that would break downstream compilation.
- add-related-work-section now fills an empty existing section instead of no-op'ing, so convergence cannot stall when the title already exists - Convergence doc + diagram say all five dimensions must hit target_score, matching CriticLoop._target_met - Deterministic-scoring prose says clarity tracks average section body length (matching deterministic_score), not section count - Quiz: reviser receives (paper, suggestions), matching Reviser.__call__
- Result and drain paths now use stats.setdefault, so a runner that returns an unseen branch records the result instead of crashing - Drain phase evaluates paper.trigger after updating stats, so valid triggers fired post-budget/deadline are no longer dropped - Concurrency doc reflects the FIRST_COMPLETED in-flight wait + fixed delay_ms, matching the runner implementation
Align stale docs with current code/tests: - Stage-failure paragraph now references the real picker tests (NoTriggerError / BestResultError) instead of a non-existent empty-seed short-circuit test - mini_to_full_paper attaches one figure for the selected branch (not one per high-yield branch) - Replace stale build_paper_from_result reference with the actual exposed pair build_mini_paper + mini_to_full_paper
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
phases/19-capstone-projects/56-iteration-scheduler/code/main.py (1)
261-277:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't block past the deadline stop.
If
stop_reasonis"deadline", Lines 261-263 still await every in-flight task to completion. That can pushwall_secondswell pastmax_seconds, so the scheduler no longer honors the advertised hard wall-clock budget. Split the drain path so deadline stops cancel or skip outstanding work, and keep full draining only for non-deadline exits.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/56-iteration-scheduler/code/main.py` around lines 261 - 277, The loop that awaits every task in in_flight regardless of stop_reason can exceed the hard wall-clock budget; change the drain logic in main.py so that when self.stop_reason == "deadline" you do not await all outstanding tasks: iterate over in_flight keys and for each outstanding asyncio.Task call task.cancel() (and optionally await them with return_exceptions or simply pop them), removing them from the in_flight dict immediately, whereas keep the existing await-and-process behavior only for non-deadline exits so BranchStats updates, bs.runs/bs.reward_sum increments, paper trigger checks (paper_threshold / bs.mean) and TraceEvent appends still run in the normal draining path.
♻️ Duplicate comments (1)
phases/19-capstone-projects/57-end-to-end-research-demo/docs/en.md (1)
80-80:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix stale test names in the failure-mode paragraph.
The referenced test names do not match current
code/tests/test_e2e.py(test_no_trigger_raises,test_orphan_trigger_raises_best_result_error). Please rename them here (and avoid claiming writer non-invocation is pinned unless a direct test exists).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/57-end-to-end-research-demo/docs/en.md` at line 80, Update the failure-mode paragraph to use the current test names: replace `test_no_triggers_raises_typed_error` with `test_no_trigger_raises` and `test_best_picker_raises_when_no_triggers` with `test_orphan_trigger_raises_best_result_error`, and remove or soften the absolute claim that "the writer is never invoked" unless there is a direct test asserting writer non-invocation (instead state that tests assert the picker raises `NoTriggerError` / `BestResultError` when no branch fired a trigger).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@phases/19-capstone-projects/55-critic-loop/docs/en.md`:
- Line 76: The opening sentence incorrectly states "three signals" but the
critic actually scores five dimensions; update that phrase in the paragraph to
"five signals" (or "five scored signals") so it matches the subsequent list
(average section body length (clarity), figure count and citation count
(evidence), `originality_tag` (novelty), and the methodology/related-work
signals) — locate the sentence mentioning "three signals" near the description
of the shipped critic and revise it accordingly.
---
Outside diff comments:
In `@phases/19-capstone-projects/56-iteration-scheduler/code/main.py`:
- Around line 261-277: The loop that awaits every task in in_flight regardless
of stop_reason can exceed the hard wall-clock budget; change the drain logic in
main.py so that when self.stop_reason == "deadline" you do not await all
outstanding tasks: iterate over in_flight keys and for each outstanding
asyncio.Task call task.cancel() (and optionally await them with
return_exceptions or simply pop them), removing them from the in_flight dict
immediately, whereas keep the existing await-and-process behavior only for
non-deadline exits so BranchStats updates, bs.runs/bs.reward_sum increments,
paper trigger checks (paper_threshold / bs.mean) and TraceEvent appends still
run in the normal draining path.
---
Duplicate comments:
In `@phases/19-capstone-projects/57-end-to-end-research-demo/docs/en.md`:
- Line 80: Update the failure-mode paragraph to use the current test names:
replace `test_no_triggers_raises_typed_error` with `test_no_trigger_raises` and
`test_best_picker_raises_when_no_triggers` with
`test_orphan_trigger_raises_best_result_error`, and remove or soften the
absolute claim that "the writer is never invoked" unless there is a direct test
asserting writer non-invocation (instead state that tests assert the picker
raises `NoTriggerError` / `BestResultError` when no branch fired a trigger).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 03fd4c2e-6909-4d54-b61d-77756a8c0a4f
📒 Files selected for processing (7)
phases/19-capstone-projects/54-paper-writer/code/main.pyphases/19-capstone-projects/55-critic-loop/code/main.pyphases/19-capstone-projects/55-critic-loop/docs/en.mdphases/19-capstone-projects/55-critic-loop/quiz.jsonphases/19-capstone-projects/56-iteration-scheduler/code/main.pyphases/19-capstone-projects/56-iteration-scheduler/docs/en.mdphases/19-capstone-projects/57-end-to-end-research-demo/docs/en.md
✅ Files skipped from review due to trivial changes (2)
- phases/19-capstone-projects/56-iteration-scheduler/docs/en.md
- phases/19-capstone-projects/55-critic-loop/quiz.json
|
|
||
| ## The deterministic critic in this lesson | ||
|
|
||
| The lesson does not call a model. The shipped critic is a callable that scores a draft based on three signals: average section body length (clarity), figure count and citation count (evidence), and an `originality_tag` field on the paper metadata (novelty). The reviser knows how to push each score upward. |
There was a problem hiding this comment.
Fix scoring-signal count for consistency.
This sentence says “three signals,” but the section documents five scored dimensions/signals right below (including methodology and related-work). Please align the count in this line to avoid confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@phases/19-capstone-projects/55-critic-loop/docs/en.md` at line 76, The
opening sentence incorrectly states "three signals" but the critic actually
scores five dimensions; update that phrase in the paragraph to "five signals"
(or "five scored signals") so it matches the subsequent list (average section
body length (clarity), figure count and citation count (evidence),
`originality_tag` (novelty), and the methodology/related-work signals) — locate
the sentence mentioning "three signals" near the description of the shipped
critic and revise it accordingly.
# Conflicts: # catalog.json
feat(phase-19): track D auto research lessons 54-57
feat(phase-19): track D auto research lessons 54-57
Summary
Track D auto-research loop deep capstones (Phase 19, lessons 54-57). Four self-contained Python lessons that compose into a self-running research loop.
Stats
Test plan
python3 code/main.pydemos exit 0python3 -m unittest discover -s code/testspasses per lesson (15 + 11 + 12 + 10 = 48)