chore: bump version to 1.8.0 and update docs (#145)

sankpal-shreyas · web-flow · commit 5955ea03b6b4 · 2026-04-05T05:56:56.000-04:00
- CHANGELOG.md: v1.8.0 entries for all 8 issue fixes (#136-#142) - docs/upgrades.md: v1.8.0 upgrade notes with migration guidance - docs/tool-reference.md: add enumerate_ssh_keys tool documentation - docs/cli-reference.md: add syslog-analysis template, update coverage - docs/troubleshooting.md: ReAct fallback, quant-aware param estimation - docs/live-mode-operations.md: KV-cache fix, EP override detection - docs/getting-started.md: quantization-aware tier description - docs/RELEASE_PLAN.md: update roadmap through v1.8.0 - Cargo.toml: version 1.7.1 -> 1.8.0
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,6 +18,27 @@ The format is inspired by Keep a Changelog and this project follows Semantic Ver
 
 - (none yet)
 
+## 1.8.0 - 2026-04-05
+
+### Added
+
+- New `syslog-analysis` investigation template matching keywords: log, syslog, journal, event, audit. Runs `read_syslog` → `audit_account_changes` → `inspect_persistence_locations` (#141).
+- New `enumerate_ssh_keys` tool: cross-platform SSH key enumeration scanning `.ssh` directories for authorized_keys, private keys, and public keys (#141).
+- New `--task-template syslog-summary` maps to the `syslog-analysis` investigation template (#141).
+
+### Changed
+
+- **Severity calibration**: raised listener thresholds (Info <50, Low 50–149, Medium 150–249, High ≥250), lowered account severity (1 account → Low, 3–4 → Medium, ≥5 → High), raised persistence thresholds (Low <3, Medium 3–7, High ≥8). Normal desktops no longer trigger spurious high-severity findings (#139).
+- **Findings detail**: finding titles now include specifics — account names, persistence entry text, and SSH directory info — instead of bare counts (#140).
+- **Parameter estimation**: quantization-aware divisor replaces the hardcoded 2.2. Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. Detected from model filename conventions (#138).
+- **Template tool ordering**: `file-integrity-check` now leads with `hash_binary` (was `audit_account_changes`), `ssh-key-investigation` now leads with `enumerate_ssh_keys` (#141).
+
+### Fixed
+
+- **KV-cache attention mask**: prefill attention length now accounts for forced cache padding when the model lacks a `use_cache` toggle, preventing shape broadcast errors on models like Qwen2.5 and Llama 3.2 (#136).
+- **ReAct garbage output**: when the model produces a `<final>` tag at step 0 without calling any tools, the agent falls back to template-driven execution. Quality guard now detects hallucinated `<call>` tags and `[observation]` markers inside final answers and replaces them with a deterministic summary (#137).
+- **EP reporting**: `detect_execution_provider()` now recognises DirectML and CUDA backend overrides instead of always reporting CPU (#142).
+
 ## 1.7.1 - 2026-04-05
 
 ### Changed
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -10,7 +10,7 @@ resolver = "2"
 
 [workspace.package]
 edition = "2021"
-version = "1.7.1"
+version = "1.8.0"
 license = "MIT"
 
 [workspace.dependencies]
diff --git a/docs/RELEASE_PLAN.md b/docs/RELEASE_PLAN.md
@@ -91,31 +91,30 @@ Release should be blocked when:
    Define `ExecutionProviderBackend` trait, provider registry, provider-agnostic config, extract Vitis/CPU backends, provider-aware doctor, CLI `--backend` flag, and multi-backend test harness.
 - `v1.4.0` Concrete Hardware Backends (milestone #17, tracking: #55):
    DirectML (Windows GPU), CoreML (macOS/Apple Silicon), CUDA/TensorRT (NVIDIA), QNN (Qualcomm Hexagon), non-ONNX formats (GGUF/SafeTensors), and quantization-aware loading.
+- `v1.5.0` Concrete Hardware Backends (completed).
+- `v1.6.0` Agentic Investigation Engine (completed): ReAct agent loop, task-aware LLM synthesis, temperature-scaled sampling, EP-aware debug logs, session caching, KV-cache prefix reuse.
+- `v1.7.0` Live Evaluation Hardening (completed): per-tool timing, LLM reasoning capture, evidence-derived confidence, task-specific synthesis, expanded privilege/persistence checks, tokenizer discovery.
+- `v1.7.1` Dependency Bumps (completed): toml 1.1, thiserror 2.0, sha2 0.11, CI actions v6–v8.
+- `v1.8.0` Live Evaluation Fixes (completed): KV-cache attention mask fix, ReAct garbage fallback, quantization-aware param estimation, severity recalibration, findings detail, template/tool fixes, EP reporting, syslog-analysis template, enumerate_ssh_keys tool.
 
-## Immediate Next Steps for v1.0.0
+## Immediate Next Steps
 
 Use this runbook to execute the active next milestone end-to-end.
 
 1. Create a tracking issue from the Release Checklist template.
-2. Apply labels `release`, `milestone:v1.0.0`, and priority labels as needed.
-3. Run milestone bootstrap workflow:
-   - Workflow: `Milestone Bootstrap`
-   - Inputs:
-   - `seed_roadmap`: `true` (upserts canonical milestones for the active roadmap set)
-   - `title`: `v1.0.0`
-   - `description`: `Local API and Web UI MVP: local server endpoints, security baseline, durable local data model, and initial triage UI.`
-     - `due_date`: optional (`YYYY-MM-DD`)
-4. Verify quality gates locally:
+2. Apply labels `release` and the target milestone label.
+3. Verify quality gates locally:
    - `cargo check`
    - `cargo test --workspace`
+   - `cargo clippy --all-targets -- -D warnings`
    - `cargo check -p inference_bridge --features vitis`
-5. Verify GitHub Actions CI is green on latest `main`.
-6. Tag and publish:
-   - `git tag -a v1.0.0 -m "Release v1.0.0"`
-   - `git push origin v1.0.0`
-7. Confirm `Release` workflow completed and assets are attached.
-8. Close the milestone and open a follow-on milestone.
-9. Open planning issue for the next milestone scope.
+4. Verify GitHub Actions CI is green on latest `main`.
+5. Tag and publish:
+   - `git tag -a vX.Y.Z -m "Release vX.Y.Z"`
+   - `git push origin vX.Y.Z`
+6. Confirm `Release` workflow completed and assets are attached.
+7. Close the milestone and open a follow-on milestone.
+8. Open planning issue for the next milestone scope.
 
 ## Labels and Milestones
 
diff --git a/docs/cli-reference.md b/docs/cli-reference.md
@@ -143,7 +143,7 @@ Behavior:
 
 `--list-tools` output includes tool names, descriptions, and JSON argument schemas.
 
-Current built-in coverage includes log tailing, listener inventory, file hashing, privilege vectors, persistence inventory, account-role snapshots, process-network correlation, and baseline capture for drift workflows.
+Current built-in coverage includes log tailing, listener inventory, file hashing, privilege vectors, persistence inventory, account-role snapshots, process-network correlation, SSH key enumeration, and baseline capture for drift workflows.
 
 Coverage tool argument highlights:
 
@@ -465,11 +465,12 @@ When a free-text `--task` is provided, the agent resolves a declarative investig
 Built-in investigation templates:
 
 - **broad-host-triage**: default fallback. Runs all host-level tools.
-- **ssh-key-investigation**: SSH key and account audit focus.
+- **ssh-key-investigation**: SSH key enumeration and account audit focus.
 - **persistence-analysis**: autorun and persistence mechanism checks.
 - **network-exposure-audit**: listener and network binding analysis.
 - **privilege-escalation-check**: privilege escalation indicator checks.
 - **file-integrity-check**: hash verification and file integrity analysis.
+- **syslog-analysis**: log review, account audit, and persistence checks. Matches keywords: log, syslog, journal, event, audit.
 
 List investigation templates via `--list-task-templates`.
 
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -141,6 +141,8 @@ When running in live mode, WraithRun automatically probes the loaded model to cl
 - **Moderate**: medium models. Agent uses a ReAct (Reason + Act) loop, iteratively choosing tools based on observations, then synthesizes findings via LLM.
 - **Strong**: large models (≥10B params and ≤50ms latency). Agent uses a full ReAct loop with the complete evidence window for deep iterative reasoning and synthesis.
 
+Since v1.8.0, parameter estimation is quantization-aware: Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. This means Q4 models are classified more accurately — a 750 MB Q4 file now correctly estimates ~1.4B parameters instead of ~0.3B.
+
 Override automatic classification when you know your model's capability:
 
 ```powershell
diff --git a/docs/live-mode-operations.md b/docs/live-mode-operations.md
@@ -128,6 +128,10 @@ WraithRun caches the ONNX session and tokenizer across investigation steps withi
 
 The agent also tracks prompt prefix reuse across steps. When consecutive prompts share a common prefix (e.g., system prompt + prior context), the prefix hit/miss ratio is logged for observability. Full KV-state reuse is scaffolded for a future release.
 
+Since v1.8.0, the prefill attention mask correctly accounts for forced cache padding on models that lack a `use_cache` branch toggle (#136). Previously, models like Qwen2.5 and Llama 3.2 could crash with a shape broadcast error during prefill because the attention mask length did not include the initial cache dimension.
+
+Also since v1.8.0, execution provider reporting now detects DirectML and CUDA backend overrides (#142), so `model_capability.execution_provider` in JSON output accurately reflects the active backend instead of always showing `CPUExecutionProvider`.
+
 Temperature controls affect live inference behavior:
 
 - `--temperature 0` (or omit): greedy decoding — fastest, fully deterministic output.
diff --git a/docs/tool-reference.md b/docs/tool-reference.md
@@ -142,6 +142,23 @@ Output fields:
 - `network_risk_level`
 - `records`
 
+## enumerate_ssh_keys
+
+Purpose:
+
+- Enumerates SSH key material across user home directories. Cross-platform: scans Windows `%USERPROFILE%\.ssh`, `ProgramData\ssh`, and other user profiles; on Linux/macOS scans `/root/.ssh` and `/home/*/.ssh`.
+
+Arguments:
+
+- none
+
+Output fields:
+
+- `directories` (array): per-directory summary including path, `has_authorized_keys`, `private_key_count`, and `public_key_count`.
+- `total_authorized_keys_files` (integer)
+- `total_private_keys` (integer)
+- `total_public_keys` (integer)
+
 ## capture_coverage_baseline
 
 Purpose:
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -168,6 +168,7 @@ wraithrun --task "Investigate ..." --live --model C:/models/llm.onnx --tokenizer
 ```
 
 - Tier thresholds: Basic ≤2B params or ≥200ms latency; Strong ≥10B params and ≤50ms latency; Moderate is everything in between.
+- Since v1.8.0, parameter estimation is quantization-aware. Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. This may reclassify models that were previously under-estimated (e.g., a Q4 model that reported 0.5B may now correctly report ~2B and shift from Basic to Moderate).
 
 ## Final answer looks generic or templated
 
@@ -179,6 +180,7 @@ Fix:
 
 - This happens when the model is classified as Basic tier (deterministic summary) or when LLM output quality is detected as low.
 - Since v1.6.0, Moderate/Strong tiers use a ReAct loop that typically produces richer output. If output is still generic, try `--capability-override strong` or increase `--temperature` slightly (e.g., `0.1`).
+- Since v1.8.0, the quality guard also catches hallucinated `<call>` tags and `[observation]` markers inside the final answer. When detected, the agent replaces the garbage with a deterministic summary built from real findings. This means even Moderate/Strong tier runs may show a structured summary if the model hallucinates.
 
 ## Agent not calling expected tools
 
@@ -191,6 +193,7 @@ Fix:
 - Moderate/Strong tiers use a ReAct loop where the LLM decides which tools to call. The model may not choose the same tools as the template-driven Basic tier.
 - Increase `--max-steps` if the agent is exhausting its step budget before reaching all relevant tools.
 - If the model is too small, it may produce a `<final>` answer immediately. Try `--capability-override strong` to allow full iterative reasoning.
+- Since v1.8.0, if the model produces `<final>` at step 0 without calling any tools, the agent automatically falls back to template-driven execution so that real host data is still collected.
 - Check `RUST_LOG=debug` output for `react_step` entries showing the agent's reasoning at each step.
 
 ## Task returned a scope-boundary finding instead of running
diff --git a/docs/upgrades.md b/docs/upgrades.md
@@ -1,5 +1,22 @@
 # Upgrade Notes
 
+## v1.8.0
+
+### Breaking/visible changes
+
+- **Severity thresholds recalibrated** (#139): listener, account, and persistence findings now use higher thresholds. A normal desktop with ~100 listeners and 1 non-default admin account will report Low instead of High. Automation that keys on specific severity values should be reviewed.
+- **Finding titles include specifics** (#140): finding `title` fields now embed account names, persistence entry text, and SSH directory details (e.g., `"Non-default privileged accounts observed (1): shrey"` instead of `"Non-default privileged accounts observed (1)"`). Parsers matching on exact title strings must be updated.
+- **Parameter estimation changed** (#138): quantization-aware sizing means `estimated_params_b` in `model_capability` output may change. Q4 models now report ~4× higher param counts than before. This may reclassify some models into a higher capability tier.
+- **New tool and template** (#141): `enumerate_ssh_keys` tool added to the registry; `syslog-analysis` investigation template added. Template tool ordering changed for `file-integrity-check` and `ssh-key-investigation`.
+- **ReAct fallback behavior** (#137): Moderate/Strong tier runs may now produce a deterministic summary instead of LLM-generated text when the model hallucinates. The `final_answer` field will contain a structured SUMMARY block in these cases.
+
+### Migration
+
+- No TOML config changes required.
+- If you parse `RunReport` JSON `findings[].title` strings, update matchers — titles now include entity names and entry details.
+- If automation relies on severity thresholds, review the new calibration: listener counts below 50 are now Info (was Low at 25), single non-default admin accounts are Low (was Medium).
+- The `enumerate_ssh_keys` tool is automatically included in `ssh-key-investigation` template runs. No opt-in needed.
+
 ## v1.7.1
 
 - Dependency-only release. No breaking API changes.