You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- New `enumerate_ssh_keys` tool: cross-platform SSH key enumeration scanning `.ssh` directories for authorized_keys, private keys, and public keys (#141).
27
+
- New `--task-template syslog-summary` maps to the `syslog-analysis` investigation template (#141).
28
+
29
+
### Changed
30
+
31
+
-**Severity calibration**: raised listener thresholds (Info <50, Low 50–149, Medium 150–249, High ≥250), lowered account severity (1 account → Low, 3–4 → Medium, ≥5 → High), raised persistence thresholds (Low <3, Medium 3–7, High ≥8). Normal desktops no longer trigger spurious high-severity findings (#139).
32
+
-**Findings detail**: finding titles now include specifics — account names, persistence entry text, and SSH directory info — instead of bare counts (#140).
33
+
-**Parameter estimation**: quantization-aware divisor replaces the hardcoded 2.2. Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. Detected from model filename conventions (#138).
34
+
-**Template tool ordering**: `file-integrity-check` now leads with `hash_binary` (was `audit_account_changes`), `ssh-key-investigation` now leads with `enumerate_ssh_keys` (#141).
35
+
36
+
### Fixed
37
+
38
+
-**KV-cache attention mask**: prefill attention length now accounts for forced cache padding when the model lacks a `use_cache` toggle, preventing shape broadcast errors on models like Qwen2.5 and Llama 3.2 (#136).
39
+
-**ReAct garbage output**: when the model produces a `<final>` tag at step 0 without calling any tools, the agent falls back to template-driven execution. Quality guard now detects hallucinated `<call>` tags and `[observation]` markers inside final answers and replaces them with a deterministic summary (#137).
40
+
-**EP reporting**: `detect_execution_provider()` now recognises DirectML and CUDA backend overrides instead of always reporting CPU (#142).
Copy file name to clipboardExpand all lines: docs/getting-started.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -141,6 +141,8 @@ When running in live mode, WraithRun automatically probes the loaded model to cl
141
141
-**Moderate**: medium models. Agent uses a ReAct (Reason + Act) loop, iteratively choosing tools based on observations, then synthesizes findings via LLM.
142
142
-**Strong**: large models (≥10B params and ≤50ms latency). Agent uses a full ReAct loop with the complete evidence window for deep iterative reasoning and synthesis.
143
143
144
+
Since v1.8.0, parameter estimation is quantization-aware: Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. This means Q4 models are classified more accurately — a 750 MB Q4 file now correctly estimates ~1.4B parameters instead of ~0.3B.
145
+
144
146
Override automatic classification when you know your model's capability:
Copy file name to clipboardExpand all lines: docs/live-mode-operations.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -128,6 +128,10 @@ WraithRun caches the ONNX session and tokenizer across investigation steps withi
128
128
129
129
The agent also tracks prompt prefix reuse across steps. When consecutive prompts share a common prefix (e.g., system prompt + prior context), the prefix hit/miss ratio is logged for observability. Full KV-state reuse is scaffolded for a future release.
130
130
131
+
Since v1.8.0, the prefill attention mask correctly accounts for forced cache padding on models that lack a `use_cache` branch toggle (#136). Previously, models like Qwen2.5 and Llama 3.2 could crash with a shape broadcast error during prefill because the attention mask length did not include the initial cache dimension.
132
+
133
+
Also since v1.8.0, execution provider reporting now detects DirectML and CUDA backend overrides (#142), so `model_capability.execution_provider` in JSON output accurately reflects the active backend instead of always showing `CPUExecutionProvider`.
134
+
131
135
Temperature controls affect live inference behavior:
Copy file name to clipboardExpand all lines: docs/tool-reference.md
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -142,6 +142,23 @@ Output fields:
142
142
-`network_risk_level`
143
143
-`records`
144
144
145
+
## enumerate_ssh_keys
146
+
147
+
Purpose:
148
+
149
+
- Enumerates SSH key material across user home directories. Cross-platform: scans Windows `%USERPROFILE%\.ssh`, `ProgramData\ssh`, and other user profiles; on Linux/macOS scans `/root/.ssh` and `/home/*/.ssh`.
150
+
151
+
Arguments:
152
+
153
+
- none
154
+
155
+
Output fields:
156
+
157
+
-`directories` (array): per-directory summary including path, `has_authorized_keys`, `private_key_count`, and `public_key_count`.
- Tier thresholds: Basic ≤2B params or ≥200ms latency; Strong ≥10B params and ≤50ms latency; Moderate is everything in between.
171
+
- Since v1.8.0, parameter estimation is quantization-aware. Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. This may reclassify models that were previously under-estimated (e.g., a Q4 model that reported 0.5B may now correctly report ~2B and shift from Basic to Moderate).
171
172
172
173
## Final answer looks generic or templated
173
174
@@ -179,6 +180,7 @@ Fix:
179
180
180
181
- This happens when the model is classified as Basic tier (deterministic summary) or when LLM output quality is detected as low.
181
182
- Since v1.6.0, Moderate/Strong tiers use a ReAct loop that typically produces richer output. If output is still generic, try `--capability-override strong` or increase `--temperature` slightly (e.g., `0.1`).
183
+
- Since v1.8.0, the quality guard also catches hallucinated `<call>` tags and `[observation]` markers inside the final answer. When detected, the agent replaces the garbage with a deterministic summary built from real findings. This means even Moderate/Strong tier runs may show a structured summary if the model hallucinates.
182
184
183
185
## Agent not calling expected tools
184
186
@@ -191,6 +193,7 @@ Fix:
191
193
- Moderate/Strong tiers use a ReAct loop where the LLM decides which tools to call. The model may not choose the same tools as the template-driven Basic tier.
192
194
- Increase `--max-steps` if the agent is exhausting its step budget before reaching all relevant tools.
193
195
- If the model is too small, it may produce a `<final>` answer immediately. Try `--capability-override strong` to allow full iterative reasoning.
196
+
- Since v1.8.0, if the model produces `<final>` at step 0 without calling any tools, the agent automatically falls back to template-driven execution so that real host data is still collected.
194
197
- Check `RUST_LOG=debug` output for `react_step` entries showing the agent's reasoning at each step.
195
198
196
199
## Task returned a scope-boundary finding instead of running
Copy file name to clipboardExpand all lines: docs/upgrades.md
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,22 @@
1
1
# Upgrade Notes
2
2
3
+
## v1.8.0
4
+
5
+
### Breaking/visible changes
6
+
7
+
-**Severity thresholds recalibrated** (#139): listener, account, and persistence findings now use higher thresholds. A normal desktop with ~100 listeners and 1 non-default admin account will report Low instead of High. Automation that keys on specific severity values should be reviewed.
8
+
-**Finding titles include specifics** (#140): finding `title` fields now embed account names, persistence entry text, and SSH directory details (e.g., `"Non-default privileged accounts observed (1): shrey"` instead of `"Non-default privileged accounts observed (1)"`). Parsers matching on exact title strings must be updated.
9
+
-**Parameter estimation changed** (#138): quantization-aware sizing means `estimated_params_b` in `model_capability` output may change. Q4 models now report ~4× higher param counts than before. This may reclassify some models into a higher capability tier.
10
+
-**New tool and template** (#141): `enumerate_ssh_keys` tool added to the registry; `syslog-analysis` investigation template added. Template tool ordering changed for `file-integrity-check` and `ssh-key-investigation`.
11
+
-**ReAct fallback behavior** (#137): Moderate/Strong tier runs may now produce a deterministic summary instead of LLM-generated text when the model hallucinates. The `final_answer` field will contain a structured SUMMARY block in these cases.
12
+
13
+
### Migration
14
+
15
+
- No TOML config changes required.
16
+
- If you parse `RunReport` JSON `findings[].title` strings, update matchers — titles now include entity names and entry details.
17
+
- If automation relies on severity thresholds, review the new calibration: listener counts below 50 are now Info (was Low at 25), single non-default admin accounts are Low (was Medium).
18
+
- The `enumerate_ssh_keys` tool is automatically included in `ssh-key-investigation` template runs. No opt-in needed.
19
+
3
20
## v1.7.1
4
21
5
22
- Dependency-only release. No breaking API changes.
0 commit comments