📡 OTel Instrumentation Improvement: Restore agent-conclusion GenAI attributes by mirroring the setup-step env normalization
Analysis Date: 2026-05-26
Priority: High
Effort: Small (< 2h)
Problem
Every gh-aw.agent.conclusion span emitted in the last 24 hours — across Matt Pocock Skills Reviewer, Test Quality Sentinel, Design Decision Gate, PR Sous Chef, Issue Monster, Daily Malicious Code Scan, Spec Librarian, CI Coach, Smoke CI, and PR Code Quality Reviewer — is missing the attributes that distinguish an agent (LLM) job from a non-agent job:
| Attribute |
Expected on agent job? |
Observed on gh-aw.agent.conclusion? |
gh-aw.job.name |
yes |
❌ missing |
gen_ai.operation.name |
yes (chat) |
❌ missing |
gen_ai.response.model |
yes |
❌ missing |
gen_ai.response.finish_reasons |
yes (always, with unknown fallback) |
❌ missing |
gen_ai.usage.input_tokens / output_tokens / total_tokens |
yes |
❌ missing |
gh-aw.turns, gh-aw.estimated_cost_usd |
yes |
❌ missing |
dedicated gh-aw.agent.agent (LLM) sub-span |
yes |
❌ zero such spans in last 14 days |
gh-aw.effective_tokens (NOT gated on jobName) |
yes |
✅ present |
gen_ai.request.model (NOT gated on jobName) |
yes |
✅ present |
Every missing field above is gated on if (jobName === "agent") (or its derivative hasDedicatedAgentSpan) inside sendJobConclusionSpan (actions/setup/js/send_otlp_span.cjs:1790,1918-1931,2124,2165). Every field still present is not gated on jobName. This is a deterministic signature: sendJobConclusionSpan is reading jobName = "" for the agent-job's post step, while the matching gh-aw.agent.setup span on the same job correctly carries gh-aw.job.name: agent. Comparison of 2026-05-21 vs 2026-05-26 traces confirms this is a regression — token data used to flow through.
Why This Matters (DevOps Perspective)
Agent-job observability is the highest-value telemetry this repo emits: it is the only signal that tells an on-call engineer whether the LLM stopped early, ran for too many turns, burned tokens beyond budget, or chose the wrong model. With these fields absent on gh-aw.agent.conclusion:
- Backend dashboards filtering on
gen_ai.operation.name:chat exclude every gh-aw agent span and show zero LLM activity.
sum(gen_ai.usage.total_tokens) by workflow returns nothing for agent runs (only the duplicated copy on safe_outputs.conclusion survives, which inflates downstream-job tokens and undercounts agent tokens).
- The dedicated
gh-aw.agent.agent sub-span — designed to measure pure LLM latency excluding setup overhead — is never emitted, so p95 LLM duration alerting is impossible.
- Alerts keyed on
gen_ai.response.finish_reasons:length (truncated outputs) silently never fire.
For incident triage, this is the difference between “LLM failed” being one query away vs. requiring artifact downloads.
Current Behavior
The setup post step normalizes the env var so sendJobSetupSpan (which reads process.env.INPUT_JOB_NAME directly) always sees the value, even on runner versions that preserve the hyphen form INPUT_JOB-NAME:
// actions/setup/js/action_setup_otlp.cjs:81-87
const inputJobName = getActionInput("JOB_NAME");
if (inputJobName) {
process.env.INPUT_JOB_NAME = inputJobName;
}
if (inputParentSpanId) {
process.env.INPUT_PARENT_SPAN_ID = inputParentSpanId;
}
The conclusion post step does not do the same normalization:
// actions/setup/js/action_conclusion_otlp.cjs:72-86
async function run() {
const endpoints = process.env.GH_AW_OTLP_ENDPOINTS;
const spanName = buildSpanName(getActionInput("JOB_NAME")); // ← resolves hyphen form for the span NAME
const startMs = parseJobStartMs(process.env.GITHUB_AW_OTEL_JOB_START_MS);
if (endpoints) {
console.log(`[otlp] sending conclusion span "${spanName}" to configured endpoints`);
} else {
console.log("[otlp] GH_AW_OTLP_ENDPOINTS not set, skipping OTLP export (will attempt JSONL mirror)");
}
await sendOtlpSpan.sendJobConclusionSpan(spanName, { startMs });
// ← spanName="gh-aw.agent.conclusion" is correct, but inside sendJobConclusionSpan
// `const jobName = process.env.INPUT_JOB_NAME || ""` returns "" when only the
// hyphen form is set, so every `if (jobName === "agent")` branch is skipped.
}
Inside sendJobConclusionSpan (send_otlp_span.cjs:1790):
const jobName = process.env.INPUT_JOB_NAME || ""; // ← only checks underscore form
// ...
if (jobName) attributes.push(buildAttr("gh-aw.job.name", jobName));
if (jobName === "agent") {
attributes.push(buildAttr("gen_ai.operation.name", "chat"));
// gen_ai.response.model, gen_ai.response.finish_reasons, etc.
}
const hasDedicatedAgentSpan = jobName === "agent" && /* ... */;
if (!hasDedicatedAgentSpan && jobName === "agent") {
attributes.push(...usageAttrs);
}
The setup span survives this same pattern only because action_setup_otlp.cjs already promotes the hyphen form into the underscore form before invoking sendJobSetupSpan.
Proposed Change
Mirror the setup-step normalization in actions/setup/js/action_conclusion_otlp.cjs:
// actions/setup/js/action_conclusion_otlp.cjs (inside run(), before sendJobConclusionSpan)
async function run() {
const endpoints = process.env.GH_AW_OTLP_ENDPOINTS;
// Normalize to the canonical underscore form so sendJobConclusionSpan
// (which reads process.env.INPUT_JOB_NAME directly) always finds the value,
// matching the normalization done by action_setup_otlp.cjs at setup time.
const inputJobName = getActionInput("JOB_NAME");
if (inputJobName) {
process.env.INPUT_JOB_NAME = inputJobName;
}
const spanName = buildSpanName(inputJobName);
const startMs = parseJobStartMs(process.env.GITHUB_AW_OTEL_JOB_START_MS);
// ...rest unchanged
await sendOtlpSpan.sendJobConclusionSpan(spanName, { startMs });
}
This is the minimal, behavior-preserving fix and stays consistent with the existing setup-step pattern. A follow-up cleanup could replace the direct process.env.INPUT_JOB_NAME reads inside sendJobSetupSpan and sendJobConclusionSpan with getActionInput("JOB_NAME") so the two paths can't drift again — but that is a larger refactor and not required for the regression fix.
Expected Outcome
After this change, gh-aw.agent.conclusion spans (and the dedicated gh-aw.agent.agent LLM span) will once again carry:
- In Grafana / Sentry / Honeycomb / Datadog: queryable
gen_ai.operation.name:chat, gen_ai.usage.total_tokens, gen_ai.response.finish_reasons:length (truncation detection), gen_ai.response.model, gh-aw.turns, gh-aw.estimated_cost_usd, and gh-aw.job.name:agent for filtering. Sum-over-workflow dashboards will stop missing agent-job tokens.
- A dedicated
gh-aw.agent.agent span (CLIENT kind) per agent execution, allowing p95 LLM-latency alerting separately from total job duration.
- In the JSONL mirror (
/tmp/gh-aw/otel.jsonl): the same attributes survive, so artifact-only debugging is also restored.
- For on-call: a single query
span.name:gh-aw.agent.conclusion AND gen_ai.response.finish_reasons:length becomes sufficient to detect every truncated agent response across all workflows.
Implementation Steps
Evidence from Live OTel Data (Sentry/Grafana)
Grafana Tempo — full trace 4655a6296a6d36b45a1a62a328bbe244 (Matt Pocock Skills Reviewer, run 26455647757, 2026-05-26T14:50:27Z):
The gh-aw.agent.conclusion span (id=0a9f22a916c26aa2, parent=b867e64153c0916a which is gh-aw.agent.setup) has:
gen_ai.request.model: claude-sonnet-4.6 ✅ (set unconditionally from awInfo.model)
gh-aw.effective_tokens: 1219039 ✅ (set unconditionally when > 0)
gh-aw.action_minutes: 3.874 ✅
gh-aw.run.status: success ✅
gh-aw.job.name: ❌ absent (gated on if (jobName))
gen_ai.operation.name: ❌ absent (gated on if (jobName === "agent"))
gen_ai.response.model: ❌ absent
gen_ai.response.finish_reasons: ❌ absent
gen_ai.usage.*: ❌ absent
gh-aw.turns, gh-aw.estimated_cost_usd: ❌ absent
The matching gh-aw.agent.setup span (id=b867e64153c0916a) in the same trace does carry gh-aw.job.name: agent, confirming the asymmetry is in the post-step path only.
Sentry spans dataset — query span.name:gh-aw.agent.conclusion timestamp:>2026-05-25 returns 14+ spans across multiple workflows (Matt Pocock Skills Reviewer, Test Quality Sentinel, Design Decision Gate, PR Sous Chef, Issue Monster, Daily Malicious Code Scan, Spec Librarian, CI Coach, Smoke CI). All are missing the jobName-gated attributes above.
Regression evidence: Sentry query span.name:gh-aw.agent.conclusion gen_ai.usage.total_tokens:>0 returns matches from 2026-05-21 (gen_ai.usage.total_tokens present, e.g. 682735, 2116561, 1269362) but zero matches from 2026-05-25 onward.
Cross-backend consistency: Sentry and Grafana Tempo agree on the attribute set — the omission is at the emit side, not an ingestion gap.
Connectivity checks performed: Sentry whoami ✅, Sentry find_organizations ✅ (github), Sentry find_projects ✅ (gh-aw, project id 4511347087179777), Grafana list_datasources ✅, Grafana tempo_traceql-search ✅, Grafana tempo_get-trace ✅. The Sentry MCP build available here does not expose search_events, so list_events with explicit fields was used throughout — this is a tool limitation, not a data gap.
Related Files
actions/setup/js/action_conclusion_otlp.cjs (the fix — add 3-line normalization in run())
actions/setup/js/action_setup_otlp.cjs (reference implementation — lines 81-84)
actions/setup/js/send_otlp_span.cjs (consumer that reads process.env.INPUT_JOB_NAME directly at line 1790; jobName-gated branches at lines 1881, 1918-1931, 2124, 2165)
actions/setup/js/action_conclusion_otlp.test.cjs (extend with a hyphen-form normalization test)
actions/setup/js/action_input_utils.cjs (provides getActionInput which already handles both forms)
Generated by the Daily OTel Instrumentation Advisor workflow
Generated by 📊 Daily OTel Instrumentation Advisor · opus47 34.3M · ◷
📡 OTel Instrumentation Improvement: Restore agent-conclusion GenAI attributes by mirroring the setup-step env normalization
Analysis Date: 2026-05-26
Priority: High
Effort: Small (< 2h)
Problem
Every
gh-aw.agent.conclusionspan emitted in the last 24 hours — across Matt Pocock Skills Reviewer, Test Quality Sentinel, Design Decision Gate, PR Sous Chef, Issue Monster, Daily Malicious Code Scan, Spec Librarian, CI Coach, Smoke CI, and PR Code Quality Reviewer — is missing the attributes that distinguish an agent (LLM) job from a non-agent job:agentjob?gh-aw.agent.conclusion?gh-aw.job.namegen_ai.operation.namechat)gen_ai.response.modelgen_ai.response.finish_reasonsunknownfallback)gen_ai.usage.input_tokens/output_tokens/total_tokensgh-aw.turns,gh-aw.estimated_cost_usdgh-aw.agent.agent(LLM) sub-spangh-aw.effective_tokens(NOT gated on jobName)gen_ai.request.model(NOT gated on jobName)Every missing field above is gated on
if (jobName === "agent")(or its derivativehasDedicatedAgentSpan) insidesendJobConclusionSpan(actions/setup/js/send_otlp_span.cjs:1790,1918-1931,2124,2165). Every field still present is not gated onjobName. This is a deterministic signature:sendJobConclusionSpanis readingjobName = ""for the agent-job's post step, while the matchinggh-aw.agent.setupspan on the same job correctly carriesgh-aw.job.name: agent. Comparison of 2026-05-21 vs 2026-05-26 traces confirms this is a regression — token data used to flow through.Why This Matters (DevOps Perspective)
Agent-job observability is the highest-value telemetry this repo emits: it is the only signal that tells an on-call engineer whether the LLM stopped early, ran for too many turns, burned tokens beyond budget, or chose the wrong model. With these fields absent on
gh-aw.agent.conclusion:gen_ai.operation.name:chatexclude every gh-aw agent span and show zero LLM activity.sum(gen_ai.usage.total_tokens)by workflow returns nothing for agent runs (only the duplicated copy onsafe_outputs.conclusionsurvives, which inflates downstream-job tokens and undercounts agent tokens).gh-aw.agent.agentsub-span — designed to measure pure LLM latency excluding setup overhead — is never emitted, so p95 LLM duration alerting is impossible.gen_ai.response.finish_reasons:length(truncated outputs) silently never fire.For incident triage, this is the difference between “LLM failed” being one query away vs. requiring artifact downloads.
Current Behavior
The setup post step normalizes the env var so
sendJobSetupSpan(which readsprocess.env.INPUT_JOB_NAMEdirectly) always sees the value, even on runner versions that preserve the hyphen formINPUT_JOB-NAME:The conclusion post step does not do the same normalization:
Inside
sendJobConclusionSpan(send_otlp_span.cjs:1790):The setup span survives this same pattern only because
action_setup_otlp.cjsalready promotes the hyphen form into the underscore form before invokingsendJobSetupSpan.Proposed Change
Mirror the setup-step normalization in
actions/setup/js/action_conclusion_otlp.cjs:This is the minimal, behavior-preserving fix and stays consistent with the existing setup-step pattern. A follow-up cleanup could replace the direct
process.env.INPUT_JOB_NAMEreads insidesendJobSetupSpanandsendJobConclusionSpanwithgetActionInput("JOB_NAME")so the two paths can't drift again — but that is a larger refactor and not required for the regression fix.Expected Outcome
After this change,
gh-aw.agent.conclusionspans (and the dedicatedgh-aw.agent.agentLLM span) will once again carry:gen_ai.operation.name:chat,gen_ai.usage.total_tokens,gen_ai.response.finish_reasons:length(truncation detection),gen_ai.response.model,gh-aw.turns,gh-aw.estimated_cost_usd, andgh-aw.job.name:agentfor filtering. Sum-over-workflow dashboards will stop missing agent-job tokens.gh-aw.agent.agentspan (CLIENT kind) per agent execution, allowing p95 LLM-latency alerting separately from total job duration./tmp/gh-aw/otel.jsonl): the same attributes survive, so artifact-only debugging is also restored.span.name:gh-aw.agent.conclusion AND gen_ai.response.finish_reasons:lengthbecomes sufficient to detect every truncated agent response across all workflows.Implementation Steps
actions/setup/js/action_conclusion_otlp.cjs: copy the 3-lineINPUT_JOB_NAMEnormalization fromaction_setup_otlp.cjs:81-84into the top ofrun(), before constructingspanName.actions/setup/js/action_conclusion_otlp.test.cjs: add a test that sets onlyprocess.env["INPUT_JOB-NAME"] = "agent"(hyphen form), callsrun(), and asserts thatprocess.env.INPUT_JOB_NAME === "agent"after the call. This locks in the contract thatsendJobConclusionSpanwill see the canonical form.cd actions/setup/js && npx vitest run action_conclusion_otlp.test.cjs send_otlp_span.test.cjsto confirm tests pass.make fmtto ensure formatting.Evidence from Live OTel Data (Sentry/Grafana)
Grafana Tempo — full trace
4655a6296a6d36b45a1a62a328bbe244(Matt Pocock Skills Reviewer, run 26455647757, 2026-05-26T14:50:27Z):The
gh-aw.agent.conclusionspan (id=0a9f22a916c26aa2, parent=b867e64153c0916awhich isgh-aw.agent.setup) has:gen_ai.request.model: claude-sonnet-4.6✅ (set unconditionally fromawInfo.model)gh-aw.effective_tokens: 1219039✅ (set unconditionally when > 0)gh-aw.action_minutes: 3.874✅gh-aw.run.status: success✅gh-aw.job.name: ❌ absent (gated onif (jobName))gen_ai.operation.name: ❌ absent (gated onif (jobName === "agent"))gen_ai.response.model: ❌ absentgen_ai.response.finish_reasons: ❌ absentgen_ai.usage.*: ❌ absentgh-aw.turns,gh-aw.estimated_cost_usd: ❌ absentThe matching
gh-aw.agent.setupspan (id=b867e64153c0916a) in the same trace does carrygh-aw.job.name: agent, confirming the asymmetry is in the post-step path only.Sentry spans dataset — query
span.name:gh-aw.agent.conclusion timestamp:>2026-05-25returns 14+ spans across multiple workflows (Matt Pocock Skills Reviewer, Test Quality Sentinel, Design Decision Gate, PR Sous Chef, Issue Monster, Daily Malicious Code Scan, Spec Librarian, CI Coach, Smoke CI). All are missing the jobName-gated attributes above.Regression evidence: Sentry query
span.name:gh-aw.agent.conclusion gen_ai.usage.total_tokens:>0returns matches from 2026-05-21 (gen_ai.usage.total_tokenspresent, e.g. 682735, 2116561, 1269362) but zero matches from 2026-05-25 onward.Cross-backend consistency: Sentry and Grafana Tempo agree on the attribute set — the omission is at the emit side, not an ingestion gap.
Connectivity checks performed: Sentry
whoami✅, Sentryfind_organizations✅ (github), Sentryfind_projects✅ (gh-aw, project id 4511347087179777), Grafanalist_datasources✅, Grafanatempo_traceql-search✅, Grafanatempo_get-trace✅. The Sentry MCP build available here does not exposesearch_events, solist_eventswith explicitfieldswas used throughout — this is a tool limitation, not a data gap.Related Files
actions/setup/js/action_conclusion_otlp.cjs(the fix — add 3-line normalization inrun())actions/setup/js/action_setup_otlp.cjs(reference implementation — lines 81-84)actions/setup/js/send_otlp_span.cjs(consumer that readsprocess.env.INPUT_JOB_NAMEdirectly at line 1790; jobName-gated branches at lines 1881, 1918-1931, 2124, 2165)actions/setup/js/action_conclusion_otlp.test.cjs(extend with a hyphen-form normalization test)actions/setup/js/action_input_utils.cjs(providesgetActionInputwhich already handles both forms)Generated by the Daily OTel Instrumentation Advisor workflow