AUDIO-1: audio output service specification

JarbasAl · claude · JarbasAl · commit bda203b8eaed · 2026-06-22T19:04:30.000+01:00
Defines the audio output service: rendering pipeline, the sequential
playback queue shared by speech and sound effects, remote-client rendering
(ovos.utterance.speak.b64 -&gt; ovos.audio.speech), output lifecycle signals,
speaking-status query, stop integration, and the listen-triggered
ovos.mic.listen follow-up.

- §4 — renumber the Listen flag section from §4.5 to §4.4 (no §4.4 existed);
  update its eight in-document references.
- §5.3 — ovos.audio.is_speaking: an absent or "default" session_id asks
  about the device-local default session (SESSION-1 §3.1), not a wildcard
  over all sessions.

The §9.6 listen field and the speak payload live in the PIPELINE-1 PR.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,15 +7,17 @@ field and adds an entry here.
 
 ## OVOS-AUDIO-1 — Audio Output Service
 
-### 1
+### 2
 
-- Initial draft. Defines two rendering modes (`ovos.utterance.speak`
-  for local playback, `ovos.utterance.speak.b64` for remote-client
-  delivery), sequential playback queue for speech and sound effects,
-  fire-and-forget playback control (`ovos.audio.speech`), session
-  scoping (default session only for local service), TTS-as-a-service
-  via `ovos.audio.speech`, stop/pause/resume/duck lifecycle, and
-  conformance roles (Audio Service, Orchestrator, Skill, TTS Plugin).
+- The audio output service: the rendering pipeline (dialog-transformer
+  chain, TTS synthesis, TTS-transformer chain, playback queue), the
+  sequential playback queue shared by speech (`ovos.utterance.speak`) and
+  sound effects (`ovos.audio.queue` / `ovos.audio.play_sound`), the
+  remote-client rendering mode (`ovos.utterance.speak.b64` →
+  `ovos.audio.speech`), output lifecycle signals
+  (`ovos.audio.output.started` / `.ended`), the speaking-status query
+  (`ovos.audio.is_speaking`), stop integration (`ovos.audio.stop`,
+  `ovos.stop`), and the `listen`-triggered `ovos.mic.listen` follow-up.
 
 ## OVOS-INTENT-1 — Sentence Template Grammar
 
diff --git a/README.md b/README.md
@@ -113,7 +113,6 @@ below). Adoption is voluntary; conformance, once adopted, is not.
 | OVOS-CONTEXT-1 | [Intent Context](intent-context.md) | 1 | [Draft — in review (PR #18)](https://github.com/OpenVoiceOS/architecture/pull/18) |
 | OVOS-CONVERSE-1 | [Active Handlers and Interactive Response](converse.md) | 1 | [Draft — in review (PR #25)](https://github.com/OpenVoiceOS/architecture/pull/25) |
 | OVOS-STOP-1 | [Stop Pipeline Plugin](ovos-stop-1.md) | 1 | [Draft — in review (PR #33)](https://github.com/OpenVoiceOS/architecture/pull/33) |
-| OVOS-AUDIO-1 | [Audio Output Service](audio-out.md) | 1 | [Draft — in review (PR #38)](https://github.com/OpenVoiceOS/architecture/pull/38) |
 | OVOS-AUDIO-IN-1 | [Audio Input Service](ovos-audio-in-1.md) | 1 | Draft |
 
 Each spec carries its own scope statement, design rationale, and
@@ -174,10 +173,12 @@ require a version bump.
 
 ## Credits
 
-These specifications were produced as part of a documentation and
-interoperability effort for OpenVoiceOS, funded by NLnet's
-[NGI0 Commons Fund](https://nlnet.nl/project/OpenVoiceOS) under
-grant agreement No
-[101135429](https://cordis.europa.eu/project/id/101135429).
+Produced for [OpenVoiceOS](https://openvoiceos.org).
 
-![NGI0 / NLnet](./ngi.png)
+[![NGI0 Commons Fund](./ngi.png)](https://nlnet.nl/project/OpenVoiceOS)
+
+This project was funded through the [NGI0 Commons Fund](https://nlnet.nl/commonsfund),
+a fund established by [NLnet](https://nlnet.nl) with financial support from the
+European Commission's [Next Generation Internet](https://ngi.eu) programme, under
+the aegis of [DG Communications Networks, Content and Technology](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/communications-networks-content-and-technology_en)
+under grant agreement No [101135429](https://cordis.europa.eu/project/id/101135429).
diff --git a/VERSIONING.md b/VERSIONING.md
@@ -0,0 +1,33 @@
+# Spec versioning policy
+
+Version numbers in this repository carry compatibility semantics anchored to
+the pre-specification behavior of the OVOS stack:
+
+| Version | Meaning |
+| --- | --- |
+| **V0** | The de facto, undocumented status quo — the behavior the stack ships before a subsystem is formalized. V0 is never written down as a spec; it is the reference point. |
+| **V1** | A formalization of behavior that is **compatible with V0**. A V0 component keeps working against a V1 implementation, even if degraded (missing optional fields, reduced guarantees, legacy namespaces honored). |
+| **V2** | Behavior that is **not backwards compatible** with V0. Adopting it requires coordinated migration (e.g. the `legacy_namespace` configuration gate). |
+
+Until launch day, every spec in this repository MUST be classified as V1 or
+V2. The classification is part of the spec header. Rules of thumb:
+
+- A spec that documents existing message flows, adds optional fields, or
+  introduces parallel namespaces while the legacy ones keep working → **V1**.
+- A spec that renames or removes message types, changes payload semantics, or
+  requires consumers to change before producers (or vice versa) → **V2**.
+- A single spec MAY contain V1 sections and V2 sections only if the V2 parts
+  are explicitly gated (configuration flag) and the ungated behavior is V1.
+
+Within a class, editorial revisions bump the spec's own revision number in
+its header; compatibility class changes (V1 → V2) are a new spec version, not
+a revision.
+
+## The 1.0 definition
+
+The compatibility classes define the project roadmap. The stack starts at V0
+(the undocumented status quo — beta). Each subsystem is formalized as V1, then
+migrated to V2 where the spec demands incompatible change. **OVOS is fully
+spec compliant when every subsystem operates on V2 — that state is the
+"breakthrough" in "from beta to breakthrough", and it is the 1.0 release
+criterion.**
diff --git a/audio-out.md b/audio-out.md
@@ -1,6 +1,6 @@
 # Audio Output Service Specification
 
-**Spec ID:** OVOS-AUDIO-1 · **Version:** 1 · **Status:** Draft
+**Spec ID:** OVOS-AUDIO-1 · **Version:** 2 · **Status:** Draft
 
 This specification defines the **audio output service** — the
 pipeline's output-side counterpart that consumes natural-language
@@ -169,7 +169,7 @@ local playback, the service **MUST** emit `ovos.audio.speech` (§4.3)
 with the synthesised audio encoded as base64. The audio is not
 enqueued and does not play on the local device.
 
-The `listen` flag (§4.5) applies: if the originating Message carries
+The `listen` flag (§4.4) applies: if the originating Message carries
 `listen: true`, the service **MUST** emit `ovos.mic.listen` after
 emitting `ovos.audio.speech`.
 
@@ -215,7 +215,7 @@ participants and their audio is delivered via
 |-------|------|----------|---------|
 | `uri` | string | no | URI referencing the audio data. |
 | `audio` | string | no | Base64-encoded audio data, used when the audio source is on a different host (alternative to `uri`). |
-| `listen` | bool | no | When `true`, re-opens the user input channel after this item plays (§4.5). |
+| `listen` | bool | no | When `true`, re-opens the user input channel after this item plays (§4.4). |
 
 Exactly one of `uri` or `audio` MUST be present.
 
@@ -251,7 +251,7 @@ The session is identified via `context.session` as usual. A bridge
 (OVOS-BRIDGE-1 §4.2.4) subscribes by `session_id` or `destination`
 and relays this message to the client.
 
-### 4.5 Listen flag
+### 4.4 Listen flag
 
 The `listen` field on `ovos.utterance.speak` is defined by
 OVOS-PIPELINE-1 §9.6. When a received Message carries `listen: true`,
@@ -300,9 +300,9 @@ of this Message.
 Components that subscribed to `ovos.audio.output.started` use this
 signal to restore state.
 
-If the last completed item carried `listen: true` (§4.5), the audio
+If the last completed item carried `listen: true` (§4.4), the audio
 output service emits `ovos.mic.listen` **after** `ovos.audio.output.ended`.
-On a stop-initiated end, `ovos.mic.listen` is not emitted (§4.5).
+On a stop-initiated end, `ovos.mic.listen` is not emitted (§4.4).
 
 ### 5.3 Speaking-status query
 
@@ -313,8 +313,9 @@ currently speaking by emitting:
 
 Request payload: none. To scope the query to a specific session,
 the requester sets `context.session.session_id` in the request
-Message; the service answers for that session only. A request with
-`session_id: "default"` in context asks about any active session.
+Message; the service answers for that session only. An absent or
+`"default"` `session_id` asks about the device-local default session
+(OVOS-SESSION-1 §3.1); it is not a wildcard over all sessions.
 
 The service replies with:
 
@@ -365,7 +366,7 @@ The audio output service **MAY** scope its response to that session.
 | `ovos.audio.output.started` | audio → broadcast | Playback session started (§5.1). |
 | `ovos.audio.output.ended` | audio → broadcast | Playback session ended (§5.2). |
 | `ovos.audio.speech` | audio → broadcast | Synthesised audio as base64 for remote clients (§4.3). |
-| `ovos.mic.listen` | audio → broadcast | Request microphone re-open after `listen: true` (§4.5). |
+| `ovos.mic.listen` | audio → broadcast | Request microphone re-open after `listen: true` (§4.4). |
 
 ---
 
@@ -388,8 +389,8 @@ The audio output service **MAY** scope its response to that session.
 - emit `ovos.audio.output.ended` when a playback session ends (§5.2);
 - clear the scheduled queue and terminate playback on stop signals (§6);
 - emit `ovos.mic.listen` after playback when the last item carries
-  `listen: true` (§4.5);
-- suppress `ovos.mic.listen` when playback ends due to a stop signal (§4.5, §6).
+  `listen: true` (§4.4);
+- suppress `ovos.mic.listen` when playback ends due to a stop signal (§4.4, §6).
 
 ### An audio output service **SHOULD**:
 
diff --git a/ngi.png b/ngi.png
diff --git a/ovos-pipeline-1.md b/ovos-pipeline-1.md
@@ -1130,7 +1130,6 @@ audio-capable deployment.
 |-------|------|----------|---------|
 | `utterance` | string | yes | The natural-language response string. |
 | `lang` | string | no | BCP-47 tag of the response language. When absent, the output stage resolves language from the session per OVOS-SESSION-1 §3.2. |
-| `listen` | bool | no | When `true`, the handler expects a follow-up utterance from the user after this response is delivered. Output consumers **SHOULD** re-open the user input channel (microphone, chat input affordance, etc.) once delivery is complete. Absent or `false` means no follow-up is expected. |
 
 **Derivation and session propagation.** A handler **MUST** derive each
 `ovos.utterance.speak` emission from the dispatch Message (§7) it
@@ -1147,26 +1146,9 @@ acts silently (playing a sound, toggling a device, queuing media) is
 conformant. When a handler emits multiple, the order of emission is the
 intended delivery order; the output stage **SHOULD** preserve it.
 
-**The `listen` flag and follow-up flows.** When a handler emits
-`ovos.utterance.speak` as the prompt in a `get_response` flow
-(OVOS-CONVERSE-1 §5), it **MUST** set `listen: true` on that Message.
-The flag is a protocol-level statement that the handler expects a
-follow-up utterance; every output consumer — audio, chat, any other
-delivery channel — reads it and re-opens the user input channel
-accordingly. Omitting the flag in a `get_response` flow is
-non-conformant: the user is asked a question but the input channel
-is never re-opened.
-
 **Broadcast.** `ovos.utterance.speak` carries no `destination` — it is
 broadcast. Any output component subscribed to the topic may consume it.
 
-**Remote-client variant.** When the intended recipient cannot render
-audio locally (e.g. a satellite without TTS), a handler or bridge MAY
-emit `ovos.utterance.speak.b64` instead. The audio output service
-processes this through the same TTS pipeline and emits
-`ovos.audio.speech` with base64-encoded audio for the client to play
-(OVOS-AUDIO-1 §3.4).
-
 ---
 
 ## 10. Per-pipeline introspection