AUDIO-1: audio output service specification

JarbasAl · JarbasAl · commit 52ef5aa1873e · 2026-06-23T06:18:00.000+01:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,20 @@ status quo, `2` once it is not backwards compatible. Entries are grouped under
 the spec's current class. Every pull request that alters normative content adds
 an entry here.
 
+## OVOS-AUDIO-1 — Audio Output Service
+
+### 2
+
+- The audio output service: the rendering pipeline (dialog-transformer
+  chain, TTS synthesis, TTS-transformer chain, playback queue), the
+  sequential playback queue shared by speech (`ovos.utterance.speak`) and
+  sound effects (`ovos.audio.queue` / `ovos.audio.play_sound`), the
+  remote-client rendering mode (`ovos.utterance.speak.b64` →
+  `ovos.audio.speech`), output lifecycle signals
+  (`ovos.audio.output.started` / `.ended`), the speaking-status query
+  (`ovos.audio.is_speaking`), stop integration (`ovos.audio.stop`,
+  `ovos.stop`), and the `listen`-triggered `ovos.mic.listen` follow-up.
+
 ## OVOS-INTENT-1 — Sentence Template Grammar
 
 ### 2
diff --git a/appendix/divergences.md b/appendix/divergences.md
@@ -196,6 +196,21 @@ defined by any spec** and should be removed or replaced:
 - **`ovos.utterance.speak`** (PIPELINE-1 §9.6). The NL output
   exit point; symmetric to `ovos.utterance.handle`. No current
   equivalent — TTS trigger is currently implicit.
+- **`ovos.utterance.speak.b64`** (AUDIO-1 §3.4). Variant of
+  `ovos.utterance.speak` for remote-client delivery: the audio
+  output service runs the same TTS pipeline but emits synthesised
+  audio as base64 via `ovos.audio.speech` instead of queuing for
+  local playback. Used by bridges serving satellites without TTS
+  (BRIDGE-1 §4.2.4).
+- **`ovos.audio.speech`** (AUDIO-1 §4.3). Base64-encoded
+  synthesised audio broadcast; emitted in response to
+  `ovos.utterance.speak.b64`. Carries a `listen` flag. Remote
+  clients (e.g. satellites relayed by a bridge) decode and play
+  the audio themselves.
+- **`ovos.audio.queue`** / **`ovos.audio.play_sound`** (AUDIO-1
+  §4.1, §4.2). Sound-effect playback topics. Payloads accept
+  either a `uri` or inline base64 `audio` field, enabling
+  cross-host audio delivery without shared filesystem access.
 - **`ovos.intent.list` / `ovos.intent.describe`** (INTENT-4
   §10). Introspection topics served from the orchestrator's
   passive registration index.
diff --git a/appendix/rationale.md b/appendix/rationale.md
@@ -449,3 +449,23 @@ subscribed to `<own_skill_id>:stop`. The pipeline plugin matches
 and selects; the skill stops. Stop is one of the few cases in
 the spec set where the pipeline / skill split is not
 substitutable.
+
+
+### 4.9 Audio output service (AUDIO-1)
+
+**Sentence segmentation as a latency-reduction technique (AUDIO-1 §3.2).**
+When a TTS engine synthesises a long utterance as a single unit, the
+user must wait for the entire synthesis to complete before hearing
+anything. An implementation can reduce perceived latency by splitting
+the utterance at sentence boundaries, synthesising each sentence
+independently, and enqueuing each segment as soon as it is ready —
+so the first sentence begins playing while later sentences are still
+being synthesised.
+
+This is an internal implementation strategy: no other bus participant
+observes whether the TTS engine segments or not. The visible contract
+is unchanged — `ovos.audio.output.started` fires when the first
+audio begins, `ovos.audio.output.ended` fires when the last audio
+completes. The `listen` flag is honoured after all audio for the
+originating utterance has played, regardless of how many internal
+segments were used.
diff --git a/audio-out.md b/audio-out.md