Skip to content

Commit bda203b

Browse files
JarbasAlclaude
andcommitted
AUDIO-1: audio output service specification
Defines the audio output service: rendering pipeline, the sequential playback queue shared by speech and sound effects, remote-client rendering (ovos.utterance.speak.b64 -> ovos.audio.speech), output lifecycle signals, speaking-status query, stop integration, and the listen-triggered ovos.mic.listen follow-up. - §4 — renumber the Listen flag section from §4.5 to §4.4 (no §4.4 existed); update its eight in-document references. - §5.3 — ovos.audio.is_speaking: an absent or "default" session_id asks about the device-local default session (SESSION-1 §3.1), not a wildcard over all sessions. The §9.6 listen field and the speak payload live in the PIPELINE-1 PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent bf3bc4b commit bda203b

6 files changed

Lines changed: 63 additions & 44 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,17 @@ field and adds an entry here.
77

88
## OVOS-AUDIO-1 — Audio Output Service
99

10-
### 1
10+
### 2
1111

12-
- Initial draft. Defines two rendering modes (`ovos.utterance.speak`
13-
for local playback, `ovos.utterance.speak.b64` for remote-client
14-
delivery), sequential playback queue for speech and sound effects,
15-
fire-and-forget playback control (`ovos.audio.speech`), session
16-
scoping (default session only for local service), TTS-as-a-service
17-
via `ovos.audio.speech`, stop/pause/resume/duck lifecycle, and
18-
conformance roles (Audio Service, Orchestrator, Skill, TTS Plugin).
12+
- The audio output service: the rendering pipeline (dialog-transformer
13+
chain, TTS synthesis, TTS-transformer chain, playback queue), the
14+
sequential playback queue shared by speech (`ovos.utterance.speak`) and
15+
sound effects (`ovos.audio.queue` / `ovos.audio.play_sound`), the
16+
remote-client rendering mode (`ovos.utterance.speak.b64`
17+
`ovos.audio.speech`), output lifecycle signals
18+
(`ovos.audio.output.started` / `.ended`), the speaking-status query
19+
(`ovos.audio.is_speaking`), stop integration (`ovos.audio.stop`,
20+
`ovos.stop`), and the `listen`-triggered `ovos.mic.listen` follow-up.
1921

2022
## OVOS-INTENT-1 — Sentence Template Grammar
2123

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,6 @@ below). Adoption is voluntary; conformance, once adopted, is not.
113113
| OVOS-CONTEXT-1 | [Intent Context](intent-context.md) | 1 | [Draft — in review (PR #18)](https://github.com/OpenVoiceOS/architecture/pull/18) |
114114
| OVOS-CONVERSE-1 | [Active Handlers and Interactive Response](converse.md) | 1 | [Draft — in review (PR #25)](https://github.com/OpenVoiceOS/architecture/pull/25) |
115115
| OVOS-STOP-1 | [Stop Pipeline Plugin](ovos-stop-1.md) | 1 | [Draft — in review (PR #33)](https://github.com/OpenVoiceOS/architecture/pull/33) |
116-
| OVOS-AUDIO-1 | [Audio Output Service](audio-out.md) | 1 | [Draft — in review (PR #38)](https://github.com/OpenVoiceOS/architecture/pull/38) |
117116
| OVOS-AUDIO-IN-1 | [Audio Input Service](ovos-audio-in-1.md) | 1 | Draft |
118117

119118
Each spec carries its own scope statement, design rationale, and
@@ -174,10 +173,12 @@ require a version bump.
174173

175174
## Credits
176175

177-
These specifications were produced as part of a documentation and
178-
interoperability effort for OpenVoiceOS, funded by NLnet's
179-
[NGI0 Commons Fund](https://nlnet.nl/project/OpenVoiceOS) under
180-
grant agreement No
181-
[101135429](https://cordis.europa.eu/project/id/101135429).
176+
Produced for [OpenVoiceOS](https://openvoiceos.org).
182177

183-
![NGI0 / NLnet](./ngi.png)
178+
[![NGI0 Commons Fund](./ngi.png)](https://nlnet.nl/project/OpenVoiceOS)
179+
180+
This project was funded through the [NGI0 Commons Fund](https://nlnet.nl/commonsfund),
181+
a fund established by [NLnet](https://nlnet.nl) with financial support from the
182+
European Commission's [Next Generation Internet](https://ngi.eu) programme, under
183+
the aegis of [DG Communications Networks, Content and Technology](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/communications-networks-content-and-technology_en)
184+
under grant agreement No [101135429](https://cordis.europa.eu/project/id/101135429).

VERSIONING.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Spec versioning policy
2+
3+
Version numbers in this repository carry compatibility semantics anchored to
4+
the pre-specification behavior of the OVOS stack:
5+
6+
| Version | Meaning |
7+
| --- | --- |
8+
| **V0** | The de facto, undocumented status quo — the behavior the stack ships before a subsystem is formalized. V0 is never written down as a spec; it is the reference point. |
9+
| **V1** | A formalization of behavior that is **compatible with V0**. A V0 component keeps working against a V1 implementation, even if degraded (missing optional fields, reduced guarantees, legacy namespaces honored). |
10+
| **V2** | Behavior that is **not backwards compatible** with V0. Adopting it requires coordinated migration (e.g. the `legacy_namespace` configuration gate). |
11+
12+
Until launch day, every spec in this repository MUST be classified as V1 or
13+
V2. The classification is part of the spec header. Rules of thumb:
14+
15+
- A spec that documents existing message flows, adds optional fields, or
16+
introduces parallel namespaces while the legacy ones keep working → **V1**.
17+
- A spec that renames or removes message types, changes payload semantics, or
18+
requires consumers to change before producers (or vice versa) → **V2**.
19+
- A single spec MAY contain V1 sections and V2 sections only if the V2 parts
20+
are explicitly gated (configuration flag) and the ungated behavior is V1.
21+
22+
Within a class, editorial revisions bump the spec's own revision number in
23+
its header; compatibility class changes (V1 → V2) are a new spec version, not
24+
a revision.
25+
26+
## The 1.0 definition
27+
28+
The compatibility classes define the project roadmap. The stack starts at V0
29+
(the undocumented status quo — beta). Each subsystem is formalized as V1, then
30+
migrated to V2 where the spec demands incompatible change. **OVOS is fully
31+
spec compliant when every subsystem operates on V2 — that state is the
32+
"breakthrough" in "from beta to breakthrough", and it is the 1.0 release
33+
criterion.**

audio-out.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Audio Output Service Specification
22

3-
**Spec ID:** OVOS-AUDIO-1 · **Version:** 1 · **Status:** Draft
3+
**Spec ID:** OVOS-AUDIO-1 · **Version:** 2 · **Status:** Draft
44

55
This specification defines the **audio output service** — the
66
pipeline's output-side counterpart that consumes natural-language
@@ -169,7 +169,7 @@ local playback, the service **MUST** emit `ovos.audio.speech` (§4.3)
169169
with the synthesised audio encoded as base64. The audio is not
170170
enqueued and does not play on the local device.
171171

172-
The `listen` flag (§4.5) applies: if the originating Message carries
172+
The `listen` flag (§4.4) applies: if the originating Message carries
173173
`listen: true`, the service **MUST** emit `ovos.mic.listen` after
174174
emitting `ovos.audio.speech`.
175175

@@ -215,7 +215,7 @@ participants and their audio is delivered via
215215
|-------|------|----------|---------|
216216
| `uri` | string | no | URI referencing the audio data. |
217217
| `audio` | string | no | Base64-encoded audio data, used when the audio source is on a different host (alternative to `uri`). |
218-
| `listen` | bool | no | When `true`, re-opens the user input channel after this item plays (§4.5). |
218+
| `listen` | bool | no | When `true`, re-opens the user input channel after this item plays (§4.4). |
219219

220220
Exactly one of `uri` or `audio` MUST be present.
221221

@@ -251,7 +251,7 @@ The session is identified via `context.session` as usual. A bridge
251251
(OVOS-BRIDGE-1 §4.2.4) subscribes by `session_id` or `destination`
252252
and relays this message to the client.
253253

254-
### 4.5 Listen flag
254+
### 4.4 Listen flag
255255

256256
The `listen` field on `ovos.utterance.speak` is defined by
257257
OVOS-PIPELINE-1 §9.6. When a received Message carries `listen: true`,
@@ -300,9 +300,9 @@ of this Message.
300300
Components that subscribed to `ovos.audio.output.started` use this
301301
signal to restore state.
302302

303-
If the last completed item carried `listen: true` (§4.5), the audio
303+
If the last completed item carried `listen: true` (§4.4), the audio
304304
output service emits `ovos.mic.listen` **after** `ovos.audio.output.ended`.
305-
On a stop-initiated end, `ovos.mic.listen` is not emitted (§4.5).
305+
On a stop-initiated end, `ovos.mic.listen` is not emitted (§4.4).
306306

307307
### 5.3 Speaking-status query
308308

@@ -313,8 +313,9 @@ currently speaking by emitting:
313313

314314
Request payload: none. To scope the query to a specific session,
315315
the requester sets `context.session.session_id` in the request
316-
Message; the service answers for that session only. A request with
317-
`session_id: "default"` in context asks about any active session.
316+
Message; the service answers for that session only. An absent or
317+
`"default"` `session_id` asks about the device-local default session
318+
(OVOS-SESSION-1 §3.1); it is not a wildcard over all sessions.
318319

319320
The service replies with:
320321

@@ -365,7 +366,7 @@ The audio output service **MAY** scope its response to that session.
365366
| `ovos.audio.output.started` | audio → broadcast | Playback session started (§5.1). |
366367
| `ovos.audio.output.ended` | audio → broadcast | Playback session ended (§5.2). |
367368
| `ovos.audio.speech` | audio → broadcast | Synthesised audio as base64 for remote clients (§4.3). |
368-
| `ovos.mic.listen` | audio → broadcast | Request microphone re-open after `listen: true` (§4.5). |
369+
| `ovos.mic.listen` | audio → broadcast | Request microphone re-open after `listen: true` (§4.4). |
369370

370371
---
371372

@@ -388,8 +389,8 @@ The audio output service **MAY** scope its response to that session.
388389
- emit `ovos.audio.output.ended` when a playback session ends (§5.2);
389390
- clear the scheduled queue and terminate playback on stop signals (§6);
390391
- emit `ovos.mic.listen` after playback when the last item carries
391-
`listen: true` (§4.5);
392-
- suppress `ovos.mic.listen` when playback ends due to a stop signal (§4.5, §6).
392+
`listen: true` (§4.4);
393+
- suppress `ovos.mic.listen` when playback ends due to a stop signal (§4.4, §6).
393394

394395
### An audio output service **SHOULD**:
395396

ngi.png

5.85 KB
Loading

ovos-pipeline-1.md

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1130,7 +1130,6 @@ audio-capable deployment.
11301130
|-------|------|----------|---------|
11311131
| `utterance` | string | yes | The natural-language response string. |
11321132
| `lang` | string | no | BCP-47 tag of the response language. When absent, the output stage resolves language from the session per OVOS-SESSION-1 §3.2. |
1133-
| `listen` | bool | no | When `true`, the handler expects a follow-up utterance from the user after this response is delivered. Output consumers **SHOULD** re-open the user input channel (microphone, chat input affordance, etc.) once delivery is complete. Absent or `false` means no follow-up is expected. |
11341133

11351134
**Derivation and session propagation.** A handler **MUST** derive each
11361135
`ovos.utterance.speak` emission from the dispatch Message (§7) it
@@ -1147,26 +1146,9 @@ acts silently (playing a sound, toggling a device, queuing media) is
11471146
conformant. When a handler emits multiple, the order of emission is the
11481147
intended delivery order; the output stage **SHOULD** preserve it.
11491148

1150-
**The `listen` flag and follow-up flows.** When a handler emits
1151-
`ovos.utterance.speak` as the prompt in a `get_response` flow
1152-
(OVOS-CONVERSE-1 §5), it **MUST** set `listen: true` on that Message.
1153-
The flag is a protocol-level statement that the handler expects a
1154-
follow-up utterance; every output consumer — audio, chat, any other
1155-
delivery channel — reads it and re-opens the user input channel
1156-
accordingly. Omitting the flag in a `get_response` flow is
1157-
non-conformant: the user is asked a question but the input channel
1158-
is never re-opened.
1159-
11601149
**Broadcast.** `ovos.utterance.speak` carries no `destination` — it is
11611150
broadcast. Any output component subscribed to the topic may consume it.
11621151

1163-
**Remote-client variant.** When the intended recipient cannot render
1164-
audio locally (e.g. a satellite without TTS), a handler or bridge MAY
1165-
emit `ovos.utterance.speak.b64` instead. The audio output service
1166-
processes this through the same TTS pipeline and emits
1167-
`ovos.audio.speech` with base64-encoded audio for the client to play
1168-
(OVOS-AUDIO-1 §3.4).
1169-
11701152
---
11711153

11721154
## 10. Per-pipeline introspection

0 commit comments

Comments
 (0)