Skip to content

Commit 4ce1d22

Browse files
author
Tom Lasswell
committed
feat(lan): reality probe — capture all raw LAN replies, not just devStatus (#57)
Don't inherit another integration's idea of the LAN data surface. Replace the devStatus-only probe with a reality probe that fires a read-only query battery (devStatus + the never-used-by-libs `status` cmd + a unicast scan) at each discovered device and captures EVERY datagram it emits — whole payload, any cmd, any field, even undecodable bytes — so the real surface is measured on hardware, not assumed. - api/lan.py: async_probe_lan_raw() + _RawProbeProtocol (unfiltered capture, keyed by source IP, list per device). LAN_PROBE_COMMANDS is a read-only battery; NO write verbs (turn/brightness/colorwc/ptReal) ever sent. The `status` reply's `pt` BLE-hex is the prime candidate for segment/scene/ sensor readback the 4-field devStatus omits. Per-IP capture capped. - diagnostics.py: each device gains lan_raw (full capture), status (devStatus summary), commands_answered; plus fleet-wide commands_answered. New _scrub_lan_addresses value-redacts any MAC/IPv4 anywhere in the capture (MAC->stable hash, IPv4->REDACTED_IP) on top of key-name TO_REDACT, since the capture keeps unknown keys. Firmware versions like "1.02.03" are not IPv4 quads, so they survive. - docs: reality-probe note in §6. - tests: rewrite probe coverage for the raw protocol incl. unknown-cmd preservation, undecodable capture, value-level address scrub, and the no-write-verbs guarantee. Claude-Session: https://claude.ai/code/session_01QVkrSto5stGSV1NNS5pmKM
1 parent df88e44 commit 4ce1d22

5 files changed

Lines changed: 391 additions & 225 deletions

File tree

custom_components/govee/api/lan.py

Lines changed: 96 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -11,37 +11,38 @@
1111
1212
- ``async_scan_lan_devices`` — one bounded multicast ``scan`` (discovery): which
1313
devices answer and their identity/firmware metadata.
14-
- ``async_probe_lan_devstatus`` — a unicast ``devStatus`` query per discovered
15-
device, capturing its full runtime reply so we can measure empirically how
16-
much state the LAN API actually exposes. Verified against
17-
``Galorhallen/govee-local-api`` and ``wez/govee2mqtt``, a ``devStatus`` reply
18-
carries exactly four runtime fields — ``onOff``, ``brightness``, ``color`` and
19-
``colorTemInKelvin`` — but we capture the whole ``data`` dict so a firmware
20-
that returns more is not silently discarded (the entire point is discovery).
21-
22-
``ptReal`` (the BLE-over-WiFi passthrough that drives scenes/segments/music) is
23-
deliberately NOT probed: both reference libraries send it fire-and-forget with
24-
no response to read back, and emitting one is a state-changing control write —
25-
forbidden in this read-only module. So scene/segment/music/sensor state is
26-
simply not readable over the LAN API; only the four ``devStatus`` fields and the
27-
discovery metadata are.
14+
- ``async_probe_lan_raw`` — a "reality probe": fire a battery of safe READ-ONLY
15+
queries (``devStatus`` + ``status`` + a unicast ``scan``) at each discovered
16+
device and capture **every** datagram it emits, completely unfiltered — whole
17+
payload, any command, any field, even undecodable bytes. We do NOT trust any
18+
other integration's notion of which commands exist or which fields a reply
19+
carries (``govee-local-api`` parses only 4 ``devStatus`` fields and never sends
20+
``status`` at all); the point is to measure on real hardware what the firmware
21+
actually exposes rather than inherit someone else's parser. In particular the
22+
``status`` command's ``pt`` (BLE-passthrough hex) field may carry
23+
segment/scene/sensor state that the 4-field ``devStatus`` omits.
24+
25+
``ptReal`` and the other control verbs (``turn``/``brightness``/``colorwc``) are
26+
deliberately NOT sent: they are state-changing writes, forbidden in this
27+
read-only module. Capturing what the device *volunteers* in response to read
28+
queries is the safe way to map the surface.
2829
2930
Deliberately scoped: no control writes, no entities, no persistent socket — each
3031
call opens a socket, collects responses for a short timeout, and returns them.
3132
Protocol per ``docs/govee-protocol-reference.md`` §6:
3233
3334
- Scan request -> 239.255.255.250:4001 ``{"msg":{"cmd":"scan",...}}``
3435
- Scan response -> 239.255.255.250:4002 ``{"msg":{"cmd":"scan","data":{...}}}``
35-
- devStatus query -> <device-ip>:4003 ``{"msg":{"cmd":"devStatus","data":{}}}``
36-
- devStatus reply -> our :4002 (unicast OR multicast, firmware-dependent)
36+
- Read queries -> <device-ip>:4003/4001 ``{"msg":{"cmd":"devStatus|status|scan","data":{}}}``
37+
- Replies -> our :4002 (unicast OR multicast, firmware-dependent)
3738
3839
Critical protocol detail (the reason early builds returned zero devices, issue
3940
#57): a Govee device sends its scan *response* as **multicast** to the group on
4041
port 4002 — it does NOT unicast the reply back to the sender. So the receive
4142
socket MUST join the ``239.255.255.250`` group via ``IP_ADD_MEMBERSHIP`` or the
4243
kernel silently drops every reply before it reaches us. Binding port 4002 alone
43-
is not enough. The devStatus probe reuses the same group-joined 4002 socket so
44-
it catches replies whether a given firmware answers unicast or multicast. This
44+
is not enough. The reality probe reuses the same group-joined 4002 socket so it
45+
catches replies whether a given firmware answers unicast or multicast. This
4546
mirrors ``govee-local-api`` (the library behind Home Assistant's
4647
``govee_light_local``) and ``wez/govee2mqtt``.
4748
"""
@@ -69,22 +70,39 @@
6970
LAN_COMMAND_PORT = 4003 # devices listen here for unicast devStatus/control
7071
LAN_MULTICAST_TTL = 2 # let a scan / reply cross at most one router hop
7172

72-
# devStatus probe budget. All probes share ONE socket and ONE collection window
73+
# Reality-probe budget. All probes share ONE socket and ONE collection window
7374
# (sends are fire-and-forget; replies arrive asynchronously), so total wall time
7475
# is bounded by the window regardless of device_count — 11 devices cost the same
7576
# ~2s as one. The cap bounds send-loop work against a large CIDR sweep, not wait.
76-
LAN_PROBE_WINDOW = 2.0 # seconds to collect all devStatus replies
77+
LAN_PROBE_WINDOW = 2.5 # seconds to collect all probe replies
7778
LAN_PROBE_MAX_DEVICES = 64 # hard cap on how many IPs we probe in one batch
79+
LAN_PROBE_MAX_REPLIES_PER_IP = 32 # guard against a chatty device flooding output
7880

7981
# INADDR_ANY: join/egress on the kernel's default-route interface. Always added
8082
# alongside any explicit interface IPs as a catch-all for single-NIC hosts.
8183
_DEFAULT_INTERFACE = "0.0.0.0"
8284

8385
_SCAN_REQUEST = json.dumps({"msg": {"cmd": "scan", "data": {"account_topic": "reserve"}}}).encode("utf-8")
8486

85-
# Empty-data devStatus query; matches DevStatusMessage in govee-local-api and
86-
# Request::DevStatus{} in wez/govee2mqtt. Sent unicast to <device-ip>:4003.
87-
_DEVSTATUS_REQUEST = json.dumps({"msg": {"cmd": "devStatus", "data": {}}}).encode("utf-8")
87+
# Read-only LAN query battery for the reality probe (issue #57). We do NOT trust
88+
# any other integration's field/command list — we send every safe READ query we
89+
# know of and capture whatever the hardware actually emits, so the real LAN data
90+
# surface is measured, not assumed. STRICTLY read-only: NO writes
91+
# (turn/brightness/colorwc/ptReal) — a diagnostics probe must never mutate device
92+
# state. Each entry is ``(cmd, port, data)`` sent unicast; ``data`` is empty so no
93+
# parameters are set. Replies are captured raw on 4002 regardless of cmd.
94+
#
95+
# - ``devStatus`` (:4003) — the documented status read (4 known fields).
96+
# - ``status`` (:4003) — undocumented in HA libs but a ``StatusResponse`` with
97+
# a ``pt`` (base64 BLE passthrough) field exists in govee-local-api yet is never
98+
# sent; it may carry segment/scene/sensor state the 4-field devStatus omits.
99+
# - ``scan`` (:4001) — unicast discovery, captured WHOLE (not the 7-field
100+
# allowlist) so any extra identity/firmware fields surface.
101+
LAN_PROBE_COMMANDS: tuple[tuple[str, int, dict[str, Any]], ...] = (
102+
("devStatus", LAN_COMMAND_PORT, {}),
103+
("status", LAN_COMMAND_PORT, {}),
104+
("scan", LAN_DISCOVERY_PORT, {"account_topic": "reserve"}),
105+
)
88106

89107
# Packed multicast group address, reused for every IP_ADD/DROP_MEMBERSHIP call.
90108
_GROUP_BYTES = socket.inet_aton(LAN_MULTICAST_GROUP)
@@ -187,38 +205,38 @@ def error_received(self, exc: Exception) -> None: # pragma: no cover - rare
187205
_LOGGER.debug("LAN scan socket error: %s", exc)
188206

189207

190-
class _DevStatusProtocol(asyncio.DatagramProtocol):
191-
"""Collects raw Govee ``devStatus`` replies, keyed by responder IP.
208+
class _RawProbeProtocol(asyncio.DatagramProtocol):
209+
"""Captures EVERY datagram received during a probe, raw, keyed by source IP.
210+
211+
Deliberately unfiltered — no ``cmd`` check, no field allowlist, no shape
212+
assumptions. The goal is to record exactly what the hardware emits, including
213+
commands and fields no reference library parses, so the real LAN data surface
214+
is *measured* rather than inherited from another integration's parser. The
215+
whole ``{"msg": ...}`` payload is kept; an undecodable datagram is captured as
216+
a truncated ``_unparsed`` string rather than dropped (even garbage is signal).
192217
193-
Separate from ``_ScanProtocol`` because that one hard-drops ``cmd != "scan"``.
194-
Captures the ENTIRE ``data`` dict (no field allowlist) — the purpose of the
195-
probe is to discover what firmware actually returns, so an allowlist would
196-
throw away exactly the signal we want. Redaction happens downstream in
197-
diagnostics ``_redact`` (key-name based: any ``ip``/``device``/``mac`` key a
198-
firmware echoes inside ``data`` is auto-redacted there).
218+
Keyed by the datagram SOURCE IP — correct for both reply paths (a unicast
219+
reply to our 4002 source and a multicast reply to the group both carry the
220+
device's own IP as the UDP source). Each IP accumulates a LIST of replies so
221+
multiple commands' responses (devStatus + status + scan) are all retained.
222+
Redaction is downstream in diagnostics (key-name + value-level address scrub).
199223
"""
200224

201225
def __init__(self) -> None:
202-
self.responses: dict[str, dict[str, Any]] = {}
226+
self.replies: dict[str, list[Any]] = {}
203227

204228
def datagram_received(self, data: bytes, addr: tuple[str, int]) -> None:
229+
bucket = self.replies.setdefault(addr[0], [])
230+
if len(bucket) >= LAN_PROBE_MAX_REPLIES_PER_IP:
231+
return # chatty device / broadcast storm — keep the dump bounded
205232
try:
206-
payload = json.loads(data.decode("utf-8", errors="replace"))
207-
msg = payload.get("msg", {})
208-
if msg.get("cmd") != "devStatus":
209-
return # ignore scan replies / unrelated multicast noise
210-
body = msg.get("data", {})
211-
if not isinstance(body, dict):
212-
return
213-
except (ValueError, AttributeError):
214-
return
215-
# Key by the datagram SOURCE IP — correct for both reply paths: a
216-
# unicast reply to our 4002 source and a multicast reply to the group
217-
# both carry the device's own IP as the UDP source. Last reply wins.
218-
self.responses[addr[0]] = body
233+
payload: Any = json.loads(data.decode("utf-8", errors="replace"))
234+
except ValueError:
235+
payload = {"_unparsed": data.decode("utf-8", errors="replace")[:512]}
236+
bucket.append(payload)
219237

220238
def error_received(self, exc: Exception) -> None: # pragma: no cover - rare
221-
_LOGGER.debug("LAN devStatus socket error: %s", exc)
239+
_LOGGER.debug("LAN raw-probe socket error: %s", exc)
222240

223241

224242
def _build_socket() -> socket.socket:
@@ -355,27 +373,30 @@ async def async_scan_lan_devices(
355373
return list(protocol.responses.values())
356374

357375

358-
async def async_probe_lan_devstatus(
376+
async def async_probe_lan_raw(
359377
ips: list[str],
360378
timeout: float = LAN_PROBE_WINDOW,
361379
interface_ips: list[str] | None = None,
362-
) -> dict[str, dict[str, Any]]:
363-
"""Unicast ``devStatus`` to each IP and collect raw replies for ``timeout`` s.
364-
365-
Returns ``{responder_ip: raw_data_dict}`` capturing the WHOLE reply body for
366-
each device that answers — the probe exists to measure the real LAN data
367-
surface, so no field allowlist is applied here (redaction is downstream in
368-
diagnostics). A device that discovers but does not answer ``devStatus`` (LAN
369-
control disabled in the app, BLE-only SKU) simply has no entry — the caller
370-
treats a missing IP as "no status".
371-
372-
Sends are fire-and-forget to ``<ip>:4003``; replies may return unicast to our
373-
4002 source OR multicast to ``239.255.255.250:4002`` depending on firmware,
374-
so we reuse the scan socket pattern (bound 4002 + group-joined) to catch
375-
both. All probes share one socket and one collection window, so total wall
376-
time is bounded by ``timeout`` regardless of device count. ``ips`` is capped
377-
at ``LAN_PROBE_MAX_DEVICES`` so a large ``extra_targets`` sweep cannot blow up
378-
the send loop.
380+
commands: tuple[tuple[str, int, dict[str, Any]], ...] = LAN_PROBE_COMMANDS,
381+
) -> dict[str, list[Any]]:
382+
"""Reality probe: fire a read-only query battery at each IP, capture all replies.
383+
384+
Returns ``{responder_ip: [raw_payload, ...]}`` — the WHOLE ``{"msg": ...}`` of
385+
every datagram each device emits during the window, completely unfiltered. We
386+
do not trust any other integration's idea of which commands exist or which
387+
fields a reply carries: we send every safe READ query in ``commands`` and
388+
record exactly what comes back, so the real LAN data surface is measured. A
389+
device that does not answer simply has no entry.
390+
391+
``commands`` is ``((cmd, port, data), ...)`` — STRICTLY read-only (default
392+
``LAN_PROBE_COMMANDS``: devStatus + status + unicast scan). No control writes
393+
are ever sent. Each is unicast to ``<ip>:port``; replies may return unicast to
394+
our 4002 source OR multicast to ``239.255.255.250:4002`` depending on
395+
firmware, so we reuse the scan socket pattern (bound 4002 + group-joined) to
396+
catch both. All probes share one socket and one collection window, so total
397+
wall time is bounded by ``timeout`` regardless of device count. ``ips`` is
398+
capped at ``LAN_PROBE_MAX_DEVICES``; per-IP capture is capped at
399+
``LAN_PROBE_MAX_REPLIES_PER_IP``.
379400
380401
``interface_ips`` join the multicast group on each adapter (multi-homed
381402
coverage), mirroring ``async_scan_lan_devices``.
@@ -389,22 +410,26 @@ async def async_probe_lan_devstatus(
389410

390411
interfaces = list(interface_ips or [])
391412
targets = ips[:LAN_PROBE_MAX_DEVICES]
413+
requests = [
414+
(json.dumps({"msg": {"cmd": cmd, "data": data}}).encode("utf-8"), port) for cmd, port, data in commands
415+
]
392416

393417
loop = asyncio.get_running_loop()
394418
sock = _build_socket() # raises OSError if port 4002 cannot be bound
395419
joined = _join_group(sock, interfaces) # catch multicast replies too
396420

397-
transport, protocol = await loop.create_datagram_endpoint(_DevStatusProtocol, sock=sock)
398-
assert isinstance(protocol, _DevStatusProtocol)
421+
transport, protocol = await loop.create_datagram_endpoint(_RawProbeProtocol, sock=sock)
422+
assert isinstance(protocol, _RawProbeProtocol)
399423
try:
400424
for ip in targets:
401-
try:
402-
transport.sendto(_DEVSTATUS_REQUEST, (ip, LAN_COMMAND_PORT))
403-
except OSError as err: # one bad/unreachable IP must not abort the batch
404-
_LOGGER.debug("LAN probe: send to %s failed: %s", ip, err)
425+
for request, port in requests:
426+
try:
427+
transport.sendto(request, (ip, port))
428+
except OSError as err: # one bad/unreachable IP must not abort the batch
429+
_LOGGER.debug("LAN raw probe: send to %s:%s failed: %s", ip, port, err)
405430
await asyncio.sleep(timeout)
406431
finally:
407432
_drop_group(sock, joined)
408433
transport.close()
409434

410-
return dict(protocol.responses)
435+
return dict(protocol.replies)

0 commit comments

Comments
 (0)