You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kueue (CNCF project, recently graduated) is increasingly used for GPU and compute-intensive workload scheduling in multi-tenant Kubernetes clusters. When Kueue manages a pod it adds a SchedulingGate, holds the pod in SchedulingGated state until a ClusterQueue admits the workload, then removes the gate and lets the scheduler run.
The current spawner has no awareness of this: it sees the pod stuck in a non-Running phase and either times out or shows a blank progress screen to the user with no explanation of why their server is waiting.
Kueue pod lifecycle
stateDiagram-v2
direction LR
[*] --> Pending : pod created
(SchedulingGate added by Kueue)
Pending --> SchedulingGated : waiting for
ClusterQueue admission
SchedulingGated --> PodScheduled : Kueue admits workload
(gate removed)
PodScheduled --> Running : node assigned,
containers starting
Running --> [*]
note right of SchedulingGated
Current spawner sees this as
"stuck" and may time out.
User sees a blank wait screen.
end note
note right of Running
KueuePodLifecycleHandler
logs each transition and
surfaces Workload events
in the spawn progress UI.
end note
Loading
Proposed design
A new lifecycle_handlers.py module with a strategy pattern so different scheduling systems can plug into the spawner without adding conditional branches to the core:
PodLifecycleHandler (default, no behaviour change)
create_pod(v1, namespace, body) — delegates to v1.create_namespaced_pod
wait_for_ready(v1, pod_name, namespace, timeout, log) — polls for Ready condition
When kueue_enabled=False (default), PodLifecycleHandler is used — zero Kueue code runs and behaviour is identical to today.
No new dependencies
Everything uses kubernetes_asyncio, which is already required. We never import a Kueue SDK; Workload events are fetched via the standard CoreV1 events API.
Summary
Kueue (CNCF project, recently graduated) is increasingly used for GPU and compute-intensive workload scheduling in multi-tenant Kubernetes clusters. When Kueue manages a pod it adds a
SchedulingGate, holds the pod inSchedulingGatedstate until aClusterQueueadmits the workload, then removes the gate and lets the scheduler run.The current spawner has no awareness of this: it sees the pod stuck in a non-Running phase and either times out or shows a blank progress screen to the user with no explanation of why their server is waiting.
Kueue pod lifecycle
stateDiagram-v2 direction LR [*] --> Pending : pod created (SchedulingGate added by Kueue) Pending --> SchedulingGated : waiting for ClusterQueue admission SchedulingGated --> PodScheduled : Kueue admits workload (gate removed) PodScheduled --> Running : node assigned, containers starting Running --> [*] note right of SchedulingGated Current spawner sees this as "stuck" and may time out. User sees a blank wait screen. end note note right of Running KueuePodLifecycleHandler logs each transition and surfaces Workload events in the spawn progress UI. end noteProposed design
A new
lifecycle_handlers.pymodule with a strategy pattern so different scheduling systems can plug into the spawner without adding conditional branches to the core:PodLifecycleHandler(default, no behaviour change)create_pod(v1, namespace, body)— delegates tov1.create_namespaced_podwait_for_ready(v1, pod_name, namespace, timeout, log)— polls forReadyconditionget_extra_events(v1, pod_name, namespace, log)— returns[]KueuePodLifecycleHandler(queue_name=None)wait_for_readyto logSchedulingGated → Scheduledtransitionsget_extra_eventsto fetch KueueWorkloadCR events, surfacing queue admission progress in the JupyterHub spawn UITwo new
KubeSpawnertraitlets to wire it in:When
kueue_enabled=False(default),PodLifecycleHandleris used — zero Kueue code runs and behaviour is identical to today.No new dependencies
Everything uses
kubernetes_asyncio, which is already required. We never import a Kueue SDK; Workload events are fetched via the standard CoreV1 events API.Implementation
A working implementation is in kaddynator/kubespawner feat/kueue-lifecycle-handlers. Happy to open a PR if this direction looks reasonable to maintainers.