Skip to content

Feature: Kueue integration via pluggable pod lifecycle handlers #922

Description

@kaddynator

Summary

Kueue (CNCF project, recently graduated) is increasingly used for GPU and compute-intensive workload scheduling in multi-tenant Kubernetes clusters. When Kueue manages a pod it adds a SchedulingGate, holds the pod in SchedulingGated state until a ClusterQueue admits the workload, then removes the gate and lets the scheduler run.

The current spawner has no awareness of this: it sees the pod stuck in a non-Running phase and either times out or shows a blank progress screen to the user with no explanation of why their server is waiting.

Kueue pod lifecycle

stateDiagram-v2
    direction LR
    [*] --> Pending : pod created
(SchedulingGate added by Kueue)
    Pending --> SchedulingGated : waiting for
ClusterQueue admission
    SchedulingGated --> PodScheduled : Kueue admits workload
(gate removed)
    PodScheduled --> Running : node assigned,
containers starting
    Running --> [*]

    note right of SchedulingGated
        Current spawner sees this as
        "stuck" and may time out.
        User sees a blank wait screen.
    end note

    note right of Running
        KueuePodLifecycleHandler
        logs each transition and
        surfaces Workload events
        in the spawn progress UI.
    end note
Loading

Proposed design

A new lifecycle_handlers.py module with a strategy pattern so different scheduling systems can plug into the spawner without adding conditional branches to the core:

PodLifecycleHandler (default, no behaviour change)

  • create_pod(v1, namespace, body) — delegates to v1.create_namespaced_pod
  • wait_for_ready(v1, pod_name, namespace, timeout, log) — polls for Ready condition
  • get_extra_events(v1, pod_name, namespace, log) — returns []

KueuePodLifecycleHandler(queue_name=None)

  • Overrides wait_for_ready to log SchedulingGated → Scheduled transitions
  • Overrides get_extra_events to fetch Kueue Workload CR events, surfacing queue admission progress in the JupyterHub spawn UI
  • Falls back to standard Ready polling once the gate is lifted

Two new KubeSpawner traitlets to wire it in:

c.KubeSpawner.kueue_enabled = True       # default: False
c.KubeSpawner.kueue_queue_name = "user-queue"

When kueue_enabled=False (default), PodLifecycleHandler is used — zero Kueue code runs and behaviour is identical to today.

No new dependencies

Everything uses kubernetes_asyncio, which is already required. We never import a Kueue SDK; Workload events are fetched via the standard CoreV1 events API.

Implementation

A working implementation is in kaddynator/kubespawner feat/kueue-lifecycle-handlers. Happy to open a PR if this direction looks reasonable to maintainers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions