Automatic horizontal and vertical scaling of Hetzner worker nodes via the Kubernetes Cluster Autoscaler with the Hetzner Cloud provider.
KSail natively installs and manages the Cluster Autoscaler when
spec.cluster.autoscaler.node.enabled: true is set in the ksail config.
KSail (static baseline)
├── 3 control planes (cx33, 4 vCPU / 8 GB, never autoscaled)
└── 3 static workers (cx33, 4 vCPU / 8 GB, guaranteed minimum, Longhorn storage nodes)
Cluster Autoscaler (dynamic workers, managed by KSail)
├── Pool: autoscale-cx23 → 0-4 × CX23 (2 vCPU, 4 GB)
├── Pool: autoscale-cx33 → 0-4 × CX33 (4 vCPU, 8 GB)
├── Pool: autoscale-cx43 → 0-4 × CX43 (8 vCPU, 16 GB)
├── Pool: autoscale-cx53 → 0-4 × CX53 (16 vCPU, 32 GB)
├── maxNodesTotal: 10 (total cluster nodes incl. baseline: 6 static + 4 autoscaler)
└── Expander: [LeastNodes, LeastWaste] (priority chain)
- Horizontal scaling — autoscaler adds workers when pods are Pending due to insufficient resources, and removes underutilized workers after a configurable cooldown.
- Vertical scaling — multiple node pools with different server types.
The expander is an ordered priority chain,
[LeastNodes, LeastWaste](upstream--expander=least-nodes,least-waste).LeastNodesruns first and keeps the pools that satisfy the pending pods with the fewest total new nodes — preferring the largest adequate type so a burst consolidates onto fewer, bigger servers;LeastWastethen breaks any tie by least idle CPU/memory. (Priceis not supported on Hetzner: the cluster-autoscaler hcloud provider implements no pricing API, so KSail rejects it and the autoscaler crashes on startup.) See cluster-autoscaler FAQ. - KSail integration — KSail installs the Cluster Autoscaler Helm chart,
generates the worker config secret (
cluster-autoscaler-config), and manages the Talos snapshot lifecycle. Node pool configuration lives inksail.prod.yaml, not in Flux manifests. - Storage architecture — autoscaler nodes are compute-only (no Hetzner volume, no Longhorn storage). Static KSail workers have dedicated Hetzner volumes and serve as Longhorn storage nodes. Pods on autoscaler nodes access Longhorn PVCs via the CSI driver (network). The Hetzner Cluster Autoscaler does not support volume attachment.
- Cluster Autoscaler detects Pending pods with unmet resource requests.
- It calls the Hetzner API to create a new server using:
HCLOUD_IMAGE— a Talos snapshot managed by KSail.HCLOUD_CLOUD_INIT— base64-encoded Talos worker machine config generated by KSail.
- The server boots Talos, applies the machine config, and joins the cluster.
- Once the node is Ready, pending pods are scheduled.
All autoscaler configuration lives in ksail.prod.yaml under
spec.cluster.autoscaler.node:
spec:
cluster:
autoscaler:
node:
enabled: true
expander: [LeastNodes, LeastWaste]
maxNodesTotal: 10
scaleDownUnneededTime: "10m"
pools:
- name: autoscale-cx23
serverType: cx23
location: fsn1
min: 0
max: 4
- name: autoscale-cx33
serverType: cx33
location: fsn1
min: 0
max: 4
- name: autoscale-cx43
serverType: cx43
location: fsn1
min: 0
max: 4
- name: autoscale-cx53
serverType: cx53
location: fsn1
min: 0
max: 4| Field | Default | Description |
|---|---|---|
enabled |
false |
Enable/disable node autoscaling |
expander |
LeastWaste |
Node selection strategy: LeastWaste, LeastNodes, Random (Price is unsupported on Hetzner — no pricing API). Accepts a single value or an ordered priority chain, e.g. [LeastNodes, LeastWaste] (requires KSail expander-list support; scalar-only up to v7.57.0) |
maxNodesTotal |
0 (unlimited) |
Hard ceiling on total cluster nodes, including the static baseline (see ksail#5017) |
scaleDownUnneededTime |
10m |
Time before an underutilized node is eligible for removal |
pools[].name |
— | DNS-1123 pool identifier |
pools[].serverType |
— | Hetzner server type (e.g., cx23, cx33) |
pools[].location |
— | Hetzner datacenter (e.g., fsn1) |
pools[].min |
— | Minimum nodes in pool |
pools[].max |
— | Maximum nodes in pool |
- Hard max per pool —
pools[].maxcaps each pool independently. Set to4so any single CX type can serve a full burst (e.g. 4× cx23) instead of forcing larger types; themaxNodesTotaltotal ceiling still caps the autoscaler at 4 (10 − 6baseline), so this changes only the type distribution, never the node count. - Total node ceiling —
maxNodesTotalcaps the total cluster node count, including the static baseline. Set to10(6 static + up to 4 autoscaler). It is passed straight to cluster-autoscaler's--max-nodes-total, so the runtime already enforces this total. - serverLimit (
spec.provider.hetzner.serverLimit) — the Hetzner account hard cap (10). Under the in-progress KSail change (ksail#5017) the autoscaler validation becomesmaxNodesTotal ≤ serverLimit(10 ≤ 10); until it ships, the old validation (CP + workers + min(maxNodesTotal, Σ pool.max)) rejects this config, so the KSail change must land first. - Expander —
[LeastNodes, LeastWaste](current) is an ordered priority chain.LeastNodesruns first and keeps the pools that scale up with the fewest total new nodes (preferring the largest adequate type to keep the node count down);LeastWastethen breaks any remaining tie by least idle CPU/memory. The list form needs KSail expander-list support — releases up to v7.57.0 are scalar-only and reject a list, so the pinnedKSAIL_VERSIONmust be bumped to a release that ships it first. (Priceis unsupported on Hetzner.) - Scale-down — underutilized nodes are removed after 10 minutes.
Add a new entry to the pools list in ksail.prod.yaml and run
ksail cluster update. KSail updates the Helm release automatically.
# Check autoscaler logs
kubectl -n kube-system logs -l app.kubernetes.io/name=cluster-autoscaler --tail=100
# Check for unschedulable pods
kubectl get pods -A --field-selector=status.phase=Pending
# Check autoscaler status ConfigMap
kubectl -n kube-system get cm cluster-autoscaler-status -o yaml# Check if the Hetzner server was created
hcloud server list --selector cluster.autoscaler.nodeGroupLabel
# Check Talos bootstrap status (if server IP is known)
talosctl -n <node-ip> health
# Verify the machine config is valid
talosctl validate --config worker.yaml --mode cloudAfter a full cluster rebuild (ksail cluster delete + create):
- KSail regenerates the worker config secret automatically.
- KSail manages the Talos snapshot lifecycle — no manual snapshot creation is needed.
When bumping the Talos (or Kubernetes) version in ksail.prod.yaml, the
deploy's ksail cluster update brings both node classes onto the new
baseline:
- KSail's snapshot lifecycle manager creates or updates the Talos
snapshot automatically during
cluster update, and the worker machine config is regenerated to match — so new autoscaler nodes boot the new version. - The static control planes and workers are upgraded in place (rolling).
- Existing autoscaler nodes are recycled automatically so they follow the
new baseline instead of drifting on the old version: after the refreshed
cluster-autoscaler is ready, KSail cordons and drains each autoscaler node
one at a time (via the Kubernetes eviction API, honoring
PodDisruptionBudgets) and deletes its Hetzner server; the autoscaler then
re-provisions any still-needed capacity from the new snapshot on demand.
This runs only when the version actually changes — a no-op
cluster updateleaves autoscaler nodes untouched. See the KSail Autoscaler Node Upgrades docs.
A strict PodDisruptionBudget on a workload running on autoscaler nodes can slow or block the drain (the update fails rather than force-evicting), so keep PDBs realistic for compute-only/burstable workloads.
Hetzner periodically renames or retires server types. Check the
Hetzner Cloud changelog and update
the pools[].serverType values in ksail.prod.yaml.