Last updated: 2026-06-02
Operator-facing map of production: traffic path, pod placement, storage, HA failover, and the reasoning behind every architectural decision.
Related docs: admin-tools-private-access.md · rejourney-ci.md · postgres-backup-and-restore.md · clickhouse-api-endpoint-daily-stats-migration.md
| Hetzner / k8s name | Type | DC | vCPU | RAM | Role |
|---|---|---|---|---|---|
rejourney-fsn1-1 |
CPX52 | FSN1 (Falkenstein) | 12 | 24 GB | API · ingress · primary DB · monitoring |
rejourney-hel1-worker-1 |
CX43 | HEL1 (Helsinki) | 8 | 16 GB | Bulk workers · standby DB |
rejourney-hel1-quorum-1 |
CX43 | HEL1 (Helsinki) | 8 | 16 GB | Bulk workers · etcd quorum · excluded from LB |
Hetzner server names and k8s node names are the same (aligned 2026-04-27). Never rename a Hetzner server without also changing node-name in /etc/rancher/k3s/config.yaml and rebuilding all PV nodeAffinity entries — mismatches cause CCM network-unavailable taints and block pod scheduling cluster-wide.
FSN1 ↔ HEL1 RTT: ~25ms. Every serial cross-DC call adds 25ms. API handlers make 5–10 serial DB calls — a HEL1 API pod adds 125–250ms of pure wire overhead per request.
All nodes carry rejourney.co/datacenter=fsn1|hel1. New nodes must get this label on join.
graph TD
USER[Browser / Mobile SDK]
CF["Cloudflare\nDNS · TLS · WAF"]
LB["Hetzner Load Balancer\nFSN1 · private-IP backends\nexternalTrafficPolicy: Local"]
subgraph FSN1["FSN1 — Falkenstein · rejourney-fsn1-1"]
TR["Traefik\n(sole ingress in normal operation)"]
APIING["api-ingest\n/api/ingest/* · /api/sdk/config\ncolocated with primary"]
FSN1_SVC["ingest-upload · web · monitoring\npostgres primary · redis master\npgbouncer (rw)"]
end
subgraph HEL1["HEL1 — Helsinki · worker-1 + quorum-1"]
APIDASH["api-dashboard\neverything else\nuses dbRead for analytics"]
WORKERS["ingest-worker · replay-worker\nalert-worker · session-lifecycle-worker\nretention-worker · stripe-sync-worker"]
STANDBY["postgres standby · redis replicas\npgbouncer (rw replicas)\npgbouncer-ro → standby"]
end
S3["S3-compatible object storage\nHetzner · OVH · Scaleway via storage_endpoints"]
Backups["OVH Object Storage\nPostgres WAL/base · ClickHouse backups"]
USER --> CF
CF --> LB
LB -->|"ETP:Local → FSN1 nodeport only"| TR
TR -->|"/api/ingest/* · /api/sdk/config"| APIING
TR -->|"everything else"| APIDASH
APIING --> FSN1_SVC
APIDASH -->|"writes (rare): pgbouncer → primary"| FSN1_SVC
APIDASH -->|"reads (heavy): pgbouncer-ro → standby"| STANDBY
FSN1_SVC -->|"WAL stream · sync replication"| STANDBY
FSN1_SVC --> S3
FSN1_SVC -->|"WAL/base backups"| Backups
WORKERS --> FSN1_SVC
WORKERS --> S3
classDef dc fill:#1a2a3a,stroke:#4fc3f7,color:#fff
class FSN1,HEL1 dc
Public path: Internet → Cloudflare → Hetzner LB (FSN1) → Traefik → backends
Admin path: Tailscale tailnet → SSH / kubectl / port-forward over 100.x addresses. Admin UIs (Grafana, Traefik dashboard) are not public.
graph LR
subgraph fsn1["rejourney-fsn1-1 · CPX52 · 12 vCPU / 24 GB · FSN1"]
direction TB
TR0["Traefik ×1\npreferred FSN1"]
APIING["api-ingest ×3–6\nHPA · required affinity to primary\nall serial DB calls local"]
WEB0["web ×1"]
U0["ingest-upload ×2\nHPA fixed"]
SLC["session-lifecycle-worker ×5\npreferred FSN1"]
PG1["postgres-local-1\nCNPG primary"]
RD0["redis-node-0\nSentinel master"]
PGB0["pgbouncer (rw)\npool 60"]
MON["victoria-metrics · grafana\ngatus · pushgateway\nnode-exporter · cadvisor\nkube-state-metrics · postgres-exporter"]
end
subgraph worker1["rejourney-hel1-worker-1 · CX43 · 8 vCPU / 16 GB · HEL1"]
direction TB
APIDASH1["api-dashboard ×1\nHPA 2–5 · preferred HEL1\nuses dbRead for analytics"]
IW1["ingest-worker ×bulk\nHPA 4–6 · preferred HEL1"]
RW1["replay-worker ×bulk\nHPA 1–10 · preferred HEL1"]
U1["ingest-upload overflow only if rescheduled"]
PG2["postgres-local-2\nCNPG sync standby"]
RD2["redis-node-2\nSentinel replica"]
PGB1["pgbouncer (rw)\npool 60"]
PGBRO1["pgbouncer-ro ×1\n→ postgres-local-ro\n(standby only)"]
end
subgraph quorum1["rejourney-hel1-quorum-1 · CX43 · 8 vCPU / 16 GB · HEL1 · excluded from LB"]
direction TB
APIDASH2["api-dashboard ×1\nHPA 2–5 · preferred HEL1"]
IW2["ingest-worker ×bulk\nHPA 4–6 · preferred HEL1"]
RW2["replay-worker ×bulk\nHPA 1–10 · preferred HEL1"]
AW["alert-worker ×1\npreferred HEL1"]
WEB1["web ×1"]
RD1["redis-node-1\nSentinel replica"]
PGB2["pgbouncer (rw)\npool 60"]
PGBRO2["pgbouncer-ro ×1\n→ postgres-local-ro"]
CRON["retention-worker CronJob\nstripe-sync-worker CronJob"]
ETCD["etcd quorum voter"]
end
classDef pinned fill:#1a3a1a,stroke:#4caf50,color:#fff
classDef dash fill:#1f2a3f,stroke:#7e8aff,color:#fff
classDef data fill:#1a1a3a,stroke:#5c6bc0,color:#fff
classDef worker fill:#2a1a1a,stroke:#ef5350,color:#fff
classDef ingress fill:#1a2a2a,stroke:#26c6da,color:#fff
classDef mon fill:#2a2a1a,stroke:#ffa726,color:#fff
class APIING pinned
class APIDASH1,APIDASH2 dash
class PG1,PG2,PGB0,PGB1,PGB2,PGBRO1,PGBRO2,RD0,RD1,RD2 data
class IW1,IW2,RW1,RW2,AW,SLC,CRON worker
class TR0 ingress
class MON mon
Color key:
- Green —
api-ingestpods (SDK traffic only): required pod affinity to the CNPG primary node, preferred FSN1. All serial DB calls stay local. Isolated from dashboard traffic so a slow dashboard aggregation can never starve the SDK event loop. - Indigo —
api-dashboardpods (operator UI + everything else): preferred HEL1 next to the read replica. Writes go to the primary via pgbouncer (rw); heavy reads (dbRead) go to the standby via pgbouncer-ro — local-DC for both because writes are rare on this path. - Blue — Data: CNPG primary + Redis master on FSN1, standby + replicas on HEL1. pgbouncer (rw) on all three nodes; pgbouncer-ro only on HEL1 (next to standby).
- Red — Workers: ingest/replay prefer HEL1 and fall back to FSN1 only when HEL1 is full; session-lifecycle prefers FSN1 because event rollup is DB-write heavy. DB latency is acceptable for async processing; ingest/replay writes use
SET LOCAL synchronous_commit = localto skip the 25ms SyncRep wait. - Cyan — Ingress: Traefik single replica on FSN1. quorum-1 excluded from LB entirely.
- Orange — Monitoring: all on FSN1 for simplicity. Goes dark if FSN1 fails — acceptable gap.
graph LR
subgraph fsn1_1["rejourney-fsn1-1 · existing"]
direction TB
TR1["Traefik replica-1"]
API1["api-ingest ×all\nHPA · colocated with primary"]
PG1F["postgres-local-1\nCNPG primary"]
RD0F["redis-node-0"]
PGB0F["pgbouncer (rw)"]
MONF["monitoring stack"]
end
subgraph fsn1_2["rejourney-fsn1-2 · NEW CPX41 or CX32"]
direction TB
TR2["Traefik replica-2"]
PGB3["pgbouncer (rw) replica-4"]
end
subgraph hel1["HEL1 — worker-1 + quorum-1 · unchanged"]
direction TB
APIDASHF["api-dashboard ×2\nunchanged, HEL1"]
WK["bulk workers preferred HEL1\nsession-lifecycle preferred FSN1"]
DB2["postgres standby · redis replicas\npgbouncer (rw) replicas\npgbouncer-ro replicas"]
end
classDef new fill:#1a3a2a,stroke:#66bb6a,color:#fff
classDef existing fill:#1a3a1a,stroke:#4caf50,color:#fff
classDef hel fill:#2a1a1a,stroke:#ef5350,color:#fff
class fsn1_2,TR2,PGB3 new
class fsn1_1,TR1,API1,PG1F,RD0F,PGB0F,MONF existing
class hel1,APIDASHF,WK,DB2 hel
See Compute Scaling Plan for the exact steps.
- FSN1 location, round-robin across
fsn1andworker-1backends. quorum-1 excluded vianode.kubernetes.io/exclude-from-external-load-balancers: "true". - Uses private IPs — traffic stays inside the Hetzner private network.
externalTrafficPolicy: Localon the Traefik service. Without this, kube-proxy could VXLAN-forward to the other DC's Traefik pod before the request hits Traefik, adding an invisible 25ms hop.
- 1 replica, preferred FSN1. With ETP:Local, the LB health check on
worker-1's nodeport returns unhealthy when no Traefik pod is there — LB routes 100% to FSN1 automatically. On FSN1 failure, Traefik reschedules toworker-1(~90s) and the LB detects it healthy. - Required to exclude nodes with
node.kubernetes.io/exclude-from-external-load-balancers(quorum-1). - Trusts Cloudflare IP ranges for real-IP passthrough on both entry points.
- Middlewares:
https-redirect,http-www-redirect,www-redirect,security-headers,rate-limit-api(1 000 req/min, burst 5 000),rate-limit-ingest(20 000 req/min, burst 40 000). - Metrics on a separate
metricsentry point, scraped by VictoriaMetrics.
Background: a single api deployment served both SDK ingest (~500 req/min/pod) and the operator dashboard (~10 req/min/pod). Same Node.js event loop, same pgbouncer pool. A heavy dashboard aggregation could block ingest writes behind it, and an ingest spike could push dashboard clicks past 3s. Split into two deployments running the same image with different routing and placement (May 2026).
Both deployments expose port 3000 and load the same Express app — the difference is purely traffic routing at the ingress and pod placement.
A backward-compat api Service still exists and aliases api-ingest so anything inside the cluster that hardcodes api:3000 (gatus, monitoring, etc.) keeps working.
- Routed by ingress for
/api/ingest/*and/api/sdk/configonapi.rejourney.co, and everything oningest.rejourney.coexcept/upload. - Required pod affinity to the current CNPG primary hostname (
cnpg.io/cluster=postgres-local,cnpg.io/instanceRole=primary). Ingest response time regresses sharply when serial DB calls cross nodes/DCs, so new ingest pods must schedule on the same node as the writable Postgres pod. - Preferred FSN1, weight 100 (
rejourney.co/datacenter=fsn1). Secondary to primary-node colocation; preserves normal placement whilepostgres-local-1is primary. - No
topologySpreadConstraints— tested:maxSkew:1 ScheduleAnywayoverrides a weight-80 preference and spreads pods to HEL1, causing 6–11s p50. - HPA
api-ingest: min 3, max 6, target 65% CPU. Min 3 fits entirely on FSN1 in normal operation. - Continuous colocation guard (
api-postgres-colocatorCronJob): every minute, checks the CNPG primary node and evicts at most one healthyapi-ingestpod if it is not colocated. It only acts when the deployment is fully ready and uses the Eviction API, so the PDB keeps one-at-a-time movement. The CronJob'sAPI_DEPLOYMENTenv is set toapi-ingest. - Post-deploy colocation check in CI (
pin_deployment_to_postgres_primary api-ingestinscripts/k8s/deploy-release.sh): after every rollout, evicts anyapi-ingestpods not on the CNPG primary node one at a time and waits for replacements. Fixes rollout-time drift immediately; the CronJob handles later CNPG primary movement.
- Routed by ingress for everything else on
api.rejourney.co(/, dashboard, analytics, auth, webhooks, etc.). - Preferred HEL1, weight 100 — colocated with the Postgres read replica so heavy analytics aggregations are local-DC.
podAntiAffinity(preferred) on hostname — spreads the two replicas across different HEL1 nodes so a single-node loss doesn't take both pods. This is an intentional scheduling preference, not a sign that one HEL1 node ran out of room; Kubernetes may still put both pods on one node if capacity forces it.- No required affinity to the primary — most queries on this deployment use
dbRead(read replica). The occasional write (login, settings, Stripe webhook) goes throughpgbouncer(rw) → primary, which crosses 25ms cross-DC, acceptable at that frequency. - HPA
api-dashboard: min 2, max 5, target 60% CPU. - Sets the
DATABASE_URL_READenv var (interpolated fromPOSTGRES_*secrets — those keys must be declared beforeDATABASE_URL_READin the env list because Kubernetes$(VAR)interpolation only resolves vars listed earlier in the same container).
backend/src/db/client.tsexportsdb(primary) anddbRead.dbReadusesDATABASE_URL_READif set, otherwise falls back todb. Writes viadbReadwould fail at the standby withcannot execute INSERT in a read-only transaction— that's the intentional guardrail.api-ingestdoes not setDATABASE_URL_READ, so itsdbReadaliasesdb. Tests, local dev, and the legacyapiService backers all use the same fallback path.- Today only the two heaviest aggregations in
dashboardOverview.ts(loadUserFirstSeenMap,loadTopUsersPreview) usedbRead. Other dashboard reads are still ondb— incremental migration is safe to do later.
ingest-workerandreplay-workerare preferred HEL1, weight 100. They fall back to FSN1 only when HEL1 is full.session-lifecycle-workeris fixed at 5 replicas, preferred FSN1. It hostsrj-session-event-rollupandrj-session-effects; keeping it close to the Postgres primary lets event/activity rollups catch up quickly after ingest storms.- HPA:
ingest-workermin 4 / max 6,replay-workermin 1 / max 10. Do not raise ingest max without checking lifecycle, Postgres, Redis, and replay headroom. - IO-bound, not CPU-bound — HPA can undershoot during queue spikes (workers may sit below CPU limits while waiting on S3/DB round-trips). Queue depth is the primary signal.
- Workers are event-driven via BullMQ — no SQL polling. Five queues are backed by the Redis Sentinel cluster:
rj-artifact-flush(Redis buffered relay uploads waiting for S3),rj-ingest-artifacts(cheap artifact ready/validation for events, crashes, anrs),rj-replay-artifacts(screenshots, hierarchy),rj-session-event-rollup(per-session event metrics rollup), andrj-session-effects(debounced per-session reconcile/cache effects). Workers block on the queue and consume jobs as they arrive. - Event artifact jobs now validate/summarize the payload, mark the artifact ready, and enqueue
rj-session-event-rollup; the heavy event metrics/heatmap/ClickHouse work is serial per session so one large session cannot occupy the whole ingest worker pool. Crash/ANR jobs still process directly and enqueuerj-session-effectswith a 15s debounce window. Replay artifacts still reconcile immediately to preserve replay readiness. rj-session-event-rollupcoalesces to one job ID per session (session-event-rollup-{sessionId}) and uses a Redis dirty marker to avoid duplicate no-op queue storms while still catching events that arrive during an active rollup.- The broad
queuePendingSessionEventRollupsPostgres recovery sweep is disabled by default (RJ_SESSION_EVENT_ROLLUP_SWEEP_ENABLEDunset). Keep it off on the 20M-row production table unless the optional concurrent index inbackend/drizzle/manual/event-rollup-pending-index-concurrent.sqlhas been built successfully. - Do not broad-backfill old
recording_artifacts.event_rollup_requested_at IS NULLrows. Historical nulls are expected legacy history and should not be interpreted as live backlog. - BullMQ deduplication uses
jobId = artifact-{artifactId}. Stalled jobs (worker died mid-process) are automatically re-queued afterstalledInterval = 30s, up tomaxStalledCount = 3. - Retry policy: 5 attempts, exponential backoff starting at 1s. Failed jobs are kept in the failed set for 7 days (DLQ window). Completed jobs retained 1h for observability.
- Queue depth monitoring:
LLEN bull:rj-artifact-flush:wait,LLEN bull:rj-ingest-artifacts:wait,LLEN bull:rj-replay-artifacts:wait,LLEN bull:rj-session-event-rollup:wait, andLLEN bull:rj-session-effects:waitin Redis. Artifact queues should be near zero in steady state;rj-session-event-rollupandrj-session-effectscan briefly hold delayed per-session jobs during ingest bursts. /health/queuereturns503whenever BullMQ has failed/DLQ jobs, including historical failed jobs retained during the 7-day DLQ window. Before treating it as a fresh outage, sample the newest failed job per queue and comparefailedOnto the current deploy window.
- HPA: min 2, max 2.
- Upload relay path: collect tiny artifact bodies, write
artifact:buf:{artifactId}to Redis with a 30-minute TTL, markrecording_artifacts.status='buffered', enqueuerj-artifact-flush, and return 204. S3 latency is moved out of the SDK request path.
alert-workeris single-replica, preferred HEL1.session-lifecycle-workeris covered in the worker section above: fixed 5 replicas, preferred FSN1.
- CronJobs, preferred HEL1. Periodic, not latency-sensitive.
- 2 replicas, no affinity. Static/SSR, no DB calls.
- 3 replicas, one per node, pool 60 connections each. Total: 180 server connections, under
max_connections: 200. - All connect to
postgres-app-rw(CNPG label selectorcnpg.io/instanceRole: primary) — always resolves to current primary after failover. trafficDistribution: PreferClose— kube-proxy routes to local-node pgbouncer, falls back automatically.requiredanti-affinity on hostname — exactly one per node, always. Do not change topreferredand do not addmaxSurge: 1. With 3 nodes, a surge pod has nowhere to go and deadlocks the rollout.- Rolling update:
maxSurge: 0, maxUnavailable: 1. - When adding a node, bump replicas to match new node count before deploying. Also raise
max_connectionsif total pool × 60 approaches 200.
- 2 replicas, preferred HEL1, pool 30 connections each. Fronts the Postgres standby for dashboard analytics reads.
- Connects to
postgres-local-ro— notpostgres-app-ro. See operational gotcha #14:postgres-app-rois a custom alias in this cluster with a broken selector that matches both primary AND standby, so it round-robins between them. The CNPG-defaultpostgres-local-rocorrectly filters bycnpg.io/instanceRole=replica. trafficDistribution: PreferClose—api-dashboardpods on HEL1 hit the local pgbouncer-ro pod; FSN1 pod is only used as a fallback.- Soft anti-affinity on hostname — preferred spread, but doesn't block scheduling if HEL1 is full (one replica may end up on FSN1 during rollout; that's fine — the cross-DC hop only matters when HEL1 is unavailable).
- Used only by
api-dashboardvia theDATABASE_URL_READenv var →dbReadDrizzle client.
- 2 instances:
postgres-local-1(primary, FSN1) +postgres-local-2(sync standby, worker-1). synchronous_commit = remote_write,minSyncReplicas: 1,maxSyncReplicas: 1. Adds ~25ms to write commits.maxSyncReplicas: 1means postgres degrades to async rather than blocking if standby is down.- SyncRep is the write throughput ceiling. 33 concurrent SyncRep waits = 33 blocked connections. Ingest workers mitigate with
SET LOCAL synchronous_commit = local. For any new write-heavy path: checkpg_stat_activity WHERE wait_event = 'SyncRep'first. - WAL/base backups are archived to OVH Object Storage (gzip WAL). See
postgres-backup-and-restore.md. - Storage:
rejourney-db-local-retain(local-path, Retain). Data is on the node's local disk — not Hetzner cloud volumes. PVCs survive pod/cluster deletion. Standby + OVH WAL/base backups are the recovery paths.
ClickHouse is the analytics scale-out path for API endpoint telemetry. The old Postgres api_endpoint_daily_stats data table is no longer the runtime source; ClickHouse owns the API endpoint facts, imported history, and daily rollups. It is not a replacement for Postgres as the source of truth for sessions, recording artifacts, auth, billing, storage configuration, or ingest lifecycle state.
Production deployment remains gated for fresh clusters or disaster rebuilds:
DEPLOY_CLICKHOUSE=falseby default inscripts/k8s/deploy-release.sh, so normal CI deploys do not create ClickHouse or requireclickhouse-secret.- App flags default off:
CLICKHOUSE_ENABLED=false,CLICKHOUSE_DUAL_WRITE_ENABLED=false,CLICKHOUSE_READS_ENABLED=false. - All app/workers ClickHouse secret refs are
optional: true, so the application can deploy before ClickHouse exists. api-ingestand SDK request handlers must never synchronously depend on ClickHouse. If ClickHouse is down, session capture must continue.clickhouse-daily-backupbacks up therejourneyClickHouse database to OVH Object Storage using the native ClickHouseBACKUP ... TO S3(...)command.
Topology when enabled:
| Component | Placement | Purpose |
|---|---|---|
| ClickHouse Keeper | 3 replicas, one voter per node | quorum for replicated ClickHouse tables |
| ClickHouse data | 1 shard, 2 replicas on HEL1 nodes | analytics facts and imported daily aggregates |
clickhouse-setup Job |
manual/explicit deploy step | creates api_endpoint_request_events, api_endpoint_daily_stats_imported, api_endpoint_daily_rollups, the rollup materialized view, and schema_migrations |
clickhouse-backfill-api-rollups Job |
manual or explicit deploy flag | rebuilds api_endpoint_daily_rollups from ClickHouse imported history plus raw request facts |
Why HEL1 for the data replicas: the first writer is ingest-worker, which already lives in HEL1 and processes artifacts asynchronously; the first reader is api-dashboard, which also lives in HEL1. Keeping ClickHouse data on HEL1 avoids loading the FSN1 Postgres primary node with analytical storage and merge work. This topology only makes sense because ClickHouse is outside the synchronous /api/ingest/* return path. The earlier FSN1 primary colocation incident showed what happens when ingest makes serial cross-DC calls: ingest latency can jump into seconds. Do not move any synchronous ingest write to a HEL1-only ClickHouse endpoint.
ClickHouse config note: the cluster profile sets prefer_column_name_to_alias=1. Keep it. It prevents ClickHouse from resolving a SELECT ... AS date alias inside a WHERE date ... predicate and was required after a production alias/type conflict on the daily trend query.
Resource impact expected after the full cutover:
- largest direct win: lower Postgres CPU, WAL, index churn, autovacuum work, and disk I/O from removing hot aggregate upserts on
api_endpoint_daily_stats - secondary win: lower Postgres buffer/cache pressure because
api_endpoint_daily_statsand its indexes stop competing with transactional tables - dashboard API endpoint analytics should become steadier under larger date ranges because ClickHouse handles grouped scans better than Postgres OLTP tables
- not a magic fix for
sessionsorrecording_artifactsbloat; those remain Postgres source-of-truth tables until a separate lifecycle/archive design exists - not a fix for synchronous ingest latency by itself;
api-ingestmust still colocate with the writable Postgres primary
Current rebuild/deploy model:
- Keep raw API fact writes in asynchronous artifact processing;
api-ingestmust not synchronously wait on ClickHouse. - Run
clickhouse-setupso the rollup table and materialized view exist. - Run
clickhouse-backfill-api-rollups --replaceonce to rebuildapi_endpoint_daily_rollupsfrom ClickHouse imported history and raw facts. In production deploys, useDEPLOY_CLICKHOUSE=true RUN_CLICKHOUSE_ROLLUP_BACKFILL=trueso setup and rollup rebuild finish before the new app pods roll. - Keep
CLICKHOUSE_ENABLED=true,CLICKHOUSE_DUAL_WRITE_ENABLED=true, andCLICKHOUSE_READS_ENABLED=true. - Apply the Postgres migration
20260522010000_drop_api_endpoint_daily_stats; it drops the heavy historical table and leaves only an empty no-op compatibility shell for rolling deploy safety. After this release there is no Postgres read/write fallback for API endpoint analytics.
Local verification completed on 2026-05-21: the backfill imported 832 local api_endpoint_daily_stats rows; Postgres and ClickHouse FINAL totals matched exactly at 27,505 calls, 262 errors, and 11,229,944 summed latency ms. See the migration runbook for commands and details.
Rollup migration update on 2026-05-22: runtime reads now target api_endpoint_daily_rollups; issue generation, dashboard insights, and API degradation emails no longer read api_endpoint_daily_stats; artifact processing no longer writes that Postgres table. The one-time rollup backfill job is clickhouse-backfill-api-rollups. The remaining Postgres object with that name is only an empty no-op compatibility shell during deployment.
Production infrastructure verification on 2026-05-22:
- ClickHouse Keeper is running as 3 voters, one per node.
- ClickHouse data is running as 1 shard / 2 replicas, one on each HEL1 node.
- Historical backfill for dates
< 2026-05-21completed: 9,325,058 rows, 180,151,363 calls, 8,322,568 errors, and 503,552,981,006 summed latency ms. - Postgres and ClickHouse
FINALtotals matched exactly by global total, by date, and by project for dates< 2026-05-21. - Current production flags on
api-dashboard:CLICKHOUSE_ENABLED=true,CLICKHOUSE_DUAL_WRITE_ENABLED=true,CLICKHOUSE_READS_ENABLED=true. - The cutover deploy image
760a4e6b519e2ad9eed469181cd432f57055c2e8rolled out successfully. api_endpoint_daily_rollupswas live and fresh after deploy: about 622k rollup rows, about 243.5M calls, newest rollup date2026-05-22, andupdated_atwithin seconds of the health check.api_endpoint_request_eventswas receiving fresh raw facts after deploy: about 2.19M raw rows with newestinserted_aton2026-05-22 04:13:36 UTC.- The Postgres compatibility shell
public.api_endpoint_daily_statsexisted and had 0 rows. Empty is expected; the trigger discards legacy writes withINSERT 0 0.
- 3-node StatefulSet:
redis-node-0(FSN1, master),redis-node-1(quorum-1, replica),redis-node-2(worker-1, replica). - Sentinel quorum = 2/3. On FSN1 failure, HEL1 Sentinels elect a new master.
- 8 GiB volumes per node,
reclaimPolicy: Retain. maxmemory-policy: noevictionis required. BullMQ stores job state as Redis hashes. With LRU eviction Redis silently drops job records under memory pressure — workers never see those jobs and artifacts are permanently stuck inuploaded. Production Redis at steady state uses ~28 MB for BullMQ state; the 8 GiB limit is effectively unlimited headroom.- BullMQ connections require
maxRetriesPerRequest: nullon the ioredis client — this is handled bycreateBullMQRedisConnection()inartifactBullQueue.ts, which creates dedicated connections separate from the app's general Redis client (BullMQ internally needs a commands connection + a blocking subscribe connection).
| Component | Purpose |
|---|---|
| VictoriaMetrics | Metrics store. Scraped from node-exporter, cadvisor, kube-state-metrics, postgres-exporter, redis-metrics, Traefik, pushgateway. |
| Grafana | Dashboard UI. Port-forwarded for operator access. |
| Gatus | Public endpoint + internal service health checks. |
| Pushgateway | Push metrics from CronJobs and short-lived pods. |
| node-exporter | DaemonSet — one per node. |
| cadvisor | DaemonSet — one per node. |
| kube-state-metrics | Cluster-level k8s object metrics. |
| postgres-exporter | Scrapes CNPG primary. |
Grafana provisions the Rejourney dashboards from k8s/grafana-dashboards.yaml.
For artifact backlog incidents, start with 55 — Artifact Ingest Diagnosis.
The top row should show rj-ingest-artifacts waiting decreasing after deploy;
if ingest drops but rj-session-event-rollup grows, the bottleneck has moved
to the new per-session rollup worker. The compact queue view is also embedded
in 50 — Application.
- 2 replicas (1 FSN1, 1 HEL1). Without a HEL1 replica, FSN1 failure causes 30–60s cluster-wide DNS outage.
- Not CI-managed — k3s controls CoreDNS via its internal addon mechanism. Verify after any k3s upgrade with
kubectl get pods -n kube-system -l k8s-app=kube-dns.
| Class | Driver | Reclaim | Used by |
|---|---|---|---|
rejourney-db-local-retain |
local-path | Retain | postgres, redis |
local-path |
local-path | Delete | grafana, victoria-metrics, gatus, pgadmin |
Retain policy on DB volumes is critical. Deleting a PVC does NOT delete the underlying data. Recreating CNPG or Redis without verifying volumes creates orphaned data silently.
| Failure | What happens |
|---|---|
FSN1 api-ingest pods |
Reschedule to HEL1 (the required affinity follows the promoted primary, so they land wherever Postgres lands). Slower until postgres/Redis failover completes (~30s). |
HEL1 api-dashboard pods |
Reschedule to whichever HEL1 node is still up; if both HEL1 nodes are gone, spill to FSN1. dbRead queries still work because pgbouncer-ro follows the standby — on full HEL1 loss, the standby itself is gone, so dbRead queries error until CNPG promotes and a new standby is rebuilt. Mitigation: dbRead falls back gracefully only via app retry; consider switching DATABASE_URL_READ=DATABASE_URL during prolonged standby outages. |
| CNPG primary | postgres-local-2 auto-promotes. postgres-app-rw selector follows new primary. pgbouncer (rw) on HEL1 reconnects to local primary. postgres-local-ro momentarily has zero endpoints until CNPG rebuilds the standby — dbRead queries error during that window. |
| In-flight writes at crash | Postgres writes: no data loss — remote_write means every committed write was already buffered on standby. BullMQ jobs that were active at crash time are detected as stalled after stalledInterval = 30s and automatically re-queued. Relay uploads already ACKed to the SDK depend on the Redis artifact:buf:{artifactId} key surviving until flush; buffered flush jobs are recoverable while that 30-minute key exists. Artifact processing is idempotent — safe to reprocess. |
| Redis master | Sentinel elects new master within seconds. |
| ClickHouse Keeper voter | Any single node loss leaves 2/3 Keeper quorum. ClickHouse data remains available if at least one HEL1 data replica is healthy. |
| ClickHouse data replica | One HEL1 node loss leaves the other data replica serving analytics. Full HEL1 loss takes ClickHouse reads down. There is no Postgres fallback for API endpoint analytics after the cutover; treat this as an analytics outage, not an ingest outage. |
| Traefik | Reschedules to worker-1 (~90s). LB detects worker-1 nodeport healthy and resumes routing. |
| CoreDNS | Second replica on HEL1 keeps DNS alive. |
| Monitoring | victoria-metrics, Grafana, Gatus go offline. Accepted gap. |
Post-failover: once CNPG promotes (~30s) and Redis elects master (~5s), api-ingest pods reschedule onto whatever node now holds the primary and hit local pgbouncer → local postgres → local Redis. Latency recovers close to FSN1 levels. The dashboard read path is degraded until the new standby finishes initial sync (minutes); during that window dashboard queries should be temporarily routed back to the primary (set DATABASE_URL_READ= on api-dashboard).
| Host | Path | Backend | Middlewares |
|---|---|---|---|
rejourney.co |
/ |
web:80 |
security-headers |
www.rejourney.co |
/ HTTP |
— | http-www-redirect |
www.rejourney.co |
/ HTTPS |
web:80 |
www-redirect |
api.rejourney.co |
/api/ingest, /api/sdk/config |
api-ingest:3000 |
security-headers, rate-limit-ingest |
api.rejourney.co |
/ |
api-dashboard:3000 |
security-headers, rate-limit-api |
ingest.rejourney.co |
/upload |
ingest-upload:3001 |
security-headers, rate-limit-ingest |
ingest.rejourney.co |
/ |
api-ingest:3000 |
security-headers, rate-limit-ingest |
*.rejourney.co HTTP |
/ |
— | https-redirect |
The api-ingest ingress carries priority 110 so the more-specific paths win against the catch-all / route on api.rejourney.co (priority 10) which goes to api-dashboard.
Dashboard replay object reads are intentionally dynamic. Production can sign URLs for Hetzner, OVH, Scaleway, or any active storage_endpoints row, so the web CSP should allow HTTPS object-storage reads by scheme rather than a hardcoded provider host list:
connect-src 'self' https: wss://api.rejourney.co
media-src 'self' https: blob:
Local k8s uses the same idea with http: allowed for MinIO/local endpoints. Buckets still need CORS that permits dashboard origins to GET/HEAD; otherwise rrweb segments and screenshot frames will fall back to the API proxy routes and put replay traffic back on api-dashboard.
-
Never rename a Hetzner server without a coordinated k3s migration. CCM matches Hetzner server names to k8s node names. Mismatch →
network-unavailable:NoScheduletaint → no pods schedule there. Also breaks PV nodeAffinity (immutable field — must delete/recreate PVs). Requires: k3snode-nameconfig change on all nodes, cluster-reset if etcd gets corrupted, flannel FDB repopulation (k3s restart on all nodes), PV rebuild. -
API and Traefik affinity use
rejourney.co/datacenter=fsn1. New nodes must be labelled on join.api-dashboard,ingest-worker, andreplay-workerprefer HEL1;session-lifecycle-workerprefers FSN1 because event rollup is DB-write heavy. -
Do not add
topologySpreadConstraintstoapi-ingest.maxSkew:1 ScheduleAnywayoverrides the preferred affinity and spreads pods to HEL1, causing 6–11s p50 response times. (Same constraint previously applied to the unifiedapideployment.) Workers had a similar issue with topologySpread fighting the HEL1 affinity and overflowing to FSN1; both were removed in May 2026. -
Do not change
externalTrafficPolicyback toCluster. Adds an invisible 25ms kube-proxy VXLAN hop before every Traefik request. -
quorum-1 is excluded from the Hetzner LB via
node.kubernetes.io/exclude-from-external-load-balancers: "true". Do not remove this label. -
pgbouncer anti-affinity is
required. One per node, always. Do not change topreferred, do not addmaxSurge: 1— with 3 nodes and 3 pods, a surge pod deadlocks the rollout. When adding a node, bump replicas first. -
CNPG sync replication degrades to async when standby is down.
maxSyncReplicas: 1— intentional. You briefly lose the sync guarantee during CNPG upgrades. -
HPA can undershoot for IO-bound workers.
ingest-workerandreplay-workermay wait on S3 and DB instead of saturating CPU. Monitor queue depth:LLEN bull:rj-artifact-flush:wait,LLEN bull:rj-ingest-artifacts:wait,LLEN bull:rj-replay-artifacts:wait,LLEN bull:rj-session-event-rollup:wait, andLLEN bull:rj-session-effects:waitin Redis. Ifrj-ingest-artifactsis non-zero and growing, confirm HPA is at the 6-replica max before changing code. Ifrj-session-event-rollupis growing while ingest is empty, first confirmsession-lifecycle-workeris 5/5, then tune rollup batch/concurrency rather than adding ingest pods. -
api-ingest/Postgres colocation is enforced twice.api-ingesthas required pod affinity to the current CNPG primary, CI auto-corrects viapin_deployment_to_postgres_primary api-ingest, and theapi-postgres-colocatorCronJob handles later failovers (itsAPI_DEPLOYMENTenv is set toapi-ingest).api-dashboarddoes NOT colocate with the primary — it lives on HEL1. If you see slow ingest: comparekubectl get pods -n rejourney -l app=api-ingest -o widewithkubectl get pods -n rejourney -l cnpg.io/cluster=postgres-local,cnpg.io/instanceRole=primary -o wide, then inspectkubectl get jobs -n rejourney -l app=api-postgres-colocator. -
SyncRep is the write throughput ceiling. Every committed write waits ~25ms for standby ACK. For any new write-heavy path that becomes slow: check
pg_stat_activity WHERE wait_event = 'SyncRep'first, then addSET LOCAL synchronous_commit = localif the path is safe to skip (idempotent retries are acceptable). The BullMQ migration eliminated the former hottest path (ingest_jobsINSERT/UPDATE churn) — artifact job dispatch now goes through Redis, and only therecording_artifactsstatus update remains in Postgres. -
DB storage is local-path, not Hetzner cloud volumes. PVCs survive pod deletion (Retain) but permanent node destruction loses local data. Standby + OVH WAL/base backups are the recovery paths.
-
CoreDNS replica count may reset on k3s upgrades. Verify after any upgrade and re-apply
k8s/coredns-config.yaml. -
Cloudflare WAF silently blocks ingest PUT requests. No logs, no 4xx to the client — Cloudflare drops the request without notifying you. If ingest stops working and the API/ingest-upload pods look healthy: add a custom WAF pass-rule for PUT on
ingest.rejourney.cofirst, before touching anything else. There are zero logs until that rule is in place. -
postgres-app-rois broken in this cluster — usepostgres-local-rofor read-only traffic. Thepostgres-app-*services were configured as custom aliases at some point andpostgres-app-rohas selectorcnpg.io/podRole=instance, which matches BOTH primary and standby — so it round-robins between them and is NOT actually read-only. The CNPG-defaultpostgres-local-rohas the correct selector (cnpg.io/instanceRole=replica) and only routes to the standby.pgbouncer-ropoints atpostgres-local-rofor this reason. Verify any new read pool withkubectl get endpoints postgres-local-ro(should be exactly one IP — the standby). -
DATABASE_URL_READenv var ordering matters. Onapi-dashboard,DATABASE_URL_READis interpolated fromPOSTGRES_USER/POSTGRES_PASSWORD/POSTGRES_DBvia$(VAR)syntax. Kubernetes only resolves$(VAR)against env vars that appear EARLIER in the same container's env list. If you reorder env entries and putDATABASE_URL_READbefore its inputs, the value becomes the literal stringpostgresql://$(POSTGRES_USER):$(POSTGRES_PASSWORD)@pgbouncer-ro:5432/$(POSTGRES_DB)and Postgres connections fail with cryptic auth errors. -
API split:
apiService is a backward-compat alias forapi-ingest. Anything cluster-internal that hardcodedapi:3000(gatus, monitoring scrapes) keeps working because theapiService still exists with selectorapp: api-ingest. Don't delete it. The colocator and the deploy script referenceapi-ingestdirectly, notapi. Ingest traffic goes toapi-ingest; dashboard/auth/everything-else goes toapi-dashboard. -
api-dashboarddoes writes too — don't assume it's read-only. Login createsuser_sessionsrows, settings saves, Stripe webhooks (/api/webhooks/*) all run onapi-dashboardand write to the primary via the regulardbclient (which usesDATABASE_URL→pgbouncer→ primary). Only thedbReadclient targets the standby. If you migrate a route todbRead, verify it does not write — writes viadbReadwill fail at the standby withcannot execute INSERT in a read-only transaction. That's the intentional guardrail but it's user-visible. -
ClickHouse must stay off the synchronous ingest path. It is safe only because artifact processing writes API request facts asynchronously and catches ClickHouse failures. If any
/api/ingest/*handler starts waiting on ClickHouse before returning to the SDK, revisit placement first — a HEL1-only ClickHouse write endpoint would reintroduce the cross-DC latency pattern that previously pushed ingest latency into seconds. -
API endpoint analytics now require the ClickHouse rollup table.
CLICKHOUSE_READS_ENABLED=trueis expected in production after the rollup release. Ifapi_endpoint_daily_rollupsis missing or empty, the API endpoint page will return empty ClickHouse results because the Postgres fallback has been removed. -
Run
clickhouse-backfill-api-rollups --replaceafter creating the rollup table. The materialized view handles new raw facts, but the backfill rebuilds imported history and existing raw facts into the small daily rollup queried by the dashboard. -
The raw ClickHouse fact date must stay aligned with artifact processing day. This keeps rollup history comparable with the old Postgres aggregate and avoids late-arrival surprises around UTC day boundaries.
-
Do not reintroduce Postgres fallback for API endpoint analytics. The old
api_endpoint_daily_statsdata table is dropped by migration, with only an empty no-op compatibility shell left for rolling deploy safety. New API endpoint analytics features should read ClickHouse rollups or raw ClickHouse facts only.
Current state: 1 FSN1 node running api-ingest, DB primary, ingest-upload, and monitoring; 2 HEL1 nodes running api-dashboard, workers, standby, and pgbouncer-ro. Next step is a second FSN1 node — more api-ingest headroom with local DB latency, not more HEL1 standby capacity.
All nodes carry rejourney.co/datacenter. New FSN1 nodes are immediately eligible for api-ingest/Traefik with just:
kubectl label node <new-node> rejourney.co/datacenter=fsn1Recommended type: CPX41 (16 vCPU, 32 GB, ~€33/mo) or CX32 (8 vCPU, 32 GB, ~€19/mo). Add in FSN1 only. With strict api-ingest/Postgres-primary colocation, this node is immediate ingress/pgbouncer/general headroom; api-ingest pods only use it if the CNPG primary can also run there. After the node joins:
kubectl label node <new-node> rejourney.co/datacenter=fsn1
# Remove LB exclusion label if present:
kubectl label node <new-node> node.kubernetes.io/exclude-from-external-load-balancers-pgbouncer (rw) requires one replica per node. With 4 nodes, set replicas = 4 and raise max_connections to at least 280 (4 × 60 = 240 connections, need headroom). Update k8s/pgbouncer.yaml. (pgbouncer-ro does not need a replica per node — it lives on HEL1 next to the standby.)
# k8s/hpa.yaml — HorizontalPodAutoscaler/api-ingest
minReplicas: 3 # unchanged
maxReplicas: 8 # was 6Only raise this if the Postgres primary node has CPU/memory headroom, or after adding a CNPG instance/primary path on the new FSN1 node. api-ingest pods are required to colocate with the writable Postgres pod. api-dashboard HPA scales independently and isn't affected.
# k8s/traefik-config.yaml
replicas: 2
# affinity: required rejourney.co/datacenter=fsn1, required anti-affinity on hostnameAdd the new FSN1 node as a Hetzner LB backend.
- CNPG: stays 1 primary + 1 standby. More replicas add sync overhead.
- Redis: stays 3-node Sentinel.
- ClickHouse: keep the first rollout as 1 shard / 2 HEL1 data replicas plus 3 Keeper voters. Do not colocate ClickHouse data with the FSN1 Postgres primary unless ClickHouse becomes part of a synchronous FSN1 write path.
- HEL1 nodes: unchanged — HA standby, bulk worker capacity, and
api-dashboardhome. api-dashboard: 2 replicas on HEL1 is plenty for current operator load; scale via its own HPA (2–5) if dashboard traffic grows.
ClickHouse changes the Postgres scaling pressure but not the next compute step: with api_endpoint_daily_stats writes removed and the heavy Postgres table already dropped to an empty compatibility shell, Postgres should see less CPU, WAL, autovacuum, bloat, and cache pressure. The next FSN1 node is still the right compute expansion for synchronous ingest headroom because api-ingest must stay colocated with the writable Postgres primary.