| name | performance-and-capacity |
|---|---|
| description | Use when tail latency, load tests, saturation, capacity, headroom, or peak/failover traffic need analysis |
NO CAPACITY OR PERFORMANCE PLAN WITHOUT A TRAFFIC MODEL, TAIL METRIC, SATURATION SIGNAL, AND LOAD-TEST RESULTS
If the answer only says "scale horizontally" or reports averages, it is not enough.
Users experience tail latency, not averages.
Core principle: model demand, concurrency, queueing, saturation, and fanout, then test to the knee of the curve before production finds it.
- The user asks about p95, p99, p99.9, throughput, QPS, concurrency, queueing, saturation, hot paths, or scaling limits.
- The work is raw concurrent-connection, memory, file-descriptor, or autoscaling headroom without changing connection lifecycle semantics.
- A release caused latency or throughput regression.
- A launch, PRR, or migration needs capacity test results.
- The system needs load, stress, spike, soak, or failure-condition testing.
- Cost is discussed as a capacity/headroom tradeoff rather than a billing support question.
- The main problem is retries, timeouts, or dependency failure safety; use
dependency-resilienceinstead. - The main request is public edge abuse, denial-of-service defense, or application-layer filtering; use
edge-traffic-and-ddos-defenseinstead. - The user asks pure billing/procurement questions; out of scope.
- The work is SLO target selection without performance investigation; use
slo-and-error-budgetsinstead. - The regression is explicitly tied to a query plan, index, or schema migration; use
database-operationsinstead. - The request is browser field/lab release checks for a frontend rollout; use
web-release-gatesinstead.
- Current work phase, next decision, what is known, and assumptions where details are missing.
- User journeys, SLOs, latency percentiles, throughput targets, and acceptable degradation behavior.
- Traffic model: current, peak, forecast, burstiness, tenant skew, payload size, fanout, batch ramp-up rate, and internal or privileged entry points that can bypass public quotas.
- Resource signals: CPU, memory, IO, network, lock contention, connection pools, thread pools, queue depth, queue age, GC, and background or maintenance work that shares serving capacity.
- Load-balancing behavior, locality, shard keys, hot partitions, cache hit rate, and downstream quotas.
- Existing load tests, production incidents, profiling/flame graphs, and regression data.
- Tested breakpoint, startup-to-ready time, recovery time after stress, and profile differences between normal and heavy load.
- Headroom rule, autoscaling, load-balancing, or protection-control behavior under saturation, feedback-amplification risks, static failed-domain capacity, and unit-cost constraints.
- Control-loop input contracts for autoscaling, load-balancing, and protection controls: metric source, unit, labels, filters, missing-data behavior, validation path, and compatibility across changes.
- Capacity-change plan: scale-up batch size, rebalance work, scheduler or allocator processing cost, and rollback path if adding capacity slows the serving path.
- Frame the answer before inspection. Start with a compact provisional check frame: target percentile and boundary; load-test method with scenarios and pass/stop criteria; headroom plus USE signal; overload mechanism and priority; queue-depth or in-flight work metric plus backpressure; hot-path/key hypothesis plus mitigation. Mark unknowns and refine them after investigation.
- Define the user-visible target. Choose p95/p99/p99.9 and throughput targets that map to SLOs or launch requirements.
- Build the demand model. Capture request rate, burstiness, concurrency, fanout, payload, tenant skew, batch or maintenance ramp-up curve, and seasonal peaks.
- Apply queueing sanity checks. Use Little's Law to connect arrival rate, latency, and concurrency; identify queues that can hide saturation.
- Find saturation points. Track RED for services and USE for resources. Include locks, connection pools, thread pools, caches, downstream quotas, automated scaling or protection-control actions, and every internal or external entry point's true serving capacity. Do not let privileged, batch, or maintenance paths ramp faster than the bottleneck dependency can absorb.
- Test to the knee. Run load/stress/spike/soak tests in production-like environments until latency or errors become nonlinear; include representative peak traffic, tenant skew, and background jobs that share serving resources. Record the breakpoint, startup-to-ready time, recovery behavior after stress, and the profile differences that explain bottlenecks.
- Protect the system. Define admission control, load shedding, prioritization, and graceful degradation before saturation.
- Budget background work. Set resource ceilings, scheduling limits, and preemption behavior for maintenance, config-processing, compaction, indexing, and other background paths that can starve foreground requests.
- Validate control loops. Test autoscaling, load-balancing, and protection controls when their input signals are slow, missing, erroring, renamed, relabeled, or rejected by policy validation; confirm the action does not add work to the same saturated path or scale a failing dependency into a feedback loop.
- Investigate regressions scientifically. Compare before/after profiles, deploy markers, dependency metrics, cache behavior, and resource saturation.
- Model failed-domain headroom. For HA requirements, show remaining domains have enough already-available capacity at peak; do not count emergency scaling as the primary recovery mechanism.
- Treat capacity changes as load. When adding, moving, or reserving capacity, model rebalance work, scheduler or allocator processing, and rollout batch size so the expansion itself cannot create a latency incident.
- Tie capacity to cost when relevant. Preserve required headroom and failover capacity; optimize unit economics only after risk is explicit.
Optimize around tail percentiles, saturation, queue age, and headroom rather than averages. Combine tail-at-scale design, SRE golden signals, performance baselines, load-shedding practice, and unit-cost discipline when cost is explicitly part of the reliability tradeoff. Name the static-stability and constant-work patterns as the default for headroom: pre-provision capacity that is already available when a domain fails rather than relying on reactive scaling, and add proactive demand forecasting with provisioning lead-time instead of relying on reactive saturation response.
- Ideation: identify risks, defaults, unknowns, options, and the next decision before code exists.
- Design: shape the target artifact, tradeoffs, checks, and details to gather.
- Development: guide sequencing, code boundaries, checks, and acceptance criteria.
- Testing: define release-blocking tests, evals, fixtures, and failure probes.
- Release: define rollout, observability, abort, rollback, and readiness details.
- Maintenance: define owners, drift checks, cleanup triggers, and refresh cadence.
- Existing artifact: use current code, docs, telemetry, incidents, or diffs as context for the next engineering decision; do not wait for a finished artifact before guiding design, build, release, or operation.
- Missing details: state assumptions and say what to check next instead of blocking lifecycle guidance.
- Batch pipelines may use freshness and completion latency instead of request p99; route to data pipeline reliability when the system is mainly ETL.
- Internal low-impact tools may use lower headroom or follow-up-only alerts when the user accepts the SLO tradeoff.
- Hedged requests can reduce tail latency only when extra load is budgeted and duplicate work is safe.
- Predictive scaling helps predictable demand, but cold-start latency must not sit on a critical synchronous path.
- Lead with the capacity model, tail-latency diagnosis, load-test plan, or headroom decision requested.
- Cover traffic shape, fanout, tail budgets, saturation signals, load shedding, test results, failure-domain headroom, and cost tradeoffs when relevant before optional performance breadth.
- Make recommendations actionable with thresholds, test scenarios, stop criteria, scaling limits, rollback actions, and regression checks where relevant.
- Name the details to inspect, such as p95/p99 metrics, peak/burst traffic, concurrency, queue age, resource saturation, downstream limits, load-test results, and unit cost; do not state details you have not seen.
- Stay technology-agnostic by default: do not introduce provider, product, framework, database, protocol, or command names unless the user supplied them or explicitly requested tool-specific guidance.
- Stay inside capacity, performance, and tail latency. Route data pipelines, dependency resilience, or FinOps only when they materially change the decision.
- Be concise: avoid generic performance advice and prefer compact capacity models, latency budgets, and test matrices.
- Output shape: render the matching shared template headings or tables in the reply, or use the same shape. Every answer, including narrow regression diagnoses, must state, in this order:
- Target at user boundary: numeric latency/throughput target, percentile (p95/p99/p99.9), and the measurement boundary (edge, gateway, service ingress). Mark unknown explicitly.
- Load-test methodology: name the method (synthetic load, traffic shadow, prod replay), the scenarios (normal/peak/burst/soak), and pass/stop criteria.
- Headroom and saturation (USE): required headroom percentage and the saturation indicator(s) tracked (utilization, queue depth, queue age, pool wait, drain rate).
- Overload behavior: load-shedding or admission-control mechanism AND which traffic class is preserved by priority.
- Queue/backpressure model for any asynchronous path: queue-depth metric and the backpressure response.
- Hot-path / hot-key analysis: the suspected hot path or hot key and its mitigation.
- Background-work resource budget where maintenance, config-processing, compaction, indexing, or control work shares foreground serving capacity.
- Capacity model (normal/peak/burst/failure-domain), capacity-change and rebalance plan, control-loop input contract and behavior under saturation with feedback-amplification guard, latency budget by hop, regression analysis, tested breakpoint, recovery-after-stress result, and cost/headroom tradeoff when cost is in scope.
tail_metric: target percentile, window, and journey are stated.traffic_model: peak, burst, concurrency, fanout, and tenant skew are modeled or marked unknown.saturation_signals: resource, queue, pool, and downstream saturation metrics are identified.entry_point_limits: internal, batch, admin, and public entry points enforce steady-state and ramp-rate limits no higher than measured downstream capacity.test_result: load or regression test has scenario, stop criteria, result, and check path.breakpoint_known: the nonlinear failure point, or the reason it was not tested, is recorded.background_budget: shared-capacity background work has resource ceilings, breach actions, and representative peak-load test coverage.headroom_check: capacity includes peak, resource or dependency limits, and expected failure-domain conditions, with static capacity separated from emergency scaling.capacity_change_load: capacity additions, reservations, or moves account for rebalance work, scheduler or allocator cost, batch size, and rollback.control_loop_behavior: autoscaling, load-balancing, and protection controls have input metric contracts, expected actions under saturation, compatibility checks for label/filter changes, cannot amplify the failing path, and cannot reduce serving capacity below the user contract without an explicit overload decision.recovery_after_stress: recovery time and behavior after stress are measured or explicitly unknown.
- Average latency is used as the primary user-experience metric.
- The plan scales replicas but ignores database, cache, queue, or downstream limits.
- Load tests stop at expected peak and never find the nonlinear point.
- Queue depth is monitored without queue age or drain rate.
- Autoscaling or protection logic uses unhealthy dependency signals in a way that adds work to the saturated path.
- A capacity expansion assumes new headroom is free before measuring rebalance, scheduler, or allocator work.
- Cost cutting removes failover headroom without changing the SLO or accepting risk.
- A single fault-domain or partition recovery plan depends on scaling after the failure rather than preexisting headroom.
| Mistake | Correction |
|---|---|
| Treating CPU as capacity | Include all saturation points: queues, locks, pools, IO, network, and dependencies. |
| Letting internal callers bypass quota | Apply capacity limits at every entry point and size them to the real bottleneck. |
| Reactive scaling as the only plan | Pre-provision static-stability headroom and forecast demand with lead-time. |
| Testing only steady load | Add bursts, soak, failover, cold cache, and dependency-slow scenarios. |
| Letting background work share unlimited serving capacity | Give maintenance and control work explicit resource budgets and preemption behavior. |
| Hiding overload in queues | Track age and drain rate; shed work before recovery becomes impossible. |
| Optimizing p50 | Optimize the percentile users and SLOs experience. |