Tiny eBPF experiment that attaches to a cgroup, tracks HTTP requests, attributes CPU work to the thread serving each request, injects an X-Energy-Score header on responses, and supports two attribution modes:
psys: distribute sampled PSYS interval energy across the tracked workmodel: apply fitted per-signal coefficients directly in-kernel as microjoule weights
clang(for both userland and BPF)- libbpf headers/libs (
libbpf-develon Fedora) andpkg-config - A cgroup you can write to (examples below use
/sys/fs/cgroup/httpdemo)
make
This produces:
http_energy.bpf.o– the BPF programhttp_energy– the user-space loader/attacher
Clean up:
make clean
- Create the target cgroup:
sudo mkdir -p /sys/fs/cgroup/httpdemo
- Start a server in that cgroup (adjust the path to your cgroup):
sudo bash -c 'echo $$ > /sys/fs/cgroup/httpdemo/cgroup.procs; exec ./scripts/workload_server.py --port 8080' - Review or edit the energy model config:
sed -n '1,120p' ./energy_model.conf - In another shell, load the programs:
sudo ./http_energy /sys/fs/cgroup/httpdemo ./http_energy.bpf.o ./energy_model.conf
- Curl the server and check for the
X-Energy-Scoreheader:The header value is emitted as a decimal microjoule count.curl -v http://127.0.0.1:8080/
- On connection establish (sockops), the program adds the socket to a sockhash and enables TCP state callbacks so per-connection state can be cleaned up on close.
- On inbound plaintext HTTP traffic, a cgroup ingress program detects the end of the request headers, creates a request ID, and marks the connection as awaiting a response.
- When the server thread reads from the TCP socket, an
fexit/tcp_recvmsgprogram binds that thread to the active request for the connection. - A
power/cpu_frequencytracepoint updates a per-CPU clock map whenever the kernel reports a CPU frequency transition. - Userspace opens the
power/energy-psysperf event, reads its scale from sysfs, and samples total machine energy everypsys_interval_ms. - A
sched_switchtracepoint charges on-CPU runtime to the currently bound request whenever that thread is scheduled in and out, and also accumulates a(tgid, cpu_khz) -> runtime_nsmapping in theprocess_freq_runtimemap. sched_wakeupandsched_wakeup_newtracepoints count wakeup events by waking process in theprocess_wakeup_countmap keyed bytgid, and also charge a wakeup penalty when the request-owning thread triggers a wakeup while the request is active.- Per-CPU PMU counters attribute cycles, instructions, and cache misses to the active request on each sched-in/sched-out slice, and aggregate them into
process_cycles,process_instructions, andprocess_cache_missesmaps keyed bytgid. - A
sched_migrate_taskhook counts migrations in theprocess_migrationsmap keyed bytgid. - Userspace computes an interval score for every process from the logged signals, subtracts the configured idle baseline from the PSYS interval, and derives a live
uJ / scorefactor fromactive_psys_uj / total_interval_score. - That live factor is written back into the
psys_split_statemap. BPF converts each request’s incremental score into attributed microjoules as work is observed, and also exports cumulative per-process attributed energy inprocess_attributed_energy_uj. - On the first outbound
HTTP/1.xresponse write, thesk_msgprogram injectsX-Energy-Scoreusing the PSYS-attributed request energy in microjoules. - If the first response write cannot be rewritten safely, the response is left untouched and the pending request state is cleared so later responses are not corrupted.
attribution_mode=psys|modelselects either live PSYS interval splitting or direct in-kernel model evaluation.default_multiplier=<float>sets the fallback score multiplier when there is no exactfreq_khzentry for the current CPU frequency.wakeup_penalty=<integer>adds that many score units whenever the request-owning thread triggers a scheduler wakeup.cycles_weight=<float>addscycles * cycles_weightto the score on each accounted slice.instructions_weight=<float>addsinstructions * instructions_weightto the score on each accounted slice.cache_miss_weight=<float>addscache_misses * cache_miss_weightto the score on each accounted slice.migration_penalty=<integer>adds that many score units whenever the request-owning thread is migrated.idle_power_uw=<integer>subtracts that idle baseline from each sampled PSYS interval before energy is distributed across processes.psys_interval_ms=<integer>controls how often userspace samples PSYS and recomputes the liveuJ / scorefactor.freq_khz=<khz> <float>sets an exact-match multiplier for a specific CPU frequency in kHz.- Float weights are fixed-point scalars with
1.0meaningdelta += signal_value,2.0meaningdelta += 2 * signal_value, and so on. - In
psysmode those deltas are intermediate score units that are converted to microjoules through the live PSYS split factor. - In
modelmode those deltas are interpreted directly as microjoules, so the fitted config coefficients must already be in energy units.
The intended workflow is:
- Run collection in
attribution_mode=psysso each interval has a PSYS energy target. - Fit a direct model from the collected CSV.
- Switch to the generated config with
attribution_mode=modelfor direct in-kernel energy estimation.
Before collecting data, measure the host's idle platform power and copy the suggested value into energy_model.conf:
sudo ./scripts/measure_idle_power.py --duration 5 --samples 7The script reads power/energy-psys, reports the observed idle power distribution, and prints a final line such as idle_power_uw=123456. By default it reports a robust fluctuation score based on median absolute deviation, keeps the raw range in the JSON for visibility, and still prints the suggested baseline even when the host is noisy. Add --strict if you want the command to exit nonzero when the fluctuation limit is exceeded.
The loader can emit one CSV row per PSYS update interval:
sudo ./http_energy --collect-csv ./samples.csv --collect-label baseline \
/sys/fs/cgroup/httpdemo ./http_energy.bpf.o ./energy_model.confThe CSV includes:
- PSYS interval energy (
interval_psys_uj,active_psys_uj,idle_uj) - aggregate interval features (
runtime_ns,wakeups,cycles,instructions,cache_misses,migrations) - per-frequency runtime encoded as
freq_runtime_ns="800000:123;2200000:456"
The bundled workload server exposes endpoints with materially different behavior:
/cpu?iters=.../json?items=.../compress?kb=.../file?kb=.../post?kb=...
To automate collection against that server with a mixed request profile:
sudo ./scripts/collect_signals.py \
--cgroup /sys/fs/cgroup/httpdemo \
--use-workload-server \
--workload-port 8080 \
--output-csv ./samples.csv \
--benchmark-json ./benchmark.json \
--duration 20 \
--concurrency 8That script:
- starts the server in the target cgroup
- starts
http_energywith CSV collection enabled - runs the bundled HTTP benchmark with a mixed endpoint profile
- writes the benchmark summary and collected signal CSV
If you want to specify the mix manually, repeat --path or pass a JSON --mix-file through collect_signals.py or benchmark_http.py.
The standalone benchmark driver is also available directly:
./scripts/benchmark_http.py \
--url http://127.0.0.1:8080/ \
--profile mixed \
--duration 15 \
--concurrency 4Fit the collected CSV against PSYS energy and emit a ready-to-use energy_model.conf:
./scripts/fit_energy_model.py \
--input-csv ./samples.csv \
--output-config ./energy_model.fitted.conf \
--report-json ./energy_model.report.jsonThe fitter:
- uses
active_psys_ujas the default target - fits per-frequency runtime coefficients plus wakeup/cycle/instruction/cache-miss/migration coefficients
- writes evaluation metrics for train/test splits (
MAE,RMSE,MAPE,R²) - generates a config with
attribution_mode=model - writes
psys_interval_ms=200by default in the generated config
After fitting, launch the loader with the generated config:
sudo ./http_energy /sys/fs/cgroup/httpdemo ./http_energy.bpf.o ./energy_model.fitted.confIn model mode the request datapath no longer needs PSYS to estimate per-request energy for the response header. Collection still uses PSYS, so --collect-csv should be run with a PSYS-based config.
- In
psysmode the request header value represents PSYS-attributed microjoules, and the attribution accuracy still depends on how well the chosen signal weights explain whole-machine energy on your target host. - In
psysmode the live split factor is interval-based. Very short requests that finish before the first PSYS update after startup may still report0until the first calibrated factor is available. - In
modelmode the output quality depends entirely on the host-specific dataset used to fit the coefficients. - This works best for blocking or thread-per-request servers where one worker thread handles one request at a time.
- It does not attempt to attribute background work or async work that moves across threads.
If you hit verifier, perf permission, or attachment issues, ensure the cgroup path is correct, that your kernel supports SK_MSG and the tracepoints used here, and that hardware perf counters are available. power/energy-psys is now required for live attribution. Use make clean && make after code changes.