Skip to content

BUILDS_INPROGRESS metric not working as expected #2051

Description

@rgaiacs

As reported downstream on jupyterhub/mybinder.org-deploy#3594, the binderhub_inprogress_builds / BUILDS_INPROGRESS metric is not working as expected.

Observed Expected
Image Image

BUILDS_INPROGRESS is defined in

BUILDS_INPROGRESS = Gauge("binderhub_inprogress_builds", "Builds currently in progress")
The Prometheus client_python Gauge() can go up and down but from the observed Grafana visualisation, it is only going up. track_inprogress() is used in
with BUILDS_INPROGRESS.track_inprogress():
done = False
failed = False
def _check_result(future):
nonlocal done
nonlocal failed
try:
r = future.result()
app_log.debug("task completed: %s", r)
except Exception:
app_log.error("task failed", exc_info=True)
done = True
failed = True
# TODO: Propagate error to front-end
build_starttime = time.perf_counter()
pool = self.settings["build_pool"]
# Start building
submit_future = pool.submit(build.submit)
submit_future.add_done_callback(_check_result)
IOLoop.current().add_callback(lambda: submit_future)
log_future = None
# initial waiting event
await self.emit(
{
"phase": "waiting",
"message": "Waiting for build to start...\n",
}
)
while not done:
progress = await q.get()
# FIXME: If pod goes into an unrecoverable stage, such as ImagePullBackoff or
# whatever, we should fail properly.
if progress.kind == ProgressEvent.Kind.BUILD_STATUS_CHANGE:
phase = progress.payload.value
if progress.payload == ProgressEvent.BuildStatus.PENDING:
# nothing to do, just waiting
continue
elif progress.payload == ProgressEvent.BuildStatus.BUILT:
if build_only:
message = "Done! Image built\n"
phase = "ready"
else:
message = "Built image, launching...\n"
event = {
"phase": phase,
"message": message,
"imageName": image_name,
}
BUILD_TIME.labels(status="success").observe(
time.perf_counter() - build_starttime
)
BUILD_COUNT.labels(
status="success", **self.repo_metric_labels
).inc()
done = True
elif progress.payload == ProgressEvent.BuildStatus.RUNNING:
# start capturing build logs once the pod is running
if log_future is None:
log_future = pool.submit(build.stream_logs)
log_future.add_done_callback(_check_result)
continue
elif progress.payload == ProgressEvent.BuildStatus.BUILT:
# Do nothing, is ok!
continue
elif progress.payload == ProgressEvent.BuildStatus.FAILED:
event = {"phase": phase}
elif progress.payload == ProgressEvent.BuildStatus.UNKNOWN:
event = {"phase": phase}
else:
raise ValueError(
f"Found unknown phase {phase} in ProgressEvent"
)
elif progress.kind == ProgressEvent.Kind.LOG_MESSAGE:
# The logs are coming out of repo2docker, so we expect
# them to be JSON structured anyway
event = progress.payload
payload = json.loads(event)
if payload.get("phase") in ("failure", "failed"):
failed = True
BUILD_TIME.labels(status="failure").observe(
time.perf_counter() - build_starttime
)
BUILD_COUNT.labels(
status="failure", **self.repo_metric_labels
).inc()
await self.emit(event)
as described in the Prometheus client_python documentation.

My search pointed me that the cause of this problem could be

  • that the process might be terminated before the with block completes
  • the async / await instruction

@yuvipanda and @manics what does your experience says?

Should we replace the BUILDS_INPROGRESS with a downstream calculation of BUILDS_STARTED - BUILDS_COMPLETED - BUILDS_FAILED?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions