The traefik switch introduced a 60s timeout despite my thinking the timeout was set at 600s (preserving prior config).
We don't have a test for this, so some requests were failing and not really showing up until @yuvipanda noticed because pushing an image layer took over 60s and failed.
We should have some tests (they don't need to be in CI) to measure our expected timeouts.
It would probably also be useful to have some characteristics of failing builds. Right now, we have a success/failure, but because we get so many builds of nonsense or stale repos that fail and not that many builds overall, the build success rate metric isn't particularly meaningful unless it's just 0 for a sustained period of time.
Failure to push at the end of the build is super meaningful, though, as that always indicates a problem. We don't have an indicator for that, though.
These recent requests did show up in this metric as 502 errors on PATCH to harbor. That's not the easiest thing to see in the noise.
The traefik switch introduced a 60s timeout despite my thinking the timeout was set at 600s (preserving prior config).
We don't have a test for this, so some requests were failing and not really showing up until @yuvipanda noticed because pushing an image layer took over 60s and failed.
We should have some tests (they don't need to be in CI) to measure our expected timeouts.
It would probably also be useful to have some characteristics of failing builds. Right now, we have a success/failure, but because we get so many builds of nonsense or stale repos that fail and not that many builds overall, the build success rate metric isn't particularly meaningful unless it's just 0 for a sustained period of time.
Failure to push at the end of the build is super meaningful, though, as that always indicates a problem. We don't have an indicator for that, though.
These recent requests did show up in this metric as 502 errors on PATCH to harbor. That's not the easiest thing to see in the noise.