OOM Detection & Confidence-Gated Sizing
Overview
CI Sizer v0.7.0 introduces confidence-gated sizing — a system that adapts recommendation aggressiveness based on how much data is available for a given workflow/job. Combined with OOM detection, the sizer can automatically recover from memory exhaustion events by applying exponential backoff and notifying the source forge via commit status.
Confidence Phases
Every workflow/job combination progresses through three confidence phases as the sizer accumulates clean (non-OOM) samples:
| Phase | Condition | Behaviour |
|---|---|---|
| unknown | 0 clean samples | Returns a bootstrap default of 4Gi memory. API responds with HTTP 200 and meta.confidence_phase == "unknown". |
| learning | 1–2 clean samples | Applies 3× headroom above observed peak. Conservative to avoid OOMs while data is sparse. |
| confident | ≥3 clean samples | Uses the tight staircase buffer (20%/10%/5%). Full algorithm precision. |
Client Note
The bootstrap phase (0 samples) now returns HTTP 200 instead of 404. Clients should checkmeta.confidence_phase to distinguish bootstrap defaults from data-driven recommendations.OOM Detection
Cgroup v2 Detection
The collector sidecar reads the cgroup v2 memory.events file and monitors the oom_kill counter. When the counter increments during a run, the sample is marked as an OOM event.
Source: internal/cgroup/oom.go
Heuristic Detection
For environments where the oom_kill counter is not available (e.g., cgroup v1), the sizer applies a heuristic: if the observed peak memory reaches ≥95% of the configured limit, the sample is marked as OOM-suspect. OOM-suspect samples are excluded from the clean sample count used for confidence phase progression.
Exponential Backoff
When consecutive OOMs are detected for a workflow/job, the sizer applies exponential backoff to the memory limit:
new_limit = current_limit × 2^consecutiveOOMs
The backoff is capped at the node ceiling to prevent unbounded growth.
Node Ceiling
The maximum memory allocation is bounded by the node ceiling, which is determined by:
- Auto-detection — reads
/proc/meminfoand uses 90% of total node RAM - Manual override — configurable via
--max-memoryflag
Similarly, --max-cpu caps the maximum CPU allocation.
Commit Status Notifications
When an OOM event is detected, the receiver posts a commit status notification to the source forge, alerting developers that their CI run was killed due to memory exhaustion.
Forgejo / GitHub
POST /api/v1/repos/{owner}/{repo}/statuses/{sha}
GitLab
POST /api/v4/projects/{id}/statuses/{sha}
Authentication is via the PRIVATE-TOKEN header for GitLab or Bearer token for Forgejo/GitHub.
Configuration
| Flag | Environment Variable | Description | Default |
|---|---|---|---|
--notify-enabled | RECEIVER_NOTIFY_ENABLED | Enable commit status notifications | true |
--notify-base-url | RECEIVER_NOTIFY_BASE_URL | Forge base URL (auto-detected from push metadata if unset) | — |
--notify-token | RECEIVER_NOTIFY_TOKEN | API token for forge commit status API | — |
In most deployments, only --notify-token is required. The base URL and node ceiling are auto-detected.
Source: internal/receiver/notify/notify.go
Web UI Indicators
The web dashboard surfaces OOM information through several visual elements:
- Confidence badges — displayed per workflow/job showing the current phase (unknown, learning, confident)
- OOM banners — warning banners on affected workflow/job pages
- Red markers on charts — individual OOM’d runs are highlighted with red markers on the timeline chart
Sizing Response
When OOM detection is active, the sizing API response includes additional fields in the meta block:
{
"meta": {
"confidence_phase": "learning",
"clean_samples": 3,
"consecutive_ooms": 1,
"node_ceiling_memory": "28Gi",
"node_ceiling_cpu": "14"
}
}
Source Files
| File | Purpose |
|---|---|
internal/cgroup/oom.go | Cgroup v2 OOM detection via memory.events |
internal/receiver/sizing/confidence.go | Confidence phase logic and phase transitions |
internal/receiver/notify/notify.go | Commit status notification dispatch |