OOM Detection & Confidence-Gated Sizing

How CI Sizer detects out-of-memory events and adapts sizing recommendations through confidence phases.

Overview

CI Sizer v0.7.0 introduces confidence-gated sizing — a system that adapts recommendation aggressiveness based on how much data is available for a given workflow/job. Combined with OOM detection, the sizer can automatically recover from memory exhaustion events by applying exponential backoff and notifying the source forge via commit status.

Confidence Phases

Every workflow/job combination progresses through three confidence phases as the sizer accumulates clean (non-OOM) samples:

PhaseConditionBehaviour
unknown0 clean samplesReturns a bootstrap default of 4Gi memory. API responds with HTTP 200 and meta.confidence_phase == "unknown".
learning1–2 clean samplesApplies 3× headroom above observed peak. Conservative to avoid OOMs while data is sparse.
confident≥3 clean samplesUses the tight staircase buffer (20%/10%/5%). Full algorithm precision.

OOM Detection

Cgroup v2 Detection

The collector sidecar reads the cgroup v2 memory.events file and monitors the oom_kill counter. When the counter increments during a run, the sample is marked as an OOM event.

Source: internal/cgroup/oom.go

Heuristic Detection

For environments where the oom_kill counter is not available (e.g., cgroup v1), the sizer applies a heuristic: if the observed peak memory reaches ≥95% of the configured limit, the sample is marked as OOM-suspect. OOM-suspect samples are excluded from the clean sample count used for confidence phase progression.

Exponential Backoff

When consecutive OOMs are detected for a workflow/job, the sizer applies exponential backoff to the memory limit:

new_limit = current_limit × 2^consecutiveOOMs

The backoff is capped at the node ceiling to prevent unbounded growth.

Node Ceiling

The maximum memory allocation is bounded by the node ceiling, which is determined by:

  1. Auto-detection — reads /proc/meminfo and uses 90% of total node RAM
  2. Manual override — configurable via --max-memory flag

Similarly, --max-cpu caps the maximum CPU allocation.

Commit Status Notifications

When an OOM event is detected, the receiver posts a commit status notification to the source forge, alerting developers that their CI run was killed due to memory exhaustion.

Forgejo / GitHub

POST /api/v1/repos/{owner}/{repo}/statuses/{sha}

GitLab

POST /api/v4/projects/{id}/statuses/{sha}

Authentication is via the PRIVATE-TOKEN header for GitLab or Bearer token for Forgejo/GitHub.

Configuration

FlagEnvironment VariableDescriptionDefault
--notify-enabledRECEIVER_NOTIFY_ENABLEDEnable commit status notificationstrue
--notify-base-urlRECEIVER_NOTIFY_BASE_URLForge base URL (auto-detected from push metadata if unset)
--notify-tokenRECEIVER_NOTIFY_TOKENAPI token for forge commit status API

In most deployments, only --notify-token is required. The base URL and node ceiling are auto-detected.

Source: internal/receiver/notify/notify.go

Web UI Indicators

The web dashboard surfaces OOM information through several visual elements:

  • Confidence badges — displayed per workflow/job showing the current phase (unknown, learning, confident)
  • OOM banners — warning banners on affected workflow/job pages
  • Red markers on charts — individual OOM’d runs are highlighted with red markers on the timeline chart

Sizing Response

When OOM detection is active, the sizing API response includes additional fields in the meta block:

{
  "meta": {
    "confidence_phase": "learning",
    "clean_samples": 3,
    "consecutive_ooms": 1,
    "node_ceiling_memory": "28Gi",
    "node_ceiling_cpu": "14"
  }
}

Source Files

FilePurpose
internal/cgroup/oom.goCgroup v2 OOM detection via memory.events
internal/receiver/sizing/confidence.goConfidence phase logic and phase transitions
internal/receiver/notify/notify.goCommit status notification dispatch