Kube-Prometheus-Stack Alertmanager Configuration: Complete Guide to Kubernetes Alert Routing (2026)
Collecting metrics is only half of Kubernetes observability. The other half is knowing when something goes wrong and getting the right notification to the right person at the right time. That is exactly what Alertmanager does within the kube-prometheus-stack. It receives alerts fired by Prometheus, deduplicates them, groups related alerts together, routes them to the correct notification channel, and manages silences and inhibitions so your on-call team sees signal rather than noise.
This guide covers the complete Alertmanager configuration workflow inside kube-prometheus-stack: from Helm values.yaml routing trees and Slack webhook integration to PagerDuty escalations, custom PrometheusRule definitions, alert grouping strategies, and production hardening. Every example uses real YAML you can apply directly to your cluster.
What is Alertmanager in Kube-Prometheus-Stack?
Alertmanager is the alert notification engine that ships as part of the kube-prometheus-stack. While Prometheus evaluates alert rules and determines when an alert should fire, Alertmanager determines what happens next -- who gets notified, through which channel, and how alerts are grouped and deduplicated before delivery.
When you install kube-prometheus-stack via the Helm chart, Alertmanager is deployed automatically as a StatefulSet managed by the Prometheus Operator. The operator watches for AlertmanagerConfig custom resources and the alertmanager.config section in your Helm values, then generates the native Alertmanager configuration file and triggers a reload without requiring a pod restart.
Out of the box, Alertmanager is configured with a single null receiver -- it accepts alerts from Prometheus but does not deliver them anywhere. This is intentional: the stack gives you working metric collection and alerting rules immediately, but notification routing is something every organization must customize to match their own on-call workflows, escalation policies, and communication tools.
Alertmanager provides five core capabilities:
- Deduplication -- When Prometheus runs in high-availability mode (two or more replicas), each instance fires the same alert independently. Alertmanager clusters together and deduplicates these, ensuring you receive only one notification per unique alert.
- Grouping -- Related alerts are batched into a single notification. For example, if 50 pods in the same namespace start crash-looping simultaneously, Alertmanager groups them into one Slack message instead of flooding the channel with 50 separate messages.
- Routing -- A routing tree directs alerts to different receivers based on label matchers. Critical infrastructure alerts go to PagerDuty, warning-level application alerts go to Slack, and informational alerts go to email -- all configurable through label-based matching.
- Inhibition -- Rules that suppress lower-priority alerts when a higher-priority alert is already active. If the entire node is down, you do not need separate alerts for every pod on that node.
- Silencing -- Time-bound rules that mute specific alerts during planned maintenance or known outages. Silences can be created through the Alertmanager UI or API.
How Alertmanager Works with Prometheus
Understanding the alert pipeline is essential before configuring anything. Here is how alerts flow through the system:
Step 1: Prometheus evaluates rules. Prometheus scrapes metrics from your cluster and evaluates PrometheusRule resources at a regular interval (default: 30 seconds). When a rule expression returns results, the alert transitions to pending. After the for duration elapses with the condition still true, the alert becomes firing.
Step 2: Prometheus sends alerts to Alertmanager. Prometheus pushes firing (and resolved) alerts to Alertmanager via HTTP POST to the /api/v2/alerts endpoint. In kube-prometheus-stack, this connection is preconfigured -- Prometheus knows Alertmanager's service address automatically.
Step 3: Alertmanager processes alerts. Alertmanager receives the alert, deduplicates it against existing alerts, and passes it through the routing tree. The routing tree matches alert labels against configured matchers and selects the appropriate receiver.
Step 4: Grouping and batching. Before sending a notification, Alertmanager groups alerts by the configured group_by labels. It then waits for the group_wait period to collect related alerts before sending the first notification. Subsequent alerts in the same group are batched and sent after the group_interval.
Step 5: Notification delivery. The selected receiver sends the notification through its configured channel -- Slack webhook, PagerDuty Events API, SMTP email, generic webhook, or any other supported integration. If delivery fails, Alertmanager retries according to its internal retry logic.
This pipeline means there are two separate configuration surfaces: PrometheusRule resources define what alerts fire and when, while Alertmanager configuration defines where alerts go and how they are grouped. Both are managed through your Helm values or Kubernetes custom resources. You can visualize alert state and history using the Alertmanager Grafana dashboard that ships with the stack.
Configuring Alert Routing in values.yaml
The routing tree is the heart of Alertmanager configuration. It determines which receiver handles each alert based on label matching. Here is a production-grade routing configuration in your kube-prometheus-stack Helm values:
alertmanager: enabled: true config: global: resolve_timeout: 5m slack_api_url: "https://hooks.slack.com/services/T00/B00/XXXX" route: receiver: "default-slack" group_by: ["alertname", "namespace", "job"] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: # Critical alerts → PagerDuty (immediate page) - receiver: "pagerduty-critical" matchers: - severity = critical group_wait: 10s repeat_interval: 1h continue: true # Warning alerts → dedicated Slack channel - receiver: "slack-warnings" matchers: - severity = warning group_wait: 30s repeat_interval: 6h # Watchdog heartbeat → dead man's switch - receiver: "deadmans-switch" matchers: - alertname = Watchdog repeat_interval: 1m receivers: - name: "default-slack" slack_configs: - channel: "#k8s-alerts" send_resolved: true - name: "pagerduty-critical" pagerduty_configs: - routing_key: "<YOUR-PAGERDUTY-INTEGRATION-KEY>" severity: "critical" - name: "slack-warnings" slack_configs: - channel: "#k8s-warnings" send_resolved: true - name: "deadmans-switch" webhook_configs: - url: "https://nosnch.in/XXXXXXX"
Let us break down the key routing parameters:
- group_by -- Labels used to aggregate alerts into groups. Grouping by
["alertname", "namespace", "job"]means all alerts with the same name, namespace, and job are batched into one notification. Choose labels that produce meaningful groups without over-aggregating. - group_wait -- How long to wait after the first alert in a new group before sending the notification. A 30-second wait allows related alerts to arrive and be grouped together rather than sending each one individually.
- group_interval -- After the initial notification is sent, how long to wait before sending updates about new alerts added to the same group. Set this to 5 minutes to avoid notification fatigue.
- repeat_interval -- How long to wait before re-sending a notification for an alert that is still firing and has not changed. For critical alerts, 1 hour ensures the issue stays visible; for warnings, 4-6 hours is typically appropriate.
- continue -- When set to
true, matching does not stop at this route -- the alert continues to the next matching route. This lets you send critical alerts to both PagerDuty and Slack simultaneously.
The routing tree is evaluated top-down. The first matching child route handles the alert (unless continue: true is set). If no child route matches, the alert falls through to the parent route's receiver. Always define a sensible default receiver at the root level to catch unmatched alerts.
Slack Integration Step-by-Step
Slack is the most common notification target for Kubernetes alerts. Here is how to set up a production-quality Slack integration with rich message formatting:
Step 1: Create a Slack Incoming Webhook. In your Slack workspace, go to Settings & Administration → Manage Apps → Incoming Webhooks (or create a Slack App with incoming webhook permissions). Select the channel and copy the webhook URL.
Step 2: Store the webhook URL securely. Never put webhook URLs directly in values.yaml files that are committed to Git. Instead, create a Kubernetes Secret:
apiVersion: v1 kind: Secret metadata: name: alertmanager-slack-webhook namespace: monitoring type: Opaque stringData: slack-webhook-url: "https://hooks.slack.com/services/T00/B00/XXXX"
Step 3: Configure the Slack receiver with rich templates. The default Slack message format is basic. Use Go templates to create informative, actionable messages:
alertmanager: config: receivers: - name: "slack-critical" slack_configs: - api_url: "https://hooks.slack.com/services/T00/B00/XXXX" channel: "#k8s-critical-alerts" send_resolved: true color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}' text: | *Cluster:* {{ .CommonLabels.cluster }} *Namespace:* {{ .CommonLabels.namespace }} *Severity:* {{ .CommonLabels.severity }} {{ range .Alerts }} --- *Alert:* {{ .Labels.alertname }} *Description:* {{ .Annotations.description }} *Runbook:* {{ .Annotations.runbook_url }} *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }} {{ end }} actions: - type: button text: "View in Alertmanager" url: '{{ template "__alertmanagerURL" . }}' - type: button text: "Silence Alert" url: '{{ template "__alertmanagerURL" . }}/#/silences/new'
This template includes the alert name, cluster, namespace, severity, description, runbook link, and start time for each alert in the group. The action buttons link directly to the Alertmanager UI for quick silencing. Setting send_resolved: true ensures your team gets a clear "resolved" notification when the alert clears, reducing confusion about whether an issue is still active.
Step 4: Apply the configuration. Run helm upgrade to push the new Alertmanager configuration to your cluster:
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \ -n monitoring \ -f values.yaml \ --reuse-values
PagerDuty and Email Configuration
For critical production alerts, Slack alone is not sufficient. PagerDuty provides escalation policies, on-call schedules, and phone/SMS delivery that ensure critical alerts actually wake someone up.
PagerDuty Integration
Create a PagerDuty service integration using the Events API v2 integration type. Copy the integration key (also called routing key) and configure the receiver:
alertmanager: config: receivers: - name: "pagerduty-critical" pagerduty_configs: - routing_key: "<YOUR-PAGERDUTY-INTEGRATION-KEY>" severity: '{{ if eq (index .Alerts 0).Labels.severity "critical" }}critical{{ else }}warning{{ end }}' description: '{{ .CommonAnnotations.description }}' details: cluster: '{{ .CommonLabels.cluster }}' namespace: '{{ .CommonLabels.namespace }}' firing: '{{ .Alerts.Firing | len }}' resolved: '{{ .Alerts.Resolved | len }}'
The severity field maps directly to PagerDuty's urgency levels. Sending critical triggers high-urgency incidents that page immediately, while warning creates low-urgency incidents that follow your PagerDuty notification rules.
Email (SMTP) Configuration
Email is useful for lower-priority alerts, compliance notifications, or environments where chat tools are not available. Configure the SMTP settings globally and add email receivers:
alertmanager: config: global: smtp_smarthost: "smtp.company.com:587" smtp_from: "alertmanager@company.com" smtp_auth_username: "alertmanager@company.com" smtp_auth_password: "<SMTP-PASSWORD>" smtp_require_tls: true receivers: - name: "email-infra-team" email_configs: - to: "infra-team@company.com" send_resolved: true headers: Subject: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} - Kubernetes Cluster'
For production environments, always store SMTP credentials in a Kubernetes Secret rather than plaintext in values.yaml. Reference the secret via alertmanager.alertmanagerSpec.secrets in your Helm values to mount it into the Alertmanager pod.
Webhook Receiver (Generic)
For custom integrations -- such as Microsoft Teams, Opsgenie, or an internal incident management system -- use the generic webhook receiver. Alertmanager sends the full alert payload as a JSON POST to your endpoint:
receivers: - name: "custom-webhook" webhook_configs: - url: "https://incident-api.company.com/alertmanager" send_resolved: true max_alerts: 10 http_config: bearer_token: "<API-TOKEN>"
Alert Grouping, Inhibition, and Silencing
Effective alert management is not just about where alerts are sent -- it is about reducing noise so on-call engineers can focus on real problems. Grouping, inhibition, and silencing are the three mechanisms that transform a firehose of raw alerts into actionable notifications.
Alert Grouping Strategy
Grouping controls how Alertmanager batches related alerts into a single notification. The group_by parameter in the routing configuration determines which labels define a group:
# Group by alert name and namespace (recommended default) group_by: ["alertname", "namespace"] # Group all alerts together (one notification for everything) group_by: ["..."] # No grouping — each alert fires independently group_by: [] # Group by cluster and severity for multi-cluster setups group_by: ["cluster", "severity", "alertname"]
The recommended default grouping is ["alertname", "namespace"]. This means all KubePodCrashLooping alerts in the production namespace get batched into one notification, while the same alert in the staging namespace gets a separate notification. Adding job to the group provides finer granularity when multiple services share a namespace.
Inhibition Rules
Inhibition rules suppress lower-severity alerts when a higher-severity alert is already active for the same scope. This prevents alert storms where a single root cause triggers dozens of dependent alerts:
alertmanager: config: inhibit_rules: # Critical silences warning for the same alert - source_matchers: - severity = critical target_matchers: - severity = warning equal: ["alertname", "namespace"] # Node down silences all pod alerts on that node - source_matchers: - alertname = KubeNodeNotReady target_matchers: - severity =~ warning|info equal: ["node"] # Cluster unreachable silences all namespace-level alerts - source_matchers: - alertname = KubeClusterUnreachable target_matchers: - severity =~ .* equal: ["cluster"]
The equal field specifies which labels must match between the source (suppressing) and target (suppressed) alerts. The first rule above means: if a KubePodCrashLooping alert fires with severity=critical in namespace production, any KubePodCrashLooping alert with severity=warning in the same namespace is silenced.
Silences
Silences are time-bound rules that mute specific alerts. Unlike inhibition rules (which are permanent and label-based), silences are created on-demand through the Alertmanager UI or API and have an explicit expiration time.
Access the Alertmanager UI to create silences:
kubectl port-forward -n monitoring \ svc/kube-prometheus-stack-alertmanager 9093:9093 # Open http://localhost:9093 in your browser # Navigate to Silences → New Silence
You can also create silences programmatically via the Alertmanager API, which is useful for integrating with deployment pipelines. For example, automatically silence alerts for a namespace during a rolling deployment:
curl -X POST http://localhost:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{ "name": "namespace", "value": "production", "isRegex": false },
{ "name": "severity", "value": "warning", "isRegex": false }
],
"startsAt": "2026-03-05T10:00:00Z",
"endsAt": "2026-03-05T11:00:00Z",
"createdBy": "deploy-pipeline",
"comment": "Silencing warnings during v2.5.0 rollout"
}' Creating Custom PrometheusRule Alerts
While kube-prometheus-stack ships with over 100 default alerts from the kubernetes-mixin project, production environments always need custom alerts specific to their applications and SLOs. PrometheusRule is the custom resource that defines alert rules for the Prometheus Operator. Make sure your applications expose metrics via ServiceMonitor before writing alert rules that depend on custom metrics.
PrometheusRule Structure
Every PrometheusRule must have the correct labels so the Prometheus Operator discovers it. By default, the operator watches for rules with release: kube-prometheus-stack (matching your Helm release name):
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: app-custom-alerts namespace: monitoring labels: release: kube-prometheus-stack app: kube-prometheus-stack spec: groups: - name: app.rules rules: # High error rate on API endpoints - alert: HighApiErrorRate expr: | sum(rate(http_requests_total{status=~"5..",job="api-server"}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m])) > 0.05 for: 5m labels: severity: critical team: platform annotations: summary: "API error rate exceeds 5%" description: "API server error rate is {{ $value | humanizePercentage }} over the last 5 minutes." runbook_url: "https://runbooks.company.com/api-high-error-rate" # Pod memory approaching limits - alert: PodMemoryNearLimit expr: | container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9 and container_spec_memory_limit_bytes > 0 for: 10m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} memory usage above 90% of limit" description: "Container {{ $labels.container }} in pod {{ $labels.pod }} (namespace {{ $labels.namespace }}) is using {{ $value | humanizePercentage }} of its memory limit." # Persistent volume filling up - alert: PVCNearlyFull expr: | kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85 for: 15m labels: severity: warning annotations: summary: "PVC {{ $labels.persistentvolumeclaim }} is 85% full" description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value | humanizePercentage }} full. Consider expanding the volume or cleaning up data." # SLO burn rate alert (multi-window) - alert: SLOBurnRateHigh expr: | ( sum(rate(http_requests_total{status=~"5..",job="api-server"}[1h])) / sum(rate(http_requests_total{job="api-server"}[1h])) ) > (14.4 * 0.001) and ( sum(rate(http_requests_total{status=~"5..",job="api-server"}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m])) ) > (14.4 * 0.001) for: 2m labels: severity: critical slo: availability annotations: summary: "SLO burn rate is consuming error budget too fast" description: "The API server error budget burn rate exceeds 14.4x the target. At this rate, the monthly error budget will be exhausted in under 2 hours."
Key points for well-designed alert rules:
- Always set a
forduration -- This prevents transient spikes from triggering pages. Five minutes is a reasonable default for most alerts; use shorter durations (1-2 minutes) only for genuinely time-critical conditions like data loss. - Include meaningful annotations -- The
descriptionshould tell the on-call engineer what is happening and include template variables ({{ $value }},{{ $labels.pod }}) that provide specific context. Therunbook_urlshould link to a document explaining how to diagnose and resolve the issue. - Use severity labels consistently -- Define clear severity levels across your organization:
criticalpages someone immediately,warningneeds attention within hours,infois for awareness only. Your Alertmanager routing tree should match these levels to appropriate receivers. - Add team labels for routing -- Custom labels like
team: platformorteam: backendallow you to build routing rules that send alerts to the team responsible for that service.
Embedding Custom Rules in Helm Values
Instead of separate PrometheusRule manifests, you can define custom rules directly in your Helm values under additionalPrometheusRulesMap:
additionalPrometheusRulesMap: custom-app-rules: groups: - name: app.rules rules: - alert: HighLatencyP99 expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2 for: 10m labels: severity: warning annotations: summary: "P99 latency exceeds 2 seconds"
Testing Your Alert Pipeline
Configuring alerts without testing them is a recipe for silent failures during real incidents. Here is a systematic approach to validating your entire alert pipeline from rule evaluation through notification delivery.
Step 1: Verify PrometheusRule Discovery
Confirm that Prometheus has discovered your custom rules:
# List all PrometheusRule resources kubectl get prometheusrules -n monitoring # Check Prometheus targets and rules in the UI kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 # Open http://localhost:9090/rules to see all loaded rules # Open http://localhost:9090/alerts to see alert states # Check for rule evaluation errors in Prometheus logs kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \ --tail=100 | grep -i "rule"
Step 2: Fire a Test Alert
Create a PrometheusRule that fires immediately to test the entire pipeline end-to-end:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: test-alert namespace: monitoring labels: release: kube-prometheus-stack spec: groups: - name: test.rules rules: - alert: TestAlertPipeline expr: vector(1) for: 1m labels: severity: warning team: platform annotations: summary: "Test alert — safe to ignore" description: "This alert validates the notification pipeline. Delete the test-alert PrometheusRule to resolve."
Apply this rule and wait 1-2 minutes. Check each stage:
- Prometheus -- Verify the alert appears as "firing" at
http://localhost:9090/alerts. - Alertmanager -- Verify the alert appears at
http://localhost:9093/#/alertsand shows the correct receiver assignment. - Notification channel -- Verify you receive the notification in Slack, PagerDuty, email, or your configured destination.
- Cleanup -- Delete the test rule with
kubectl delete prometheusrule test-alert -n monitoring. Verify you receive a "resolved" notification ifsend_resolved: trueis configured.
Step 3: Validate Alertmanager Configuration Syntax
Before applying configuration changes to production, validate the syntax using amtool, the Alertmanager CLI:
# Extract current Alertmanager config kubectl get secret -n monitoring \ alertmanager-kube-prometheus-stack-alertmanager-generated \ -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d > /tmp/am-config.yaml # Validate the configuration amtool check-config /tmp/am-config.yaml # Test routing for a specific alert amtool config routes test \ --config.file=/tmp/am-config.yaml \ severity=critical alertname=HighApiErrorRate namespace=production
The amtool config routes test command is particularly valuable -- it shows exactly which receiver a given set of labels would match, letting you verify routing logic without firing real alerts.
Production Alerting Best Practices
After configuring and testing your alert pipeline, apply these production hardening practices to ensure reliability during real incidents. For clusters with Thanos long-term storage, you can also create alerts on historical data trends that span weeks or months.
- Run Alertmanager in high-availability mode. Set
alertmanager.alertmanagerSpec.replicas: 3in your Helm values. Alertmanager instances automatically form a cluster using the Gossip protocol, deduplicating notifications across replicas. If one replica goes down, the remaining instances continue delivering alerts. - Use the Watchdog alert as a dead man's switch. kube-prometheus-stack includes a
Watchdogalert that fires continuously when everything is healthy. Route it to a dead man's switch service (Dead Man's Snitch, Healthchecks.io, or PagerDuty heartbeat). If the Watchdog notification stops arriving, the monitoring system itself is broken. - Store secrets outside of Helm values. Webhook URLs, API tokens, and SMTP passwords should live in Kubernetes Secrets, not plaintext in values.yaml. Use
alertmanager.alertmanagerSpec.secretsto mount secrets into the Alertmanager pod and reference them with file-based config parameters. - Tune repeat intervals per severity. Critical alerts should repeat every 1 hour to stay visible. Warning alerts can repeat every 4-6 hours. Info alerts should repeat every 12-24 hours or not at all. Over-repeating causes notification fatigue; under-repeating lets issues go unnoticed.
- Create runbooks for every alert. Every
PrometheusRuleshould have arunbook_urlannotation linking to a document that explains: what this alert means, how to diagnose the root cause, how to remediate it, and what the expected impact is if left unresolved. Runbooks make the difference between a 5-minute fix and a 2-hour debugging session at 3 AM. - Implement alert ownership. Use labels like
team,service, andcomponenton your PrometheusRules and build routing rules that send alerts to the responsible team. Unowned alerts are ignored alerts. - Review and prune alerts quarterly. Audit your alert rules every quarter. Delete alerts nobody acts on, tighten thresholds that trigger too often, and add alerts for failure modes you discovered during incidents. A lean, high-signal alert set is worth more than 500 rules nobody trusts.
- Monitor Alertmanager itself. The kube-prometheus-stack includes an Alertmanager / Overview Grafana dashboard that shows notification success/failure rates, active alerts, and cluster health. Set up an alert on
alertmanager_notifications_failed_totalto detect when notifications silently fail. - Use
continue: truefor critical alerts. Route critical alerts to both PagerDuty (for paging) and Slack (for team visibility) by settingcontinue: trueon the PagerDuty route. This ensures the on-call engineer is paged while the rest of the team can follow along in the channel. - Test the pipeline after every change. After modifying Alertmanager configuration or adding new PrometheusRules, fire a test alert to verify end-to-end delivery. Configuration changes that break notification delivery fail silently -- you will not know until a real incident goes unnoticed.
Conclusion
Alertmanager is the critical bridge between knowing something is wrong and getting the right person to fix it. Without properly configured alert routing, grouping, and notification delivery, even the best Prometheus metrics and alerting rules are useless -- they fire into a void that nobody monitors.
The kube-prometheus-stack makes Alertmanager operationally simple: the Prometheus Operator manages the lifecycle, Helm values provide a clean configuration surface, and the 100+ default PrometheusRules give you meaningful alerts from day one. Your job is to connect the last mile -- routing alerts to Slack, PagerDuty, email, or webhooks; tuning grouping and inhibition to reduce noise; writing custom PrometheusRules for your application-specific SLOs; and testing the entire pipeline to ensure it works when it matters most.
Start with the Slack integration, add PagerDuty for critical alerts, configure the Watchdog dead man's switch, and build from there. A well-tuned alerting pipeline is not something you build once and forget -- it evolves with every incident, every post-mortem, and every new service your team deploys.
Ready to Deploy?
Get your full Kubernetes observability stack running in minutes with the official Helm chart.