Collecting metrics is only half of Kubernetes observability. The other half is knowing when something goes wrong and getting the right notification to the right person at the right time. That is exactly what Alertmanager does within the kube-prometheus-stack. It receives alerts fired by Prometheus, deduplicates them, groups related alerts together, routes them to the correct notification channel, and manages silences and inhibitions so your on-call team sees signal rather than noise.

This guide covers the complete Alertmanager configuration workflow inside kube-prometheus-stack: from Helm values.yaml routing trees and Slack webhook integration to PagerDuty escalations, custom PrometheusRule definitions, alert grouping strategies, and production hardening. Every example uses real YAML you can apply directly to your cluster.

What is Alertmanager in Kube-Prometheus-Stack?

Alertmanager is the alert notification engine that ships as part of the kube-prometheus-stack. While Prometheus evaluates alert rules and determines when an alert should fire, Alertmanager determines what happens next -- who gets notified, through which channel, and how alerts are grouped and deduplicated before delivery.

When you install kube-prometheus-stack via the Helm chart, Alertmanager is deployed automatically as a StatefulSet managed by the Prometheus Operator. The operator watches for AlertmanagerConfig custom resources and the alertmanager.config section in your Helm values, then generates the native Alertmanager configuration file and triggers a reload without requiring a pod restart.

Out of the box, Alertmanager is configured with a single null receiver -- it accepts alerts from Prometheus but does not deliver them anywhere. This is intentional: the stack gives you working metric collection and alerting rules immediately, but notification routing is something every organization must customize to match their own on-call workflows, escalation policies, and communication tools.

Alertmanager provides five core capabilities:

  • Deduplication -- When Prometheus runs in high-availability mode (two or more replicas), each instance fires the same alert independently. Alertmanager clusters together and deduplicates these, ensuring you receive only one notification per unique alert.
  • Grouping -- Related alerts are batched into a single notification. For example, if 50 pods in the same namespace start crash-looping simultaneously, Alertmanager groups them into one Slack message instead of flooding the channel with 50 separate messages.
  • Routing -- A routing tree directs alerts to different receivers based on label matchers. Critical infrastructure alerts go to PagerDuty, warning-level application alerts go to Slack, and informational alerts go to email -- all configurable through label-based matching.
  • Inhibition -- Rules that suppress lower-priority alerts when a higher-priority alert is already active. If the entire node is down, you do not need separate alerts for every pod on that node.
  • Silencing -- Time-bound rules that mute specific alerts during planned maintenance or known outages. Silences can be created through the Alertmanager UI or API.

How Alertmanager Works with Prometheus

Understanding the alert pipeline is essential before configuring anything. Here is how alerts flow through the system:

Step 1: Prometheus evaluates rules. Prometheus scrapes metrics from your cluster and evaluates PrometheusRule resources at a regular interval (default: 30 seconds). When a rule expression returns results, the alert transitions to pending. After the for duration elapses with the condition still true, the alert becomes firing.

Step 2: Prometheus sends alerts to Alertmanager. Prometheus pushes firing (and resolved) alerts to Alertmanager via HTTP POST to the /api/v2/alerts endpoint. In kube-prometheus-stack, this connection is preconfigured -- Prometheus knows Alertmanager's service address automatically.

Step 3: Alertmanager processes alerts. Alertmanager receives the alert, deduplicates it against existing alerts, and passes it through the routing tree. The routing tree matches alert labels against configured matchers and selects the appropriate receiver.

Step 4: Grouping and batching. Before sending a notification, Alertmanager groups alerts by the configured group_by labels. It then waits for the group_wait period to collect related alerts before sending the first notification. Subsequent alerts in the same group are batched and sent after the group_interval.

Step 5: Notification delivery. The selected receiver sends the notification through its configured channel -- Slack webhook, PagerDuty Events API, SMTP email, generic webhook, or any other supported integration. If delivery fails, Alertmanager retries according to its internal retry logic.

This pipeline means there are two separate configuration surfaces: PrometheusRule resources define what alerts fire and when, while Alertmanager configuration defines where alerts go and how they are grouped. Both are managed through your Helm values or Kubernetes custom resources. You can visualize alert state and history using the Alertmanager Grafana dashboard that ships with the stack.

Configuring Alert Routing in values.yaml

The routing tree is the heart of Alertmanager configuration. It determines which receiver handles each alert based on label matching. Here is a production-grade routing configuration in your kube-prometheus-stack Helm values:

values.yaml — Alertmanager Routing
alertmanager:
  enabled: true

  config:
    global:
      resolve_timeout: 5m
      slack_api_url: "https://hooks.slack.com/services/T00/B00/XXXX"

    route:
      receiver: "default-slack"
      group_by: ["alertname", "namespace", "job"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      routes:
        # Critical alerts → PagerDuty (immediate page)
        - receiver: "pagerduty-critical"
          matchers:
            - severity = critical
          group_wait: 10s
          repeat_interval: 1h
          continue: true

        # Warning alerts → dedicated Slack channel
        - receiver: "slack-warnings"
          matchers:
            - severity = warning
          group_wait: 30s
          repeat_interval: 6h

        # Watchdog heartbeat → dead man's switch
        - receiver: "deadmans-switch"
          matchers:
            - alertname = Watchdog
          repeat_interval: 1m

    receivers:
      - name: "default-slack"
        slack_configs:
          - channel: "#k8s-alerts"
            send_resolved: true

      - name: "pagerduty-critical"
        pagerduty_configs:
          - routing_key: "<YOUR-PAGERDUTY-INTEGRATION-KEY>"
            severity: "critical"

      - name: "slack-warnings"
        slack_configs:
          - channel: "#k8s-warnings"
            send_resolved: true

      - name: "deadmans-switch"
        webhook_configs:
          - url: "https://nosnch.in/XXXXXXX"

Let us break down the key routing parameters:

  • group_by -- Labels used to aggregate alerts into groups. Grouping by ["alertname", "namespace", "job"] means all alerts with the same name, namespace, and job are batched into one notification. Choose labels that produce meaningful groups without over-aggregating.
  • group_wait -- How long to wait after the first alert in a new group before sending the notification. A 30-second wait allows related alerts to arrive and be grouped together rather than sending each one individually.
  • group_interval -- After the initial notification is sent, how long to wait before sending updates about new alerts added to the same group. Set this to 5 minutes to avoid notification fatigue.
  • repeat_interval -- How long to wait before re-sending a notification for an alert that is still firing and has not changed. For critical alerts, 1 hour ensures the issue stays visible; for warnings, 4-6 hours is typically appropriate.
  • continue -- When set to true, matching does not stop at this route -- the alert continues to the next matching route. This lets you send critical alerts to both PagerDuty and Slack simultaneously.

The routing tree is evaluated top-down. The first matching child route handles the alert (unless continue: true is set). If no child route matches, the alert falls through to the parent route's receiver. Always define a sensible default receiver at the root level to catch unmatched alerts.

Slack Integration Step-by-Step

Slack is the most common notification target for Kubernetes alerts. Here is how to set up a production-quality Slack integration with rich message formatting:

Step 1: Create a Slack Incoming Webhook. In your Slack workspace, go to Settings & Administration → Manage Apps → Incoming Webhooks (or create a Slack App with incoming webhook permissions). Select the channel and copy the webhook URL.

Step 2: Store the webhook URL securely. Never put webhook URLs directly in values.yaml files that are committed to Git. Instead, create a Kubernetes Secret:

slack-webhook-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-slack-webhook
  namespace: monitoring
type: Opaque
stringData:
  slack-webhook-url: "https://hooks.slack.com/services/T00/B00/XXXX"

Step 3: Configure the Slack receiver with rich templates. The default Slack message format is basic. Use Go templates to create informative, actionable messages:

values.yaml — Slack Receiver with Rich Templates
alertmanager:
  config:
    receivers:
      - name: "slack-critical"
        slack_configs:
          - api_url: "https://hooks.slack.com/services/T00/B00/XXXX"
            channel: "#k8s-critical-alerts"
            send_resolved: true
            color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
            title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
            text: |
              *Cluster:* {{ .CommonLabels.cluster }}
              *Namespace:* {{ .CommonLabels.namespace }}
              *Severity:* {{ .CommonLabels.severity }}
              {{ range .Alerts }}
              ---
              *Alert:* {{ .Labels.alertname }}
              *Description:* {{ .Annotations.description }}
              *Runbook:* {{ .Annotations.runbook_url }}
              *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
              {{ end }}
            actions:
              - type: button
                text: "View in Alertmanager"
                url: '{{ template "__alertmanagerURL" . }}'
              - type: button
                text: "Silence Alert"
                url: '{{ template "__alertmanagerURL" . }}/#/silences/new'

This template includes the alert name, cluster, namespace, severity, description, runbook link, and start time for each alert in the group. The action buttons link directly to the Alertmanager UI for quick silencing. Setting send_resolved: true ensures your team gets a clear "resolved" notification when the alert clears, reducing confusion about whether an issue is still active.

Step 4: Apply the configuration. Run helm upgrade to push the new Alertmanager configuration to your cluster:

Apply configuration
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f values.yaml \
  --reuse-values

PagerDuty and Email Configuration

For critical production alerts, Slack alone is not sufficient. PagerDuty provides escalation policies, on-call schedules, and phone/SMS delivery that ensure critical alerts actually wake someone up.

PagerDuty Integration

Create a PagerDuty service integration using the Events API v2 integration type. Copy the integration key (also called routing key) and configure the receiver:

values.yaml — PagerDuty Receiver
alertmanager:
  config:
    receivers:
      - name: "pagerduty-critical"
        pagerduty_configs:
          - routing_key: "<YOUR-PAGERDUTY-INTEGRATION-KEY>"
            severity: '{{ if eq (index .Alerts 0).Labels.severity "critical" }}critical{{ else }}warning{{ end }}'
            description: '{{ .CommonAnnotations.description }}'
            details:
              cluster: '{{ .CommonLabels.cluster }}'
              namespace: '{{ .CommonLabels.namespace }}'
              firing: '{{ .Alerts.Firing | len }}'
              resolved: '{{ .Alerts.Resolved | len }}'

The severity field maps directly to PagerDuty's urgency levels. Sending critical triggers high-urgency incidents that page immediately, while warning creates low-urgency incidents that follow your PagerDuty notification rules.

Email (SMTP) Configuration

Email is useful for lower-priority alerts, compliance notifications, or environments where chat tools are not available. Configure the SMTP settings globally and add email receivers:

values.yaml — Email (SMTP) Configuration
alertmanager:
  config:
    global:
      smtp_smarthost: "smtp.company.com:587"
      smtp_from: "alertmanager@company.com"
      smtp_auth_username: "alertmanager@company.com"
      smtp_auth_password: "<SMTP-PASSWORD>"
      smtp_require_tls: true

    receivers:
      - name: "email-infra-team"
        email_configs:
          - to: "infra-team@company.com"
            send_resolved: true
            headers:
              Subject: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} - Kubernetes Cluster'

For production environments, always store SMTP credentials in a Kubernetes Secret rather than plaintext in values.yaml. Reference the secret via alertmanager.alertmanagerSpec.secrets in your Helm values to mount it into the Alertmanager pod.

Webhook Receiver (Generic)

For custom integrations -- such as Microsoft Teams, Opsgenie, or an internal incident management system -- use the generic webhook receiver. Alertmanager sends the full alert payload as a JSON POST to your endpoint:

values.yaml — Generic Webhook Receiver
receivers:
  - name: "custom-webhook"
    webhook_configs:
      - url: "https://incident-api.company.com/alertmanager"
        send_resolved: true
        max_alerts: 10
        http_config:
          bearer_token: "<API-TOKEN>"

Alert Grouping, Inhibition, and Silencing

Effective alert management is not just about where alerts are sent -- it is about reducing noise so on-call engineers can focus on real problems. Grouping, inhibition, and silencing are the three mechanisms that transform a firehose of raw alerts into actionable notifications.

Alert Grouping Strategy

Grouping controls how Alertmanager batches related alerts into a single notification. The group_by parameter in the routing configuration determines which labels define a group:

Grouping Strategy Examples
# Group by alert name and namespace (recommended default)
group_by: ["alertname", "namespace"]

# Group all alerts together (one notification for everything)
group_by: ["..."]

# No grouping — each alert fires independently
group_by: []

# Group by cluster and severity for multi-cluster setups
group_by: ["cluster", "severity", "alertname"]

The recommended default grouping is ["alertname", "namespace"]. This means all KubePodCrashLooping alerts in the production namespace get batched into one notification, while the same alert in the staging namespace gets a separate notification. Adding job to the group provides finer granularity when multiple services share a namespace.

Inhibition Rules

Inhibition rules suppress lower-severity alerts when a higher-severity alert is already active for the same scope. This prevents alert storms where a single root cause triggers dozens of dependent alerts:

values.yaml — Inhibition Rules
alertmanager:
  config:
    inhibit_rules:
      # Critical silences warning for the same alert
      - source_matchers:
          - severity = critical
        target_matchers:
          - severity = warning
        equal: ["alertname", "namespace"]

      # Node down silences all pod alerts on that node
      - source_matchers:
          - alertname = KubeNodeNotReady
        target_matchers:
          - severity =~ warning|info
        equal: ["node"]

      # Cluster unreachable silences all namespace-level alerts
      - source_matchers:
          - alertname = KubeClusterUnreachable
        target_matchers:
          - severity =~ .*
        equal: ["cluster"]

The equal field specifies which labels must match between the source (suppressing) and target (suppressed) alerts. The first rule above means: if a KubePodCrashLooping alert fires with severity=critical in namespace production, any KubePodCrashLooping alert with severity=warning in the same namespace is silenced.

Silences

Silences are time-bound rules that mute specific alerts. Unlike inhibition rules (which are permanent and label-based), silences are created on-demand through the Alertmanager UI or API and have an explicit expiration time.

Access the Alertmanager UI to create silences:

Access Alertmanager UI
kubectl port-forward -n monitoring \
  svc/kube-prometheus-stack-alertmanager 9093:9093

# Open http://localhost:9093 in your browser
# Navigate to Silences → New Silence

You can also create silences programmatically via the Alertmanager API, which is useful for integrating with deployment pipelines. For example, automatically silence alerts for a namespace during a rolling deployment:

Create Silence via API
curl -X POST http://localhost:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      { "name": "namespace", "value": "production", "isRegex": false },
      { "name": "severity", "value": "warning", "isRegex": false }
    ],
    "startsAt": "2026-03-05T10:00:00Z",
    "endsAt": "2026-03-05T11:00:00Z",
    "createdBy": "deploy-pipeline",
    "comment": "Silencing warnings during v2.5.0 rollout"
  }'

Creating Custom PrometheusRule Alerts

While kube-prometheus-stack ships with over 100 default alerts from the kubernetes-mixin project, production environments always need custom alerts specific to their applications and SLOs. PrometheusRule is the custom resource that defines alert rules for the Prometheus Operator. Make sure your applications expose metrics via ServiceMonitor before writing alert rules that depend on custom metrics.

PrometheusRule Structure

Every PrometheusRule must have the correct labels so the Prometheus Operator discovers it. By default, the operator watches for rules with release: kube-prometheus-stack (matching your Helm release name):

custom-alerts.yaml — Application PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-custom-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
    app: kube-prometheus-stack
spec:
  groups:
    - name: app.rules
      rules:
        # High error rate on API endpoints
        - alert: HighApiErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5..",job="api-server"}[5m]))
            /
            sum(rate(http_requests_total{job="api-server"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
            team: platform
          annotations:
            summary: "API error rate exceeds 5%"
            description: "API server error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
            runbook_url: "https://runbooks.company.com/api-high-error-rate"

        # Pod memory approaching limits
        - alert: PodMemoryNearLimit
          expr: |
            container_memory_working_set_bytes
            / container_spec_memory_limit_bytes > 0.9
            and container_spec_memory_limit_bytes > 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} memory usage above 90% of limit"
            description: "Container {{ $labels.container }} in pod {{ $labels.pod }} (namespace {{ $labels.namespace }}) is using {{ $value | humanizePercentage }} of its memory limit."

        # Persistent volume filling up
        - alert: PVCNearlyFull
          expr: |
            kubelet_volume_stats_used_bytes
            / kubelet_volume_stats_capacity_bytes > 0.85
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "PVC {{ $labels.persistentvolumeclaim }} is 85% full"
            description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value | humanizePercentage }} full. Consider expanding the volume or cleaning up data."

        # SLO burn rate alert (multi-window)
        - alert: SLOBurnRateHigh
          expr: |
            (
              sum(rate(http_requests_total{status=~"5..",job="api-server"}[1h]))
              / sum(rate(http_requests_total{job="api-server"}[1h]))
            ) > (14.4 * 0.001)
            and
            (
              sum(rate(http_requests_total{status=~"5..",job="api-server"}[5m]))
              / sum(rate(http_requests_total{job="api-server"}[5m]))
            ) > (14.4 * 0.001)
          for: 2m
          labels:
            severity: critical
            slo: availability
          annotations:
            summary: "SLO burn rate is consuming error budget too fast"
            description: "The API server error budget burn rate exceeds 14.4x the target. At this rate, the monthly error budget will be exhausted in under 2 hours."

Key points for well-designed alert rules:

  • Always set a for duration -- This prevents transient spikes from triggering pages. Five minutes is a reasonable default for most alerts; use shorter durations (1-2 minutes) only for genuinely time-critical conditions like data loss.
  • Include meaningful annotations -- The description should tell the on-call engineer what is happening and include template variables ({{ $value }}, {{ $labels.pod }}) that provide specific context. The runbook_url should link to a document explaining how to diagnose and resolve the issue.
  • Use severity labels consistently -- Define clear severity levels across your organization: critical pages someone immediately, warning needs attention within hours, info is for awareness only. Your Alertmanager routing tree should match these levels to appropriate receivers.
  • Add team labels for routing -- Custom labels like team: platform or team: backend allow you to build routing rules that send alerts to the team responsible for that service.

Embedding Custom Rules in Helm Values

Instead of separate PrometheusRule manifests, you can define custom rules directly in your Helm values under additionalPrometheusRulesMap:

values.yaml — additionalPrometheusRulesMap
additionalPrometheusRulesMap:
  custom-app-rules:
    groups:
      - name: app.rules
        rules:
          - alert: HighLatencyP99
            expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "P99 latency exceeds 2 seconds"

Testing Your Alert Pipeline

Configuring alerts without testing them is a recipe for silent failures during real incidents. Here is a systematic approach to validating your entire alert pipeline from rule evaluation through notification delivery.

Step 1: Verify PrometheusRule Discovery

Confirm that Prometheus has discovered your custom rules:

Check rule discovery
# List all PrometheusRule resources
kubectl get prometheusrules -n monitoring

# Check Prometheus targets and rules in the UI
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/rules to see all loaded rules
# Open http://localhost:9090/alerts to see alert states

# Check for rule evaluation errors in Prometheus logs
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  --tail=100 | grep -i "rule"

Step 2: Fire a Test Alert

Create a PrometheusRule that fires immediately to test the entire pipeline end-to-end:

test-alert.yaml — Always-Firing Test Alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: test-alert
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: test.rules
      rules:
        - alert: TestAlertPipeline
          expr: vector(1)
          for: 1m
          labels:
            severity: warning
            team: platform
          annotations:
            summary: "Test alert — safe to ignore"
            description: "This alert validates the notification pipeline. Delete the test-alert PrometheusRule to resolve."

Apply this rule and wait 1-2 minutes. Check each stage:

  1. Prometheus -- Verify the alert appears as "firing" at http://localhost:9090/alerts.
  2. Alertmanager -- Verify the alert appears at http://localhost:9093/#/alerts and shows the correct receiver assignment.
  3. Notification channel -- Verify you receive the notification in Slack, PagerDuty, email, or your configured destination.
  4. Cleanup -- Delete the test rule with kubectl delete prometheusrule test-alert -n monitoring. Verify you receive a "resolved" notification if send_resolved: true is configured.

Step 3: Validate Alertmanager Configuration Syntax

Before applying configuration changes to production, validate the syntax using amtool, the Alertmanager CLI:

Validate config with amtool
# Extract current Alertmanager config
kubectl get secret -n monitoring \
  alertmanager-kube-prometheus-stack-alertmanager-generated \
  -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d > /tmp/am-config.yaml

# Validate the configuration
amtool check-config /tmp/am-config.yaml

# Test routing for a specific alert
amtool config routes test \
  --config.file=/tmp/am-config.yaml \
  severity=critical alertname=HighApiErrorRate namespace=production

The amtool config routes test command is particularly valuable -- it shows exactly which receiver a given set of labels would match, letting you verify routing logic without firing real alerts.

Production Alerting Best Practices

After configuring and testing your alert pipeline, apply these production hardening practices to ensure reliability during real incidents. For clusters with Thanos long-term storage, you can also create alerts on historical data trends that span weeks or months.

  1. Run Alertmanager in high-availability mode. Set alertmanager.alertmanagerSpec.replicas: 3 in your Helm values. Alertmanager instances automatically form a cluster using the Gossip protocol, deduplicating notifications across replicas. If one replica goes down, the remaining instances continue delivering alerts.
  2. Use the Watchdog alert as a dead man's switch. kube-prometheus-stack includes a Watchdog alert that fires continuously when everything is healthy. Route it to a dead man's switch service (Dead Man's Snitch, Healthchecks.io, or PagerDuty heartbeat). If the Watchdog notification stops arriving, the monitoring system itself is broken.
  3. Store secrets outside of Helm values. Webhook URLs, API tokens, and SMTP passwords should live in Kubernetes Secrets, not plaintext in values.yaml. Use alertmanager.alertmanagerSpec.secrets to mount secrets into the Alertmanager pod and reference them with file-based config parameters.
  4. Tune repeat intervals per severity. Critical alerts should repeat every 1 hour to stay visible. Warning alerts can repeat every 4-6 hours. Info alerts should repeat every 12-24 hours or not at all. Over-repeating causes notification fatigue; under-repeating lets issues go unnoticed.
  5. Create runbooks for every alert. Every PrometheusRule should have a runbook_url annotation linking to a document that explains: what this alert means, how to diagnose the root cause, how to remediate it, and what the expected impact is if left unresolved. Runbooks make the difference between a 5-minute fix and a 2-hour debugging session at 3 AM.
  6. Implement alert ownership. Use labels like team, service, and component on your PrometheusRules and build routing rules that send alerts to the responsible team. Unowned alerts are ignored alerts.
  7. Review and prune alerts quarterly. Audit your alert rules every quarter. Delete alerts nobody acts on, tighten thresholds that trigger too often, and add alerts for failure modes you discovered during incidents. A lean, high-signal alert set is worth more than 500 rules nobody trusts.
  8. Monitor Alertmanager itself. The kube-prometheus-stack includes an Alertmanager / Overview Grafana dashboard that shows notification success/failure rates, active alerts, and cluster health. Set up an alert on alertmanager_notifications_failed_total to detect when notifications silently fail.
  9. Use continue: true for critical alerts. Route critical alerts to both PagerDuty (for paging) and Slack (for team visibility) by setting continue: true on the PagerDuty route. This ensures the on-call engineer is paged while the rest of the team can follow along in the channel.
  10. Test the pipeline after every change. After modifying Alertmanager configuration or adding new PrometheusRules, fire a test alert to verify end-to-end delivery. Configuration changes that break notification delivery fail silently -- you will not know until a real incident goes unnoticed.

Conclusion

Alertmanager is the critical bridge between knowing something is wrong and getting the right person to fix it. Without properly configured alert routing, grouping, and notification delivery, even the best Prometheus metrics and alerting rules are useless -- they fire into a void that nobody monitors.

The kube-prometheus-stack makes Alertmanager operationally simple: the Prometheus Operator manages the lifecycle, Helm values provide a clean configuration surface, and the 100+ default PrometheusRules give you meaningful alerts from day one. Your job is to connect the last mile -- routing alerts to Slack, PagerDuty, email, or webhooks; tuning grouping and inhibition to reduce noise; writing custom PrometheusRules for your application-specific SLOs; and testing the entire pipeline to ensure it works when it matters most.

Start with the Slack integration, add PagerDuty for critical alerts, configure the Watchdog dead man's switch, and build from there. A well-tuned alerting pipeline is not something you build once and forget -- it evolves with every incident, every post-mortem, and every new service your team deploys.

Ready to Deploy?

Get your full Kubernetes observability stack running in minutes with the official Helm chart.

Quick Install Guide Helm Chart Docs