The kube-prometheus-stack Helm chart ships with a values.yaml file that spans over 4,000 lines and controls the behavior of every component in the stack: Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics, and the Prometheus Operator itself. Getting these settings right is the difference between a monitoring stack that survives production workloads and one that crashes under pressure, loses data on pod restarts, or silently drops alerts.

This guide walks through every major section of the kube-prometheus-stack values.yaml — not just listing what each key does, but explaining why each setting matters, what the real-world defaults should be, and how the values interact with each other. Every YAML example is a working snippet you can paste directly into your own override file.

Why values.yaml Matters

The default values.yaml in kube-prometheus-stack is designed for quick evaluation, not production readiness. Out of the box, Prometheus stores metrics in an emptyDir volume (data is lost on pod restart), resource requests and limits are unset (the scheduler cannot make intelligent placement decisions), and Alertmanager sends notifications to a blackhole receiver (alerts fire but go nowhere). These defaults let you install the stack in two minutes and start exploring — but they will fail you the first time a node drains or a pod gets OOM-killed.

The values.yaml file is the single point of control for the entire monitoring stack. Rather than editing Kubernetes manifests directly, you declare your desired state in values.yaml and Helm renders the correct Deployments, StatefulSets, ConfigMaps, and CRDs. This is critical because the kube-prometheus-stack uses Custom Resource Definitions (Prometheus, Alertmanager, ServiceMonitor, PodMonitor) managed by the Prometheus Operator — and the values.yaml is the primary interface for configuring those CRDs.

Understanding the structure of values.yaml also matters because the chart is a meta-chart. It bundles several sub-charts (Grafana, kube-state-metrics, prometheus-node-exporter) under a single umbrella. Each sub-chart has its own values namespace, and knowing which prefix maps to which component prevents hours of debugging misconfigured overrides.

How to View and Override Default Values

Before customizing anything, inspect the full default values to understand what you are overriding. Helm provides built-in commands for this:

View defaults
# Dump the entire default values.yaml to a file
helm show values prometheus-community/kube-prometheus-stack > defaults.yaml

# Or inspect a specific sub-chart's defaults
helm show values prometheus-community/kube-prometheus-stack | grep -A 50 "prometheusSpec:"

The key principle of Helm overrides is deep merge: you only need to specify the values you want to change. Create a minimal file with your customizations and pass it with -f:

Install with overrides
# Install with custom overrides
helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace \
  -f my-values.yaml

# You can layer multiple override files (later files win)
helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f base-values.yaml \
  -f environment-specific.yaml

You can also override individual values inline with --set, which is useful for CI/CD pipelines where a single value changes between environments:

Inline overrides
# Override a single value inline
helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f my-values.yaml \
  --set prometheus.prometheusSpec.retention=30d \
  --set grafana.adminPassword=mySecurePassword

After installation, verify which values are actually in effect by checking the rendered resources:

Verify applied values
# See what values Helm actually used for the release
helm get values kube-prometheus-stack -n monitoring

# Show ALL values (including defaults) for the release
helm get values kube-prometheus-stack -n monitoring --all

# Dry-run to preview changes before applying
helm upgrade kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -n monitoring -f my-values.yaml --dry-run --debug

Prometheus Configuration (prometheusSpec)

The prometheus.prometheusSpec section is the heart of the values.yaml. It maps directly to the Prometheus Custom Resource managed by the Prometheus Operator. Every field here controls how the Prometheus StatefulSet behaves — scraping, retention, storage, replication, and more.

Retention Settings

Retention controls how long Prometheus keeps time-series data before compacting and deleting it. There are two complementary settings:

prometheus-retention.yaml
prometheus:
  prometheusSpec:
    # Time-based retention — delete data older than this
    retention: 15d

    # Size-based retention — delete oldest data when TSDB exceeds this
    # Whichever limit is hit first triggers deletion
    retentionSize: "45GB"

    # WAL compression reduces disk I/O at the cost of CPU
    # Recommended for production — typically 50% reduction in WAL size
    walCompression: true

The default retention is 10d (10 days). For production, 15-30 days is common for hot storage. If you need longer retention, consider using remote write to send data to Thanos, Cortex, or Grafana Mimir rather than increasing Prometheus local retention — large TSDB volumes slow down Prometheus restarts and increase memory usage significantly.

Scrape Configuration

The scrape interval and timeout control how frequently Prometheus collects metrics and how long it waits for a target to respond:

prometheus-scrape.yaml
prometheus:
  prometheusSpec:
    # Global scrape interval — applies to all targets unless overridden
    scrapeInterval: "30s"

    # Global scrape timeout — must be less than scrapeInterval
    scrapeTimeout: "10s"

    # How often Prometheus evaluates alerting and recording rules
    evaluationInterval: "30s"

    # External labels added to all metrics (useful for multi-cluster)
    externalLabels:
      cluster: production-us-east-1
      environment: production

Reducing scrapeInterval to 15s doubles the number of samples collected, which doubles storage and memory requirements. For most workloads, 30s provides sufficient granularity. Only reduce it for latency-sensitive SLO tracking where 30-second gaps in data are unacceptable.

ServiceMonitor and PodMonitor Selectors

By default, the Prometheus Operator watches all namespaces for ServiceMonitor and PodMonitor resources. You can restrict this for multi-tenant clusters or to isolate monitoring scopes:

prometheus-selectors.yaml
prometheus:
  prometheusSpec:
    # Select ServiceMonitors by label
    serviceMonitorSelector:
      matchLabels:
        prometheus: kube-prometheus-stack

    # Watch specific namespaces only (empty = all namespaces)
    serviceMonitorNamespaceSelector:
      matchLabels:
        monitoring: "enabled"

    # Same pattern applies to PodMonitors
    podMonitorSelector: {}
    podMonitorNamespaceSelector: {}

    # And to ProbeMonitors (blackbox exporter probes)
    probeSelector: {}
    probeNamespaceSelector: {}

Setting these selectors to {} (empty object) means "select everything." Omitting them entirely also means "select everything." Only add label selectors when you have a specific reason to restrict which ServiceMonitors Prometheus picks up.

Replicas and High Availability

prometheus-ha.yaml
prometheus:
  prometheusSpec:
    # Run 2 replicas for high availability
    replicas: 2

    # Shard across multiple Prometheus instances for large clusters
    # Each shard gets a subset of targets automatically
    shards: 1

    # Pod anti-affinity ensures replicas run on different nodes
    podAntiAffinity: "hard"

    # Topology spread constraints for zone-aware scheduling
    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: prometheus

Running 2 replicas means both Prometheus instances scrape the same targets independently. They produce near-identical data but are not perfectly synchronized. For deduplication at query time, pair this with Thanos Query or use Prometheus's built-in replicaExternalLabelName to differentiate replicas.

Grafana Configuration

Grafana runs as a sub-chart within kube-prometheus-stack. All Grafana settings are nested under the grafana: key and follow the Grafana Helm chart's values structure.

Admin Credentials and Basic Settings

grafana-basic.yaml
grafana:
  # Enable or disable Grafana entirely
  enabled: true

  # Default admin credentials (change these!)
  adminUser: admin
  adminPassword: my-secure-grafana-password

  # Or use an existing Kubernetes Secret
  admin:
    existingSecret: grafana-admin-credentials
    userKey: admin-user
    passwordKey: admin-password

  # Number of Grafana replicas
  replicas: 1

  # Grafana.ini settings passed directly to the Grafana config
  grafana.ini:
    server:
      root_url: https://grafana.example.com
    auth.generic_oauth:
      enabled: true
      name: SSO
      client_id: grafana
      scopes: openid profile email

Dashboard and Datasource Sidecar

The sidecar is what makes Grafana dashboards self-provisioning in kube-prometheus-stack. It watches for ConfigMaps with specific labels and loads them as dashboards or data sources:

grafana-sidecar.yaml
grafana:
  sidecar:
    dashboards:
      enabled: true
      # Label that ConfigMaps must have to be picked up
      label: grafana_dashboard
      labelValue: "1"
      # Watch all namespaces, not just the release namespace
      searchNamespace: ALL
      # Folder where sidecar-loaded dashboards appear
      folder: /tmp/dashboards
      folderAnnotation: grafana_folder
      provider:
        foldersFromFilesStructure: true
    datasources:
      enabled: true
      label: grafana_datasource
      labelValue: "1"

Grafana Persistence

By default, Grafana uses emptyDir storage, meaning dashboards created through the UI, user preferences, and alert rules are lost on pod restart. Enable persistence for production:

grafana-persistence.yaml
grafana:
  persistence:
    enabled: true
    type: pvc
    storageClassName: gp3
    accessModes:
      - ReadWriteOnce
    size: 10Gi
    finalizers:
      - kubernetes.io/pvc-protection

Even with persistence enabled, the recommended approach for production dashboards is ConfigMap provisioning (described in the Grafana dashboards guide) rather than relying on the SQLite database stored on the PVC. The PVC should be seen as a safety net, not the primary storage mechanism for dashboards.

Alertmanager Configuration

The Alertmanager configuration in values.yaml controls how alerts are routed, grouped, and delivered to notification channels. This is the section most teams get wrong on the first attempt.

Alertmanager Spec

alertmanager-spec.yaml
alertmanager:
  enabled: true

  alertmanagerSpec:
    # Replicas for high availability (requires a shared storage or gossip)
    replicas: 3

    # Retention for resolved alerts
    retention: 120h

    # Persistent storage for alert silences and notification state
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 5Gi

    # Resource limits
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi

Alert Routing Configuration

The alertmanager.config section defines how alerts flow from Prometheus to your notification channels. This is the actual Alertmanager configuration — not Helm abstractions:

alertmanager-routing.yaml
alertmanager:
  config:
    global:
      resolve_timeout: 5m
      slack_api_url: "https://hooks.slack.com/services/T.../B.../xxx"

    route:
      # Group alerts by these labels before sending
      group_by: ['namespace', 'alertname', 'severity']
      # Wait this long to batch alerts in the same group
      group_wait: 30s
      # Wait this long before sending updates for an existing group
      group_interval: 5m
      # Wait this long before resending a notification
      repeat_interval: 4h
      receiver: slack-default

      routes:
        # Critical alerts go to PagerDuty immediately
        - matchers:
            - severity = critical
          receiver: pagerduty-critical
          group_wait: 10s
          repeat_interval: 1h

        # Warning alerts go to a dedicated Slack channel
        - matchers:
            - severity = warning
          receiver: slack-warnings
          repeat_interval: 12h

        # Silence Watchdog alerts (they are heartbeat checks)
        - matchers:
            - alertname = Watchdog
          receiver: "null"

    receivers:
      - name: "null"
      - name: slack-default
        slack_configs:
          - channel: "#monitoring-alerts"
            send_resolved: true
            title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
            text: '{{ range .Alerts }}*{{ .Labels.severity }}*: {{ .Annotations.summary }}\n{{ end }}'
      - name: slack-warnings
        slack_configs:
          - channel: "#monitoring-warnings"
            send_resolved: true
      - name: pagerduty-critical
        pagerduty_configs:
          - routing_key: "your-pagerduty-integration-key"
            severity: '{{ .CommonLabels.severity }}'

    inhibit_rules:
      # If a critical alert fires, suppress warning alerts for the same alertname
      - source_matchers:
          - severity = critical
        target_matchers:
          - severity = warning
        equal: ['namespace', 'alertname']

The group_wait, group_interval, and repeat_interval settings are the most commonly misconfigured. Setting group_wait too low causes alert storms — dozens of individual notifications instead of a single grouped message. Setting repeat_interval too low causes alert fatigue, where teams start ignoring repeated notifications.

Node Exporter and Kube-State-Metrics Settings

These two exporters are the primary data sources for infrastructure and Kubernetes-level metrics. Both run as sub-charts with their own configuration namespaces.

Node Exporter

Node Exporter runs as a DaemonSet, placing one pod on every node to collect OS-level metrics (CPU, memory, disk, network, filesystem):

node-exporter.yaml
prometheus-node-exporter:
  enabled: true

  # Run on all nodes including control plane
  tolerations:
    - effect: NoSchedule
      operator: Exists

  # Resource limits — node-exporter is lightweight
  resources:
    requests:
      cpu: 50m
      memory: 30Mi
    limits:
      cpu: 100m
      memory: 50Mi

  # Extra arguments to enable/disable specific collectors
  extraArgs:
    - --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)
    - --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$

The tolerations field is critical — without it, node-exporter pods will not be scheduled on tainted nodes (control plane nodes, GPU nodes, or nodes with custom taints). You get blind spots in your monitoring wherever node-exporter cannot run.

Kube-State-Metrics

Kube-state-metrics generates metrics about the state of Kubernetes objects (Deployments, Pods, Nodes, PVCs, etc.). It queries the Kubernetes API server and converts object states into Prometheus metrics:

kube-state-metrics.yaml
kube-state-metrics:
  enabled: true

  # Only generate metrics for these resource types (reduce cardinality)
  collectors:
    - daemonsets
    - deployments
    - horizontalpodautoscalers
    - jobs
    - namespaces
    - nodes
    - persistentvolumeclaims
    - persistentvolumes
    - pods
    - replicasets
    - statefulsets

  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi

  # Add custom labels from Kubernetes objects as Prometheus labels
  metricLabelsAllowlist:
    - pods=[app.kubernetes.io/name,app.kubernetes.io/instance]
    - deployments=[app.kubernetes.io/name,app.kubernetes.io/version]

The metricLabelsAllowlist setting is powerful but often overlooked. It lets you promote Kubernetes object labels to Prometheus metric labels, enabling queries like "show me CPU usage grouped by application version." Be selective — every additional label increases metric cardinality exponentially.

Storage and Persistence Configuration

Storage is the most critical production configuration. Without persistent storage, every pod restart loses all historical metrics, alert silences, and Grafana state. This is the single most common failure mode in new kube-prometheus-stack deployments.

Prometheus Storage

prometheus-storage.yaml
prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              # Size depends on: retention period x ingestion rate x bytes per sample
              # Rule of thumb: ~2 bytes per sample after compression
              # 100k active series x 30s interval x 15 days ~ 50GB
              storage: 50Gi

Sizing the PVC correctly requires understanding your cluster's cardinality. Use the Prometheus expression prometheus_tsdb_head_series to check active time series count, and prometheus_tsdb_storage_blocks_bytes to see current disk usage. A good rule of thumb: provision 2x the disk space you expect to need, because TSDB compaction temporarily requires double the space.

Alertmanager Storage

alertmanager-storage.yaml
alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              # Alertmanager stores silences and nflog — 5Gi is plenty
              storage: 5Gi

Alertmanager storage requirements are modest — it only stores alert silences and the notification log. 5Gi is sufficient for virtually all deployments. The critical thing is that storage exists at all, so silences survive pod restarts during maintenance windows.

Resource Requests and Limits

Setting resource requests and limits is essential for production stability. Without them, the Kubernetes scheduler cannot make informed placement decisions, and monitoring components compete with application workloads for resources — often losing during memory pressure events.

resource-limits.yaml
# Prometheus — the most resource-intensive component
prometheus:
  prometheusSpec:
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 4Gi

# Prometheus Operator
prometheusOperator:
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 200m
      memory: 256Mi

# Grafana
grafana:
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 256Mi

# Alertmanager
alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi

Prometheus memory usage scales linearly with the number of active time series. A rough formula: memory = active_series x 2KB + 500MB base. A cluster with 500,000 active series needs approximately 1.5GB of memory. Always set memory limits 50-100% higher than requests to handle query spikes and compaction bursts.

Avoid setting CPU limits on Prometheus if possible. CPU throttling can cause Prometheus to fall behind on scraping, creating gaps in metrics data. If your cluster enforces CPU limits via LimitRange, set the limit at least 4x the request.

Network Policies and Security

Securing the monitoring stack is critical because it has broad read access to the Kubernetes API and stores sensitive operational data. The values.yaml provides several security-related configuration options.

Network Policies

network-policies.yaml
# Enable network policies for all components
prometheus:
  prometheusSpec:
    podMetadata:
      labels:
        networking/allow-prometheus: "true"

grafana:
  networkPolicy:
    enabled: true
    # Allow ingress from specific namespaces
    ingress:
      - from:
          - namespaceSelector:
              matchLabels:
                name: ingress-nginx
        ports:
          - port: 3000
            protocol: TCP

Security Contexts and RBAC

security-context.yaml
prometheus:
  prometheusSpec:
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 2000
      fsGroup: 2000
      seccompProfile:
        type: RuntimeDefault

    containers:
      - name: prometheus
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - ALL

grafana:
  securityContext:
    runAsNonRoot: true
    runAsUser: 472
    fsGroup: 472

  # Use existing ServiceAccount or let the chart create one
  serviceAccount:
    create: true
    name: grafana
    annotations:
      # For AWS IRSA or GCP Workload Identity
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/grafana-role

Ingress Configuration

Expose Grafana (and optionally Prometheus/Alertmanager) via an Ingress resource:

ingress.yaml
grafana:
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
      nginx.ingress.kubernetes.io/ssl-redirect: "true"
    hosts:
      - grafana.example.com
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.example.com

prometheus:
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      nginx.ingress.kubernetes.io/auth-type: basic
      nginx.ingress.kubernetes.io/auth-secret: prometheus-basic-auth
    hosts:
      - prometheus.example.com
    tls:
      - secretName: prometheus-tls
        hosts:
          - prometheus.example.com

Always protect Prometheus and Alertmanager with authentication when exposing them via Ingress. Unlike Grafana, they have no built-in authentication — anyone with access to the URL can read all metrics and modify alert silences. Use basic auth, OAuth2 proxy, or a VPN.

Complete Production-Ready values.yaml Example

Here is a comprehensive production values.yaml that combines all the best practices discussed in this guide. This configuration is suitable for a medium-sized cluster (50-200 nodes, 500k-2M active time series):

production-values.yaml
# ============================================================
# Production-Ready kube-prometheus-stack values.yaml
# Cluster size: 50-200 nodes, 500k-2M active time series
# ============================================================

# --- Global Settings ---
fullnameOverride: kps
defaultRules:
  create: true
  rules:
    alertmanager: true
    etcd: true
    configReloaders: true
    general: true
    k8sContainerCpuUsageSecondsTotal: true
    k8sContainerMemoryWorkingSetBytes: true
    k8sPodOwner: true
    kubeApiserverAvailability: true
    kubeApiserverBurnrate: true
    kubeApiserverHistogram: true
    kubeApiserverSlos: true
    kubePrometheusGeneral: true
    kubePrometheusNodeRecording: true
    kubernetesApps: true
    kubernetesResources: true
    kubernetesStorage: true
    kubernetesSystem: true
    kubeSchedulerAlerting: true
    kubeSchedulerRecording: true
    kubeStateMetrics: true
    network: true
    node: true
    nodeExporterAlerting: true
    nodeExporterRecording: true
    prometheus: true
    prometheusOperator: true

# --- Prometheus Operator ---
prometheusOperator:
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 200m
      memory: 256Mi
  admissionWebhooks:
    enabled: true
    patch:
      enabled: true

# --- Prometheus ---
prometheus:
  prometheusSpec:
    replicas: 2
    retention: 15d
    retentionSize: "45GB"
    walCompression: true
    scrapeInterval: "30s"
    evaluationInterval: "30s"
    podAntiAffinity: "hard"
    externalLabels:
      cluster: production
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    resources:
      requests:
        cpu: 1000m
        memory: 4Gi
      limits:
        memory: 8Gi
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      fsGroup: 2000
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector: {}
    podMonitorSelector: {}
    podMonitorNamespaceSelector: {}

# --- Alertmanager ---
alertmanager:
  alertmanagerSpec:
    replicas: 3
    retention: 120h
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 5Gi
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['namespace', 'alertname']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: slack-default
      routes:
        - matchers:
            - severity = critical
          receiver: pagerduty-critical
          group_wait: 10s
        - matchers:
            - alertname = Watchdog
          receiver: "null"
    receivers:
      - name: "null"
      - name: slack-default
        slack_configs:
          - channel: "#monitoring"
            send_resolved: true
      - name: pagerduty-critical
        pagerduty_configs:
          - routing_key: "your-key-here"

# --- Grafana ---
grafana:
  enabled: true
  replicas: 1
  admin:
    existingSecret: grafana-admin-credentials
    userKey: admin-user
    passwordKey: admin-password
  persistence:
    enabled: true
    storageClassName: gp3
    size: 10Gi
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 256Mi
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
    hosts:
      - grafana.example.com
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.example.com

# --- Node Exporter ---
prometheus-node-exporter:
  tolerations:
    - effect: NoSchedule
      operator: Exists
  resources:
    requests:
      cpu: 50m
      memory: 30Mi
    limits:
      cpu: 100m
      memory: 50Mi

# --- Kube State Metrics ---
kube-state-metrics:
  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi

This configuration provides: high-availability Prometheus with 2 replicas and anti-affinity, persistent storage for all stateful components, resource limits on every component, Alertmanager with routing to Slack and PagerDuty, Grafana with TLS ingress and credential management via Secrets, and node-exporter tolerations for full cluster coverage.

Adapt the storage class (gp3) to match your cloud provider: use standard or pd-ssd on GKE, managed-premium on AKS, or your custom StorageClass for on-premises clusters. Adjust resource requests based on your cluster's active time series count — the values above are starting points for medium clusters.

Conclusion

The kube-prometheus-stack values.yaml is the single most important file in your Kubernetes monitoring infrastructure. Getting it right means the difference between a monitoring stack that survives production incidents and one that fails precisely when you need it most.

The critical takeaways from this guide are:

  1. Always configure persistent storage for Prometheus and Alertmanager. Without it, you lose all metrics and silences on every pod restart.
  2. Set resource requests and limits on every component. Prometheus especially needs memory guarantees to avoid OOM kills during query spikes or compaction.
  3. Configure Alertmanager routing to real notification channels. The default blackhole receiver means alerts fire but nobody knows.
  4. Use retentionSize alongside retention so Prometheus cannot fill its disk. Size-based retention is your safety net when cardinality unexpectedly increases.
  5. Override only what you need. Keep your values file minimal and let Helm deep-merge with the chart defaults. This makes upgrades dramatically less painful.
  6. Test changes with --dry-run --debug before applying them. A typo in values.yaml can take down your monitoring stack.

Start with the quick install guide, customize your values.yaml using the examples in this guide, and build from there. The Helm chart documentation covers additional advanced configuration options for edge cases not covered here.

Ready to Deploy?

Get your full Kubernetes observability stack running in minutes with the official Helm chart.

Quick Install Guide Helm Chart Docs