Kube-Prometheus-Stack values.yaml: Every Configuration Option Explained (2026)
The kube-prometheus-stack Helm chart ships with a values.yaml file that spans over 4,000 lines and controls the behavior of every component in the stack: Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics, and the Prometheus Operator itself. Getting these settings right is the difference between a monitoring stack that survives production workloads and one that crashes under pressure, loses data on pod restarts, or silently drops alerts.
This guide walks through every major section of the kube-prometheus-stack values.yaml — not just listing what each key does, but explaining why each setting matters, what the real-world defaults should be, and how the values interact with each other. Every YAML example is a working snippet you can paste directly into your own override file.
Why values.yaml Matters
The default values.yaml in kube-prometheus-stack is designed for quick evaluation, not production readiness. Out of the box, Prometheus stores metrics in an emptyDir volume (data is lost on pod restart), resource requests and limits are unset (the scheduler cannot make intelligent placement decisions), and Alertmanager sends notifications to a blackhole receiver (alerts fire but go nowhere). These defaults let you install the stack in two minutes and start exploring — but they will fail you the first time a node drains or a pod gets OOM-killed.
The values.yaml file is the single point of control for the entire monitoring stack. Rather than editing Kubernetes manifests directly, you declare your desired state in values.yaml and Helm renders the correct Deployments, StatefulSets, ConfigMaps, and CRDs. This is critical because the kube-prometheus-stack uses Custom Resource Definitions (Prometheus, Alertmanager, ServiceMonitor, PodMonitor) managed by the Prometheus Operator — and the values.yaml is the primary interface for configuring those CRDs.
Understanding the structure of values.yaml also matters because the chart is a meta-chart. It bundles several sub-charts (Grafana, kube-state-metrics, prometheus-node-exporter) under a single umbrella. Each sub-chart has its own values namespace, and knowing which prefix maps to which component prevents hours of debugging misconfigured overrides.
How to View and Override Default Values
Before customizing anything, inspect the full default values to understand what you are overriding. Helm provides built-in commands for this:
# Dump the entire default values.yaml to a file helm show values prometheus-community/kube-prometheus-stack > defaults.yaml # Or inspect a specific sub-chart's defaults helm show values prometheus-community/kube-prometheus-stack | grep -A 50 "prometheusSpec:"
The key principle of Helm overrides is deep merge: you only need to specify the values you want to change. Create a minimal file with your customizations and pass it with -f:
# Install with custom overrides helm upgrade --install kube-prometheus-stack \ prometheus-community/kube-prometheus-stack \ -n monitoring --create-namespace \ -f my-values.yaml # You can layer multiple override files (later files win) helm upgrade --install kube-prometheus-stack \ prometheus-community/kube-prometheus-stack \ -n monitoring \ -f base-values.yaml \ -f environment-specific.yaml
You can also override individual values inline with --set, which is useful for CI/CD pipelines where a single value changes between environments:
# Override a single value inline helm upgrade --install kube-prometheus-stack \ prometheus-community/kube-prometheus-stack \ -n monitoring \ -f my-values.yaml \ --set prometheus.prometheusSpec.retention=30d \ --set grafana.adminPassword=mySecurePassword
After installation, verify which values are actually in effect by checking the rendered resources:
# See what values Helm actually used for the release helm get values kube-prometheus-stack -n monitoring # Show ALL values (including defaults) for the release helm get values kube-prometheus-stack -n monitoring --all # Dry-run to preview changes before applying helm upgrade kube-prometheus-stack \ prometheus-community/kube-prometheus-stack \ -n monitoring -f my-values.yaml --dry-run --debug
Prometheus Configuration (prometheusSpec)
The prometheus.prometheusSpec section is the heart of the values.yaml. It maps directly to the Prometheus Custom Resource managed by the Prometheus Operator. Every field here controls how the Prometheus StatefulSet behaves — scraping, retention, storage, replication, and more.
Retention Settings
Retention controls how long Prometheus keeps time-series data before compacting and deleting it. There are two complementary settings:
prometheus: prometheusSpec: # Time-based retention — delete data older than this retention: 15d # Size-based retention — delete oldest data when TSDB exceeds this # Whichever limit is hit first triggers deletion retentionSize: "45GB" # WAL compression reduces disk I/O at the cost of CPU # Recommended for production — typically 50% reduction in WAL size walCompression: true
The default retention is 10d (10 days). For production, 15-30 days is common for hot storage. If you need longer retention, consider using remote write to send data to Thanos, Cortex, or Grafana Mimir rather than increasing Prometheus local retention — large TSDB volumes slow down Prometheus restarts and increase memory usage significantly.
Scrape Configuration
The scrape interval and timeout control how frequently Prometheus collects metrics and how long it waits for a target to respond:
prometheus: prometheusSpec: # Global scrape interval — applies to all targets unless overridden scrapeInterval: "30s" # Global scrape timeout — must be less than scrapeInterval scrapeTimeout: "10s" # How often Prometheus evaluates alerting and recording rules evaluationInterval: "30s" # External labels added to all metrics (useful for multi-cluster) externalLabels: cluster: production-us-east-1 environment: production
Reducing scrapeInterval to 15s doubles the number of samples collected, which doubles storage and memory requirements. For most workloads, 30s provides sufficient granularity. Only reduce it for latency-sensitive SLO tracking where 30-second gaps in data are unacceptable.
ServiceMonitor and PodMonitor Selectors
By default, the Prometheus Operator watches all namespaces for ServiceMonitor and PodMonitor resources. You can restrict this for multi-tenant clusters or to isolate monitoring scopes:
prometheus: prometheusSpec: # Select ServiceMonitors by label serviceMonitorSelector: matchLabels: prometheus: kube-prometheus-stack # Watch specific namespaces only (empty = all namespaces) serviceMonitorNamespaceSelector: matchLabels: monitoring: "enabled" # Same pattern applies to PodMonitors podMonitorSelector: {} podMonitorNamespaceSelector: {} # And to ProbeMonitors (blackbox exporter probes) probeSelector: {} probeNamespaceSelector: {}
Setting these selectors to {} (empty object) means "select everything." Omitting them entirely also means "select everything." Only add label selectors when you have a specific reason to restrict which ServiceMonitors Prometheus picks up.
Replicas and High Availability
prometheus: prometheusSpec: # Run 2 replicas for high availability replicas: 2 # Shard across multiple Prometheus instances for large clusters # Each shard gets a subset of targets automatically shards: 1 # Pod anti-affinity ensures replicas run on different nodes podAntiAffinity: "hard" # Topology spread constraints for zone-aware scheduling topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/name: prometheus
Running 2 replicas means both Prometheus instances scrape the same targets independently. They produce near-identical data but are not perfectly synchronized. For deduplication at query time, pair this with Thanos Query or use Prometheus's built-in replicaExternalLabelName to differentiate replicas.
Grafana Configuration
Grafana runs as a sub-chart within kube-prometheus-stack. All Grafana settings are nested under the grafana: key and follow the Grafana Helm chart's values structure.
Admin Credentials and Basic Settings
grafana: # Enable or disable Grafana entirely enabled: true # Default admin credentials (change these!) adminUser: admin adminPassword: my-secure-grafana-password # Or use an existing Kubernetes Secret admin: existingSecret: grafana-admin-credentials userKey: admin-user passwordKey: admin-password # Number of Grafana replicas replicas: 1 # Grafana.ini settings passed directly to the Grafana config grafana.ini: server: root_url: https://grafana.example.com auth.generic_oauth: enabled: true name: SSO client_id: grafana scopes: openid profile email
Dashboard and Datasource Sidecar
The sidecar is what makes Grafana dashboards self-provisioning in kube-prometheus-stack. It watches for ConfigMaps with specific labels and loads them as dashboards or data sources:
grafana: sidecar: dashboards: enabled: true # Label that ConfigMaps must have to be picked up label: grafana_dashboard labelValue: "1" # Watch all namespaces, not just the release namespace searchNamespace: ALL # Folder where sidecar-loaded dashboards appear folder: /tmp/dashboards folderAnnotation: grafana_folder provider: foldersFromFilesStructure: true datasources: enabled: true label: grafana_datasource labelValue: "1"
Grafana Persistence
By default, Grafana uses emptyDir storage, meaning dashboards created through the UI, user preferences, and alert rules are lost on pod restart. Enable persistence for production:
grafana: persistence: enabled: true type: pvc storageClassName: gp3 accessModes: - ReadWriteOnce size: 10Gi finalizers: - kubernetes.io/pvc-protection
Even with persistence enabled, the recommended approach for production dashboards is ConfigMap provisioning (described in the Grafana dashboards guide) rather than relying on the SQLite database stored on the PVC. The PVC should be seen as a safety net, not the primary storage mechanism for dashboards.
Alertmanager Configuration
The Alertmanager configuration in values.yaml controls how alerts are routed, grouped, and delivered to notification channels. This is the section most teams get wrong on the first attempt.
Alertmanager Spec
alertmanager: enabled: true alertmanagerSpec: # Replicas for high availability (requires a shared storage or gossip) replicas: 3 # Retention for resolved alerts retention: 120h # Persistent storage for alert silences and notification state storage: volumeClaimTemplate: spec: storageClassName: gp3 accessModes: - ReadWriteOnce resources: requests: storage: 5Gi # Resource limits resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi
Alert Routing Configuration
The alertmanager.config section defines how alerts flow from Prometheus to your notification channels. This is the actual Alertmanager configuration — not Helm abstractions:
alertmanager: config: global: resolve_timeout: 5m slack_api_url: "https://hooks.slack.com/services/T.../B.../xxx" route: # Group alerts by these labels before sending group_by: ['namespace', 'alertname', 'severity'] # Wait this long to batch alerts in the same group group_wait: 30s # Wait this long before sending updates for an existing group group_interval: 5m # Wait this long before resending a notification repeat_interval: 4h receiver: slack-default routes: # Critical alerts go to PagerDuty immediately - matchers: - severity = critical receiver: pagerduty-critical group_wait: 10s repeat_interval: 1h # Warning alerts go to a dedicated Slack channel - matchers: - severity = warning receiver: slack-warnings repeat_interval: 12h # Silence Watchdog alerts (they are heartbeat checks) - matchers: - alertname = Watchdog receiver: "null" receivers: - name: "null" - name: slack-default slack_configs: - channel: "#monitoring-alerts" send_resolved: true title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}' text: '{{ range .Alerts }}*{{ .Labels.severity }}*: {{ .Annotations.summary }}\n{{ end }}' - name: slack-warnings slack_configs: - channel: "#monitoring-warnings" send_resolved: true - name: pagerduty-critical pagerduty_configs: - routing_key: "your-pagerduty-integration-key" severity: '{{ .CommonLabels.severity }}' inhibit_rules: # If a critical alert fires, suppress warning alerts for the same alertname - source_matchers: - severity = critical target_matchers: - severity = warning equal: ['namespace', 'alertname']
The group_wait, group_interval, and repeat_interval settings are the most commonly misconfigured. Setting group_wait too low causes alert storms — dozens of individual notifications instead of a single grouped message. Setting repeat_interval too low causes alert fatigue, where teams start ignoring repeated notifications.
Node Exporter and Kube-State-Metrics Settings
These two exporters are the primary data sources for infrastructure and Kubernetes-level metrics. Both run as sub-charts with their own configuration namespaces.
Node Exporter
Node Exporter runs as a DaemonSet, placing one pod on every node to collect OS-level metrics (CPU, memory, disk, network, filesystem):
prometheus-node-exporter: enabled: true # Run on all nodes including control plane tolerations: - effect: NoSchedule operator: Exists # Resource limits — node-exporter is lightweight resources: requests: cpu: 50m memory: 30Mi limits: cpu: 100m memory: 50Mi # Extra arguments to enable/disable specific collectors extraArgs: - --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/) - --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
The tolerations field is critical — without it, node-exporter pods will not be scheduled on tainted nodes (control plane nodes, GPU nodes, or nodes with custom taints). You get blind spots in your monitoring wherever node-exporter cannot run.
Kube-State-Metrics
Kube-state-metrics generates metrics about the state of Kubernetes objects (Deployments, Pods, Nodes, PVCs, etc.). It queries the Kubernetes API server and converts object states into Prometheus metrics:
kube-state-metrics: enabled: true # Only generate metrics for these resource types (reduce cardinality) collectors: - daemonsets - deployments - horizontalpodautoscalers - jobs - namespaces - nodes - persistentvolumeclaims - persistentvolumes - pods - replicasets - statefulsets resources: requests: cpu: 50m memory: 64Mi limits: cpu: 100m memory: 128Mi # Add custom labels from Kubernetes objects as Prometheus labels metricLabelsAllowlist: - pods=[app.kubernetes.io/name,app.kubernetes.io/instance] - deployments=[app.kubernetes.io/name,app.kubernetes.io/version]
The metricLabelsAllowlist setting is powerful but often overlooked. It lets you promote Kubernetes object labels to Prometheus metric labels, enabling queries like "show me CPU usage grouped by application version." Be selective — every additional label increases metric cardinality exponentially.
Storage and Persistence Configuration
Storage is the most critical production configuration. Without persistent storage, every pod restart loses all historical metrics, alert silences, and Grafana state. This is the single most common failure mode in new kube-prometheus-stack deployments.
Prometheus Storage
prometheus: prometheusSpec: storageSpec: volumeClaimTemplate: spec: storageClassName: gp3 accessModes: - ReadWriteOnce resources: requests: # Size depends on: retention period x ingestion rate x bytes per sample # Rule of thumb: ~2 bytes per sample after compression # 100k active series x 30s interval x 15 days ~ 50GB storage: 50Gi
Sizing the PVC correctly requires understanding your cluster's cardinality. Use the Prometheus expression prometheus_tsdb_head_series to check active time series count, and prometheus_tsdb_storage_blocks_bytes to see current disk usage. A good rule of thumb: provision 2x the disk space you expect to need, because TSDB compaction temporarily requires double the space.
Alertmanager Storage
alertmanager: alertmanagerSpec: storage: volumeClaimTemplate: spec: storageClassName: gp3 accessModes: - ReadWriteOnce resources: requests: # Alertmanager stores silences and nflog — 5Gi is plenty storage: 5Gi
Alertmanager storage requirements are modest — it only stores alert silences and the notification log. 5Gi is sufficient for virtually all deployments. The critical thing is that storage exists at all, so silences survive pod restarts during maintenance windows.
Resource Requests and Limits
Setting resource requests and limits is essential for production stability. Without them, the Kubernetes scheduler cannot make informed placement decisions, and monitoring components compete with application workloads for resources — often losing during memory pressure events.
# Prometheus — the most resource-intensive component prometheus: prometheusSpec: resources: requests: cpu: 500m memory: 2Gi limits: cpu: 2000m memory: 4Gi # Prometheus Operator prometheusOperator: resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi # Grafana grafana: resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 256Mi # Alertmanager alertmanager: alertmanagerSpec: resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi
Prometheus memory usage scales linearly with the number of active time series. A rough formula: memory = active_series x 2KB + 500MB base. A cluster with 500,000 active series needs approximately 1.5GB of memory. Always set memory limits 50-100% higher than requests to handle query spikes and compaction bursts.
Avoid setting CPU limits on Prometheus if possible. CPU throttling can cause Prometheus to fall behind on scraping, creating gaps in metrics data. If your cluster enforces CPU limits via LimitRange, set the limit at least 4x the request.
Network Policies and Security
Securing the monitoring stack is critical because it has broad read access to the Kubernetes API and stores sensitive operational data. The values.yaml provides several security-related configuration options.
Network Policies
# Enable network policies for all components prometheus: prometheusSpec: podMetadata: labels: networking/allow-prometheus: "true" grafana: networkPolicy: enabled: true # Allow ingress from specific namespaces ingress: - from: - namespaceSelector: matchLabels: name: ingress-nginx ports: - port: 3000 protocol: TCP
Security Contexts and RBAC
prometheus: prometheusSpec: securityContext: runAsNonRoot: true runAsUser: 1000 runAsGroup: 2000 fsGroup: 2000 seccompProfile: type: RuntimeDefault containers: - name: prometheus securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: - ALL grafana: securityContext: runAsNonRoot: true runAsUser: 472 fsGroup: 472 # Use existing ServiceAccount or let the chart create one serviceAccount: create: true name: grafana annotations: # For AWS IRSA or GCP Workload Identity eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/grafana-role
Ingress Configuration
Expose Grafana (and optionally Prometheus/Alertmanager) via an Ingress resource:
grafana: ingress: enabled: true ingressClassName: nginx annotations: cert-manager.io/cluster-issuer: letsencrypt-prod nginx.ingress.kubernetes.io/ssl-redirect: "true" hosts: - grafana.example.com tls: - secretName: grafana-tls hosts: - grafana.example.com prometheus: ingress: enabled: true ingressClassName: nginx annotations: nginx.ingress.kubernetes.io/auth-type: basic nginx.ingress.kubernetes.io/auth-secret: prometheus-basic-auth hosts: - prometheus.example.com tls: - secretName: prometheus-tls hosts: - prometheus.example.com
Always protect Prometheus and Alertmanager with authentication when exposing them via Ingress. Unlike Grafana, they have no built-in authentication — anyone with access to the URL can read all metrics and modify alert silences. Use basic auth, OAuth2 proxy, or a VPN.
Complete Production-Ready values.yaml Example
Here is a comprehensive production values.yaml that combines all the best practices discussed in this guide. This configuration is suitable for a medium-sized cluster (50-200 nodes, 500k-2M active time series):
# ============================================================ # Production-Ready kube-prometheus-stack values.yaml # Cluster size: 50-200 nodes, 500k-2M active time series # ============================================================ # --- Global Settings --- fullnameOverride: kps defaultRules: create: true rules: alertmanager: true etcd: true configReloaders: true general: true k8sContainerCpuUsageSecondsTotal: true k8sContainerMemoryWorkingSetBytes: true k8sPodOwner: true kubeApiserverAvailability: true kubeApiserverBurnrate: true kubeApiserverHistogram: true kubeApiserverSlos: true kubePrometheusGeneral: true kubePrometheusNodeRecording: true kubernetesApps: true kubernetesResources: true kubernetesStorage: true kubernetesSystem: true kubeSchedulerAlerting: true kubeSchedulerRecording: true kubeStateMetrics: true network: true node: true nodeExporterAlerting: true nodeExporterRecording: true prometheus: true prometheusOperator: true # --- Prometheus Operator --- prometheusOperator: resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi admissionWebhooks: enabled: true patch: enabled: true # --- Prometheus --- prometheus: prometheusSpec: replicas: 2 retention: 15d retentionSize: "45GB" walCompression: true scrapeInterval: "30s" evaluationInterval: "30s" podAntiAffinity: "hard" externalLabels: cluster: production storageSpec: volumeClaimTemplate: spec: storageClassName: gp3 accessModes: ["ReadWriteOnce"] resources: requests: storage: 50Gi resources: requests: cpu: 1000m memory: 4Gi limits: memory: 8Gi securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 2000 serviceMonitorSelector: {} serviceMonitorNamespaceSelector: {} podMonitorSelector: {} podMonitorNamespaceSelector: {} # --- Alertmanager --- alertmanager: alertmanagerSpec: replicas: 3 retention: 120h storage: volumeClaimTemplate: spec: storageClassName: gp3 accessModes: ["ReadWriteOnce"] resources: requests: storage: 5Gi resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi config: global: resolve_timeout: 5m route: group_by: ['namespace', 'alertname'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: slack-default routes: - matchers: - severity = critical receiver: pagerduty-critical group_wait: 10s - matchers: - alertname = Watchdog receiver: "null" receivers: - name: "null" - name: slack-default slack_configs: - channel: "#monitoring" send_resolved: true - name: pagerduty-critical pagerduty_configs: - routing_key: "your-key-here" # --- Grafana --- grafana: enabled: true replicas: 1 admin: existingSecret: grafana-admin-credentials userKey: admin-user passwordKey: admin-password persistence: enabled: true storageClassName: gp3 size: 10Gi resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 256Mi ingress: enabled: true ingressClassName: nginx annotations: cert-manager.io/cluster-issuer: letsencrypt-prod hosts: - grafana.example.com tls: - secretName: grafana-tls hosts: - grafana.example.com # --- Node Exporter --- prometheus-node-exporter: tolerations: - effect: NoSchedule operator: Exists resources: requests: cpu: 50m memory: 30Mi limits: cpu: 100m memory: 50Mi # --- Kube State Metrics --- kube-state-metrics: resources: requests: cpu: 50m memory: 64Mi limits: cpu: 100m memory: 128Mi
This configuration provides: high-availability Prometheus with 2 replicas and anti-affinity, persistent storage for all stateful components, resource limits on every component, Alertmanager with routing to Slack and PagerDuty, Grafana with TLS ingress and credential management via Secrets, and node-exporter tolerations for full cluster coverage.
Adapt the storage class (gp3) to match your cloud provider: use standard or pd-ssd on GKE, managed-premium on AKS, or your custom StorageClass for on-premises clusters. Adjust resource requests based on your cluster's active time series count — the values above are starting points for medium clusters.
Conclusion
The kube-prometheus-stack values.yaml is the single most important file in your Kubernetes monitoring infrastructure. Getting it right means the difference between a monitoring stack that survives production incidents and one that fails precisely when you need it most.
The critical takeaways from this guide are:
- Always configure persistent storage for Prometheus and Alertmanager. Without it, you lose all metrics and silences on every pod restart.
- Set resource requests and limits on every component. Prometheus especially needs memory guarantees to avoid OOM kills during query spikes or compaction.
- Configure Alertmanager routing to real notification channels. The default blackhole receiver means alerts fire but nobody knows.
- Use
retentionSizealongsideretentionso Prometheus cannot fill its disk. Size-based retention is your safety net when cardinality unexpectedly increases. - Override only what you need. Keep your values file minimal and let Helm deep-merge with the chart defaults. This makes upgrades dramatically less painful.
- Test changes with
--dry-run --debugbefore applying them. A typo in values.yaml can take down your monitoring stack.
Start with the quick install guide, customize your values.yaml using the examples in this guide, and build from there. The Helm chart documentation covers additional advanced configuration options for edge cases not covered here.
Ready to Deploy?
Get your full Kubernetes observability stack running in minutes with the official Helm chart.