MonitoringPrometheusGrafanaKubernetes
Monitoring Kubernetes with Prometheus and Grafana
Set up a production-ready observability stack for Kubernetes using Prometheus for metrics collection and Grafana for visualization. Learn to configure alerts, dashboards, and scrape targets.
January 10, 2026·Phan Minh Anh
Why Observability Matters
In distributed systems, you can't fix what you can't see. A proper monitoring stack gives you real-time visibility into cluster health, application performance, and infrastructure metrics.
Architecture Overview
┌─────────────────────────────────────────┐
│ Kubernetes Cluster │
│ ┌──────────┐ ┌──────────┐ │
│ │ App Pods │ │ Nodes │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ /metrics │
│ ┌────▼──────────────▼─────┐ │
│ │ Prometheus │ │
│ └────────────┬─────────────┘ │
│ │ │
│ ┌────────────▼─────────────┐ │
│ │ Grafana │ │
│ └──────────────────────────┘ │
└─────────────────────────────────────────┘
Installing with Helm
# Add the kube-prometheus-stack chart
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install the full monitoring stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values values.yaml
Prometheus Configuration
# values.yaml
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 50Gi
grafana:
adminPassword: "your-secure-password"
persistence:
enabled: true
size: 10Gi
Instrumenting Your Application
Add Prometheus metrics to your app:
from prometheus_client import Counter, Histogram, start_http_server
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency')
@REQUEST_LATENCY.time()
def handle_request(method, endpoint):
# your logic here
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status='200').inc()
Key Alerts to Configure
groups:
- name: cluster-health
rules:
- alert: HighCPUUsage
expr: node_cpu_seconds_total{mode="idle"} < 0.1
for: 5m
annotations:
summary: "Node {{ $labels.instance }} CPU usage above 90%"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
Essential Grafana Dashboards
- Kubernetes Cluster Overview (Dashboard ID: 315) — node CPU, memory, disk
- Kubernetes Pod Resources (Dashboard ID: 6417) — per-pod resource usage
- Node Exporter Full (Dashboard ID: 1860) — deep node-level metrics
Import these from grafana.com directly in your Grafana UI.
Pro Tips
- Set recording rules to pre-compute expensive queries
- Use Alertmanager for routing alerts to Slack, PagerDuty, email
- Enable persistent storage for Prometheus — losing metrics history is painful