Core Concepts
5 min readRapid overview
- Observability & Monitoring
- The Three Pillars
- Metrics
- Logs
- Traces
- Prometheus
- Architecture
- Metric Types
- PromQL Basics
- Service Discovery
- Recording Rules
- Alerting
- Alertmanager Rules
- Alertmanager Configuration
- Grafana
- Dashboard JSON Model
- Dashboard as Code (Grafonnet)
- Logging
- Structured Logging
- Fluentd Configuration
- Loki (Grafana Logging)
- LogQL Queries
- Distributed Tracing
- OpenTelemetry SDK
- Jaeger Query Examples
- Trace Context Propagation
- SLOs and SLIs
- Defining SLIs
- Error Budget
- Golden Signals
- Interview Questions
- 1. Explain the difference between metrics, logs, and traces
- 2. How do you design alerts that don't cause alert fatigue?
- 3. What are SLOs and why are they important?
- 4. How do you troubleshoot a slow request?
Observability & Monitoring
The Three Pillars
Metrics
- Numeric measurements over time
- Aggregatable, low cardinality
- Examples: CPU usage, request count, error rate
Logs
- Timestamped text records
- High cardinality, detailed context
- Examples: Application logs, audit logs
Traces
- Request flow across services
- Distributed context propagation
- Examples: End-to-end request tracking
Prometheus
Architecture
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Target ββββββΆβ Prometheus ββββββΆβ Alertmgr β
β (metrics) β β Server β β β
βββββββββββββββ ββββββββ¬βββββββ βββββββββββββββ
β
ββββββββΌβββββββ
β Grafana β
β β
βββββββββββββββ
Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing | Total requests |
| Gauge | Can go up or down | Current connections |
| Histogram | Distribution buckets | Request duration |
| Summary | Quantiles (client-side) | Response size |
PromQL Basics
# Instant vector
http_requests_total{job="api", status="200"}
# Range vector (last 5 minutes)
http_requests_total[5m]
# Rate (per-second)
rate(http_requests_total[5m])
# Increase (total over period)
increase(http_requests_total[1h])
# Aggregation
sum(rate(http_requests_total[5m])) by (service)
# Histogram quantile (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Service Discovery
# Prometheus config for Kubernetes
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use annotation for port
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: $1
Recording Rules
groups:
- name: api_rules
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
- record: job:http_error_rate:ratio
expr: job:http_errors:rate5m / job:http_requests:rate5m
Alerting
Alertmanager Rules
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: job:http_error_rate:ratio > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} memory > 90%"
Alertmanager Configuration
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<key>'
severity: critical
- name: 'slack-warnings'
slack_configs:
- channel: '#warnings'
Grafana
Dashboard JSON Model
{
"title": "API Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
"legendFormat": "Error Rate"
}
],
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.01},
{"color": "red", "value": 0.05}
]
}
}
]
}
Dashboard as Code (Grafonnet)
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.row;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
dashboard.new(
'API Dashboard',
time_from='now-1h',
)
.addPanel(
graphPanel.new(
'Request Rate',
datasource='Prometheus',
)
.addTarget(
prometheus.target(
'sum(rate(http_requests_total[5m])) by (service)',
legendFormat='{{service}}',
)
),
gridPos={x: 0, y: 0, w: 12, h: 8},
)
Logging
Structured Logging
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "error",
"message": "Failed to process order",
"service": "order-service",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user-789",
"order_id": "order-101",
"error": {
"type": "ValidationError",
"message": "Invalid payment method"
}
}
Fluentd Configuration
# Kubernetes logs to Elasticsearch
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
<parse>
@type json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
@id filter_kube_metadata
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch
port 9200
logstash_format true
logstash_prefix kubernetes
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes
flush_mode interval
flush_interval 5s
</buffer>
</match>
Loki (Grafana Logging)
# Promtail config
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- json:
expressions:
level: level
message: message
- labels:
level:
- timestamp:
source: timestamp
format: RFC3339Nano
LogQL Queries
# Filter by label
{namespace="production", app="api"}
# Text filter
{app="api"} |= "error"
# JSON parsing
{app="api"} | json | level="error"
# Aggregation (logs per minute)
sum(rate({app="api"}[1m])) by (level)
# Pattern matching
{app="api"} |~ "user_id=\\d+"
Distributed Tracing
OpenTelemetry SDK
// Node.js instrumentation
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const provider = new NodeTracerProvider();
provider.addSpanProcessor(
new SimpleSpanProcessor(
new OTLPTraceExporter({
url: 'http://otel-collector:4317',
})
)
);
provider.register();
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
],
});
Jaeger Query Examples
# Find traces with errors
service=api AND status.code=ERROR
# Find slow traces
service=api AND duration>1s
# Find by tag
service=api AND user.id="123"
Trace Context Propagation
# Kubernetes deployment with trace context
spec:
containers:
- name: api
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
- name: OTEL_SERVICE_NAME
value: "api-service"
- name: OTEL_PROPAGATORS
value: "tracecontext,baggage"
SLOs and SLIs
Defining SLIs
# SLI definitions
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: api-service
spec:
service: "api"
labels:
team: platform
slos:
- name: "availability"
objective: 99.9
description: "API availability"
sli:
events:
errorQuery: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
totalQuery: sum(rate(http_requests_total[{{.window}}]))
alerting:
pageAlert:
disable: false
ticketAlert:
disable: false
- name: "latency"
objective: 99
description: "API latency p99 < 200ms"
sli:
events:
errorQuery: sum(rate(http_request_duration_seconds_bucket{le="0.2"}[{{.window}}]))
totalQuery: sum(rate(http_request_duration_seconds_count[{{.window}}]))
Error Budget
# Error budget remaining
1 - (
(1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))))
/
(1 - 0.999) # SLO target
)
Golden Signals
| Signal | Description | Metric Example |
|---|---|---|
| Latency | Request duration | histogram_quantile(0.99, http_request_duration_seconds_bucket) |
| Traffic | Request rate | sum(rate(http_requests_total[5m])) |
| Errors | Error rate | sum(rate(http_requests_total{status=~"5.."}[5m])) |
| Saturation | Resource utilization | container_memory_usage_bytes / container_spec_memory_limit_bytes |
Interview Questions
1. Explain the difference between metrics, logs, and traces
Metrics:
- Numeric, time-series data
- Aggregated, low storage cost
- Best for: Alerting, dashboards, trends
Logs:
- Detailed event records
- High cardinality, high storage
- Best for: Debugging, audit trails
Traces:
- Request flow across services
- Causally connected spans
- Best for: Performance analysis, debugging distributed systems
2. How do you design alerts that don't cause alert fatigue?
- Alert on symptoms, not causes
- Set appropriate thresholds with hysteresis
- Use multi-window alerts (burn rate)
- Group related alerts
- Include actionable runbooks
- Review and tune regularly
3. What are SLOs and why are they important?
- Service Level Objectives define reliability targets
- Based on user-facing metrics (SLIs)
- Error budget = allowed failures before impacting SLO
- Balance reliability with feature velocity
- Guide on-call and incident response priorities
4. How do you troubleshoot a slow request?
- Check traces - find slow span
- Check metrics - CPU, memory, saturation
- Check logs - errors, warnings
- Check dependencies - database, external APIs
- Profile if needed - identify bottleneck