DevOps / Monitoring / Core Concepts

Core Concepts

5 min read

Rapid overview

Observability & Monitoring

The Three Pillars

Metrics

Numeric measurements over time
Aggregatable, low cardinality
Examples: CPU usage, request count, error rate

Logs

Timestamped text records
High cardinality, detailed context
Examples: Application logs, audit logs

Traces

Request flow across services
Distributed context propagation
Examples: End-to-end request tracking

Prometheus

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Target    │────▶│  Prometheus │────▶│  Alertmgr   │
│  (metrics)  │     │   Server    │     │             │
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────▼──────┐
                    │   Grafana   │
                    │             │
                    └─────────────┘

Metric Types

Type	Description	Example
Counter	Monotonically increasing	Total requests
Gauge	Can go up or down	Current connections
Histogram	Distribution buckets	Request duration
Summary	Quantiles (client-side)	Response size

PromQL Basics

# Instant vector
http_requests_total{job="api", status="200"}

# Range vector (last 5 minutes)
http_requests_total[5m]

# Rate (per-second)
rate(http_requests_total[5m])

# Increase (total over period)
increase(http_requests_total[1h])

# Aggregation
sum(rate(http_requests_total[5m])) by (service)

# Histogram quantile (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Service Discovery

# Prometheus config for Kubernetes
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use annotation for port
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: $1

Recording Rules

groups:
  - name: api_rules
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)

      - record: job:http_error_rate:ratio
        expr: job:http_errors:rate5m / job:http_requests:rate5m

Alerting

Alertmanager Rules

groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_error_rate:ratio > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.container }} memory > 90%"

Alertmanager Configuration

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<key>'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#warnings'

Grafana

Dashboard JSON Model

{
  "title": "API Dashboard",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "{{service}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
          "legendFormat": "Error Rate"
        }
      ],
      "thresholds": {
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 0.01},
          {"color": "red", "value": 0.05}
        ]
      }
    }
  ]
}

Dashboard as Code (Grafonnet)

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.row;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new(
  'API Dashboard',
  time_from='now-1h',
)
.addPanel(
  graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
  )
  .addTarget(
    prometheus.target(
      'sum(rate(http_requests_total[5m])) by (service)',
      legendFormat='{{service}}',
    )
  ),
  gridPos={x: 0, y: 0, w: 12, h: 8},
)

Logging

Structured Logging

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "message": "Failed to process order",
  "service": "order-service",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user-789",
  "order_id": "order-101",
  "error": {
    "type": "ValidationError",
    "message": "Invalid payment method"
  }
}

Fluentd Configuration

# Kubernetes logs to Elasticsearch
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  <parse>
    @type json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<filter kubernetes.**>
  @type kubernetes_metadata
  @id filter_kube_metadata
</filter>

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch
  port 9200
  logstash_format true
  logstash_prefix kubernetes
  <buffer>
    @type file
    path /var/log/fluentd-buffers/kubernetes
    flush_mode interval
    flush_interval 5s
  </buffer>
</match>

Loki (Grafana Logging)

# Promtail config
clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message
      - labels:
          level:
      - timestamp:
          source: timestamp
          format: RFC3339Nano

LogQL Queries

# Filter by label
{namespace="production", app="api"}

# Text filter
{app="api"} |= "error"

# JSON parsing
{app="api"} | json | level="error"

# Aggregation (logs per minute)
sum(rate({app="api"}[1m])) by (level)

# Pattern matching
{app="api"} |~ "user_id=\\d+"

Distributed Tracing

OpenTelemetry SDK

// Node.js instrumentation
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

const provider = new NodeTracerProvider();

provider.addSpanProcessor(
  new SimpleSpanProcessor(
    new OTLPTraceExporter({
      url: 'http://otel-collector:4317',
    })
  )
);

provider.register();

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

Jaeger Query Examples

# Find traces with errors
service=api AND status.code=ERROR

# Find slow traces
service=api AND duration>1s

# Find by tag
service=api AND user.id="123"

Trace Context Propagation

# Kubernetes deployment with trace context
spec:
  containers:
    - name: api
      env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://otel-collector:4317"
        - name: OTEL_SERVICE_NAME
          value: "api-service"
        - name: OTEL_PROPAGATORS
          value: "tracecontext,baggage"

SLOs and SLIs

Defining SLIs

# SLI definitions
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-service
spec:
  service: "api"
  labels:
    team: platform
  slos:
    - name: "availability"
      objective: 99.9
      description: "API availability"
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
          totalQuery: sum(rate(http_requests_total[{{.window}}]))
      alerting:
        pageAlert:
          disable: false
        ticketAlert:
          disable: false

    - name: "latency"
      objective: 99
      description: "API latency p99 < 200ms"
      sli:
        events:
          errorQuery: sum(rate(http_request_duration_seconds_bucket{le="0.2"}[{{.window}}]))
          totalQuery: sum(rate(http_request_duration_seconds_count[{{.window}}]))

Error Budget

# Error budget remaining
1 - (
  (1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))))
  /
  (1 - 0.999)  # SLO target
)

Golden Signals

Signal	Description	Metric Example
Latency	Request duration	`histogram_quantile(0.99, http_request_duration_seconds_bucket)`
Traffic	Request rate	`sum(rate(http_requests_total[5m]))`
Errors	Error rate	`sum(rate(http_requests_total{status=~"5.."}[5m]))`
Saturation	Resource utilization	`container_memory_usage_bytes / container_spec_memory_limit_bytes`

Interview Questions

1. Explain the difference between metrics, logs, and traces

Metrics:

Numeric, time-series data
Aggregated, low storage cost
Best for: Alerting, dashboards, trends

Logs:

Detailed event records
High cardinality, high storage
Best for: Debugging, audit trails

Traces:

Request flow across services
Causally connected spans
Best for: Performance analysis, debugging distributed systems

2. How do you design alerts that don't cause alert fatigue?

Alert on symptoms, not causes
Set appropriate thresholds with hysteresis
Use multi-window alerts (burn rate)
Group related alerts
Include actionable runbooks
Review and tune regularly

3. What are SLOs and why are they important?

Service Level Objectives define reliability targets
Based on user-facing metrics (SLIs)
Error budget = allowed failures before impacting SLO
Balance reliability with feature velocity
Guide on-call and incident response priorities

4. How do you troubleshoot a slow request?

Check traces - find slow span
Check metrics - CPU, memory, saturation
Check logs - errors, warnings
Check dependencies - database, external APIs
Profile if needed - identify bottleneck