Backend / Observability / Grafana

Grafana

6 min read

Rapid overview

Grafana for .NET Applications

What is Grafana?

Grafana is an open-source visualization and analytics platform that allows you to query, visualize, alert on, and understand your metrics from multiple data sources.

┌─────────────────────────────────────────────────────────────┐
│                        Grafana                               │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐│
│  │ Dashboards  │ │  Alerting   │ │     Explore View        ││
│  └──────┬──────┘ └──────┬──────┘ └────────────┬────────────┘│
│         │               │                      │             │
│  ┌──────▼───────────────▼──────────────────────▼───────────┐│
│  │                  Data Sources                            ││
│  │  Prometheus │ Loki │ Tempo │ Jaeger │ InfluxDB │ SQL    ││
│  └──────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Grafana Stack (LGTM)

Component	Purpose	Query Language
Loki	Log aggregation	LogQL
Grafana	Visualization	-
Tempo	Distributed tracing	TraceQL
Mimir	Metrics (Prometheus-compatible)	PromQL

Setting Up Grafana with .NET

Docker Compose Setup

version: '3.8'

services:
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml

  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"   # tempo
      - "4317:4317"   # otlp grpc
      - "4318:4318"   # otlp http
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml

volumes:
  grafana-storage:

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'dotnet-app'
    static_configs:
      - targets: ['host.docker.internal:5000']
    metrics_path: /metrics

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

.NET App Configuration

// Install: prometheus-net.AspNetCore

using Prometheus;

var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

// Enable HTTP metrics
app.UseHttpMetrics(options =>
{
    options.AddCustomLabel("host", context => context.Request.Host.Host);
});

// Expose /metrics endpoint
app.MapMetrics();

app.MapControllers();
app.Run();

Grafana Dashboards

Dashboard JSON Structure

{
  "dashboard": {
    "id": null,
    "uid": "dotnet-app",
    "title": ".NET Application Dashboard",
    "tags": ["dotnet", "api"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": []
  }
}

Essential Panels for .NET Apps

Request Rate Panel

{
  "title": "Request Rate",
  "type": "timeseries",
  "gridPos": { "x": 0, "y": 0, "w": 8, "h": 8 },
  "targets": [
    {
      "expr": "sum(rate(http_requests_received_total[5m])) by (method, endpoint)",
      "legendFormat": "{{method}} {{endpoint}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps"
    }
  }
}

Latency Panel (p50, p95, p99)

{
  "title": "Request Latency",
  "type": "timeseries",
  "gridPos": { "x": 8, "y": 0, "w": 8, "h": 8 },
  "targets": [
    {
      "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
      "legendFormat": "p50"
    },
    {
      "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
      "legendFormat": "p95"
    },
    {
      "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
      "legendFormat": "p99"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "s"
    }
  }
}

Error Rate Panel

{
  "title": "Error Rate",
  "type": "stat",
  "gridPos": { "x": 16, "y": 0, "w": 8, "h": 8 },
  "targets": [
    {
      "expr": "sum(rate(http_requests_received_total{code=~\"5..\"}[5m])) / sum(rate(http_requests_received_total[5m])) * 100",
      "legendFormat": "Error %"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "thresholds": {
        "steps": [
          { "color": "green", "value": null },
          { "color": "yellow", "value": 1 },
          { "color": "red", "value": 5 }
        ]
      }
    }
  }
}

.NET Runtime Metrics

{
  "title": "GC Collections",
  "type": "timeseries",
  "targets": [
    {
      "expr": "rate(dotnet_collection_count_total[5m])",
      "legendFormat": "Gen {{generation}}"
    }
  ]
},
{
  "title": "Heap Size",
  "type": "timeseries",
  "targets": [
    {
      "expr": "dotnet_gc_heap_size_bytes",
      "legendFormat": "Gen {{generation}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "bytes"
    }
  }
},
{
  "title": "Thread Pool",
  "type": "timeseries",
  "targets": [
    {
      "expr": "dotnet_threadpool_num_threads",
      "legendFormat": "Active Threads"
    },
    {
      "expr": "dotnet_threadpool_queue_length",
      "legendFormat": "Queue Length"
    }
  ]
}

PromQL for .NET Metrics

Common Queries

# Request rate by endpoint
sum(rate(http_requests_received_total[5m])) by (endpoint)

# Error rate percentage
sum(rate(http_requests_received_total{code=~"5.."}[5m]))
  / sum(rate(http_requests_received_total[5m])) * 100

# Latency percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# Requests in flight
http_requests_in_progress

# .NET GC pause time
rate(dotnet_gc_pause_ratio[5m]) * 100

# Memory usage
process_working_set_bytes / 1024 / 1024

# CPU usage
rate(process_cpu_seconds_total[5m]) * 100

Business Metrics

# Orders per minute
sum(rate(orders_created_total[1m])) * 60

# Revenue per hour
sum(increase(order_total_amount[1h]))

# Failed payments rate
sum(rate(payments_total{status="failed"}[5m]))
  / sum(rate(payments_total[5m])) * 100

# Cache hit rate
sum(rate(cache_hits_total[5m]))
  / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) * 100

Grafana Alerting

Alert Rule Definition

# provisioning/alerting/rules.yaml
apiVersion: 1
groups:
  - orgId: 1
    name: dotnet-alerts
    folder: .NET Alerts
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_received_total{code=~"5.."}[5m])) / sum(rate(http_requests_received_total[5m])) * 100
          - refId: C
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [5]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [A]
                  reducer:
                    type: last
              type: threshold
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $values.A }}%"

      - uid: high-latency
        title: High Latency
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          - refId: C
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [2]
                    type: gt
              type: threshold
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p95 latency is {{ $values.A }}s"

Contact Points

# provisioning/alerting/contactpoints.yaml
apiVersion: 1
contactPoints:
  - orgId: 1
    name: slack-notifications
    receivers:
      - uid: slack-receiver
        type: slack
        settings:
          url: ${SLACK_WEBHOOK_URL}
          recipient: "#alerts"
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-critical
    receivers:
      - uid: pagerduty-receiver
        type: pagerduty
        settings:
          integrationKey: ${PAGERDUTY_KEY}
          severity: critical

Loki for Logs

LogQL Queries

# Filter by app and level
{app="order-service"} |= "error"

# JSON parsing
{app="order-service"} | json | level="error"

# Count errors by type
sum(count_over_time({app="order-service"} |= "error" [5m])) by (error_type)

# Search for specific user
{app="order-service"} | json | user_id="12345"

# Pattern matching
{app="order-service"} |~ "order_id=\\d+"

# Unwrap numeric value
{app="order-service"} | json | unwrap duration | avg_over_time([5m])

.NET Logging to Loki

// Install: Serilog.Sinks.Grafana.Loki

Log.Logger = new LoggerConfiguration()
    .MinimumLevel.Information()
    .Enrich.FromLogContext()
    .Enrich.WithProperty("app", "order-service")
    .WriteTo.GrafanaLoki(
        "http://loki:3100",
        labels: new List<LokiLabel>
        {
            new LokiLabel { Key = "app", Value = "order-service" },
            new LokiLabel { Key = "env", Value = "production" }
        },
        propertiesAsLabels: new[] { "level", "SourceContext" })
    .CreateLogger();

Tempo for Traces

TraceQL Queries

# Find traces by service
{ resource.service.name = "order-service" }

# Find error traces
{ status = error }

# Find slow spans
{ duration > 1s }

# Find traces with specific attribute
{ span.http.url =~ "/api/orders.*" }

# Combine conditions
{ resource.service.name = "order-service" && status = error && duration > 500ms }

.NET Tracing to Tempo

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://tempo:4317");
            options.Protocol = OtlpExportProtocol.Grpc;
        }));

Dashboard as Code

Grafonnet (Jsonnet)

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
local statPanel = grafana.statPanel;

dashboard.new(
  '.NET Application Dashboard',
  time_from='now-1h',
  refresh='30s',
  tags=['dotnet', 'api'],
)
.addPanel(
  graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
    span=8,
  )
  .addTarget(
    prometheus.target(
      'sum(rate(http_requests_received_total[5m])) by (endpoint)',
      legendFormat='{{endpoint}}',
    )
  ),
  gridPos={ x: 0, y: 0, w: 8, h: 8 },
)
.addPanel(
  statPanel.new(
    'Error Rate',
    datasource='Prometheus',
  )
  .addTarget(
    prometheus.target(
      'sum(rate(http_requests_received_total{code=~"5.."}[5m])) / sum(rate(http_requests_received_total[5m])) * 100',
    )
  )
  .addThreshold({ color: 'green', value: 0 })
  .addThreshold({ color: 'yellow', value: 1 })
  .addThreshold({ color: 'red', value: 5 }),
  gridPos={ x: 8, y: 0, w: 4, h: 4 },
)

Best Practices

Dashboard Design

Use consistent layouts: Request rate → Latency → Errors → Resources
Add thresholds: Color-code values for quick understanding
Include variables: Filter by environment, service, endpoint
Link panels: Click error rate → see related logs
Document: Add descriptions to panels

Performance

Use recording rules for expensive queries
Set appropriate time ranges (don't query months of data)
Limit label cardinality (avoid high-cardinality labels like user_id)
Use instant queries for stat panels

Interview Questions

1. What's the difference between Grafana and Prometheus?

Prometheus: Time-series database + scraper + alerting rules Grafana: Visualization layer that queries Prometheus (and others)

2. How do you reduce dashboard load time?

Use recording rules for complex queries
Limit time range defaults
Reduce panel count per dashboard
Use instant queries for stats
Avoid high-cardinality labels

3. Explain the LGTM stack

Loki: Log aggregation (like Elasticsearch, but lighter)
Grafana: Unified visualization for all telemetry
Tempo: Distributed tracing storage
Mimir: Long-term metrics storage (Prometheus-compatible)

4. How do you correlate logs, metrics, and traces?

Use consistent labels (service name, environment)
Add trace IDs to logs
Configure exemplars in Prometheus
Use Grafana's correlations feature to link panels