Grafana

6 min read
Rapid overview

Grafana for .NET Applications

What is Grafana?

Grafana is an open-source visualization and analytics platform that allows you to query, visualize, alert on, and understand your metrics from multiple data sources.

┌─────────────────────────────────────────────────────────────┐
│                        Grafana                               │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐│
│  │ Dashboards  │ │  Alerting   │ │     Explore View        ││
│  └──────┬──────┘ └──────┬──────┘ └────────────┬────────────┘│
│         │               │                      │             │
│  ┌──────▼───────────────▼──────────────────────▼───────────┐│
│  │                  Data Sources                            ││
│  │  Prometheus │ Loki │ Tempo │ Jaeger │ InfluxDB │ SQL    ││
│  └──────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Grafana Stack (LGTM)

ComponentPurposeQuery Language
LokiLog aggregationLogQL
GrafanaVisualization-
TempoDistributed tracingTraceQL
MimirMetrics (Prometheus-compatible)PromQL

Setting Up Grafana with .NET

Docker Compose Setup

version: '3.8'

services:
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml

  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"   # tempo
      - "4317:4317"   # otlp grpc
      - "4318:4318"   # otlp http
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml

volumes:
  grafana-storage:

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'dotnet-app'
    static_configs:
      - targets: ['host.docker.internal:5000']
    metrics_path: /metrics

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

.NET App Configuration

// Install: prometheus-net.AspNetCore

using Prometheus;

var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

// Enable HTTP metrics
app.UseHttpMetrics(options =>
{
    options.AddCustomLabel("host", context => context.Request.Host.Host);
});

// Expose /metrics endpoint
app.MapMetrics();

app.MapControllers();
app.Run();

Grafana Dashboards

Dashboard JSON Structure

{
  "dashboard": {
    "id": null,
    "uid": "dotnet-app",
    "title": ".NET Application Dashboard",
    "tags": ["dotnet", "api"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": []
  }
}

Essential Panels for .NET Apps

Request Rate Panel

{
  "title": "Request Rate",
  "type": "timeseries",
  "gridPos": { "x": 0, "y": 0, "w": 8, "h": 8 },
  "targets": [
    {
      "expr": "sum(rate(http_requests_received_total[5m])) by (method, endpoint)",
      "legendFormat": "{{method}} {{endpoint}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps"
    }
  }
}

Latency Panel (p50, p95, p99)

{
  "title": "Request Latency",
  "type": "timeseries",
  "gridPos": { "x": 8, "y": 0, "w": 8, "h": 8 },
  "targets": [
    {
      "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
      "legendFormat": "p50"
    },
    {
      "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
      "legendFormat": "p95"
    },
    {
      "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
      "legendFormat": "p99"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "s"
    }
  }
}

Error Rate Panel

{
  "title": "Error Rate",
  "type": "stat",
  "gridPos": { "x": 16, "y": 0, "w": 8, "h": 8 },
  "targets": [
    {
      "expr": "sum(rate(http_requests_received_total{code=~\"5..\"}[5m])) / sum(rate(http_requests_received_total[5m])) * 100",
      "legendFormat": "Error %"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "thresholds": {
        "steps": [
          { "color": "green", "value": null },
          { "color": "yellow", "value": 1 },
          { "color": "red", "value": 5 }
        ]
      }
    }
  }
}

.NET Runtime Metrics

{
  "title": "GC Collections",
  "type": "timeseries",
  "targets": [
    {
      "expr": "rate(dotnet_collection_count_total[5m])",
      "legendFormat": "Gen {{generation}}"
    }
  ]
},
{
  "title": "Heap Size",
  "type": "timeseries",
  "targets": [
    {
      "expr": "dotnet_gc_heap_size_bytes",
      "legendFormat": "Gen {{generation}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "bytes"
    }
  }
},
{
  "title": "Thread Pool",
  "type": "timeseries",
  "targets": [
    {
      "expr": "dotnet_threadpool_num_threads",
      "legendFormat": "Active Threads"
    },
    {
      "expr": "dotnet_threadpool_queue_length",
      "legendFormat": "Queue Length"
    }
  ]
}

PromQL for .NET Metrics

Common Queries

# Request rate by endpoint
sum(rate(http_requests_received_total[5m])) by (endpoint)

# Error rate percentage
sum(rate(http_requests_received_total{code=~"5.."}[5m]))
  / sum(rate(http_requests_received_total[5m])) * 100

# Latency percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# Requests in flight
http_requests_in_progress

# .NET GC pause time
rate(dotnet_gc_pause_ratio[5m]) * 100

# Memory usage
process_working_set_bytes / 1024 / 1024

# CPU usage
rate(process_cpu_seconds_total[5m]) * 100

Business Metrics

# Orders per minute
sum(rate(orders_created_total[1m])) * 60

# Revenue per hour
sum(increase(order_total_amount[1h]))

# Failed payments rate
sum(rate(payments_total{status="failed"}[5m]))
  / sum(rate(payments_total[5m])) * 100

# Cache hit rate
sum(rate(cache_hits_total[5m]))
  / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) * 100

Grafana Alerting

Alert Rule Definition

# provisioning/alerting/rules.yaml
apiVersion: 1
groups:
  - orgId: 1
    name: dotnet-alerts
    folder: .NET Alerts
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_received_total{code=~"5.."}[5m])) / sum(rate(http_requests_received_total[5m])) * 100
          - refId: C
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [5]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [A]
                  reducer:
                    type: last
              type: threshold
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $values.A }}%"

      - uid: high-latency
        title: High Latency
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          - refId: C
            datasourceUid: __expr__
            model:
              conditions:
                - evaluator:
                    params: [2]
                    type: gt
              type: threshold
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p95 latency is {{ $values.A }}s"

Contact Points

# provisioning/alerting/contactpoints.yaml
apiVersion: 1
contactPoints:
  - orgId: 1
    name: slack-notifications
    receivers:
      - uid: slack-receiver
        type: slack
        settings:
          url: ${SLACK_WEBHOOK_URL}
          recipient: "#alerts"
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-critical
    receivers:
      - uid: pagerduty-receiver
        type: pagerduty
        settings:
          integrationKey: ${PAGERDUTY_KEY}
          severity: critical

Loki for Logs

LogQL Queries

# Filter by app and level
{app="order-service"} |= "error"

# JSON parsing
{app="order-service"} | json | level="error"

# Count errors by type
sum(count_over_time({app="order-service"} |= "error" [5m])) by (error_type)

# Search for specific user
{app="order-service"} | json | user_id="12345"

# Pattern matching
{app="order-service"} |~ "order_id=\\d+"

# Unwrap numeric value
{app="order-service"} | json | unwrap duration | avg_over_time([5m])

.NET Logging to Loki

// Install: Serilog.Sinks.Grafana.Loki

Log.Logger = new LoggerConfiguration()
    .MinimumLevel.Information()
    .Enrich.FromLogContext()
    .Enrich.WithProperty("app", "order-service")
    .WriteTo.GrafanaLoki(
        "http://loki:3100",
        labels: new List<LokiLabel>
        {
            new LokiLabel { Key = "app", Value = "order-service" },
            new LokiLabel { Key = "env", Value = "production" }
        },
        propertiesAsLabels: new[] { "level", "SourceContext" })
    .CreateLogger();

Tempo for Traces

TraceQL Queries

# Find traces by service
{ resource.service.name = "order-service" }

# Find error traces
{ status = error }

# Find slow spans
{ duration > 1s }

# Find traces with specific attribute
{ span.http.url =~ "/api/orders.*" }

# Combine conditions
{ resource.service.name = "order-service" && status = error && duration > 500ms }

.NET Tracing to Tempo

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://tempo:4317");
            options.Protocol = OtlpExportProtocol.Grpc;
        }));

Dashboard as Code

Grafonnet (Jsonnet)

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
local statPanel = grafana.statPanel;

dashboard.new(
  '.NET Application Dashboard',
  time_from='now-1h',
  refresh='30s',
  tags=['dotnet', 'api'],
)
.addPanel(
  graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
    span=8,
  )
  .addTarget(
    prometheus.target(
      'sum(rate(http_requests_received_total[5m])) by (endpoint)',
      legendFormat='{{endpoint}}',
    )
  ),
  gridPos={ x: 0, y: 0, w: 8, h: 8 },
)
.addPanel(
  statPanel.new(
    'Error Rate',
    datasource='Prometheus',
  )
  .addTarget(
    prometheus.target(
      'sum(rate(http_requests_received_total{code=~"5.."}[5m])) / sum(rate(http_requests_received_total[5m])) * 100',
    )
  )
  .addThreshold({ color: 'green', value: 0 })
  .addThreshold({ color: 'yellow', value: 1 })
  .addThreshold({ color: 'red', value: 5 }),
  gridPos={ x: 8, y: 0, w: 4, h: 4 },
)

Best Practices

Dashboard Design

  1. Use consistent layouts: Request rate → Latency → Errors → Resources
  2. Add thresholds: Color-code values for quick understanding
  3. Include variables: Filter by environment, service, endpoint
  4. Link panels: Click error rate → see related logs
  5. Document: Add descriptions to panels

Performance

  1. Use recording rules for expensive queries
  2. Set appropriate time ranges (don't query months of data)
  3. Limit label cardinality (avoid high-cardinality labels like user_id)
  4. Use instant queries for stat panels

Interview Questions

1. What's the difference between Grafana and Prometheus?

Prometheus: Time-series database + scraper + alerting rules Grafana: Visualization layer that queries Prometheus (and others)

2. How do you reduce dashboard load time?

  • Use recording rules for complex queries
  • Limit time range defaults
  • Reduce panel count per dashboard
  • Use instant queries for stats
  • Avoid high-cardinality labels

3. Explain the LGTM stack

  • Loki: Log aggregation (like Elasticsearch, but lighter)
  • Grafana: Unified visualization for all telemetry
  • Tempo: Distributed tracing storage
  • Mimir: Long-term metrics storage (Prometheus-compatible)

4. How do you correlate logs, metrics, and traces?

  • Use consistent labels (service name, environment)
  • Add trace IDs to logs
  • Configure exemplars in Prometheus
  • Use Grafana's correlations feature to link panels