Grafana
6 min readRapid overview
- Grafana for .NET Applications
- What is Grafana?
- Grafana Stack (LGTM)
- Setting Up Grafana with .NET
- Docker Compose Setup
- Prometheus Configuration
- .NET App Configuration
- Grafana Dashboards
- Dashboard JSON Structure
- Essential Panels for .NET Apps
- Request Rate Panel
- Latency Panel (p50, p95, p99)
- Error Rate Panel
- .NET Runtime Metrics
- PromQL for .NET Metrics
- Common Queries
- Business Metrics
- Grafana Alerting
- Alert Rule Definition
- Contact Points
- Loki for Logs
- LogQL Queries
- .NET Logging to Loki
- Tempo for Traces
- TraceQL Queries
- .NET Tracing to Tempo
- Dashboard as Code
- Grafonnet (Jsonnet)
- Best Practices
- Dashboard Design
- Performance
- Interview Questions
- 1. What's the difference between Grafana and Prometheus?
- 2. How do you reduce dashboard load time?
- 3. Explain the LGTM stack
- 4. How do you correlate logs, metrics, and traces?
Grafana for .NET Applications
What is Grafana?
Grafana is an open-source visualization and analytics platform that allows you to query, visualize, alert on, and understand your metrics from multiple data sources.
┌─────────────────────────────────────────────────────────────┐
│ Grafana │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐│
│ │ Dashboards │ │ Alerting │ │ Explore View ││
│ └──────┬──────┘ └──────┬──────┘ └────────────┬────────────┘│
│ │ │ │ │
│ ┌──────▼───────────────▼──────────────────────▼───────────┐│
│ │ Data Sources ││
│ │ Prometheus │ Loki │ Tempo │ Jaeger │ InfluxDB │ SQL ││
│ └──────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Grafana Stack (LGTM)
| Component | Purpose | Query Language |
|---|---|---|
| Loki | Log aggregation | LogQL |
| Grafana | Visualization | - |
| Tempo | Distributed tracing | TraceQL |
| Mimir | Metrics (Prometheus-compatible) | PromQL |
Setting Up Grafana with .NET
Docker Compose Setup
version: '3.8'
services:
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-storage:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200" # tempo
- "4317:4317" # otlp grpc
- "4318:4318" # otlp http
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
volumes:
grafana-storage:
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'dotnet-app'
static_configs:
- targets: ['host.docker.internal:5000']
metrics_path: /metrics
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
.NET App Configuration
// Install: prometheus-net.AspNetCore
using Prometheus;
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();
// Enable HTTP metrics
app.UseHttpMetrics(options =>
{
options.AddCustomLabel("host", context => context.Request.Host.Host);
});
// Expose /metrics endpoint
app.MapMetrics();
app.MapControllers();
app.Run();
Grafana Dashboards
Dashboard JSON Structure
{
"dashboard": {
"id": null,
"uid": "dotnet-app",
"title": ".NET Application Dashboard",
"tags": ["dotnet", "api"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": []
}
}
Essential Panels for .NET Apps
Request Rate Panel
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "x": 0, "y": 0, "w": 8, "h": 8 },
"targets": [
{
"expr": "sum(rate(http_requests_received_total[5m])) by (method, endpoint)",
"legendFormat": "{{method}} {{endpoint}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
}
}
Latency Panel (p50, p95, p99)
{
"title": "Request Latency",
"type": "timeseries",
"gridPos": { "x": 8, "y": 0, "w": 8, "h": 8 },
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s"
}
}
}
Error Rate Panel
{
"title": "Error Rate",
"type": "stat",
"gridPos": { "x": 16, "y": 0, "w": 8, "h": 8 },
"targets": [
{
"expr": "sum(rate(http_requests_received_total{code=~\"5..\"}[5m])) / sum(rate(http_requests_received_total[5m])) * 100",
"legendFormat": "Error %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
}
.NET Runtime Metrics
{
"title": "GC Collections",
"type": "timeseries",
"targets": [
{
"expr": "rate(dotnet_collection_count_total[5m])",
"legendFormat": "Gen {{generation}}"
}
]
},
{
"title": "Heap Size",
"type": "timeseries",
"targets": [
{
"expr": "dotnet_gc_heap_size_bytes",
"legendFormat": "Gen {{generation}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes"
}
}
},
{
"title": "Thread Pool",
"type": "timeseries",
"targets": [
{
"expr": "dotnet_threadpool_num_threads",
"legendFormat": "Active Threads"
},
{
"expr": "dotnet_threadpool_queue_length",
"legendFormat": "Queue Length"
}
]
}
PromQL for .NET Metrics
Common Queries
# Request rate by endpoint
sum(rate(http_requests_received_total[5m])) by (endpoint)
# Error rate percentage
sum(rate(http_requests_received_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_received_total[5m])) * 100
# Latency percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
# Requests in flight
http_requests_in_progress
# .NET GC pause time
rate(dotnet_gc_pause_ratio[5m]) * 100
# Memory usage
process_working_set_bytes / 1024 / 1024
# CPU usage
rate(process_cpu_seconds_total[5m]) * 100
Business Metrics
# Orders per minute
sum(rate(orders_created_total[1m])) * 60
# Revenue per hour
sum(increase(order_total_amount[1h]))
# Failed payments rate
sum(rate(payments_total{status="failed"}[5m]))
/ sum(rate(payments_total[5m])) * 100
# Cache hit rate
sum(rate(cache_hits_total[5m]))
/ (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) * 100
Grafana Alerting
Alert Rule Definition
# provisioning/alerting/rules.yaml
apiVersion: 1
groups:
- orgId: 1
name: dotnet-alerts
folder: .NET Alerts
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: sum(rate(http_requests_received_total{code=~"5.."}[5m])) / sum(rate(http_requests_received_total[5m])) * 100
- refId: C
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [5]
type: gt
operator:
type: and
query:
params: [A]
reducer:
type: last
type: threshold
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $values.A }}%"
- uid: high-latency
title: High Latency
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
- refId: C
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [2]
type: gt
type: threshold
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "p95 latency is {{ $values.A }}s"
Contact Points
# provisioning/alerting/contactpoints.yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-notifications
receivers:
- uid: slack-receiver
type: slack
settings:
url: ${SLACK_WEBHOOK_URL}
recipient: "#alerts"
disableResolveMessage: false
- orgId: 1
name: pagerduty-critical
receivers:
- uid: pagerduty-receiver
type: pagerduty
settings:
integrationKey: ${PAGERDUTY_KEY}
severity: critical
Loki for Logs
LogQL Queries
# Filter by app and level
{app="order-service"} |= "error"
# JSON parsing
{app="order-service"} | json | level="error"
# Count errors by type
sum(count_over_time({app="order-service"} |= "error" [5m])) by (error_type)
# Search for specific user
{app="order-service"} | json | user_id="12345"
# Pattern matching
{app="order-service"} |~ "order_id=\\d+"
# Unwrap numeric value
{app="order-service"} | json | unwrap duration | avg_over_time([5m])
.NET Logging to Loki
// Install: Serilog.Sinks.Grafana.Loki
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.Enrich.FromLogContext()
.Enrich.WithProperty("app", "order-service")
.WriteTo.GrafanaLoki(
"http://loki:3100",
labels: new List<LokiLabel>
{
new LokiLabel { Key = "app", Value = "order-service" },
new LokiLabel { Key = "env", Value = "production" }
},
propertiesAsLabels: new[] { "level", "SourceContext" })
.CreateLogger();
Tempo for Traces
TraceQL Queries
# Find traces by service
{ resource.service.name = "order-service" }
# Find error traces
{ status = error }
# Find slow spans
{ duration > 1s }
# Find traces with specific attribute
{ span.http.url =~ "/api/orders.*" }
# Combine conditions
{ resource.service.name = "order-service" && status = error && duration > 500ms }
.NET Tracing to Tempo
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://tempo:4317");
options.Protocol = OtlpExportProtocol.Grpc;
}));
Dashboard as Code
Grafonnet (Jsonnet)
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
local statPanel = grafana.statPanel;
dashboard.new(
'.NET Application Dashboard',
time_from='now-1h',
refresh='30s',
tags=['dotnet', 'api'],
)
.addPanel(
graphPanel.new(
'Request Rate',
datasource='Prometheus',
span=8,
)
.addTarget(
prometheus.target(
'sum(rate(http_requests_received_total[5m])) by (endpoint)',
legendFormat='{{endpoint}}',
)
),
gridPos={ x: 0, y: 0, w: 8, h: 8 },
)
.addPanel(
statPanel.new(
'Error Rate',
datasource='Prometheus',
)
.addTarget(
prometheus.target(
'sum(rate(http_requests_received_total{code=~"5.."}[5m])) / sum(rate(http_requests_received_total[5m])) * 100',
)
)
.addThreshold({ color: 'green', value: 0 })
.addThreshold({ color: 'yellow', value: 1 })
.addThreshold({ color: 'red', value: 5 }),
gridPos={ x: 8, y: 0, w: 4, h: 4 },
)
Best Practices
Dashboard Design
- Use consistent layouts: Request rate → Latency → Errors → Resources
- Add thresholds: Color-code values for quick understanding
- Include variables: Filter by environment, service, endpoint
- Link panels: Click error rate → see related logs
- Document: Add descriptions to panels
Performance
- Use recording rules for expensive queries
- Set appropriate time ranges (don't query months of data)
- Limit label cardinality (avoid high-cardinality labels like user_id)
- Use instant queries for stat panels
Interview Questions
1. What's the difference between Grafana and Prometheus?
Prometheus: Time-series database + scraper + alerting rules Grafana: Visualization layer that queries Prometheus (and others)
2. How do you reduce dashboard load time?
- Use recording rules for complex queries
- Limit time range defaults
- Reduce panel count per dashboard
- Use instant queries for stats
- Avoid high-cardinality labels
3. Explain the LGTM stack
- Loki: Log aggregation (like Elasticsearch, but lighter)
- Grafana: Unified visualization for all telemetry
- Tempo: Distributed tracing storage
- Mimir: Long-term metrics storage (Prometheus-compatible)
4. How do you correlate logs, metrics, and traces?
- Use consistent labels (service name, environment)
- Add trace IDs to logs
- Configure exemplars in Prometheus
- Use Grafana's correlations feature to link panels