Opentelemetry Collector
6 min readRapid overview
- OpenTelemetry Collector
- What is the OpenTelemetry Collector?
- Why Use the Collector?
- Collector Architecture
- Pipeline Components
- Key Components
- Basic Configuration
- Minimal Setup
- Production Configuration
- .NET Integration
- Setup OpenTelemetry in ASP.NET Core
- Custom Instrumentation
- Kubernetes Deployment
- Collector as DaemonSet (Node-level)
- Collector as Deployment (Centralized)
- Tail-Based Sampling
- Interview Questions
- 1. Why use the OTel Collector instead of direct export?
- 2. What's the difference between Agent and Gateway deployment modes?
- 3. Explain tail-based vs head-based sampling
- 4. How do you handle collector failures?
OpenTelemetry Collector
What is the OpenTelemetry Collector?
The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data (metrics, traces, logs). It acts as a central hub for your observability pipeline.
βββββββββββββββ βββββββββββββββββββββββββββββββββββββββ βββββββββββββββ
β App 1 ββββββΆβ ββββββΆβ Jaeger β
β (OTel SDK) β β β βββββββββββββββ
βββββββββββββββ β β
β OpenTelemetry Collector β βββββββββββββββ
βββββββββββββββ β ββββββΆβ Prometheus β
β App 2 ββββββΆβ Receivers β Processors β Exporters β βββββββββββββββ
β (OTel SDK) β β β
βββββββββββββββ β β βββββββββββββββ
β ββββββΆβ Grafana β
βββββββββββββββ β β β Cloud β
β App 3 ββββββΆβ β βββββββββββββββ
β (OTel SDK) β βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ
Why Use the Collector?
| Benefit | Description |
|---|---|
| Vendor Agnostic | Export to multiple backends simultaneously |
| Decoupling | Apps don't need to know about backend changes |
| Processing | Filter, transform, batch, retry at the collector level |
| Resource Efficiency | Reduce network overhead with batching |
| Reliability | Built-in retries and buffering |
Collector Architecture
Pipeline Components
# otel-collector-config.yaml
receivers: # How data gets IN
processors: # What happens to the data
exporters: # Where data goes OUT
service: # Ties it all together
Key Components
Receivers: Accept data from external sources
otlp- OpenTelemetry Protocol (gRPC/HTTP)jaeger- Jaeger formatzipkin- Zipkin formatprometheus- Scrape Prometheus metricshostmetrics- System metrics (CPU, memory, disk)
Processors: Transform data
batch- Batch data for efficiencymemory_limiter- Prevent OOMattributes- Add/modify attributesfilter- Drop unwanted datatail_sampling- Smart trace sampling
Exporters: Send data to backends
otlp- Send to another collector or OTLP backendjaeger- Jaeger backendprometheus- Prometheus Remote Writelogging- Console output (debugging)file- Write to files
Basic Configuration
Minimal Setup
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [logging]
logs:
receivers: [otlp]
processors: [batch]
exporters: [logging]
Production Configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 4
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins:
- "http://localhost:*"
# Scrape Prometheus metrics from apps
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 15s
static_configs:
- targets: ['localhost:8888']
# Collect host metrics
hostmetrics:
collection_interval: 30s
scrapers:
cpu:
memory:
disk:
network:
processors:
# Prevent out of memory
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
# Batch for efficiency
batch:
timeout: 5s
send_batch_size: 10000
# Add resource attributes
resource:
attributes:
- key: environment
value: production
action: upsert
- key: service.namespace
value: my-company
action: upsert
# Filter out noisy traces
filter:
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/metrics"'
# Add attributes to all spans
attributes:
actions:
- key: deployment.environment
value: ${ENVIRONMENT}
action: insert
exporters:
# Send to Jaeger
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
# Send to Prometheus
prometheus:
endpoint: 0.0.0.0:8889
namespace: myapp
# Send to Grafana Cloud
otlphttp:
endpoint: https://otlp-gateway.grafana.net/otlp
headers:
Authorization: "Basic ${GRAFANA_CLOUD_TOKEN}"
# Debug logging
logging:
loglevel: warn
service:
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource, filter, attributes]
exporters: [jaeger, otlphttp]
metrics:
receivers: [otlp, prometheus, hostmetrics]
processors: [memory_limiter, batch, resource]
exporters: [prometheus, otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp, logging]
.NET Integration
Setup OpenTelemetry in ASP.NET Core
// Install packages:
// OpenTelemetry.Exporter.OpenTelemetryProtocol
// OpenTelemetry.Extensions.Hosting
// OpenTelemetry.Instrumentation.AspNetCore
// OpenTelemetry.Instrumentation.Http
// OpenTelemetry.Instrumentation.SqlClient
using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
var builder = WebApplication.CreateBuilder(args);
// Define resource (service identity)
var resourceBuilder = ResourceBuilder.CreateDefault()
.AddService(
serviceName: "OrderService",
serviceVersion: "1.0.0",
serviceInstanceId: Environment.MachineName)
.AddAttributes(new Dictionary<string, object>
{
["deployment.environment"] = builder.Environment.EnvironmentName,
["team"] = "platform"
});
// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
.ConfigureResource(r => r.AddService("OrderService"))
.WithTracing(tracing => tracing
.SetResourceBuilder(resourceBuilder)
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.Filter = httpContext =>
!httpContext.Request.Path.StartsWithSegments("/health");
})
.AddHttpClientInstrumentation(options =>
{
options.RecordException = true;
options.FilterHttpRequestMessage = request =>
!request.RequestUri?.Host.Contains("health") ?? true;
})
.AddSqlClientInstrumentation(options =>
{
options.SetDbStatementForText = true;
options.RecordException = true;
})
.AddSource("OrderService") // Custom ActivitySource
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;
}))
.WithMetrics(metrics => metrics
.SetResourceBuilder(resourceBuilder)
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddProcessInstrumentation()
.AddMeter("OrderService") // Custom Meter
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
}));
// Configure logging to send to collector
builder.Logging.AddOpenTelemetry(options =>
{
options.SetResourceBuilder(resourceBuilder);
options.IncludeFormattedMessage = true;
options.IncludeScopes = true;
options.ParseStateValues = true;
options.AddOtlpExporter(otlp =>
{
otlp.Endpoint = new Uri("http://otel-collector:4317");
});
});
var app = builder.Build();
app.Run();
Custom Instrumentation
using System.Diagnostics;
using System.Diagnostics.Metrics;
public class OrderService
{
// Custom ActivitySource for traces
private static readonly ActivitySource ActivitySource = new("OrderService");
// Custom Meter for metrics
private static readonly Meter Meter = new("OrderService", "1.0.0");
private static readonly Counter<long> OrdersCreated = Meter.CreateCounter<long>(
"orders_created_total",
description: "Total number of orders created");
private static readonly Histogram<double> OrderProcessingDuration = Meter.CreateHistogram<double>(
"order_processing_duration_seconds",
unit: "s",
description: "Order processing duration");
private readonly ILogger<OrderService> _logger;
public async Task<Order> CreateOrderAsync(CreateOrderRequest request, CancellationToken ct)
{
// Start a span
using var activity = ActivitySource.StartActivity("CreateOrder", ActivityKind.Internal);
activity?.SetTag("order.user_id", request.UserId);
activity?.SetTag("order.total", request.Total);
var stopwatch = Stopwatch.StartNew();
try
{
_logger.LogInformation("Creating order for user {UserId}", request.UserId);
var order = await ProcessOrderAsync(request, ct);
// Record metrics
OrdersCreated.Add(1,
new KeyValuePair<string, object?>("status", "success"),
new KeyValuePair<string, object?>("payment_method", request.PaymentMethod));
activity?.SetTag("order.id", order.Id);
activity?.SetStatus(ActivityStatusCode.Ok);
return order;
}
catch (Exception ex)
{
OrdersCreated.Add(1,
new KeyValuePair<string, object?>("status", "failure"),
new KeyValuePair<string, object?>("error_type", ex.GetType().Name));
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.RecordException(ex);
_logger.LogError(ex, "Order creation failed for user {UserId}", request.UserId);
throw;
}
finally
{
stopwatch.Stop();
OrderProcessingDuration.Record(
stopwatch.Elapsed.TotalSeconds,
new KeyValuePair<string, object?>("order_type", request.OrderType));
}
}
}
Kubernetes Deployment
Collector as DaemonSet (Node-level)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: observability
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args:
- --config=/etc/otel-collector-config.yaml
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8888 # Metrics
resources:
limits:
memory: 500Mi
cpu: 500m
requests:
memory: 200Mi
cpu: 100m
volumeMounts:
- name: config
mountPath: /etc/otel-collector-config.yaml
subPath: otel-collector-config.yaml
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: observability
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
- name: otlp-http
port: 4318
Collector as Deployment (Centralized)
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector-gateway
namespace: observability
spec:
replicas: 3
selector:
matchLabels:
app: otel-collector-gateway
template:
metadata:
labels:
app: otel-collector-gateway
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args:
- --config=/etc/otel-collector-config.yaml
resources:
limits:
memory: 2Gi
cpu: 1000m
Tail-Based Sampling
Sample traces intelligently based on outcome (errors, latency).
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
# Always keep errors
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Keep slow traces
- name: slow-traces
type: latency
latency:
threshold_ms: 1000
# Sample 10% of successful traces
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
# Keep traces with specific attributes
- name: important-users
type: string_attribute
string_attribute:
key: user.tier
values: [premium, enterprise]
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [jaeger]
Interview Questions
1. Why use the OTel Collector instead of direct export?
- Decoupling: Apps don't need backend credentials
- Batching: Reduces network overhead
- Processing: Filter, transform, sample at collector level
- Reliability: Retries and buffering
- Multi-backend: Export to multiple destinations
2. What's the difference between Agent and Gateway deployment modes?
Agent (DaemonSet):
- Runs on every node
- Low latency, local collection
- Processes data close to source
Gateway (Deployment):
- Centralized, scaled horizontally
- Aggregates from multiple agents
- Better for cross-cutting processing
3. Explain tail-based vs head-based sampling
Head-based: Decision at trace start (random percentage)
- Simple, predictable
- May miss important traces
Tail-based: Decision after trace completes
- Can sample based on outcome (errors, latency)
- Requires buffering, more resources
4. How do you handle collector failures?
- Use persistent queues (file-based)
- Deploy as StatefulSet with PVCs
- Configure retry policies
- Use load balancer for gateway mode
- Monitor collector health metrics