Backend / Observability / Opentelemetry Collector

Opentelemetry Collector

6 min read

Rapid overview

OpenTelemetry Collector

What is the OpenTelemetry Collector?

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data (metrics, traces, logs). It acts as a central hub for your observability pipeline.

┌─────────────┐     ┌─────────────────────────────────────┐     ┌─────────────┐
│   App 1     │────▶│                                     │────▶│  Jaeger     │
│ (OTel SDK)  │     │                                     │     └─────────────┘
└─────────────┘     │                                     │
                    │      OpenTelemetry Collector        │     ┌─────────────┐
┌─────────────┐     │                                     │────▶│  Prometheus │
│   App 2     │────▶│  Receivers │ Processors │ Exporters │     └─────────────┘
│ (OTel SDK)  │     │                                     │
└─────────────┘     │                                     │     ┌─────────────┐
                    │                                     │────▶│  Grafana    │
┌─────────────┐     │                                     │     │  Cloud      │
│   App 3     │────▶│                                     │     └─────────────┘
│ (OTel SDK)  │     └─────────────────────────────────────┘
└─────────────┘

Why Use the Collector?

Benefit	Description
Vendor Agnostic	Export to multiple backends simultaneously
Decoupling	Apps don't need to know about backend changes
Processing	Filter, transform, batch, retry at the collector level
Resource Efficiency	Reduce network overhead with batching
Reliability	Built-in retries and buffering

Collector Architecture

Pipeline Components

# otel-collector-config.yaml
receivers:    # How data gets IN
processors:   # What happens to the data
exporters:    # Where data goes OUT
service:      # Ties it all together

Key Components

Receivers: Accept data from external sources

otlp - OpenTelemetry Protocol (gRPC/HTTP)
jaeger - Jaeger format
zipkin - Zipkin format
prometheus - Scrape Prometheus metrics
hostmetrics - System metrics (CPU, memory, disk)

Processors: Transform data

batch - Batch data for efficiency
memory_limiter - Prevent OOM
attributes - Add/modify attributes
filter - Drop unwanted data
tail_sampling - Smart trace sampling

Exporters: Send data to backends

otlp - Send to another collector or OTLP backend
jaeger - Jaeger backend
prometheus - Prometheus Remote Write
logging - Console output (debugging)
file - Write to files

Basic Configuration

Minimal Setup

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]

Production Configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 4
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins:
            - "http://localhost:*"

  # Scrape Prometheus metrics from apps
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 15s
          static_configs:
            - targets: ['localhost:8888']

  # Collect host metrics
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:

processors:
  # Prevent out of memory
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

  # Batch for efficiency
  batch:
    timeout: 5s
    send_batch_size: 10000

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: service.namespace
        value: my-company
        action: upsert

  # Filter out noisy traces
  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/metrics"'

  # Add attributes to all spans
  attributes:
    actions:
      - key: deployment.environment
        value: ${ENVIRONMENT}
        action: insert

exporters:
  # Send to Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  # Send to Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: myapp

  # Send to Grafana Cloud
  otlphttp:
    endpoint: https://otlp-gateway.grafana.net/otlp
    headers:
      Authorization: "Basic ${GRAFANA_CLOUD_TOKEN}"

  # Debug logging
  logging:
    loglevel: warn

service:
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource, filter, attributes]
      exporters: [jaeger, otlphttp]

    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheus, otlphttp]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp, logging]

.NET Integration

Setup OpenTelemetry in ASP.NET Core

// Install packages:
// OpenTelemetry.Exporter.OpenTelemetryProtocol
// OpenTelemetry.Extensions.Hosting
// OpenTelemetry.Instrumentation.AspNetCore
// OpenTelemetry.Instrumentation.Http
// OpenTelemetry.Instrumentation.SqlClient

using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

// Define resource (service identity)
var resourceBuilder = ResourceBuilder.CreateDefault()
    .AddService(
        serviceName: "OrderService",
        serviceVersion: "1.0.0",
        serviceInstanceId: Environment.MachineName)
    .AddAttributes(new Dictionary<string, object>
    {
        ["deployment.environment"] = builder.Environment.EnvironmentName,
        ["team"] = "platform"
    });

// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService("OrderService"))
    .WithTracing(tracing => tracing
        .SetResourceBuilder(resourceBuilder)
        .AddAspNetCoreInstrumentation(options =>
        {
            options.RecordException = true;
            options.Filter = httpContext =>
                !httpContext.Request.Path.StartsWithSegments("/health");
        })
        .AddHttpClientInstrumentation(options =>
        {
            options.RecordException = true;
            options.FilterHttpRequestMessage = request =>
                !request.RequestUri?.Host.Contains("health") ?? true;
        })
        .AddSqlClientInstrumentation(options =>
        {
            options.SetDbStatementForText = true;
            options.RecordException = true;
        })
        .AddSource("OrderService") // Custom ActivitySource
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
            options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .SetResourceBuilder(resourceBuilder)
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddProcessInstrumentation()
        .AddMeter("OrderService") // Custom Meter
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }));

// Configure logging to send to collector
builder.Logging.AddOpenTelemetry(options =>
{
    options.SetResourceBuilder(resourceBuilder);
    options.IncludeFormattedMessage = true;
    options.IncludeScopes = true;
    options.ParseStateValues = true;
    options.AddOtlpExporter(otlp =>
    {
        otlp.Endpoint = new Uri("http://otel-collector:4317");
    });
});

var app = builder.Build();
app.Run();

Custom Instrumentation

using System.Diagnostics;
using System.Diagnostics.Metrics;

public class OrderService
{
    // Custom ActivitySource for traces
    private static readonly ActivitySource ActivitySource = new("OrderService");

    // Custom Meter for metrics
    private static readonly Meter Meter = new("OrderService", "1.0.0");
    private static readonly Counter<long> OrdersCreated = Meter.CreateCounter<long>(
        "orders_created_total",
        description: "Total number of orders created");
    private static readonly Histogram<double> OrderProcessingDuration = Meter.CreateHistogram<double>(
        "order_processing_duration_seconds",
        unit: "s",
        description: "Order processing duration");

    private readonly ILogger<OrderService> _logger;

    public async Task<Order> CreateOrderAsync(CreateOrderRequest request, CancellationToken ct)
    {
        // Start a span
        using var activity = ActivitySource.StartActivity("CreateOrder", ActivityKind.Internal);
        activity?.SetTag("order.user_id", request.UserId);
        activity?.SetTag("order.total", request.Total);

        var stopwatch = Stopwatch.StartNew();

        try
        {
            _logger.LogInformation("Creating order for user {UserId}", request.UserId);

            var order = await ProcessOrderAsync(request, ct);

            // Record metrics
            OrdersCreated.Add(1,
                new KeyValuePair<string, object?>("status", "success"),
                new KeyValuePair<string, object?>("payment_method", request.PaymentMethod));

            activity?.SetTag("order.id", order.Id);
            activity?.SetStatus(ActivityStatusCode.Ok);

            return order;
        }
        catch (Exception ex)
        {
            OrdersCreated.Add(1,
                new KeyValuePair<string, object?>("status", "failure"),
                new KeyValuePair<string, object?>("error_type", ex.GetType().Name));

            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);

            _logger.LogError(ex, "Order creation failed for user {UserId}", request.UserId);
            throw;
        }
        finally
        {
            stopwatch.Stop();
            OrderProcessingDuration.Record(
                stopwatch.Elapsed.TotalSeconds,
                new KeyValuePair<string, object?>("order_type", request.OrderType));
        }
    }
}

Kubernetes Deployment

Collector as DaemonSet (Node-level)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args:
            - --config=/etc/otel-collector-config.yaml
          ports:
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
            - containerPort: 8888  # Metrics
          resources:
            limits:
              memory: 500Mi
              cpu: 500m
            requests:
              memory: 200Mi
              cpu: 100m
          volumeMounts:
            - name: config
              mountPath: /etc/otel-collector-config.yaml
              subPath: otel-collector-config.yaml
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    app: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318

Collector as Deployment (Centralized)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector-gateway
  namespace: observability
spec:
  replicas: 3
  selector:
    matchLabels:
      app: otel-collector-gateway
  template:
    metadata:
      labels:
        app: otel-collector-gateway
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args:
            - --config=/etc/otel-collector-config.yaml
          resources:
            limits:
              memory: 2Gi
              cpu: 1000m

Tail-Based Sampling

Sample traces intelligently based on outcome (errors, latency).

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep errors
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Keep slow traces
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

      # Sample 10% of successful traces
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

      # Keep traces with specific attributes
      - name: important-users
        type: string_attribute
        string_attribute:
          key: user.tier
          values: [premium, enterprise]

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [jaeger]

Interview Questions

1. Why use the OTel Collector instead of direct export?

Decoupling: Apps don't need backend credentials
Batching: Reduces network overhead
Processing: Filter, transform, sample at collector level
Reliability: Retries and buffering
Multi-backend: Export to multiple destinations

2. What's the difference between Agent and Gateway deployment modes?

Agent (DaemonSet):

Runs on every node
Low latency, local collection
Processes data close to source

Gateway (Deployment):

Centralized, scaled horizontally
Aggregates from multiple agents
Better for cross-cutting processing

3. Explain tail-based vs head-based sampling

Head-based: Decision at trace start (random percentage)

Simple, predictable
May miss important traces

Tail-based: Decision after trace completes

Can sample based on outcome (errors, latency)
Requires buffering, more resources

4. How do you handle collector failures?

Use persistent queues (file-based)
Deploy as StatefulSet with PVCs
Configure retry policies
Use load balancer for gateway mode
Monitor collector health metrics