Opentelemetry Collector

6 min read
Rapid overview

OpenTelemetry Collector

What is the OpenTelemetry Collector?

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data (metrics, traces, logs). It acts as a central hub for your observability pipeline.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   App 1     │────▢│                                     │────▢│  Jaeger     β”‚
β”‚ (OTel SDK)  β”‚     β”‚                                     β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚                                     β”‚
                    β”‚      OpenTelemetry Collector        β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚                                     │────▢│  Prometheus β”‚
β”‚   App 2     │────▢│  Receivers β”‚ Processors β”‚ Exporters β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (OTel SDK)  β”‚     β”‚                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚                                     β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                                     │────▢│  Grafana    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚                                     β”‚     β”‚  Cloud      β”‚
β”‚   App 3     │────▢│                                     β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (OTel SDK)  β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why Use the Collector?

BenefitDescription
Vendor AgnosticExport to multiple backends simultaneously
DecouplingApps don't need to know about backend changes
ProcessingFilter, transform, batch, retry at the collector level
Resource EfficiencyReduce network overhead with batching
ReliabilityBuilt-in retries and buffering

Collector Architecture

Pipeline Components

# otel-collector-config.yaml
receivers:    # How data gets IN
processors:   # What happens to the data
exporters:    # Where data goes OUT
service:      # Ties it all together

Key Components

Receivers: Accept data from external sources

  • otlp - OpenTelemetry Protocol (gRPC/HTTP)
  • jaeger - Jaeger format
  • zipkin - Zipkin format
  • prometheus - Scrape Prometheus metrics
  • hostmetrics - System metrics (CPU, memory, disk)

Processors: Transform data

  • batch - Batch data for efficiency
  • memory_limiter - Prevent OOM
  • attributes - Add/modify attributes
  • filter - Drop unwanted data
  • tail_sampling - Smart trace sampling

Exporters: Send data to backends

  • otlp - Send to another collector or OTLP backend
  • jaeger - Jaeger backend
  • prometheus - Prometheus Remote Write
  • logging - Console output (debugging)
  • file - Write to files

Basic Configuration

Minimal Setup

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]

Production Configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 4
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins:
            - "http://localhost:*"

  # Scrape Prometheus metrics from apps
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 15s
          static_configs:
            - targets: ['localhost:8888']

  # Collect host metrics
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:

processors:
  # Prevent out of memory
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

  # Batch for efficiency
  batch:
    timeout: 5s
    send_batch_size: 10000

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: service.namespace
        value: my-company
        action: upsert

  # Filter out noisy traces
  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/metrics"'

  # Add attributes to all spans
  attributes:
    actions:
      - key: deployment.environment
        value: ${ENVIRONMENT}
        action: insert

exporters:
  # Send to Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  # Send to Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: myapp

  # Send to Grafana Cloud
  otlphttp:
    endpoint: https://otlp-gateway.grafana.net/otlp
    headers:
      Authorization: "Basic ${GRAFANA_CLOUD_TOKEN}"

  # Debug logging
  logging:
    loglevel: warn

service:
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource, filter, attributes]
      exporters: [jaeger, otlphttp]

    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheus, otlphttp]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp, logging]

.NET Integration

Setup OpenTelemetry in ASP.NET Core

// Install packages:
// OpenTelemetry.Exporter.OpenTelemetryProtocol
// OpenTelemetry.Extensions.Hosting
// OpenTelemetry.Instrumentation.AspNetCore
// OpenTelemetry.Instrumentation.Http
// OpenTelemetry.Instrumentation.SqlClient

using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

// Define resource (service identity)
var resourceBuilder = ResourceBuilder.CreateDefault()
    .AddService(
        serviceName: "OrderService",
        serviceVersion: "1.0.0",
        serviceInstanceId: Environment.MachineName)
    .AddAttributes(new Dictionary<string, object>
    {
        ["deployment.environment"] = builder.Environment.EnvironmentName,
        ["team"] = "platform"
    });

// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService("OrderService"))
    .WithTracing(tracing => tracing
        .SetResourceBuilder(resourceBuilder)
        .AddAspNetCoreInstrumentation(options =>
        {
            options.RecordException = true;
            options.Filter = httpContext =>
                !httpContext.Request.Path.StartsWithSegments("/health");
        })
        .AddHttpClientInstrumentation(options =>
        {
            options.RecordException = true;
            options.FilterHttpRequestMessage = request =>
                !request.RequestUri?.Host.Contains("health") ?? true;
        })
        .AddSqlClientInstrumentation(options =>
        {
            options.SetDbStatementForText = true;
            options.RecordException = true;
        })
        .AddSource("OrderService") // Custom ActivitySource
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
            options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .SetResourceBuilder(resourceBuilder)
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddProcessInstrumentation()
        .AddMeter("OrderService") // Custom Meter
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }));

// Configure logging to send to collector
builder.Logging.AddOpenTelemetry(options =>
{
    options.SetResourceBuilder(resourceBuilder);
    options.IncludeFormattedMessage = true;
    options.IncludeScopes = true;
    options.ParseStateValues = true;
    options.AddOtlpExporter(otlp =>
    {
        otlp.Endpoint = new Uri("http://otel-collector:4317");
    });
});

var app = builder.Build();
app.Run();

Custom Instrumentation

using System.Diagnostics;
using System.Diagnostics.Metrics;

public class OrderService
{
    // Custom ActivitySource for traces
    private static readonly ActivitySource ActivitySource = new("OrderService");

    // Custom Meter for metrics
    private static readonly Meter Meter = new("OrderService", "1.0.0");
    private static readonly Counter<long> OrdersCreated = Meter.CreateCounter<long>(
        "orders_created_total",
        description: "Total number of orders created");
    private static readonly Histogram<double> OrderProcessingDuration = Meter.CreateHistogram<double>(
        "order_processing_duration_seconds",
        unit: "s",
        description: "Order processing duration");

    private readonly ILogger<OrderService> _logger;

    public async Task<Order> CreateOrderAsync(CreateOrderRequest request, CancellationToken ct)
    {
        // Start a span
        using var activity = ActivitySource.StartActivity("CreateOrder", ActivityKind.Internal);
        activity?.SetTag("order.user_id", request.UserId);
        activity?.SetTag("order.total", request.Total);

        var stopwatch = Stopwatch.StartNew();

        try
        {
            _logger.LogInformation("Creating order for user {UserId}", request.UserId);

            var order = await ProcessOrderAsync(request, ct);

            // Record metrics
            OrdersCreated.Add(1,
                new KeyValuePair<string, object?>("status", "success"),
                new KeyValuePair<string, object?>("payment_method", request.PaymentMethod));

            activity?.SetTag("order.id", order.Id);
            activity?.SetStatus(ActivityStatusCode.Ok);

            return order;
        }
        catch (Exception ex)
        {
            OrdersCreated.Add(1,
                new KeyValuePair<string, object?>("status", "failure"),
                new KeyValuePair<string, object?>("error_type", ex.GetType().Name));

            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);

            _logger.LogError(ex, "Order creation failed for user {UserId}", request.UserId);
            throw;
        }
        finally
        {
            stopwatch.Stop();
            OrderProcessingDuration.Record(
                stopwatch.Elapsed.TotalSeconds,
                new KeyValuePair<string, object?>("order_type", request.OrderType));
        }
    }
}

Kubernetes Deployment

Collector as DaemonSet (Node-level)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args:
            - --config=/etc/otel-collector-config.yaml
          ports:
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
            - containerPort: 8888  # Metrics
          resources:
            limits:
              memory: 500Mi
              cpu: 500m
            requests:
              memory: 200Mi
              cpu: 100m
          volumeMounts:
            - name: config
              mountPath: /etc/otel-collector-config.yaml
              subPath: otel-collector-config.yaml
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    app: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318

Collector as Deployment (Centralized)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector-gateway
  namespace: observability
spec:
  replicas: 3
  selector:
    matchLabels:
      app: otel-collector-gateway
  template:
    metadata:
      labels:
        app: otel-collector-gateway
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args:
            - --config=/etc/otel-collector-config.yaml
          resources:
            limits:
              memory: 2Gi
              cpu: 1000m

Tail-Based Sampling

Sample traces intelligently based on outcome (errors, latency).

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep errors
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Keep slow traces
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

      # Sample 10% of successful traces
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

      # Keep traces with specific attributes
      - name: important-users
        type: string_attribute
        string_attribute:
          key: user.tier
          values: [premium, enterprise]

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [jaeger]

Interview Questions

1. Why use the OTel Collector instead of direct export?

  • Decoupling: Apps don't need backend credentials
  • Batching: Reduces network overhead
  • Processing: Filter, transform, sample at collector level
  • Reliability: Retries and buffering
  • Multi-backend: Export to multiple destinations

2. What's the difference between Agent and Gateway deployment modes?

Agent (DaemonSet):

  • Runs on every node
  • Low latency, local collection
  • Processes data close to source

Gateway (Deployment):

  • Centralized, scaled horizontally
  • Aggregates from multiple agents
  • Better for cross-cutting processing

3. Explain tail-based vs head-based sampling

Head-based: Decision at trace start (random percentage)

  • Simple, predictable
  • May miss important traces

Tail-based: Decision after trace completes

  • Can sample based on outcome (errors, latency)
  • Requires buffering, more resources

4. How do you handle collector failures?

  • Use persistent queues (file-based)
  • Deploy as StatefulSet with PVCs
  • Configure retry policies
  • Use load balancer for gateway mode
  • Monitor collector health metrics