Taming the Data Explosion: A Guide to Managing Cardinality in OpenTelemetry

Introduction: The Sticker Shock Moment

You've done everything right. Your team has migrated from proprietary agents to OpenTelemetry. The instrumentation is working beautifully—traces are flowing, metrics are populating dashboards, and your SREs are finally getting the visibility they've been asking for. Then the bill arrives.

The problem isn't volume—it's high cardinality. Unique combinations of attribute values create exponential growth in your metric streams, bloating storage, slowing queries, and driving up costs far beyond what you budgeted.

High cardinality isn't just about how much data you're collecting—it's about how diverse that data is.

Key Terms

Cardinality: The number of unique values or combinations in a dataset. High cardinality means millions of unique values; low cardinality means a small, predictable set.

OpenTelemetry Collector: A vendor-agnostic data pipeline that receives, processes, and exports telemetry data (metrics, traces, logs).

Metric Stream: A unique time-series created by a distinct combination of metric name and label values in time-series databases like Prometheus.

Semantic Conventions: Standardized attribute naming patterns defined by the OpenTelemetry project to ensure consistency across implementations.

Transform Processor: An OTel Collector component that modifies telemetry data in-flight, such as dropping attributes or aggregating values.

What is Cardinality? (The LEGO Analogy)

Think of cardinality as the number of unique combinations your data can produce. To make this concrete, let's use a LEGO analogy.

Low Cardinality: The LEGO Kit

Imagine a LEGO kit with specific, numbered bags and clear instructions. Every piece has a defined place. That's what low-cardinality data looks like—structured, predictable, and easy to manage.

order_status = ["pending", "processing", "shipped", "delivered", "cancelled"]
# Only 5 possible values across millions of orders

High Cardinality: The Mixed LEGO Pile

Now imagine dumping ten different LEGO sets into one giant pile with no instructions. Every piece is unique, and finding anything specific becomes overwhelming. That's high-cardinality data—millions of unique values with no natural grouping.

order_id = "ORD-8f4e3a21-9b7c-4d1e-a5f6-2c8b9e3d7a1f"
user_id  = "USR-7c2d8e9f-3a1b-4c5e-9f6a-8d2e3b7c1a4f"
container_id = "k8s-pod-checkout-svc-7f9d8c3e2b1a-5g6h7"
# Millions of unique values across millions of requests

Low vs High Cardinality: The LEGO Analogy

The Technical Impact

Here's where it gets expensive. Time-series databases like Prometheus create a unique metric stream for each distinct combination of label values. Consider a simple HTTP request duration metric:

5 HTTP methods x 20 endpoints x 1,000,000 unique order IDs per day = 100,000,000 streams (not 100)

Each of those streams requires:

Index space in memory
Storage for time-series data points
Computation overhead for queries that must scan millions of streams

Case Study: The "SparkPlug Motors" Mistake

Let's make this real with a fictional (but very common) story.

SparkPlug Motors high cardinality mistake

The Scenario

SparkPlug Motors is a mid-size auto parts e-commerce company. They recently adopted OpenTelemetry for their checkout service. A well-intentioned developer added detailed attributes to track checkout performance:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
 
meter = metrics.get_meter(__name__)
checkout_latency = meter.create_histogram(
    name="checkout.duration",
    description="Time to process checkout",
    unit="ms"
)
 
def process_checkout(order_id, user_id, items):
    start_time = time.time()
    result = perform_checkout(order_id, user_id, items)
    duration = (time.time() - start_time) * 1000
 
    # THE MISTAKE: Adding unique IDs as attributes
    checkout_latency.record(
        duration,
        attributes={
            "order_id": order_id,        # UNIQUE per request
            "user_id": user_id,           # UNIQUE per user
            "order_status": result.status # LOW cardinality (5 values)
        }
    )
 
    return result

The Mistake Analysis

The developer's reasoning was sound: "We might want to filter checkout latency by specific orders or users." But the implementation was fatally flawed.

Here's the cardinality math:

SparkPlug processes ~50,000 orders per day
~10,000 active users per day
30-day metric retention

Instead of creating ~5 metric streams (one per order status), they created ~1,500,000 unique streams per month, each with 1,440 data points per day at 1-minute resolution.

The Consequence

Within two weeks:

Prometheus memory usage spiked 600%, causing OOM-kills on their monitoring infrastructure
Grafana dashboards timed out when trying to visualize P95 checkout latency
Monthly observability costs jumped from $2K to $10K as their vendor charged per active time series
SRE team couldn't use the data because aggregation queries were too slow during incidents

The irony? Nobody actually queried checkout latency by order_id. The attribute was never used for its intended purpose.

Strategy 1: Centralize Complexity in the Collector

The first and most impactful strategy is using the OpenTelemetry Collector as a centralized data refinery. Instead of modifying every application, you process and transform telemetry data at a single control point.

Centralized Data Processing: The OTel Collector Refinery Pattern

The OTel Collector acts as a pipeline sitting between your applications and backends. Platform teams can configure processors to enforce governance policies—stripping high-cardinality attributes, standardizing names, and routing data to appropriate destinations.

Here's how SparkPlug fixed their problem:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  # Drop high-cardinality attributes from metrics
  transform/drop_high_cardinality:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          # Remove order_id and user_id from all metrics
          - delete_key(attributes, "order_id")
          - delete_key(attributes, "user_id")
          # Keep business-critical low-cardinality attributes
          - keep_keys(attributes, ["order_status", "payment_method", "region"])
 
  # Batch for efficiency
  batch:
    timeout: 10s
    send_batch_size: 1024
 
exporters:
  prometheus:
    endpoint: "prometheus:9090"
 
  # For compliance/audit, send full-fidelity data to object storage
  file:
    path: /var/log/otel/high-cardinality-archive.json
 
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [transform/drop_high_cardinality, batch]
      exporters: [prometheus]
 
    # Separate pipeline for archival (if needed for compliance)
    metrics/archive:
      receivers: [otlp]
      processors: [batch]
      exporters: [file]

Benefits of the Collector Approach

No application code changes required—developers don't need to redeploy
Centralized governance—the platform team controls what reaches the backend
Flexibility—route high-cardinality data to cheap storage while sending aggregates to expensive backends
Fast iteration—change processing rules without redeploying applications

Enterprise Governance Tip

Create a collector configuration library with pre-approved processor templates that teams can compose:

configs/
├── processors/
│   ├── drop-pii.yaml
│   ├── drop-high-cardinality-ids.yaml
│   ├── standardize-attributes.yaml
│   └── aggregate-to-histograms.yaml
└── pipelines/
    ├── web-services.yaml
    ├── data-pipelines.yaml
    └── infrastructure.yaml

Strategy 2: Aggregation and the "Inversion" Question

Aggregating to Histograms

The second strategy is to aggregate data at the source. Instead of recording raw latency values with unique identifiers, use histograms that automatically bucket values into meaningful ranges.

Here's the refactored SparkPlug code:

from opentelemetry import metrics
 
meter = metrics.get_meter(__name__)
 
# Histogram automatically aggregates into buckets
checkout_latency = meter.create_histogram(
    name="checkout.duration",
    description="Time to process checkout",
    unit="ms"
)
 
def process_checkout(order_id, user_id, items):
    start_time = time.time()
    result = perform_checkout(order_id, user_id, items)
    duration = (time.time() - start_time) * 1000
 
    # FIXED: Only low-cardinality attributes
    checkout_latency.record(
        duration,
        attributes={
            "order_status": result.status,      # 5 values
            "payment_method": result.payment,   # 4 values
            "region": user.region               # 3 values
        }
    )
 
    # High-cardinality context goes to TRACES/LOGS, not metrics
    if duration > 5000:  # Slow checkout
        logger.warning(
            "Slow checkout detected",
            extra={
                "order_id": order_id,
                "user_id": user_id,
                "duration_ms": duration,
                "trace_id": current_span.get_span_context().trace_id
            }
        )
 
    return result

The key change: order_id and user_id are removed from metric attributes and instead logged only when something interesting happens (slow checkout). This preserves debugging capability while eliminating cardinality explosion.

The "Inversion" Question

Most teams approach instrumentation by asking: "What data might be useful?" This leads to over-collection. Instead, invert the question: "What data would be a show-stopper if it were missing?"

Applied to SparkPlug:

Question: Do I need to query checkout latency by order_id 30 days from now?
Reality Check: No. If a specific order is slow, you'll see it in real-time logs and traces. For trends and alerting, you need aggregate percentiles by status and region.
Question: What if I need to debug why order #12345 was slow?
Answer: That's a trace query, not a metrics query. Traces are designed for request-level debugging.

The Three Pillars Pattern

The key insight is that each telemetry pillar has different cardinality tolerances:

Pillar	Cardinality Tolerance	Retention	Use Case
Metrics	LOW (hundreds to low thousands of streams)	30-90 days	Trends, dashboards, alerting
Traces	MEDIUM-HIGH (sampled)	7-15 days	Request-level debugging
Logs	HIGH (indexed selectively)	7-30 days	Text search, audit trails

order_id belongs in traces and logs, not metrics. This single insight would have prevented SparkPlug's entire cardinality crisis.

Strategy 3: Standardize with Semantic Conventions

The Wild West of Naming

When different teams name attributes differently for the same concept, you get cardinality multiplication without any new information:

# Team A (E-commerce)
attributes = {"user_id": "12345"}
 
# Team B (Analytics)
attributes = {"userId": "12345"}
 
# Team C (Mobile)
attributes = {"uid": "12345"}
 
# Team D (Legacy migration)
attributes = {"customer_identifier": "12345"}

These four attributes represent the same concept but create four separate dimensions in your metrics backend. Cross-team correlation becomes impossible, and your effective cardinality quadruples.

The Fix: OpenTelemetry Semantic Conventions

OpenTelemetry defines standardized attribute names for common concepts. By adopting these conventions, teams ensure consistent naming across the entire organization.

Example: HTTP Server Instrumentation

// BAD: Custom attribute names
span.SetAttributes(
    attribute.String("request_method", "POST"),
    attribute.String("url", "/checkout"),
    attribute.String("client_ip", "192.168.1.1"),
    attribute.Int("response_code", 200),
)
 
// GOOD: Semantic conventions
import "go.opentelemetry.io/otel/semconv/v1.21.0"
 
span.SetAttributes(
    semconv.HTTPMethod("POST"),
    semconv.HTTPRoute("/checkout"),
    semconv.NetPeerIP("192.168.1.1"),
    semconv.HTTPStatusCode(200),
)

Enterprise Governance: The Attribute Registry

For organization-specific attributes that go beyond OpenTelemetry's standard conventions, create an attribute registry that defines approved names, types, and cardinality expectations:

version: "1.0"
namespaces:
 
  sparkplug.order:
    attributes:
      - name: sparkplug.order.status
        type: string
        cardinality: low
        allowed_values: [pending, processing, shipped, delivered, cancelled]
        description: "Current status of order"
 
      - name: sparkplug.order.payment_method
        type: string
        cardinality: low
        allowed_values: [credit_card, paypal, apple_pay, google_pay]
        description: "Payment method used"
 
  sparkplug.user:
    attributes:
      - name: sparkplug.user.region
        type: string
        cardinality: low
        allowed_values: [us-east, us-west, eu-central, apac]
        description: "User's geographical region"
 
      # High-cardinality - TRACES/LOGS ONLY
      - name: sparkplug.user.id
        type: string
        cardinality: high
        allowed_in: [traces, logs]
        forbidden_in: [metrics]
        description: "Unique user identifier - DO NOT use in metrics"

From this registry, SparkPlug's platform team generated type-safe attribute helpers:

package sparkplug
 
// Metrics-safe attributes (low cardinality)
func OrderStatus(value string) attribute.KeyValue {
    return attribute.String("sparkplug.order.status", value)
}
 
func PaymentMethod(value string) attribute.KeyValue {
    return attribute.String("sparkplug.order.payment_method", value)
}
 
// High-cardinality - compiler error if used with metrics
func UserID(value string) trace.Attribute {  // Note: trace.Attribute, not attribute.KeyValue
    return trace.String("sparkplug.user.id", value)
}

The platform team then:

Published a code generator that creates type-safe attribute helpers from the registry
Configured the OTel Collector to validate and drop attributes not in the registry
Integrated with CI/CD to flag violations during code review

Real-World Results: SparkPlug's Transformation

After implementing all three strategies over six weeks, SparkPlug Motors saw dramatic improvements:

Results: 99.94% reduction in metric streams, 76% cost savings, 97% faster queries

Cardinality Reduction

Before: 1.5M active metric streams
After: 847 active metric streams
Reduction: 99.94%

Performance Improvements

Prometheus memory usage: 18GB to 3.2GB (82% reduction)
Grafana P95 query latency: 8.4s to 180ms (97% improvement)
Dashboard load time: 45s to 2.1s (95% improvement)

Cost Savings

Monthly observability costs: $10,000 to $2,400 (76% reduction)
Prevented need for Prometheus cluster expansion (saved $15K in infrastructure)

Developer Experience

Deployment velocity unchanged (no application code modifications required)
Incident MTTR improved by 40% (dashboards actually usable during outages)
Cross-team metric correlation now possible thanks to semantic conventions

Conclusion: Better Signal, Lower Bill

High cardinality is the silent budget killer in observability. It provides granular detail at a premium cost, and most high-cardinality attributes are rarely queried after the initial instrumentation.

The strategies outlined in this guide provide a practical framework:

Centralize governance using the OTel Collector as your enforcement point
Question uniqueness—if an attribute has millions of values, it probably belongs in traces or logs, not metrics
Standardize naming through Semantic Conventions and an attribute registry
Measure and iterate with regular cardinality audits

SparkPlug Motors reduced their metric streams by 99.94%, cut costs by 76%, and dramatically improved their team's ability to respond to incidents. The key wasn't collecting less data—it was collecting the right data, in the right place, at the right granularity.

Should I ever use high-cardinality attributes in metrics?+

Only if you have a specific business justification and the budget to support it. For most use cases, high-cardinality data belongs in traces or logs where you can sample and index selectively.

How do I know if cardinality is my problem?+

Look for signs like: Prometheus memory consumption growing faster than traffic, slow dashboard queries, increased observability vendor costs, or time-series databases running out of memory. Run count by (__name__) ({'{'}__name__=~".+"{'}'}) in Prometheus to identify top cardinality offenders.

Won't dropping attributes lose important debugging context?+

No, if you route high-cardinality data to the right pillar. Use metrics for trends and alerting, traces for request-level debugging, and logs for detailed context. The OTel Collector can send identical data to different destinations based on use case.

Can I apply these techniques to an existing deployment?+

Yes. The OpenTelemetry Collector sits between your applications and backends, so you can deploy it incrementally without modifying application code. Start with one service as a pilot, validate the results, then roll out to the rest.

What's the biggest mistake teams make with cardinality?+

Adding unique IDs (order IDs, user IDs, request IDs, container IDs) as metric attributes without understanding the downstream cost. Always ask: "Will I actually query by this attribute in a dashboard or alert?" If the answer is no, don't include it in metrics.