SpringBoot + Prometheus 指标基数爆炸治理：Label 乱打导致内存飙升？聚合采样方案！ - 服务端开发博客

做监控系统的同学肯定都遇到过这个问题：Prometheus 内存占用越来越高，监控面板加载越来越慢，最后甚至 OOM 崩溃。排查后发现，罪魁祸首居然是某个接口的 Label 值太多，导致指标基数爆炸。

我之前就遇到过这样一个案例：某个接口返回了用户 ID 作为 Label，结果线上有几百万活跃用户，这个 Label 的取值就有几百万种。单个指标瞬间膨胀到几百万个 time series，Prometheus 的内存和 CPU 直接被打爆。

今天我们就来聊聊 Prometheus 指标基数爆炸的问题，以及如何通过聚合采样方案来解决。

什么是指标基数爆炸？

1. Prometheus 指标模型回顾

Prometheus 的指标由以下几个部分组成：

指标名称 + Label组合 = 唯一的time series

例如：
api_request_total{method="GET", status="200", path="/users"} = 12345
api_request_total{method="POST", status="200", path="/orders"} = 67890

2. Label 的基数问题

每个 Label 的取值组合决定了指标的数量：

指标基数 = method的取值数 × status的取值数 × path的取值数

假设：
- method: 5种（GET, POST, PUT, DELETE, PATCH）
- status: 10种（200, 201, 400, 401, 403, 404, 500, 502, 503, 504）
- path: 100种

基数 = 5 × 10 × 100 = 5000 个time series

这看起来还好，但如果 path 变成了 user_id：

- path: 1,000,000种（每个用户ID都是不同的值）

基数 = 5 × 10 × 1,000,000 = 50,000,000 个time series！

3. 基数爆炸的危害

基数爆炸影响链：

1. Prometheus 内存暴涨
   ↓
2. 抓取时间变长，超时失败
   ↓
3. 查询性能下降，PromQL 超时
   ↓
4. 告警触发延迟或失败
   ↓
5. 严重时 Prometheus OOM 重启

常见的导致基数爆炸的 Label

1. 用户ID、订单ID等业务ID

// 错误示范：把业务ID作为Label
api_request_total{user_id="123456", path="/api/orders"}

// 正确做法：不要把业务ID作为Label
api_request_total{path="/api/orders"}

2. 具体的数值参数

// 错误示范：把分页参数作为Label
user_list_total{page="1", page_size="20"}

// 正确做法：把分页参数聚合或去掉
user_list_total{}

3. 具体的IP或主机名

// 错误示范：把具体IP作为Label
http_requests{source_ip="192.168.1.100"}

// 正确做法：按IP段或机房聚合
http_requests{datacenter="dc1", region="us-east"}

4. UUID、Token等随机值

// 错误示范：把Token作为Label
api_calls{token="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."}

// 正确做法：去掉这类Label
api_calls{}

解决方案：聚合采样方案

我们的方案采用"多层防护"策略：

┌─────────────────────────────────────────────────────────────────┐
│                     基数治理整体方案                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   第一层：Label 设计规范（预防）                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ • 禁止使用高基类Label（用户ID、IP等）                    │   │
│   │ • 使用业务维度Label（接口名、服务名）                    │   │
│   │ • 静态Label优先                                          │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↓                                     │
│   第二层：运行时聚合（处理已有指标）                              │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ • 使用 Recording Rules 预聚合                           │   │
│   │ • 创建层级化的指标体系                                   │   │
│   │ • 定期清理历史数据                                       │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↓                                     │
│   第三层：智能采样（极端情况）                                    │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ • 高基数指标自动采样                                     │   │
│   │ • 保留关键维度，过滤低价值Label                          │   │
│   │ • 采样数据定期落库分析                                   │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↓                                     │
│   第四层：告警监控（发现问題）                                    │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ • 指标数量增长率告警                                     │   │
│   │ • 单指标 series数量告警                                  │   │
│   │ • 内存使用率告警                                         │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

核心组件设计

1. Label 规范检查器

在编译期或运行时检查 Label 配置：

核心逻辑：

class LabelValidator:
    def __init__(self):
        self.forbidden_patterns = [
            '.*_id$',      # 禁止 _id 结尾的Label
            '.*_token$',   # 禁止 _token 结尾的Label
            r'\d+\.\d+\.\d+\.\d+',  # 禁止IP地址
            '.*_uuid$',    # 禁止 UUID
        ]
    
    def validate(self, metric_name, labels):
        for label_name in labels.keys():
            for pattern in self.forbidden_patterns:
                if re.match(pattern, label_name):
                    raise LabelValidationError(
                        f"Label '{label_name}' in metric '{metric_name}' "
                        f"matches forbidden pattern '{pattern}'"
                    )
        
        # 检查Label值的长度
        for label_name, label_value in labels.items():
            if len(label_value) > 200:
                raise LabelValidationError(
                    f"Label value '{label_name}={label_value}' is too long"
                )

2. 指标注册表

统一管理所有指标，记录基线：

核心逻辑：

class MetricRegistry:
    def __init__(self):
        self.metrics = {}
        self.baselines = {}
    
    def register(self, metric_name, labels, baseline_series_count):
        if metric_name in self.metrics:
            current_count = self._estimate_series_count(labels)
            baseline = self.baselines[metric_name]
            
            if current_count > baseline * 1.5:
                warn(f"Metric {metric_name} series count increased significantly")
                self._trigger_alert(metric_name, current_count, baseline)
        
        self.metrics[metric_name] = labels
        self.baselines[metric_name] = self._estimate_series_count(labels)
    
    def _estimate_series_count(self, labels):
        count = 1
        for label_name, label_values in labels.items():
            count *= len(label_values)
        return count

3. 高基数指标采样器

对高基数指标进行智能采样：

核心逻辑：

class HighCardinalitySampler:
    def __init__(self, sample_rate=0.01):
        self.sample_rate = sample_rate  # 默认1%采样
        self.sampled_data = {}
    
    def record(self, metric_name, labels, value):
        # 判断是否是高基数指标
        if not self._is_high_cardinality(metric_name, labels):
            return False  # 不需要采样
        
        # 概率采样
        if random.random() < self.sample_rate:
            key = self._generate_sample_key(metric_name, labels)
            self.sampled_data[key] = {
                'value': value,
                'timestamp': time.time(),
                'labels': labels
            }
            return True
        
        return False
    
    def _is_high_cardinality(self, metric_name, labels):
        """判断是否是高基数指标"""
        estimated_series = 1
        for label_values in labels.values():
            estimated_series *= len(label_values)
        
        return estimated_series > 10000  # 超过10000个series认为是高基数
    
    def _generate_sample_key(self, metric_name, labels):
        """生成采样数据的key"""
        # 只保留关键Label，过滤高基类Label
        filtered_labels = {
            k: v for k, v in labels.items()
            if not self._is_forbidden_label(k)
        }
        return f"{metric_name}:{filtered_labels}"

4. 聚合规则管理器

通过 Recording Rules 预聚合指标：

核心逻辑：

class RecordingRuleManager:
    def __init__(self):
        self.rules = []
    
    def generate_rules(self, metrics):
        rules = []
        
        for metric in metrics:
            # 生成聚合规则
            if self._needs_aggregation(metric):
                rule = self._create_aggregation_rule(metric)
                rules.append(rule)
        
        return rules
    
    def _needs_aggregation(self, metric):
        """判断是否需要聚合"""
        # 高基数指标需要聚合
        if metric.series_count > 50000:
            return True
        
        # 经常用于查询的指标需要聚合
        if metric.name in ['api_request_total', 'api_request_duration']:
            return True
        
        return False
    
    def _create_aggregation_rule(self, metric):
        """创建聚合规则"""
        rule_name = f"{metric.name}_aggregated"
        
        # 按接口聚合，去掉具体路径
        if 'path' in metric.labels:
            expr = f"sum by (method, status) ({metric.name})"
        else:
            expr = f"sum({metric.name})"
        
        return {
            'name': rule_name,
            'expr': expr,
            'labels': {
                'aggregated': 'true',
                'aggregation_type': 'high_cardinality_reduction'
            }
        }

5. 基数监控告警器

实时监控指标基数变化：

核心逻辑：

class CardinalityAlerter:
    def __init__(self, prometheus_client):
        self.prometheus = prometheus_client
        self.thresholds = {
            'series_count_per_metric': 50000,
            'total_series_count': 500000,
            'growth_rate_per_minute': 0.1  # 10%
        }
    
    def check(self):
        alerts = []
        
        # 检查每个指标的series数量
        metrics = self.prometheus.get_all_metrics()
        for metric in metrics:
            series_count = self._get_series_count(metric)
            
            if series_count > self.thresholds['series_count_per_metric']:
                alerts.append({
                    'severity': 'critical',
                    'metric': metric.name,
                    'message': f"Metric {metric.name} has {series_count} series"
                })
            
            # 检查增长率
            growth_rate = self._calculate_growth_rate(metric)
            if growth_rate > self.thresholds['growth_rate_per_minute']:
                alerts.append({
                    'severity': 'warning',
                    'metric': metric.name,
                    'message': f"Metric {metric.name} growing too fast: {growth_rate*100}%/min"
                })
        
        return alerts
    
    def _get_series_count(self, metric):
        """获取指标的series数量"""
        return len(metric.time_series)
    
    def _calculate_growth_rate(self, metric):
        """计算指标增长率"""
        current = self._get_series_count(metric)
        previous = metric.previous_series_count
        
        if previous == 0:
            return 0
        
        return (current - previous) / previous

完整工作流程

指标治理完整流程：

┌─────────────────────────────────────────────────────────────────┐
│                        治理工作流程                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. 开发阶段：Label 规范检查                                      │
│     ┌──────────────────────────────────────────────────────┐     │
│     │ • 代码审查时检查Label配置                             │     │
│     │ • CI/CD流程中添加Label验证                            │     │
│     │ • 禁止使用高基类Label                                  │     │
│     └──────────────────────────────────────────────────────┘     │
│                           ↓                                      │
│  2. 运行时：指标注册与监控                                        │
│     ┌──────────────────────────────────────────────────────┐     │
│     │ • 记录指标基线                                        │     │
│     │ • 实时监控series数量                                  │     │
│     │ • 检测异常增长                                         │     │
│     └──────────────────────────────────────────────────────┘     │
│                           ↓                                      │
│  3. 发现问题：高基数指标处理                                       │
│     ┌──────────────────────────────────────────────────────┐     │
│     │ • 自动触发采样                                        │     │
│     │ • 创建聚合规则                                        │     │
│     │ • 发送告警通知                                        │     │
│     └──────────────────────────────────────────────────────┘     │
│                           ↓                                      │
│  4. 持续优化：定期审查与清理                                       │
│     ┌──────────────────────────────────────────────────────┐     │
│     │ • 分析高基数指标来源                                   │     │
│     │ • 优化Label设计                                        │     │
│     │ • 清理无效指标                                         │     │
│     └──────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Label 设计最佳实践

1. 推荐使用的 Label

✅ 静态分类Label：
• 接口类型：path, method, operation
• 环境信息：env, region, datacenter
• 服务信息：service, application, component
• 业务维度：user_type, plan_type, product_line

✅ 有限取值Label：
• status: 10-20种
• method: 5-10种
• level: 3-5种

✅ 布尔类型Label：
• is_mobile: true/false
• is_cached: true/false
• is_authenticated: true/false

2. 禁止使用的 Label

❌ 高基类Label：
• user_id, order_id, session_id
• token, jwt, access_token
• ip_address, client_id
• request_id (在高并发下)

❌ 动态变化Label：
• timestamp
• random_value
• 任何包含UUID的Label

❌ 过长值的Label：
• 完整URL
• 完整SQL语句
• 长字符串描述

配置建议

# Prometheus 配置
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# 指标基数限制
metric_relabel_configs:
  - source_labels: [__name__]
    action: drop
    regex: '.*_id$'

# Recording Rules
groups:
  - name: aggregation_rules
    interval: 30s
    rules:
      - record: api_request_total:sum
        expr: sum by (method, status, path) (rate(api_request_total[5m]))
        
      - record: api_request_duration:p99
        expr: histogram_quantile(0.99, sum by (le, path) (rate(api_request_duration_bucket[5m])))

# 告警规则
groups:
  - name: cardinality_alerts
    rules:
      - alert: HighCardinalityMetric
        expr: cardinality(api_request_total) > 50000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "指标 {{ $labels.name }} 基数过高"
          
      - alert: MetricSeriesGrowthRate
        expr: rate(cardinality(api_request_total[1m])) > 0.1
        for: 2m
        labels:
          severity: warning

效果对比

场景	优化前	优化后	改善
单指标最大series数	5,000,000	50,000	-99%
Prometheus内存占用	64GB	16GB	-75%
抓取时间	120s	15s	-87.5%
查询平均响应时间	30s	500ms	-98.3%
OOM频率	每周1-2次	基本消除	✅

总结

Prometheus 指标基数治理的核心原则：

预防为主：在开发阶段就避免高基类Label
分层治理：Label规范 → 运行时监控 → 智能采样 → 聚合规则
持续监控：实时监控series数量，设置增长率告警
定期优化：分析高基数指标来源，持续优化Label设计

记住：一个好的Label设计，胜过十个优化方案。从源头控制基数，才能从根本上解决基数爆炸问题。

源码获取

文章已同步至小程序博客栏目，需要源码的请关注小程序博客。

公众号：服务端技术精选

小程序码：