SpringBoot + 分布式事务监控大盘 + 失败率告警：事务健康度实时掌握，异常秒级发现

2026-03-27 / 2026-03-27 / SpringBoot 事务监控分布式事务

前言

在微服务架构中，分布式事务的监控和告警是保障系统稳定性的关键环节：

事务可见性差：分布式事务跨越多个服务，难以追踪整体状态
故障定位困难：事务失败时，难以快速定位问题根源
告警不及时：传统日志监控无法实时发现异常
健康度缺失：缺乏对事务整体健康状态的量化评估

本文将详细介绍如何使用 Spring Boot + Micrometer + Prometheus + Grafana 构建分布式事务监控大盘，实现事务健康度实时监控和失败率秒级告警。

一、分布式事务监控挑战

1. 分布式事务的复杂性

┌─────────────────────────────────────────────────────────────┐
│                    分布式事务执行链路                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐ │
│  │ 订单服务 │───▶│ 库存服务 │───▶│ 支付服务 │───▶│ 通知服务 │ │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘ │
│       │              │              │              │       │
│       ▼              ▼              ▼              ▼       │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐ │
│  │  DB-1   │    │  DB-2   │    │  DB-3   │    │  MQ-1   │ │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘ │
│                                                             │
│  监控难点：                                                   │
│  1. 多服务协调状态难以统一追踪                                 │
│  2. 事务执行时间跨度大                                        │
│  3. 失败原因多样化                                            │
│  4. 重试机制增加监控复杂度                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2. 传统监控方案的局限

方案	优点	缺点	适用场景
日志监控	详细信息	实时性差、难以聚合	问题排查
APM 工具	链路追踪	成本高、学习曲线陡	性能分析
数据库监控	数据层可见	缺少业务视角	数据库优化
自定义监控	业务定制	开发成本高	特定需求

3. 监控指标体系

指标类型	指标名称	说明	告警阈值
执行指标	事务总数	执行的事务总量	-
	成功数	成功完成的事务数	-
	失败数	失败的事务数	-
	失败率	失败数/总数	> 5%
性能指标	平均耗时	事务平均执行时间	> 5s
	P95 耗时	95% 事务的耗时	> 10s
	P99 耗时	99% 事务的耗时	> 30s
状态指标	进行中数量	正在执行的事务数	> 100
	挂起数量	超时未完成的事务数	> 10
健康度	健康分数	综合评估分数	< 80

二、技术架构设计

1. 整体架构

┌─────────────────────────────────────────────────────────────┐
│                    分布式事务监控架构                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   应用层                             │   │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐            │   │
│  │  │ 订单服务 │  │ 库存服务 │  │ 支付服务 │            │   │
│  │  └────┬────┘  └────┬────┘  └────┬────┘            │   │
│  │       │            │            │                  │   │
│  │       └────────────┼────────────┘                  │   │
│  │                    ▼                               │   │
│  │            ┌──────────────┐                        │   │
│  │            │ Micrometer   │                        │   │
│  │            │ 指标收集器    │                        │   │
│  │            └──────┬───────┘                        │   │
│  └───────────────────┼───────────────────────────────┘   │
│                      ▼                                     │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   数据层                             │   │
│  │            ┌──────────────┐                        │   │
│  │            │  Prometheus  │                        │   │
│  │            │  时序数据库   │                        │   │
│  │            └──────┬───────┘                        │   │
│  └───────────────────┼───────────────────────────────┘   │
│                      ▼                                     │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   展示层                             │   │
│  │  ┌──────────────┐  ┌──────────────┐               │   │
│  │  │   Grafana    │  │   AlertManager│               │   │
│  │  │   监控大盘    │  │   告警服务    │               │   │
│  │  └──────────────┘  └──────────────┘               │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2. 技术选型

技术	版本	用途
Spring Boot	3.2.0	基础框架
Spring Actuator	-	健康检查和指标暴露
Micrometer	1.12.0	指标收集门面
Prometheus	2.45.0	时序数据库
Grafana	10.0.0	可视化大盘
AlertManager	0.26.0	告警管理

3. 数据流

┌─────────────────────────────────────────────────────────────┐
│                    监控数据流转流程                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. 事务执行                                                 │
│     ┌─────────────────────────────────────────────┐        │
│     │ @Transactional                               │        │
│     │ public void processOrder() {                 │        │
│     │     // 业务逻辑                               │        │
│     │ }                                            │        │
│     └─────────────────────────────────────────────┘        │
│                           ↓                                 │
│  2. 指标收集                                                 │
│     ┌─────────────────────────────────────────────┐        │
│     │ MeterRegistry.counter("transaction.total")  │        │
│     │     .tag("type", "order")                   │        │
│     │     .tag("status", "success")               │        │
│     │     .increment()                            │        │
│     └─────────────────────────────────────────────┘        │
│                           ↓                                 │
│  3. 指标暴露                                                 │
│     ┌─────────────────────────────────────────────┐        │
│     │ GET /actuator/prometheus                    │        │
│     │ 返回 Prometheus 格式的指标数据                │        │
│     └─────────────────────────────────────────────┘        │
│                           ↓                                 │
│  4. 数据采集                                                 │
│     ┌─────────────────────────────────────────────┐        │
│     │ Prometheus 定时拉取指标数据                   │        │
│     │ 存储到时序数据库                              │        │
│     └─────────────────────────────────────────────┘        │
│                           ↓                                 │
│  5. 可视化 & 告警                                            │
│     ┌─────────────────────────────────────────────┐        │
│     │ Grafana 查询 Prometheus 数据                 │        │
│     │ AlertManager 根据规则发送告警                 │        │
│     └─────────────────────────────────────────────┘        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

三、代码实现

1. 项目结构

SpringBoot-TransactionMonitor-Demo/
├── src/
│   └── main/
│       ├── java/
│       │   └── com/
│       │       └── example/
│       │           └── monitor/
│       │               ├── TransactionMonitorApplication.java
│       │               ├── config/
│       │               │   ├── MetricsConfig.java
│       │               │   └── PrometheusConfig.java
│       │               ├── metrics/
│       │               │   ├── TransactionMetrics.java
│       │               │   └── TransactionHealthIndicator.java
│       │               ├── aspect/
│       │               │   └── TransactionMetricsAspect.java
│       │               ├── service/
│       │               │   ├── TransactionMonitorService.java
│       │               │   ├── AlertService.java
│       │               │   └── OrderService.java
│       │               ├── controller/
│       │               │   ├── OrderController.java
│       │               │   └── MonitorController.java
│       │               ├── dto/
│       │               │   ├── ApiResponse.java
│       │               │   └── TransactionStatistics.java
│       │               └── event/
│       │                   └── TransactionEvent.java
│       └── resources/
│           ├── application.yml
│           └── grafana/
│               └── dashboard.json
├── prometheus/
│   └── prometheus.yml
├── grafana/
│   └── provisioning/
│       ├── dashboards/
│       └── datasources/
├── docker-compose.yml
├── pom.xml
└── README.md

2. 核心代码实现

2.1 事务指标收集器

@Component
@Slf4j
public class TransactionMetrics {

    private final MeterRegistry meterRegistry;
    
    private final Counter totalCounter;
    private final Counter successCounter;
    private final Counter failureCounter;
    private final Counter timeoutCounter;
    private final Timer executionTimer;
    private final Gauge pendingGauge;
    
    private final AtomicLong pendingCount = new AtomicLong(0);

    public TransactionMetrics(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        
        // 事务总数计数器
        this.totalCounter = Counter.builder("transaction_total")
            .description("Total number of transactions")
            .tag("application", "transaction-monitor")
            .register(meterRegistry);
        
        // 成功计数器
        this.successCounter = Counter.builder("transaction_success")
            .description("Number of successful transactions")
            .tag("application", "transaction-monitor")
            .register(meterRegistry);
        
        // 失败计数器
        this.failureCounter = Counter.builder("transaction_failure")
            .description("Number of failed transactions")
            .tag("application", "transaction-monitor")
            .register(meterRegistry);
        
        // 超时计数器
        this.timeoutCounter = Counter.builder("transaction_timeout")
            .description("Number of timeout transactions")
            .tag("application", "transaction-monitor")
            .register(meterRegistry);
        
        // 执行时间计时器
        this.executionTimer = Timer.builder("transaction_duration")
            .description("Transaction execution duration")
            .tag("application", "transaction-monitor")
            .publishPercentiles(0.5, 0.95, 0.99)
            .publishPercentileHistogram()
            .minimumExpectedValue(Duration.ofMillis(1))
            .maximumExpectedValue(Duration.ofSeconds(60))
            .register(meterRegistry);
        
        // 进行中事务数量
        this.pendingGauge = Gauge.builder("transaction_pending", pendingCount, AtomicLong::get)
            .description("Number of pending transactions")
            .tag("application", "transaction-monitor")
            .register(meterRegistry);
    }
    
    public void recordStart(String transactionType) {
        pendingCount.incrementAndGet();
        totalCounter.increment();
        log.debug("Transaction started: type={}", transactionType);
    }
    
    public void recordSuccess(String transactionType, Duration duration) {
        pendingCount.decrementAndGet();
        successCounter.increment();
        executionTimer.record(duration);
        log.debug("Transaction success: type={}, duration={}ms", 
                 transactionType, duration.toMillis());
    }
    
    public void recordFailure(String transactionType, Duration duration, String errorType) {
        pendingCount.decrementAndGet();
        failureCounter.increment();
        executionTimer.record(duration);
        log.warn("Transaction failed: type={}, duration={}ms, error={}", 
                transactionType, duration.toMillis(), errorType);
    }
    
    public void recordTimeout(String transactionType) {
        pendingCount.decrementAndGet();
        timeoutCounter.increment();
        log.error("Transaction timeout: type={}", transactionType);
    }
    
    public double getFailureRate() {
        double total = totalCounter.count();
        double failures = failureCounter.count();
        return total > 0 ? (failures / total) * 100 : 0;
    }
    
    public long getPendingCount() {
        return pendingCount.get();
    }
}

2.2 事务监控切面

@Aspect
@Component
@Slf4j
public class TransactionMetricsAspect {

    @Autowired
    private TransactionMetrics transactionMetrics;
    
    @Autowired
    private ApplicationEventPublisher eventPublisher;

    @Around("@annotation(transactional)")
    public Object monitorTransaction(ProceedingJoinPoint joinPoint, 
                                    Transactional transactional) throws Throwable {
        String transactionType = joinPoint.getSignature().getName();
        long startTime = System.nanoTime();
        
        transactionMetrics.recordStart(transactionType);
        
        try {
            Object result = joinPoint.proceed();
            
            Duration duration = Duration.ofNanos(System.nanoTime() - startTime);
            transactionMetrics.recordSuccess(transactionType, duration);
            
            // 发布事务成功事件
            eventPublisher.publishEvent(new TransactionEvent(
                transactionType, "SUCCESS", duration.toMillis()
            ));
            
            return result;
        } catch (Exception e) {
            Duration duration = Duration.ofNanos(System.nanoTime() - startTime);
            transactionMetrics.recordFailure(transactionType, duration, e.getClass().getSimpleName());
            
            // 发布事务失败事件
            eventPublisher.publishEvent(new TransactionEvent(
                transactionType, "FAILURE", duration.toMillis(), e.getMessage()
            ));
            
            throw e;
        }
    }
}

2.3 事务健康度指标

@Component
public class TransactionHealthIndicator implements HealthIndicator {

    @Autowired
    private TransactionMetrics transactionMetrics;
    
    @Value("${health.transaction.failure-rate-threshold:5.0}")
    private double failureRateThreshold;
    
    @Value("${health.transaction.pending-threshold:100}")
    private long pendingThreshold;

    @Override
    public Health health() {
        double failureRate = transactionMetrics.getFailureRate();
        long pendingCount = transactionMetrics.getPendingCount();
        
        Health.Builder builder;
        
        if (failureRate > failureRateThreshold || pendingCount > pendingThreshold) {
            builder = Health.down()
                .withDetail("failureRate", String.format("%.2f%%", failureRate))
                .withDetail("pendingCount", pendingCount)
                .withDetail("status", "UNHEALTHY");
        } else {
            builder = Health.up()
                .withDetail("failureRate", String.format("%.2f%%", failureRate))
                .withDetail("pendingCount", pendingCount)
                .withDetail("status", "HEALTHY");
        }
        
        return builder.build();
    }
}

2.4 告警服务

@Service
@Slf4j
public class AlertService {

    @Autowired
    private NotificationService notificationService;
    
    @Value("${alert.failure-rate.threshold:5.0}")
    private double failureRateThreshold;
    
    @Value("${alert.pending.threshold:100}")
    private long pendingThreshold;
    
    private final Map<String, Long> alertCooldown = new ConcurrentHashMap<>();
    private static final long COOLDOWN_MINUTES = 5;

    @Async
    @EventListener
    public void handleTransactionEvent(TransactionEvent event) {
        log.info("Transaction event: type={}, status={}, duration={}ms",
                event.getType(), event.getStatus(), event.getDuration());
        
        if ("FAILURE".equals(event.getStatus())) {
            checkAndSendAlert("transaction_failure", 
                String.format("事务失败告警: type=%s, error=%s", 
                    event.getType(), event.getErrorMessage()));
        }
    }
    
    @Scheduled(fixedRate = 60000)
    public void checkTransactionHealth() {
        TransactionMetrics metrics = getTransactionMetrics();
        
        // 检查失败率
        if (metrics.getFailureRate() > failureRateThreshold) {
            checkAndSendAlert("high_failure_rate",
                String.format("事务失败率过高: %.2f%% (阈值: %.2f%%)",
                    metrics.getFailureRate(), failureRateThreshold));
        }
        
        // 检查挂起数量
        if (metrics.getPendingCount() > pendingThreshold) {
            checkAndSendAlert("high_pending_count",
                String.format("挂起事务数量过多: %d (阈值: %d)",
                    metrics.getPendingCount(), pendingThreshold));
        }
    }
    
    private void checkAndSendAlert(String alertType, String message) {
        Long lastAlertTime = alertCooldown.get(alertType);
        long now = System.currentTimeMillis();
        
        if (lastAlertTime == null || 
            now - lastAlertTime > COOLDOWN_MINUTES * 60 * 1000) {
            
            log.warn("发送告警: type={}, message={}", alertType, message);
            notificationService.sendAlert(alertType, message);
            alertCooldown.put(alertType, now);
        }
    }
}

2.5 监控统计服务

@Service
@Slf4j
public class TransactionMonitorService {

    @Autowired
    private TransactionMetrics transactionMetrics;
    
    @Autowired
    private MeterRegistry meterRegistry;

    public TransactionStatistics getStatistics() {
        double total = getCounterValue("transaction_total");
        double success = getCounterValue("transaction_success");
        double failure = getCounterValue("transaction_failure");
        double timeout = getCounterValue("transaction_timeout");
        long pending = transactionMetrics.getPendingCount();
        
        double failureRate = total > 0 ? (failure / total) * 100 : 0;
        double successRate = total > 0 ? (success / total) * 100 : 0;
        
        Timer timer = meterRegistry.find("transaction_duration").timer();
        double avgDuration = timer != null ? timer.mean(TimeUnit.MILLISECONDS) : 0;
        double p95Duration = timer != null ? 
            timer.percentile(0.95, TimeUnit.MILLISECONDS) : 0;
        double p99Duration = timer != null ? 
            timer.percentile(0.99, TimeUnit.MILLISECONDS) : 0;
        
        int healthScore = calculateHealthScore(failureRate, pending, avgDuration);
        
        return TransactionStatistics.builder()
            .total((long) total)
            .success((long) success)
            .failure((long) failure)
            .timeout((long) timeout)
            .pending(pending)
            .successRate(successRate)
            .failureRate(failureRate)
            .avgDuration(avgDuration)
            .p95Duration(p95Duration)
            .p99Duration(p99Duration)
            .healthScore(healthScore)
            .status(healthScore >= 80 ? "HEALTHY" : "UNHEALTHY")
            .timestamp(LocalDateTime.now())
            .build();
    }
    
    private double getCounterValue(String name) {
        Counter counter = meterRegistry.find(name).counter();
        return counter != null ? counter.count() : 0;
    }
    
    private int calculateHealthScore(double failureRate, long pending, double avgDuration) {
        int score = 100;
        
        // 失败率扣分
        if (failureRate > 10) score -= 30;
        else if (failureRate > 5) score -= 20;
        else if (failureRate > 2) score -= 10;
        
        // 挂起数量扣分
        if (pending > 100) score -= 20;
        else if (pending > 50) score -= 10;
        else if (pending > 20) score -= 5;
        
        // 平均耗时扣分
        if (avgDuration > 10000) score -= 20;
        else if (avgDuration > 5000) score -= 10;
        else if (avgDuration > 2000) score -= 5;
        
        return Math.max(0, score);
    }
}

3. 配置文件

3.1 application.yml

server:
  port: 8080

spring:
  application:
    name: transaction-monitor-demo
  datasource:
    url: jdbc:h2:mem:testdb
    driver-class-name: org.h2.Driver
    username: sa
    password: 
  jpa:
    hibernate:
      ddl-auto: create-drop
    show-sql: false

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  endpoint:
    health:
      show-details: always
    prometheus:
      enabled: true
  metrics:
    tags:
      application: ${spring.application.name}
    export:
      prometheus:
        enabled: true

health:
  transaction:
    failure-rate-threshold: 5.0
    pending-threshold: 100

alert:
  failure-rate:
    threshold: 5.0
  pending:
    threshold: 100

logging:
  level:
    com.example.monitor: DEBUG

3.2 Prometheus 配置

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'spring-boot-application'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['host.docker.internal:8080']

3.3 告警规则

groups:
  - name: transaction_alerts
    rules:
      - alert: HighTransactionFailureRate
        expr: |
          (rate(transaction_failure_total[5m]) / 
           rate(transaction_total_total[5m])) * 100 > 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "事务失败率过高"
          description: "事务失败率为 {{ $value | printf \"%.2f\" }}%，超过阈值 5%"

      - alert: HighPendingTransactionCount
        expr: transaction_pending > 100
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "挂起事务数量过多"
          description: "当前挂起事务数量为 {{ $value }}，超过阈值 100"

      - alert: SlowTransactionDuration
        expr: |
          histogram_quantile(0.95, 
            rate(transaction_duration_seconds_bucket[5m])
          ) > 10
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "事务执行时间过长"
          description: "P95 事务执行时间为 {{ $value | humanizeDuration }}，超过阈值 10s"

      - alert: TransactionHealthScoreLow
        expr: transaction_health_score < 80
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "事务健康度分数过低"
          description: "当前健康度分数为 {{ $value }}，低于阈值 80"

4. Grafana 大盘配置

4.1 Dashboard JSON

{
  "dashboard": {
    "title": "分布式事务监控大盘",
    "panels": [
      {
        "title": "事务健康度",
        "type": "gauge",
        "targets": [
          {
            "expr": "transaction_health_score",
            "legendFormat": "健康度分数"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": 0},
                {"color": "yellow", "value": 60},
                {"color": "green", "value": 80}
              ]
            },
            "min": 0,
            "max": 100
          }
        }
      },
      {
        "title": "事务成功率",
        "type": "stat",
        "targets": [
          {
            "expr": "(rate(transaction_success_total[5m]) / rate(transaction_total_total[5m])) * 100",
            "legendFormat": "成功率"
          }
        ]
      },
      {
        "title": "事务失败率",
        "type": "stat",
        "targets": [
          {
            "expr": "(rate(transaction_failure_total[5m]) / rate(transaction_total_total[5m])) * 100",
            "legendFormat": "失败率"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 2},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "title": "事务执行趋势",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(transaction_total_total[5m])",
            "legendFormat": "总数"
          },
          {
            "expr": "rate(transaction_success_total[5m])",
            "legendFormat": "成功"
          },
          {
            "expr": "rate(transaction_failure_total[5m])",
            "legendFormat": "失败"
          }
        ]
      },
      {
        "title": "事务执行时间分布",
        "type": "heatmap",
        "targets": [
          {
            "expr": "rate(transaction_duration_seconds_bucket[5m])",
            "format": "heatmap",
            "legendFormat": "{{le}}"
          }
        ]
      },
      {
        "title": "挂起事务数量",
        "type": "gauge",
        "targets": [
          {
            "expr": "transaction_pending",
            "legendFormat": "挂起数量"
          }
        ]
      }
    ]
  }
}

四、监控大盘展示

1. 核心指标面板

┌─────────────────────────────────────────────────────────────┐
│                    分布式事务监控大盘                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐             │
│  │ 健康度分数 │  │  成功率   │  │  失败率   │             │
│  │           │  │           │  │           │             │
│  │    92     │  │   97.5%   │  │   2.5%    │             │
│  │  ●─────── │  │  ●─────── │  │  ●─────── │             │
│  │  HEALTHY  │  │           │  │  WARNING  │             │
│  └───────────┘  └───────────┘  └───────────┘             │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐ │
│  │              事务执行趋势                              │ │
│  │  100 ┤                                              │ │
│  │   80 ┤    ╭──╮                                      │ │
│  │   60 ┤   ╭╯  ╰╮    ╭──╮                            │ │
│  │   40 ┤  ╭╯    ╰──╮╯  ╰╮                           │ │
│  │   20 ┤ ╭╯        ╰────╰───╮                       │ │
│  │    0 ┼─┴──────────────────┴──────────────────────  │ │
│  │       10:00  10:05  10:10  10:15  10:20  10:25    │ │
│  │       ── 总数  ── 成功  ── 失败                     │ │
│  └──────────────────────────────────────────────────────┘ │
│                                                             │
│  ┌──────────────────┐  ┌──────────────────┐              │
│  │   P95 执行时间   │  │   P99 执行时间   │              │
│  │                  │  │                  │              │
│  │     2.3 秒       │  │     5.8 秒       │              │
│  └──────────────────┘  └──────────────────┘              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2. 告警配置面板

┌─────────────────────────────────────────────────────────────┐
│                      告警规则配置                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  规则名称              条件                    阈值   级别   │
│  ─────────────────────────────────────────────────────────  │
│  高失败率告警          失败率 > 阈值           5%    严重    │
│  挂起事务过多          挂起数量 > 阈值         100   警告    │
│  执行时间过长          P95 耗时 > 阈值         10s   警告    │
│  健康度过低            健康分数 < 阈值         80    严重    │
│                                                             │
│  通知渠道：                                                  │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐      │
│  │  邮件   │  │  钉钉   │  │  企业微信 │  │  短信   │      │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

五、最佳实践

1. 指标设计原则

原则	说明	示例
命名规范	使用统一前缀和单位	`transaction_*`
标签合理	标签值不宜过多	`type`, `status`
基数控制	避免高基数标签	不使用 ID 作为标签
粒度适中	平衡精度和性能	秒级 vs 毫秒级

2. 告警策略

策略	说明	建议
分级告警	不同级别不同处理	严重→短信，警告→邮件
告警收敛	避免告警风暴	合并相似告警
静默机制	维护期间静默	计划维护时静默
恢复通知	问题恢复后通知	发送恢复通知

3. 大盘设计

要素	说明	建议
核心指标	放在显眼位置	顶部区域
趋势图表	展示历史变化	中部区域
详细数据	提供下钻能力	底部区域
告警状态	实时展示告警	侧边栏

六、常见问题

Q1: 如何处理高基数标签问题？

A: 可以：

避免使用 ID、时间戳等作为标签
对标签值进行归类
使用 Histogram 替代高基数 Counter

Q2: 如何保证告警的准确性？

A: 可以：

设置合理的阈值和持续时间
使用多条件组合告警
定期审查和调整告警规则

Q3: 如何降低监控对性能的影响？

A: 可以：

使用采样策略
异步记录指标
合理设置采集间隔

Q4: 如何实现跨服务的事务追踪？

A: 可以：

使用 Trace ID 关联
集成分布式追踪系统
在指标中添加 Trace 标签

七、扩展功能

1. 自定义指标

@Component
public class CustomMetrics {
    
    private final MeterRegistry registry;
    
    public CustomMetrics(MeterRegistry registry) {
        this.registry = registry;
        
        // 自定义业务指标
        registry.gauge("order_pending_amount", 
            Tags.of("type", "high_value"), 
            this::calculatePendingAmount);
    }
    
    private double calculatePendingAmount() {
        // 计算逻辑
        return 0;
    }
}

2. 动态阈值

@Service
@RefreshScope
public class DynamicThresholdService {
    
    @Value("${alert.threshold.failure-rate:5.0}")
    private double failureRateThreshold;
    
    @Value("${alert.threshold.pending:100}")
    private long pendingThreshold;
    
    public boolean shouldAlert(TransactionStatistics stats) {
        return stats.getFailureRate() > failureRateThreshold ||
               stats.getPendingCount() > pendingThreshold;
    }
}

八、总结

通过 Spring Boot + Micrometer + Prometheus + Grafana 构建的分布式事务监控系统，实现了：

实时监控：秒级发现事务异常
可视化大盘：直观展示事务健康度
智能告警：多维度告警策略
健康评估：量化事务健康状态

这种方案具有以下优势：

低侵入性：基于 AOP 实现，无业务侵入
高性能：异步收集，对业务影响小
易扩展：支持自定义指标和告警规则
标准化：基于 Prometheus 生态，易于集成

更多技术文章，欢迎关注公众号"服务端技术精选"，及时获取最新动态。

标题：SpringBoot + 分布式事务监控大盘 + 失败率告警：事务健康度实时掌握，异常秒级发现
作者：jiangyi
地址：http://jiangyi.space/articles/2026/03/27/1774246506844.html
公众号：服务端技术精选

前言
一、分布式事务监控挑战
1. 分布式事务的复杂性
2. 传统监控方案的局限
3. 监控指标体系
二、技术架构设计
1. 整体架构
2. 技术选型
3. 数据流
三、代码实现
1. 项目结构
2. 核心代码实现
2.1 事务指标收集器
2.2 事务监控切面
2.3 事务健康度指标
2.4 告警服务
2.5 监控统计服务
3. 配置文件
3.1 application.yml
3.2 Prometheus 配置
3.3 告警规则
4. Grafana 大盘配置
4.1 Dashboard JSON
四、监控大盘展示
1. 核心指标面板
2. 告警配置面板
五、最佳实践
1. 指标设计原则
2. 告警策略
3. 大盘设计
六、常见问题
Q1: 如何处理高基数标签问题？
Q2: 如何保证告警的准确性？
Q3: 如何降低监控对性能的影响？
Q4: 如何实现跨服务的事务追踪？
七、扩展功能
1. 自定义指标
2. 动态阈值
八、总结

0 评论