SpringBoot + 系统资源水位监控 + 自动降级:CPU/内存超阈值时,非核心功能自动关闭

前言

在企业级应用中,系统的稳定性和可靠性是至关重要的。然而,即使是最精心设计的系统也可能遇到资源耗尽的情况,如 CPU 使用率过高、内存不足等。当系统资源达到瓶颈时,如果不采取措施,可能会导致整个系统崩溃,影响用户体验和业务运营。

想象一下这样的场景:你的电商系统在促销活动期间,突然涌入大量用户,导致服务器 CPU 使用率飙升到 90% 以上,内存使用率也接近 95%。此时,系统响应变得非常缓慢,用户无法正常下单,甚至可能导致系统崩溃。如果能够在资源使用率达到阈值时,自动关闭一些非核心功能,释放资源,就可以保证核心功能的正常运行。

系统资源水位监控自动降级是解决这个问题的有效方案。通过实时监控系统的 CPU、内存等资源使用情况,当资源使用率超过预设阈值时,自动关闭非核心功能,释放资源,确保核心功能的正常运行。本文将详细介绍如何在 SpringBoot 项目中实现系统资源水位监控和自动降级,构建一个高可用的系统。

一、系统资源水位监控的核心概念

1.1 什么是系统资源水位

系统资源水位是指系统各种资源(如 CPU、内存、磁盘、网络等)的使用情况,通常以百分比表示。当资源使用率达到一定阈值时,系统可能会出现性能下降或崩溃的风险。

1.2 核心资源指标

指标说明正常范围警戒范围危险范围
CPU 使用率中央处理器的使用百分比0%-70%70%-90%90%+
内存使用率内存的使用百分比0%-75%75%-90%90%+
磁盘使用率磁盘空间的使用百分比0%-80%80%-90%90%+
网络带宽网络带宽的使用百分比0%-70%70%-90%90%+

1.3 监控的重要性

  • 提前预警:及时发现资源使用异常,提前采取措施
  • 性能优化:了解系统的资源使用情况,进行针对性优化
  • 自动降级:当资源达到阈值时,自动关闭非核心功能
  • 故障分析:分析系统故障的原因,提供数据支持

二、SpringBoot 系统资源监控实现

2.1 常用监控方案

方案特点适用场景
Spring Boot Actuator内置监控,简单易用快速集成,基础监控
Micrometer统一指标收集,支持多种监控系统复杂系统,多维度监控
JMXJava 标准监控方案传统 Java 应用
自定义监控灵活定制,针对性强特定场景,深度监控

2.2 系统资源监控实现

2.2.1 依赖配置

<dependencies>
    <!-- Spring Boot Actuator -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>

    <!-- Micrometer -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>
</dependencies>

2.2.2 配置文件

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: always

# 系统资源监控配置
system:
  monitor:
    enabled: true
    cpu:
      warning-threshold: 70
      critical-threshold: 90
    memory:
      warning-threshold: 75
      critical-threshold: 90
    disk:
      warning-threshold: 80
      critical-threshold: 90
    check-interval: 5000

2.2.3 资源监控服务

@Service
@Slf4j
public class SystemResourceMonitorService {

    @Autowired
    private SystemResourceProperties properties;

    @Autowired
    private MeterRegistry meterRegistry;

    private final OperatingSystemMXBean osBean;
    private final Runtime runtime;

    public SystemResourceMonitorService() {
        this.osBean = ManagementFactory.getOperatingSystemMXBean();
        this.runtime = Runtime.getRuntime();
    }

    @Scheduled(fixedRateString = "${system.monitor.check-interval:5000}")
    public void monitorResources() {
        if (!properties.isEnabled()) {
            return;
        }

        // 监控 CPU 使用率
        double cpuUsage = getCpuUsage();
        meterRegistry.gauge("system.cpu.usage", cpuUsage);
        checkCpuThreshold(cpuUsage);

        // 监控内存使用率
        double memoryUsage = getMemoryUsage();
        meterRegistry.gauge("system.memory.usage", memoryUsage);
        checkMemoryThreshold(memoryUsage);

        // 监控磁盘使用率
        double diskUsage = getDiskUsage();
        meterRegistry.gauge("system.disk.usage", diskUsage);
        checkDiskThreshold(diskUsage);
    }

    private double getCpuUsage() {
        // 实现 CPU 使用率计算
        return 0.0; // 模拟值
    }

    private double getMemoryUsage() {
        long totalMemory = runtime.totalMemory();
        long freeMemory = runtime.freeMemory();
        long usedMemory = totalMemory - freeMemory;
        return (double) usedMemory / totalMemory * 100;
    }

    private double getDiskUsage() {
        // 实现磁盘使用率计算
        return 0.0; // 模拟值
    }

    private void checkCpuThreshold(double usage) {
        if (usage >= properties.getCpu().getCriticalThreshold()) {
            log.error("CPU usage critical: {}%", usage);
            // 触发降级
        } else if (usage >= properties.getCpu().getWarningThreshold()) {
            log.warn("CPU usage warning: {}%", usage);
            // 触发警告
        }
    }

    private void checkMemoryThreshold(double usage) {
        if (usage >= properties.getMemory().getCriticalThreshold()) {
            log.error("Memory usage critical: {}%", usage);
            // 触发降级
        } else if (usage >= properties.getMemory().getWarningThreshold()) {
            log.warn("Memory usage warning: {}%", usage);
            // 触发警告
        }
    }

    private void checkDiskThreshold(double usage) {
        if (usage >= properties.getDisk().getCriticalThreshold()) {
            log.error("Disk usage critical: {}%", usage);
            // 触发降级
        } else if (usage >= properties.getDisk().getWarningThreshold()) {
            log.warn("Disk usage warning: {}%", usage);
            // 触发警告
        }
    }

}

三、自动降级机制

3.1 什么是自动降级

自动降级是指当系统资源达到预设阈值时,自动关闭或降级非核心功能,释放资源,确保核心功能的正常运行。降级策略通常包括:

  • 功能降级:关闭非核心功能
  • 性能降级:降低功能的性能要求
  • 服务降级:暂时停止提供某些服务

3.2 功能分级

根据功能的重要性,可以将系统功能分为以下几个级别:

级别说明示例降级策略
核心功能系统必须提供的功能订单创建、支付处理不降级
重要功能重要但非必须的功能用户登录、商品浏览可降级
非核心功能辅助性功能推荐系统、日志分析优先降级
装饰性功能增强用户体验的功能动画效果、个性化推荐首先降级

3.3 自动降级实现

3.3.1 降级配置

@Data
@ConfigurationProperties(prefix = "system.degrade")
public class SystemDegradeProperties {

    private boolean enabled = true;
    private List<DegradeRule> rules = new ArrayList<>();

    @Data
    public static class DegradeRule {
        private String id;
        private String feature;
        private int level; // 1-核心, 2-重要, 3-非核心, 4-装饰性
        private String condition;
        private String action;
    }

}

3.3.2 降级服务

@Service
@Slf4j
public class SystemDegradeService {

    @Autowired
    private SystemDegradeProperties properties;

    private final Map<String, Boolean> featureStatus = new ConcurrentHashMap<>();

    @PostConstruct
    public void init() {
        // 初始化所有功能状态
        properties.getRules().forEach(rule -> {
            featureStatus.put(rule.getFeature(), true);
        });
    }

    public void degrade(DegradeLevel level) {
        if (!properties.isEnabled()) {
            return;
        }

        log.info("Degrading system to level: {}", level);

        // 根据降级级别关闭相应功能
        properties.getRules().stream()
                .filter(rule -> rule.getLevel() >= level.getValue())
                .forEach(rule -> {
                    featureStatus.put(rule.getFeature(), false);
                    log.warn("Feature {} disabled due to degradation", rule.getFeature());
                });
    }

    public void recover() {
        if (!properties.isEnabled()) {
            return;
        }

        log.info("Recovering system");

        // 恢复所有功能
        properties.getRules().forEach(rule -> {
            featureStatus.put(rule.getFeature(), true);
            log.info("Feature {} enabled", rule.getFeature());
        });
    }

    public boolean isFeatureEnabled(String feature) {
        return featureStatus.getOrDefault(feature, true);
    }

    public enum DegradeLevel {
        NONE(0),
        LIGHT(3),    // 关闭非核心功能
        MEDIUM(2),   // 关闭重要功能
        HEAVY(1);    // 关闭所有非核心功能

        private final int value;

        DegradeLevel(int value) {
            this.value = value;
        }

        public int getValue() {
            return value;
        }
    }

}

四、SpringBoot 完整实现

4.1 项目依赖

<dependencies>
    <!-- Spring Boot Starter -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- Spring Boot Actuator -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>

    <!-- Micrometer -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>

    <!-- Spring Boot Configuration Processor -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-configuration-processor</artifactId>
        <optional>true</optional>
    </dependency>

    <!-- Lombok -->
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>

    <!-- Test -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>

4.2 配置文件

server:
  port: 8080

spring:
  application:
    name: system-resource-monitor-demo

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: always

# 系统资源监控配置
system:
  monitor:
    enabled: true
    cpu:
      warning-threshold: 70
      critical-threshold: 90
    memory:
      warning-threshold: 75
      critical-threshold: 90
    disk:
      warning-threshold: 80
      critical-threshold: 90
    check-interval: 5000

  # 自动降级配置
  degrade:
    enabled: true
    rules:
      - id: 1
        feature: recommend
        level: 4
        condition: cpu > 80% or memory > 85%
        action: disable

      - id: 2
        feature: analytics
        level: 3
        condition: cpu > 85% or memory > 90%
        action: disable

      - id: 3
        feature: cache
        level: 2
        condition: cpu > 90% or memory > 95%
        action: disable

      - id: 4
        feature: order
        level: 1
        condition: never
        action: never

4.3 核心配置类

4.3.1 系统资源监控配置

@Data
@ConfigurationProperties(prefix = "system.monitor")
public class SystemResourceProperties {

    private boolean enabled = true;
    private Cpu cpu = new Cpu();
    private Memory memory = new Memory();
    private Disk disk = new Disk();
    private long checkInterval = 5000;

    @Data
    public static class Cpu {
        private int warningThreshold = 70;
        private int criticalThreshold = 90;
    }

    @Data
    public static class Memory {
        private int warningThreshold = 75;
        private int criticalThreshold = 90;
    }

    @Data
    public static class Disk {
        private int warningThreshold = 80;
        private int criticalThreshold = 90;
    }

}

4.4 服务实现

4.4.1 系统资源监控服务

@Service
@Slf4j
public class SystemResourceMonitorService {

    @Autowired
    private SystemResourceProperties properties;

    @Autowired
    private SystemDegradeService degradeService;

    @Autowired
    private MeterRegistry meterRegistry;

    private final OperatingSystemMXBean osBean;
    private final Runtime runtime;

    public SystemResourceMonitorService() {
        this.osBean = ManagementFactory.getOperatingSystemMXBean();
        this.runtime = Runtime.getRuntime();
    }

    @Scheduled(fixedRateString = "${system.monitor.check-interval:5000}")
    public void monitorResources() {
        if (!properties.isEnabled()) {
            return;
        }

        // 监控 CPU 使用率
        double cpuUsage = getCpuUsage();
        meterRegistry.gauge("system.cpu.usage", cpuUsage);
        checkCpuThreshold(cpuUsage);

        // 监控内存使用率
        double memoryUsage = getMemoryUsage();
        meterRegistry.gauge("system.memory.usage", memoryUsage);
        checkMemoryThreshold(memoryUsage);

        // 监控磁盘使用率
        double diskUsage = getDiskUsage();
        meterRegistry.gauge("system.disk.usage", diskUsage);
        checkDiskThreshold(diskUsage);
    }

    private double getCpuUsage() {
        // 实现 CPU 使用率计算
        return 0.0; // 模拟值
    }

    private double getMemoryUsage() {
        long totalMemory = runtime.totalMemory();
        long freeMemory = runtime.freeMemory();
        long usedMemory = totalMemory - freeMemory;
        return (double) usedMemory / totalMemory * 100;
    }

    private double getDiskUsage() {
        // 实现磁盘使用率计算
        return 0.0; // 模拟值
    }

    private void checkCpuThreshold(double usage) {
        if (usage >= properties.getCpu().getCriticalThreshold()) {
            log.error("CPU usage critical: {}%", usage);
            degradeService.degrade(SystemDegradeService.DegradeLevel.HEAVY);
        } else if (usage >= properties.getCpu().getWarningThreshold()) {
            log.warn("CPU usage warning: {}%", usage);
            degradeService.degrade(SystemDegradeService.DegradeLevel.LIGHT);
        }
    }

    private void checkMemoryThreshold(double usage) {
        if (usage >= properties.getMemory().getCriticalThreshold()) {
            log.error("Memory usage critical: {}%", usage);
            degradeService.degrade(SystemDegradeService.DegradeLevel.HEAVY);
        } else if (usage >= properties.getMemory().getWarningThreshold()) {
            log.warn("Memory usage warning: {}%", usage);
            degradeService.degrade(SystemDegradeService.DegradeLevel.LIGHT);
        }
    }

    private void checkDiskThreshold(double usage) {
        if (usage >= properties.getDisk().getCriticalThreshold()) {
            log.error("Disk usage critical: {}%", usage);
            degradeService.degrade(SystemDegradeService.DegradeLevel.MEDIUM);
        } else if (usage >= properties.getDisk().getWarningThreshold()) {
            log.warn("Disk usage warning: {}%", usage);
            degradeService.degrade(SystemDegradeService.DegradeLevel.LIGHT);
        }
    }

}

4.4.2 自动降级服务

@Service
@Slf4j
public class SystemDegradeService {

    @Autowired
    private SystemDegradeProperties properties;

    private final Map<String, Boolean> featureStatus = new ConcurrentHashMap<>();

    @PostConstruct
    public void init() {
        // 初始化所有功能状态
        properties.getRules().forEach(rule -> {
            featureStatus.put(rule.getFeature(), true);
        });
    }

    public void degrade(DegradeLevel level) {
        if (!properties.isEnabled()) {
            return;
        }

        log.info("Degrading system to level: {}", level);

        // 根据降级级别关闭相应功能
        properties.getRules().stream()
                .filter(rule -> rule.getLevel() >= level.getValue())
                .forEach(rule -> {
                    featureStatus.put(rule.getFeature(), false);
                    log.warn("Feature {} disabled due to degradation", rule.getFeature());
                });
    }

    public void recover() {
        if (!properties.isEnabled()) {
            return;
        }

        log.info("Recovering system");

        // 恢复所有功能
        properties.getRules().forEach(rule -> {
            featureStatus.put(rule.getFeature(), true);
            log.info("Feature {} enabled", rule.getFeature());
        });
    }

    public boolean isFeatureEnabled(String feature) {
        return featureStatus.getOrDefault(feature, true);
    }

    public enum DegradeLevel {
        NONE(0),
        LIGHT(3),    // 关闭非核心功能
        MEDIUM(2),   // 关闭重要功能
        HEAVY(1);    // 关闭所有非核心功能

        private final int value;

        DegradeLevel(int value) {
            this.value = value;
        }

        public int getValue() {
            return value;
        }
    }

}

4.5 控制器

4.5.1 系统资源监控控制器

@RestController
@RequestMapping("/api/system")
@Slf4j
public class SystemController {

    @Autowired
    private SystemResourceMonitorService monitorService;

    @Autowired
    private SystemDegradeService degradeService;

    @GetMapping("/status")
    public Map<String, Object> getSystemStatus() {
        Map<String, Object> status = new HashMap<>();
        status.put("cpuUsage", getCpuUsage());
        status.put("memoryUsage", getMemoryUsage());
        status.put("diskUsage", getDiskUsage());
        status.put("featureStatus", degradeService.getFeatureStatus());
        return status;
    }

    @PostMapping("/degrade/light")
    public void degradeLight() {
        degradeService.degrade(SystemDegradeService.DegradeLevel.LIGHT);
    }

    @PostMapping("/degrade/medium")
    public void degradeMedium() {
        degradeService.degrade(SystemDegradeService.DegradeLevel.MEDIUM);
    }

    @PostMapping("/degrade/heavy")
    public void degradeHeavy() {
        degradeService.degrade(SystemDegradeService.DegradeLevel.HEAVY);
    }

    @PostMapping("/recover")
    public void recover() {
        degradeService.recover();
    }

    private double getCpuUsage() {
        // 实现 CPU 使用率计算
        return 0.0; // 模拟值
    }

    private double getMemoryUsage() {
        Runtime runtime = Runtime.getRuntime();
        long totalMemory = runtime.totalMemory();
        long freeMemory = runtime.freeMemory();
        long usedMemory = totalMemory - freeMemory;
        return (double) usedMemory / totalMemory * 100;
    }

    private double getDiskUsage() {
        // 实现磁盘使用率计算
        return 0.0; // 模拟值
    }

}

4.5.2 业务功能控制器

@RestController
@RequestMapping("/api/feature")
@Slf4j
public class FeatureController {

    @Autowired
    private SystemDegradeService degradeService;

    @GetMapping("/recommend")
    public Map<String, Object> getRecommendations() {
        if (!degradeService.isFeatureEnabled("recommend")) {
            return Map.of("status", "degraded", "message", "推荐功能已暂时关闭");
        }

        // 实现推荐逻辑
        return Map.of("status", "ok", "recommendations", List.of("Product 1", "Product 2"));
    }

    @GetMapping("/analytics")
    public Map<String, Object> getAnalytics() {
        if (!degradeService.isFeatureEnabled("analytics")) {
            return Map.of("status", "degraded", "message", "分析功能已暂时关闭");
        }

        // 实现分析逻辑
        return Map.of("status", "ok", "analytics", Map.of("clicks", 100, "views", 1000));
    }

    @GetMapping("/cache")
    public Map<String, Object> getCache() {
        if (!degradeService.isFeatureEnabled("cache")) {
            return Map.of("status", "degraded", "message", "缓存功能已暂时关闭");
        }

        // 实现缓存逻辑
        return Map.of("status", "ok", "cache", Map.of("key", "value"));
    }

    @PostMapping("/order")
    public Map<String, Object> createOrder(@RequestBody Map<String, Object> order) {
        if (!degradeService.isFeatureEnabled("order")) {
            return Map.of("status", "degraded", "message", "订单功能已暂时关闭");
        }

        // 实现订单创建逻辑
        return Map.of("status", "ok", "orderId", "123456");
    }

}

五、监控和告警

5.1 Prometheus 监控

system_cpu_usage
system_memory_usage
system_disk_usage

5.2 Grafana 仪表盘

  • 系统资源使用率面板
  • 功能降级状态面板
  • 资源趋势图
  • 告警历史面板

5.3 告警规则

groups:
  - name: system_resource_alerts
    rules:
      - alert: HighCpuUsage
        expr: system_cpu_usage > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "CPU usage has been above 80% for more than 5 minutes"

      - alert: CriticalCpuUsage
        expr: system_cpu_usage > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage"
          description: "CPU usage has been above 90% for more than 2 minutes"

      - alert: HighMemoryUsage
        expr: system_memory_usage > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage has been above 85% for more than 5 minutes"

      - alert: CriticalMemoryUsage
        expr: system_memory_usage > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical memory usage"
          description: "Memory usage has been above 90% for more than 2 minutes"

六、最佳实践

6.1 合理设置阈值

原则

  • 根据系统的实际情况设置阈值
  • 考虑系统的峰值负载
  • 预留足够的缓冲空间
  • 定期调整阈值

建议

  • CPU 使用率:警告阈值 70%,临界阈值 90%
  • 内存使用率:警告阈值 75%,临界阈值 90%
  • 磁盘使用率:警告阈值 80%,临界阈值 90%

6.2 功能分级

原则

  • 明确核心功能和非核心功能
  • 合理设置功能的优先级
  • 确保核心功能的可用性
  • 非核心功能可以降级

建议

  • 核心功能:订单、支付、登录
  • 重要功能:商品浏览、搜索
  • 非核心功能:推荐、分析
  • 装饰性功能:动画、个性化

6.3 降级策略

原则

  • 渐进式降级
  • 先降级非核心功能
  • 保留核心功能
  • 记录降级事件

建议

  • 轻度降级:关闭装饰性功能
  • 中度降级:关闭非核心功能
  • 重度降级:关闭重要功能,仅保留核心功能

6.4 恢复策略

原则

  • 自动恢复
  • 渐进式恢复
  • 验证恢复效果
  • 记录恢复事件

建议

  • 当资源使用率恢复正常后,自动恢复功能
  • 先恢复核心功能,再恢复重要功能
  • 最后恢复非核心功能和装饰性功能

七、总结

系统资源水位监控和自动降级是构建高可用系统的重要组成部分。在实际项目中,我们应该根据系统的特点和业务需求,合理设置资源监控阈值,明确功能分级,制定合理的降级策略,确保系统在资源紧张时能够自动降级,保证核心功能的正常运行。

互动话题

  1. 你的项目中是如何处理系统资源紧张的情况的?
  2. 你认为自动降级最大的挑战是什么?
  3. 你有遇到过因为资源耗尽导致系统崩溃的情况吗?

欢迎在评论区留言讨论!更多技术文章,欢迎关注公众号:服务端技术精选


标题:SpringBoot + 系统资源水位监控 + 自动降级:CPU/内存超阈值时,非核心功能自动关闭
作者:jiangyi
地址:http://jiangyi.space/articles/2026/04/03/1774780016743.html
公众号:服务端技术精选
    评论
    0 评论
avatar

取消