Prometheus 报警规则配置指南

本文将介绍如何编写Prometheus报警规则，并通过具体示例展示其配置方法。本文中的规则包括对Ping失败、服务宕机、CPU使用过高、内存使用过高以及磁盘使用过高等情况的报警配置。

配置文件结构

Prometheus的报警规则文件通常包含一个或多个规则组，每个组包含多个具体的报警规则。以下是一个示例配置文件：

groups:
  - name: node-rules
    rules:
      - alert: PingFail # ping不通
        expr: probe_success == 0 # 告警的判定条件，参考Prometheus高级查询来设定
        for: 1m # 满足告警条件持续时间多久后，才会发送告警
        labels: # 标签项
          severity: error
        annotations: # 解析项，详细解释告警信息
          summary: "{{$labels.instance}} ping不通"
          description: "分类{{$labels.job}},实例{{$labels.instance}} ping不通 "
      - alert: InstanceDown # 服务宕机
        expr: up == 0 # 告警的判定条件，参考Prometheus高级查询来设定
        for: 1m # 满足告警条件持续时间多久后，才会发送告警
        labels: # 标签项
          severity: error
        annotations: # 解析项，详细解释告警信息
          summary: "{{$labels.instance}}宕机"
          description: "分类{{$labels.application}},实例{{$labels.instance}}宕机 "
      - alert: CpuHigh   # CPU超高
        expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (instance)) * 100 > 80
        for: 1m # 满足告警条件持续时间多久后，才会发送告警
        labels: # 标签项
          severity: warn
        annotations: # 解析项，详细解释告警信息
          summary: "{{$labels.instance}}CPU使用超过80%"
          description: "分类{{$labels.application}},实例{{$labels.instance}} 2分钟内CPU使用平均超过80%,当前值{{humanize $value }}% "
      - alert: WindowCpuHigh   # CPU超高
        expr: 100 - (avg by (instance) (irate(windows_cpu_time_total{job=~"window",mode="idle"}[2m])) * 100) > 80
        for: 1m # 满足告警条件持续时间多久后，才会发送告警
        labels: # 标签项
          severity: warn
        annotations: # 解析项，详细解释告警信息
          summary: "{{$labels.instance}}CPU使用超过80%"
          description: "分类{{$labels.application}},实例{{$labels.instance}} 2分钟内CPU使用平均超过80%,当前值{{humanize $value }}% "
      - alert: MemoryHigh   # 内存超高
        expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes)/ (node_memory_MemTotal_bytes)* 100 > 90
        for: 1m # 满足告警条件持续时间多久后，才会发送告警
        labels: # 标签项
          severity: warn
        annotations: # 解析项，详细解释告警信息
          summary: "{{$labels.instance}}内存使用超过90%"
          description: "分类{{$labels.application}},实例{{$labels.instance}} 内存使用超过90%,当前值={{humanize $value }} "
      - alert: WindowMemoryHigh   # 内存超高
        expr: (100 - 100 * irate(windows_os_physical_memory_free_bytes{job=~"window"}[5m]) / irate(windows_cs_physical_memory_bytes{job=~"window"}[5m])) > 90
        for: 1m # 满足告警条件持续时间多久后，才会发送告警
        labels: # 标签项
          severity: warn
        annotations: # 解析项，详细解释告警信息
          summary: "{{$labels.instance}}内存使用超过90%"
          description: "分类{{$labels.application}},实例{{$labels.instance}} 内存使用超过90%,当前值={{humanize $value }} "
      - alert: DiskHigh   # 磁盘使用高
        expr: (((topk(1, node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance))-(node_filesystem_free_bytes{job="node",fstype=~"ext.*|xfs",mountpoint!~".*pods.*"} AND topk(1,node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance)))*100 /((node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint!~".*pods.*"} AND topk(1,node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance))+((topk(1, node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance))- (node_filesystem_free_bytes{job="node",fstype=~"ext.*|xfs",mountpoint!~".*pods.*"} AND topk(1, node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance))))) > 90
        for: 1m # 满足告警条件持续时间多久后，才会发送告警
        labels: # 标签项
          severity: warn
        annotations: # 解析项，详细解释告警信息
          summary: "{{$labels.instance}}中{{$labels.device}}磁盘使用率超过90%"
          description: "分类{{$labels.application}},实例{{$labels.instance}}中{{$labels.device}} 磁盘使用率超过90%,当前值{{humanize $value }}% "
      - alert: WindowDiskHigh   # window磁盘使用高
        expr: (100 - (windows_logical_disk_free_bytes{volume=~"[C-F]:"} / windows_logical_disk_size_bytes{volume=~"[C-F]:"})*100)  > 90
        for: 1m # 满足告警条件持续时间多久后，才会发送告
        labels: # 标签项
          severity: warn
        annotations: # 解析项，详细解释告警信息
          summary: "{{$labels.instance}}中{{$labels.volume}}盘，磁盘使用率超过90%"
          description: "分类{{$labels.application}},实例{{$labels.instance}}中{{$labels.volume}}盘， 磁盘使用率超过90%,当前值{{humanize $value }}% "

规则解释

1. PingFail

警报名称: PingFail
表达式: probe_success == 0
持续时间: 1分钟
标签: severity: error
注释:
- 摘要: {{$labels.instance}} ping不通
- 详细描述: 分类{{$labels.job}},实例{{$labels.instance}} ping不通

2. InstanceDown

警报名称: InstanceDown
表达式: up == 0
持续时间: 1分钟
标签: severity: error
注释:
- 摘要: {{$labels.instance}}宕机
- 详细描述: 分类{{$labels.application}},实例{{$labels.instance}}宕机

3. CpuHigh

警报名称: CpuHigh
表达式: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (instance)) * 100 > 80
持续时间: 1分钟
标签: severity: warn
注释:
- 摘要: {{$labels.instance}}CPU使用超过80%
- 详细描述: 分类{{$labels.application}},实例{{$labels.instance}} 2分钟内CPU使用平均超过80%,当前值{{humanize $value }}%

4. WindowCpuHigh

警报名称: WindowCpuHigh
表达式: 100 - (avg by (instance) (irate(windows_cpu_time_total{job=~"window",mode="idle"}[2m])) * 100) > 80
持续时间: 1分钟
标签: severity: warn
注释:
- 摘要: {{$labels.instance}}CPU使用超过80%
- 详细描述: 分类{{$labels.application}},实例{{$labels.instance}} 2分钟内CPU使用平均超过80%,当前值{{humanize $value }}%

5. MemoryHigh

警报名称: MemoryHigh
表达式: `(node

_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes)/ (node_memory_MemTotal_bytes)* 100 > 90`

持续时间: 1分钟
标签: severity: warn
注释:
- 摘要: {{$labels.instance}}内存使用超过90%
- 详细描述: 分类{{$labels.application}},实例{{$labels.instance}} 内存使用超过90%,当前值={{humanize $value }}

6. WindowMemoryHigh

警报名称: WindowMemoryHigh
表达式: (100 - 100 * irate(windows_os_physical_memory_free_bytes{job=~"window"}[5m]) / irate(windows_cs_physical_memory_bytes{job=~"window"}[5m])) > 90
持续时间: 1分钟
标签: severity: warn
注释:
- 摘要: {{$labels.instance}}内存使用超过90%
- 详细描述: 分类{{$labels.application}},实例{{$labels.instance}} 内存使用超过90%,当前值={{humanize $value }}

7. DiskHigh

警报名称: DiskHigh
表达式: (((topk(1, node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance))-(node_filesystem_free_bytes{job="node",fstype=~"ext.*|xfs",mountpoint!~".*pods.*"} AND topk(1,node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance)))*100 /((node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint!~".*pods.*"} AND topk(1,node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance))+((topk(1, node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance))- (node_filesystem_free_bytes{job="node",fstype=~"ext.*|xfs",mountpoint!~".*pods.*"} AND topk(1, node_filesystem_size_bytes{job="node",fstype=~"ext.?|xfs",mountpoint!~".*pods.*"}) by (instance))))) > 90
持续时间: 1分钟
标签: severity: warn
注释:
- 摘要: {{$labels.instance}}中{{$labels.device}}磁盘使用率超过90%
- 详细描述: 分类{{$labels.application}},实例{{$labels.instance}}中{{$labels.device}} 磁盘使用率超过90%,当前值{{humanize $value }}%

8. WindowDiskHigh

警报名称: WindowDiskHigh
表达式: (100 - (windows_logical_disk_free_bytes{volume=~"[C-F]:"} / windows_logical_disk_size_bytes{volume=~"[C-F]:"})*100) > 90
持续时间: 1分钟
标签: severity: warn
注释:
- 摘要: {{$labels.instance}}中{{$labels.volume}}盘，磁盘使用率超过90%
- 详细描述: 分类{{$labels.application}},实例{{$labels.instance}}中{{$labels.volume}}盘，磁盘使用率超过90%,当前值{{humanize $value }}%

通过上述配置，您可以对系统中的各种资源使用情况进行实时监控，并在指标超出预设范围时及时收到报警通知。这有助于您在问题发生前采取预防措施，确保系统的稳定运行。

老陕小张学技术接地气

Prometheus常用报警规则配置