CI/CD 平台的SLO建设实践

2025-04-22

应用于CI/CD平台:https://horizoncd.github.io/

1. 页面展示

Grafana 面板地址:
https://github.com/ryanwuer/helm-charts/blob/main/charts/horizon/files/slo-dashboard.json
image

2. 指标介绍

核心指标分3大块:

  1. 平台 API
  2. CI/CD 能力
  3. K8s调度

上面面板里SLO维度统计到1h、1d和30d,对应的保障级别不同,下面以1d为例进行说明

2.1 平台 API

2.1.1 基础指标

# 调用数,Counter类型
horizon_request_total

# 调用耗时,Histogram类型
horizon_request_duration_seconds

2.1.2 聚合规则

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: horizon-core-api-availability
  namespace: horizoncd
spec:
  groups:
  - interval: 3m
    name: horizon-1d-availability.rules
    rules:
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="GET",code=~"2.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="POST",code=~"2.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="PUT",code=~"2.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="DELETE",code=~"2.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="GET",code=~"3.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="POST",code=~"3.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="PUT",code=~"3.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="DELETE",code=~"3.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="GET",code=~"4.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="POST",code=~"4.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="PUT",code=~"4.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="DELETE",code=~"4.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="GET",code=~"5.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="POST",code=~"5.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="PUT",code=~"5.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code, verb) (increase(horizon_request_total{verb="DELETE",code=~"5.."}[1d]))
      record: code_verb:horizon_request_total:increase1d
    - expr: sum by (code) (code_verb:horizon_request_total:increase1d{verb=~"GET"})
      labels:
        verb: read
      record: code:horizon_request_total:increase1d
    - expr: sum by (code) (code_verb:horizon_request_total:increase1d{verb=~"POST|PUT|DELETE"})
      labels:
        verb: write
      record: code:horizon_request_total:increase1d
    - expr: |-
        1 - (
          (
            # write too slow
            sum(increase(horizon_request_duration_seconds_count{verb=~"POST|PUT|DELETE"}[1d]))
            -
            sum(increase(horizon_request_duration_seconds_bucket{verb=~"POST|PUT|DELETE",le="5"}[1d]))
          ) +
          (
            # read too slow
            sum(increase(horizon_request_duration_seconds_count{verb=~"GET"}[1d]))
            -
            sum(increase(horizon_request_duration_seconds_bucket{verb=~"GET",le="1"}[1d]))
          ) +
          # errors
          sum(code:horizon_request_total:increase1d{code=~"5.."} or vector(0))
        )
        /
        sum(code:horizon_request_total:increase1d)
      labels:
        range: 1d
        verb: all
      record: horizon_request:availability
    - expr: |-
        1 - (
          sum(increase(horizon_request_duration_seconds_count{verb=~"GET"}[1d]))
          -
          sum(increase(horizon_request_duration_seconds_bucket{verb=~"GET",le="1"}[1d]))
          +
          # errors
          sum(code:horizon_request_total:increase1d{verb="read",code=~"5.."} or vector(0))
        )
        /
        sum(code:horizon_request_total:increase1d{verb="read"})
      labels:
        range: 1d
        verb: read
      record: horizon_request:availability
    - expr: |-
        1 - (
          (
            # too slow
            sum(increase(horizon_request_duration_seconds_count{verb=~"POST|PUT|DELETE"}[1d]))
            -
            sum(increase(horizon_request_duration_seconds_bucket{verb=~"POST|PUT|DELETE",le="5"}[1d]))
          )
          +
          # errors
          sum(code:horizon_request_total:increase1d{verb="write",code=~"5.."} or vector(0))
        )
        /
        sum(code:horizon_request_total:increase1d{verb="write"})
      labels:
        range: 1d
        verb: write
      record: horizon_request:availability

2.2 CI/CD 能力

2.2.1 基础指标

基于 Tekton 提供CI能力

# Tekton每个Step的耗时统计,Histogram类型
horizon_step_duration_seconds

2.2.2 聚合规则

SLO包含拉取代码、构建镜像、部署3个阶段,编译阶段因为受用户代码因素影响较大,不纳入SLO保障范围

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: horizon-core-pipeline-availability
  namespace: horizoncd
spec:
  groups:
  - interval: 5m
    name: pipeline-1d-availability.rules
    rules:
    - expr: |-
        sum(increase(horizon_step_duration_seconds_count{result="ok",step="git"}[1d]))
        /
        sum(increase(horizon_step_duration_seconds_count{step="git"}[1d]) > 0) or vector(1)
      labels:
        range: 1d
      record: horizon_git:availability
    - expr: |-
        sum(increase(horizon_step_duration_seconds_bucket{result="ok",step="git",le="20"}[1d]))
        /
        sum(increase(horizon_step_duration_seconds_count{result="ok",step="git"}[1d]) > 0) or vector(1)
      labels:
        range: 1d
      record: horizon_git_rt:availability
    - expr: |-
        sum(increase(horizon_step_duration_seconds_count{result="ok",step="image"}[1d]))
        /
        sum(increase(horizon_step_duration_seconds_count{step="image"}[1d]) > 0) or vector(1)
      labels:
        range: 1d
      record: horizon_image:availability
    - expr: |-
        sum(increase(horizon_step_duration_seconds_bucket{result="ok",step="image",le="120"}[1d]))
        /
        sum(increase(horizon_step_duration_seconds_count{result="ok",step="image"}[1d]) > 0) or vector(1)
      labels:
        range: 1d
      record: horizon_image_rt:availability
    - expr: |-
        sum(increase(horizon_step_duration_seconds_bucket{le="240",step="deploy",result="ok"}[1d]))
        /
        sum(increase(horizon_step_duration_seconds_bucket{le="240",step="deploy"}[1d]) > 0) or vector(1)
      labels:
        range: 1d
      record: horizon_deploy:availability
    - expr: |-
        sum(increase(horizon_step_duration_seconds_bucket{step="deploy",result="ok",le="40"}[1d]))
        /
        sum(increase(horizon_step_duration_seconds_count{step="deploy",result="ok"}[1d]) > 0) or vector(1)
      labels:
        range: 1d
      record: horizon_deploy_rt:availability

2.3 K8s调度

基于Kube-state-metrics暴露出的指标,如容器启动时刻、Pod pending状态等

2.3.1 容器启动耗时

计算容器进程启动时刻和Pod调度时刻的差值,计算出容器启动耗时,衡量容器环境准备性能(拉镜像、IP分配、挂载等)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: horizon-core-container-create-availability
  namespace: horizoncd
spec:
  groups:
  - interval: 5m
    name: horizon-container-create-1d-availability.rules
    rules:
    - expr: |-
        (
          avg(kube_pod_container_state_started) by (pod, container) > (time() - 86400)
        )
          - ignoring(container) group_left()
        (
          avg(kube_pod_status_scheduled_time) by (pod) + (sum(kube_pod_container_status_restarts_total) by(pod) == 0)
        )
      record: pod_container:container_start:duration1d
    - expr: count(pod_container:container_start:duration1d <= 40) /
        (count(pod_container:container_start:duration1d) > 0) or vector(1)
      labels:
        range: 1d
      record: horizon_container_create_rt:availability

备注:

  1. 86400s为1天,用于筛选最近一天内创建的Pod
  2. 天维度的SLO,设定临界RT为40s

2.3.2 资源满足度

当集群资源不足或者因网络、挂载等问题,导致Pod长时间处于Pending状态时,可用性将降低

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: horizon-core-resources-meet-availability
  namespace: horizoncd
spec:
  groups:
  - interval: 5m
    name: horizon-resources-meet-1d-availability.rules
    rules:
    - expr: kube_pod_status_phase{phase="Pending"} and ignoring(phase) (kube_pod_start_time > (time() - 86400))
      record: created:pod_pending:in1d
    - expr: (1 - count(sum_over_time(created:pod_pending:in1d[10m]) > 5) /
        (count(created:pod_pending:in1d) > 0)) or vector(1)
      labels:
        range: 1d
      record: horizon_resources_meet:availability

备注:

  1. 86400s为1天,用于筛选最近一天内创建的Pod
  2. sum_over_time 用于计算Pending持续5分钟以上的Pod数量