应用于CI/CD平台:https://horizoncd.github.io/
1. 页面展示
Grafana 面板地址:
https://github.com/ryanwuer/helm-charts/blob/main/charts/horizon/files/slo-dashboard.json
2. 指标介绍
核心指标分3大块:
- 平台 API
- CI/CD 能力
- K8s调度
上面面板里SLO维度统计到1h、1d和30d,对应的保障级别不同,下面以1d为例进行说明
2.1 平台 API
2.1.1 基础指标
# 调用数,Counter类型
horizon_request_total
# 调用耗时,Histogram类型
horizon_request_duration_seconds
2.1.2 聚合规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: horizon-core-api-availability
namespace: horizoncd
spec:
groups:
- interval: 3m
name: horizon-1d-availability.rules
rules:
- expr: sum by (code, verb) (increase(horizon_request_total{verb="GET",code=~"2.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="POST",code=~"2.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="PUT",code=~"2.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="DELETE",code=~"2.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="GET",code=~"3.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="POST",code=~"3.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="PUT",code=~"3.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="DELETE",code=~"3.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="GET",code=~"4.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="POST",code=~"4.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="PUT",code=~"4.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="DELETE",code=~"4.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="GET",code=~"5.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="POST",code=~"5.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="PUT",code=~"5.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code, verb) (increase(horizon_request_total{verb="DELETE",code=~"5.."}[1d]))
record: code_verb:horizon_request_total:increase1d
- expr: sum by (code) (code_verb:horizon_request_total:increase1d{verb=~"GET"})
labels:
verb: read
record: code:horizon_request_total:increase1d
- expr: sum by (code) (code_verb:horizon_request_total:increase1d{verb=~"POST|PUT|DELETE"})
labels:
verb: write
record: code:horizon_request_total:increase1d
- expr: |-
1 - (
(
# write too slow
sum(increase(horizon_request_duration_seconds_count{verb=~"POST|PUT|DELETE"}[1d]))
-
sum(increase(horizon_request_duration_seconds_bucket{verb=~"POST|PUT|DELETE",le="5"}[1d]))
) +
(
# read too slow
sum(increase(horizon_request_duration_seconds_count{verb=~"GET"}[1d]))
-
sum(increase(horizon_request_duration_seconds_bucket{verb=~"GET",le="1"}[1d]))
) +
# errors
sum(code:horizon_request_total:increase1d{code=~"5.."} or vector(0))
)
/
sum(code:horizon_request_total:increase1d)
labels:
range: 1d
verb: all
record: horizon_request:availability
- expr: |-
1 - (
sum(increase(horizon_request_duration_seconds_count{verb=~"GET"}[1d]))
-
sum(increase(horizon_request_duration_seconds_bucket{verb=~"GET",le="1"}[1d]))
+
# errors
sum(code:horizon_request_total:increase1d{verb="read",code=~"5.."} or vector(0))
)
/
sum(code:horizon_request_total:increase1d{verb="read"})
labels:
range: 1d
verb: read
record: horizon_request:availability
- expr: |-
1 - (
(
# too slow
sum(increase(horizon_request_duration_seconds_count{verb=~"POST|PUT|DELETE"}[1d]))
-
sum(increase(horizon_request_duration_seconds_bucket{verb=~"POST|PUT|DELETE",le="5"}[1d]))
)
+
# errors
sum(code:horizon_request_total:increase1d{verb="write",code=~"5.."} or vector(0))
)
/
sum(code:horizon_request_total:increase1d{verb="write"})
labels:
range: 1d
verb: write
record: horizon_request:availability
2.2 CI/CD 能力
2.2.1 基础指标
基于 Tekton 提供CI能力
# Tekton每个Step的耗时统计,Histogram类型
horizon_step_duration_seconds
2.2.2 聚合规则
SLO包含拉取代码、构建镜像、部署3个阶段,编译阶段因为受用户代码因素影响较大,不纳入SLO保障范围
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: horizon-core-pipeline-availability
namespace: horizoncd
spec:
groups:
- interval: 5m
name: pipeline-1d-availability.rules
rules:
- expr: |-
sum(increase(horizon_step_duration_seconds_count{result="ok",step="git"}[1d]))
/
sum(increase(horizon_step_duration_seconds_count{step="git"}[1d]) > 0) or vector(1)
labels:
range: 1d
record: horizon_git:availability
- expr: |-
sum(increase(horizon_step_duration_seconds_bucket{result="ok",step="git",le="20"}[1d]))
/
sum(increase(horizon_step_duration_seconds_count{result="ok",step="git"}[1d]) > 0) or vector(1)
labels:
range: 1d
record: horizon_git_rt:availability
- expr: |-
sum(increase(horizon_step_duration_seconds_count{result="ok",step="image"}[1d]))
/
sum(increase(horizon_step_duration_seconds_count{step="image"}[1d]) > 0) or vector(1)
labels:
range: 1d
record: horizon_image:availability
- expr: |-
sum(increase(horizon_step_duration_seconds_bucket{result="ok",step="image",le="120"}[1d]))
/
sum(increase(horizon_step_duration_seconds_count{result="ok",step="image"}[1d]) > 0) or vector(1)
labels:
range: 1d
record: horizon_image_rt:availability
- expr: |-
sum(increase(horizon_step_duration_seconds_bucket{le="240",step="deploy",result="ok"}[1d]))
/
sum(increase(horizon_step_duration_seconds_bucket{le="240",step="deploy"}[1d]) > 0) or vector(1)
labels:
range: 1d
record: horizon_deploy:availability
- expr: |-
sum(increase(horizon_step_duration_seconds_bucket{step="deploy",result="ok",le="40"}[1d]))
/
sum(increase(horizon_step_duration_seconds_count{step="deploy",result="ok"}[1d]) > 0) or vector(1)
labels:
range: 1d
record: horizon_deploy_rt:availability
2.3 K8s调度
基于Kube-state-metrics暴露出的指标,如容器启动时刻、Pod pending状态等
2.3.1 容器启动耗时
计算容器进程启动时刻和Pod调度时刻的差值,计算出容器启动耗时,衡量容器环境准备性能(拉镜像、IP分配、挂载等)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: horizon-core-container-create-availability
namespace: horizoncd
spec:
groups:
- interval: 5m
name: horizon-container-create-1d-availability.rules
rules:
- expr: |-
(
avg(kube_pod_container_state_started) by (pod, container) > (time() - 86400)
)
- ignoring(container) group_left()
(
avg(kube_pod_status_scheduled_time) by (pod) + (sum(kube_pod_container_status_restarts_total) by(pod) == 0)
)
record: pod_container:container_start:duration1d
- expr: count(pod_container:container_start:duration1d <= 40) /
(count(pod_container:container_start:duration1d) > 0) or vector(1)
labels:
range: 1d
record: horizon_container_create_rt:availability
备注:
- 86400s为1天,用于筛选最近一天内创建的Pod
- 天维度的SLO,设定临界RT为40s
2.3.2 资源满足度
当集群资源不足或者因网络、挂载等问题,导致Pod长时间处于Pending状态时,可用性将降低
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: horizon-core-resources-meet-availability
namespace: horizoncd
spec:
groups:
- interval: 5m
name: horizon-resources-meet-1d-availability.rules
rules:
- expr: kube_pod_status_phase{phase="Pending"} and ignoring(phase) (kube_pod_start_time > (time() - 86400))
record: created:pod_pending:in1d
- expr: (1 - count(sum_over_time(created:pod_pending:in1d[10m]) > 5) /
(count(created:pod_pending:in1d) > 0)) or vector(1)
labels:
range: 1d
record: horizon_resources_meet:availability
备注:
- 86400s为1天,用于筛选最近一天内创建的Pod
- sum_over_time 用于计算Pending持续5分钟以上的Pod数量