观测逻辑链条
1
2
3
4
5
6
7
| Prometheus
↓ select
ServiceMonitor
↓ select
Service
↓ endpoint
Pod /metrics
|
1.首先helm add:
1
2
3
4
| helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 在 monitoring 命名空间下安装全家桶
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
|
2.查你自己生产集群对象字段再做连接prometheus
kubectl get prometheus -n monitoring -o yaml | grep -A 5 serviceMonitorSelector
1
2
3
4
5
6
7
| (base) savilahao@bogon ~ % kubectl get prometheus -n monitoring -o yaml | grep -A 5 serviceMonitorSelector
serviceMonitorSelector:
matchLabels:
release: prometheus #你通过helm拉的一般都叫这个名字
shards: 1
tsdb:
outOfOrderTimeWindow: 0s
|
- Prometheus Operator只看有特定 Label 的资源,
matchLabels这个key下是你需要告诉它的内容,后续写到k apply YAML 里 metadata.labels对象中 - 查Agones 暴露指标的靶子(定位 Service 和 Port Name)
1
2
3
4
5
6
7
8
9
10
11
12
| # 先看srv对象列表
(base) savilahao@bogon ~ % kubectl get svc -n agones-system | grep metrics
agones-allocator-metrics-service ClusterIP 10.101.11.151 <none> 8080/TCP 26h
agones-controller-metrics-service ClusterIP 10.109.209.46 <none> 8080/TCP 26h
agones-extensions-metrics-service ClusterIP 10.105.232.133 <none> 8080/TCP 26h
# 看对应srv对象的Port Name
(base) savilahao@bogon ~ % kubectl get svc agones-controller-metrics-service -n agones-system -o jsonpath='{.spec.ports[*].name}{"\n"}'
metrics #同样的命令抓allocator和extensions的
# 看Agones 的 Service 自己带了什么标签
(base) savilahao@bogon ~ % kubectl get svc agones-controller-metrics-service -n agones-system --show-labels
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE LABELS
agones-controller-metrics-service ClusterIP 10.109.209.46 <none> 8080/TCP 26h agones.dev/role=controller,app.kubernetes.io/managed-by=Helm,app=agones,chart=agones-1.57.0,heritage=Helm,release=agones
|
kubectl get svc agones-allocator-metrics-service -n agones-system –show-labels
根据信息组成yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: agones-metrics
namespace: agones-system #注意这里最好写业务的ns而不是我们刚才创建的monitoring ns
labels:
# 替换为你从第一步查到的 Prometheus Selector
release: prometheus
spec:
endpoints:
- port: metrics # 替换为你从第二步查到的 Port Name (可能是 http web)
- port: http #我抓到allocator的是这个
namespaceSelector:
matchNames:
- agones-system
selector:
matchLabels:
# 替换为你从第三步查到的 Service Label
app: agones
|
做check:
1
2
3
| (base) savilahao@bogon ~ % (base) savilahao@bogon ~ % kubectl get servicemonitor -n agones-system
NAME AGE
agones-all-metrics 4m12s
|
进入 Grafana 监控室
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
输入user和pw

补充,转发进入服务器指令:
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80是Grafana 可视化界面的server ^f070bdkubectl port-forward -n monitoring svc/prometheus-operated 9090用于Prometheus 服务器本身
import json 到dashboards后:
当然这里有个小插曲,开始的时候都是红色三角no data, fix: add Data Source-Prometheus mannually
看板信号分析:
- GameServers count per type (饼图) & count overview (折线图): 这两个图表显示,当前集群里只有 1 个状态为
Ready 的游戏服。结合下方的折线图,这条绿线一直平稳维持在 1,说明这个游戏服已经存活了一段时间。 - GameServers per node & Node availability: 这两个底部的图表监控的是底层的物理资源。它告诉你,当前有一个 Node(也就是你的 Minikube 虚拟机)正在被“使用(used)”,并且上面跑着这个孤零零的游戏服。
Q:为什么
Fleet RollOut Percentage, Fleet Replicas Count是No Data?
当前集群里只有一个散养的单体 GameServer,并没有部署任何 Fleet资源
证据:
1
2
| (base) savilahao@bogon ~ % kubectl get fleet
No resources found in default namespace.
|
Grafana监测实验:部署gameServer舰队!(Fleet对象)
kubectl apply -f :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| cat <<EOF | kubectl apply -f -
apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
name: simple-game-fleet
spec:
# 我们要求集群里永远保持 5 个 Ready 状态的游戏服
replicas: 5
template:
spec:
ports:
- name: default
containerPort: 7654
template:
spec:
containers:
- name: simple-game-server
image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.35
EOF
|
我们的舰队all set!!!
1
2
3
| (base) savilahao@bogon ~ % kubectl get fleet
NAME SCHEDULING DESIRED CURRENT ALLOCATED READY AGE
simple-game-fleet Packed 5 5 0 5 5m23s
|
爽看Grafana:
^b3ca9b
Q:“General,鄙人突然想把这支新组建的舰队都干掉怎么办?”
1
| kubectl delete fleet simple-game-fleet
|