哪些网站页面简洁,网站建设一般好久到期,做网站什么域名好,网站开发流程人物最近长沙跑了半个多月#xff0c;跟甲方客户对了下项目指标#xff0c;许久没更新 回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控#xff0c;毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控 先说下为啥要用HAMi吧#xff0c; 一个重要原…最近长沙跑了半个多月跟甲方客户对了下项目指标许久没更新 回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控 先说下为啥要用HAMi吧 一个重要原因是公司有人引见了这个工具的作者 很多问题我都可以直接向作者提问
HAMi是一个国产的GPU与国产加速卡支持的GPU与国产加速卡型号与具体特性请查看此项目官网https://github.com/Project-HAMi/HAMi/虚拟化开源项目实现以kubernetes为基础的容器场景下GPU或加速卡虚拟化。HAMi原名“k8s-vGPU-scheduler”
最初由我司开源现已在国内与国际上愈加流行是管理Kubernetes中异构设备的中间件。它可以管理不同类型的异构设备如GPU、NPU等在Pod之间共享异构设备根据设备的拓扑信息和调度策略做出更好的调度决策。为了阐述的简明性本文只提供一种可行的办法最终实现使用prometheus抓取监控指标并作为数据源、使用grafana来展示监控信息的目的。 本文假定已经部署好Kubernetes集群、HAMi。以下涉及到的相关组件都是在kubernetes集群内安装的相关组件或软件版本信息如下
组件或软件名称版本备注kubernetes集群v1.23.1AMD64构架服务器环境下HAMi根据向开源作者提问当前HAMi版本发行机制还不够成熟暂以安装HAMi的scheduler.kubeScheduler.imageTag 参数值为其版本此值要跟kubernetes版本看齐项目地址https://github.com/Project-HAMi/HAMi/kube-prometheus stack prom/prometheus:v2.27.1关于监控的安装参见实现prometheusgrafana的监控部署_prometheus grafana监控部署-CSDN博客dcgm-exporternvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
HAMi 的默认安装方式是通过helm添加Helm仓库:
helm repo add hami-charts https://project-hami.github.io/HAMi/ 检查Kubernetes版本并安装HAMi服务器版本为1.23.1: helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTagv1.23.1 -n kube-system
验证hami安装成功
kubectl get pods -n kube-system 确认hami-device-plugin和hami-scheduler都处于Running状态表示安装成功。 把helm安装转为hami-install.yaml helm template hami hami-charts/hami --set scheduler.kubeScheduler.imageTagv1.23.1 -n kube-system hami-install.yaml
该格式部署
---
# Source: hami/templates/device-plugin/monitorserviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: hami-device-pluginnamespace: kube-systemlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/scheduler/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: hami-schedulernamespace: kube-systemlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/device-plugin/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
data:config.json: |{nodeconfig: [{name: m5-cloudinfra-online02,devicememoryscaling: 1.8,devicesplitcount: 10,migstrategy:none,filterdevices: {uuid: [],index: []}}]}
---
# Source: hami/templates/scheduler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
data:config.json: |{kind: Policy,apiVersion: v1,extenders: [{urlPrefix: https://127.0.0.1:443,filterVerb: filter,bindVerb: bind,enableHttps: true,weight: 1,nodeCacheCapable: true,httpTimeout: 30000000000,tlsConfig: {insecure: true},managedResources: [{name: nvidia.com/gpu,ignoredByScheduler: true},{name: nvidia.com/gpumem,ignoredByScheduler: true},{name: nvidia.com/gpucores,ignoredByScheduler: true},{name: nvidia.com/gpumem-percentage,ignoredByScheduler: true},{name: nvidia.com/priority,ignoredByScheduler: true},{name: cambricon.com/vmlu,ignoredByScheduler: true},{name: hygon.com/dcunum,ignoredByScheduler: true},{name: hygon.com/dcumem,ignoredByScheduler: true },{name: hygon.com/dcucores,ignoredByScheduler: true},{name: iluvatar.ai/vgpu,ignoredByScheduler: true}],ignoreable: false}]}
---
# Source: hami/templates/scheduler/configmapnew.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: hami-scheduler-newversionlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
data:config.yaml: |apiVersion: kubescheduler.config.k8s.io/v1kind: KubeSchedulerConfigurationleaderElection:leaderElect: falseprofiles:- schedulerName: hami-schedulerextenders:- urlPrefix: https://127.0.0.1:443filterVerb: filterbindVerb: bindnodeCacheCapable: trueweight: 1httpTimeout: 30senableHTTPS: truetlsConfig:insecure: truemanagedResources:- name: nvidia.com/gpuignoredByScheduler: true- name: nvidia.com/gpumemignoredByScheduler: true- name: nvidia.com/gpucoresignoredByScheduler: true- name: nvidia.com/gpumem-percentageignoredByScheduler: true- name: nvidia.com/priorityignoredByScheduler: true- name: cambricon.com/vmluignoredByScheduler: true- name: hygon.com/dcunumignoredByScheduler: true- name: hygon.com/dcumemignoredByScheduler: true- name: hygon.com/dcucoresignoredByScheduler: true- name: iluvatar.ai/vgpuignoredByScheduler: true
---
# Source: hami/templates/scheduler/device-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: hami-scheduler-devicelabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
data:device-config.yaml: |-nvidia:resourceCountName: nvidia.com/gpuresourceMemoryName: nvidia.com/gpumemresourceMemoryPercentageName: nvidia.com/gpumem-percentageresourceCoreName: nvidia.com/gpucoresresourcePriorityName: nvidia.com/priorityoverwriteEnv: falsedefaultMemory: 0defaultCores: 0defaultGPUNum: 1deviceSplitCount: 10deviceMemoryScaling: 1deviceCoreScaling: 1cambricon:resourceCountName: cambricon.com/vmluresourceMemoryName: cambricon.com/mlu.smlu.vmemoryresourceCoreName: cambricon.com/mlu.smlu.vcorehygon:resourceCountName: hygon.com/dcunumresourceMemoryName: hygon.com/dcumemresourceCoreName: hygon.com/dcucoresmetax:resourceCountName: metax-tech.com/gpumthreads:resourceCountName: mthreads.com/vgpuresourceMemoryName: mthreads.com/sgpu-memoryresourceCoreName: mthreads.com/sgpu-coreiluvatar: resourceCountName: iluvatar.ai/vgpuresourceMemoryName: iluvatar.ai/vcuda-memoryresourceCoreName: iluvatar.ai/vcuda-corevnpus:- chipName: 910BcommonWord: Ascend910AresourceName: huawei.com/Ascend910AresourceMemoryName: huawei.com/Ascend910A-memorymemoryAllocatable: 32768memoryCapacity: 32768aiCore: 30templates:- name: vir02memory: 2184aiCore: 2- name: vir04memory: 4369aiCore: 4- name: vir08memory: 8738aiCore: 8- name: vir16memory: 17476aiCore: 16- chipName: 910B3commonWord: Ascend910BresourceName: huawei.com/Ascend910BresourceMemoryName: huawei.com/Ascend910B-memorymemoryAllocatable: 65536memoryCapacity: 65536aiCore: 20aiCPU: 7templates:- name: vir05_1c_16gmemory: 16384aiCore: 5aiCPU: 1- name: vir10_3c_32gmemory: 32768aiCore: 10aiCPU: 3- chipName: 310P3commonWord: Ascend310PresourceName: huawei.com/Ascend310PresourceMemoryName: huawei.com/Ascend310P-memorymemoryAllocatable: 21527memoryCapacity: 24576aiCore: 8aiCPU: 7templates:- name: vir01memory: 3072aiCore: 1aiCPU: 1- name: vir02memory: 6144aiCore: 2aiCPU: 2- name: vir04memory: 12288aiCore: 4aiCPU: 4
---
# Source: hami/templates/device-plugin/monitorrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: hami-device-plugin-monitor
rules:- apiGroups:- resources:- podsverbs:- get- create- watch- list- update- patch- apiGroups:- resources:- nodesverbs:- get- update- list- patch
---
# Source: hami/templates/device-plugin/monitorrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRole#name: cluster-adminname: hami-device-plugin-monitor
subjects:- kind: ServiceAccountname: hami-device-pluginnamespace: kube-system
---
# Source: hami/templates/scheduler/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: cluster-admin
subjects:- kind: ServiceAccountname: hami-schedulernamespace: kube-system
---
# Source: hami/templates/device-plugin/monitorservice.yaml
apiVersion: v1
kind: Service
metadata:name: hami-device-plugin-monitorlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
spec:externalTrafficPolicy: Localselector:app.kubernetes.io/component: hami-device-plugintype: NodePortports:- name: monitorportport: 31992targetPort: 9394nodePort: 31992
---
# Source: hami/templates/scheduler/service.yaml
apiVersion: v1
kind: Service
metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
spec:type: NodePortports:- name: httpport: 443targetPort: 443nodePort: 31998protocol: TCP- name: monitorport: 31993targetPort: 9395nodePort: 31993protocol: TCPselector:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hami
---
# Source: hami/templates/device-plugin/daemonsetnvidia.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
spec:selector:matchLabels:app.kubernetes.io/component: hami-device-pluginapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamitemplate:metadata:labels:app.kubernetes.io/component: hami-device-pluginhami.io/webhook: ignoreapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamispec:imagePullSecrets: []serviceAccountName: hami-device-pluginpriorityClassName: system-node-criticalhostPID: truehostNetwork: truecontainers:- name: device-pluginimage: projecthami/hami:latestimagePullPolicy: IfNotPresentlifecycle:postStart:exec:command: [/bin/sh,-c, cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/]command:- nvidia-device-plugin- --config-file/device-config.yaml- --mig-strategynone- --disable-core-limitfalse- -vfalseenv:- name: NODE_NAMEvalueFrom:fieldRef:fieldPath: spec.nodeName- name: NVIDIA_MIG_MONITOR_DEVICESvalue: all- name: HOOK_PATHvalue: /usr/localsecurityContext:allowPrivilegeEscalation: falsecapabilities:drop: [ALL]add: [SYS_ADMIN]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-plugins- name: libmountPath: /usr/local/vgpu- name: usrbinmountPath: /usrbin- name: deviceconfigmountPath: /config- name: hosttmpmountPath: /tmp- name: device-configmountPath: /device-config.yamlsubPath: device-config.yaml- name: vgpu-monitorimage: projecthami/hami:latestimagePullPolicy: IfNotPresentcommand: [vGPUmonitor]securityContext:allowPrivilegeEscalation: falsecapabilities:drop: [ALL]add: [SYS_ADMIN]env:- name: NVIDIA_VISIBLE_DEVICESvalue: all- name: NVIDIA_MIG_MONITOR_DEVICESvalue: all- name: HOOK_PATHvalue: /usr/local/vgpu volumeMounts:- name: ctrsmountPath: /usr/local/vgpu/containers- name: dockersmountPath: /run/docker- name: containerdsmountPath: /run/containerd- name: sysinfomountPath: /sysinfo- name: hostvarmountPath: /hostvarvolumes:- name: ctrshostPath:path: /usr/local/vgpu/containers- name: hosttmphostPath:path: /tmp- name: dockershostPath:path: /run/docker- name: containerdshostPath:path: /run/containerd- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins- name: libhostPath:path: /usr/local/vgpu- name: usrbinhostPath:path: /usr/bin- name: sysinfohostPath:path: /sys- name: hostvarhostPath:path: /var- name: deviceconfigconfigMap:name: hami-device-plugin- name: device-configconfigMap:name: hami-scheduler-devicenodeSelector: gpu: on
---
# Source: hami/templates/scheduler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm
spec:replicas: 1selector:matchLabels:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamitemplate:metadata:labels:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamihami.io/webhook: ignorespec:imagePullSecrets: []serviceAccountName: hami-schedulerpriorityClassName: system-node-criticalcontainers:- name: kube-schedulerimage: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.0imagePullPolicy: IfNotPresentcommand:- kube-scheduler- --config/config/config.yaml- -v4- --leader-electtrue- --leader-elect-resource-namehami-scheduler- --leader-elect-resource-namespacekube-systemvolumeMounts:- name: scheduler-configmountPath: /config- name: vgpu-scheduler-extenderimage: projecthami/hami:latestimagePullPolicy: IfNotPresentenv:command:- scheduler- --http_bind0.0.0.0:443- --cert_file/tls/tls.crt- --key_file/tls/tls.key- --scheduler-namehami-scheduler- --metrics-bind-address:9395- --node-scheduler-policybinpack- --gpu-scheduler-policyspread- --device-config-file/device-config.yaml- --debug- -v4ports:- name: httpcontainerPort: 443protocol: TCPvolumeMounts:- name: tls-configmountPath: /tls- name: device-configmountPath: /device-config.yamlsubPath: device-config.yamlvolumes:- name: tls-configsecret:secretName: hami-scheduler-tls- name: scheduler-configconfigMap:name: hami-scheduler-newversion- name: device-configconfigMap:name: hami-scheduler-device
---
# Source: hami/templates/scheduler/webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:name: hami-webhook
webhooks:- admissionReviewVersions:- v1beta1clientConfig:service:name: hami-schedulernamespace: kube-systempath: /webhookport: 443failurePolicy: IgnorematchPolicy: Equivalentname: vgpu.hami.ionamespaceSelector:matchExpressions:- key: hami.io/webhookoperator: NotInvalues:- ignoreobjectSelector:matchExpressions:- key: hami.io/webhookoperator: NotInvalues:- ignorereinvocationPolicy: Neverrules:- apiGroups:- apiVersions:- v1operations:- CREATEresources:- podsscope: *sideEffects: NonetimeoutSeconds: 10
---
# Source: hami/templates/scheduler/job-patch/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
---
# Source: hami/templates/scheduler/job-patch/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
rules:- apiGroups:- admissionregistration.k8s.ioresources:#- validatingwebhookconfigurations- mutatingwebhookconfigurationsverbs:- get- update
---
# Source: hami/templates/scheduler/job-patch/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: hami-admission
subjects:- kind: ServiceAccountname: hami-admissionnamespace: kube-system
---
# Source: hami/templates/scheduler/job-patch/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
rules:- apiGroups:- resources:- secretsverbs:- get- create
---
# Source: hami/templates/scheduler/job-patch/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
roleRef:apiGroup: rbac.authorization.k8s.iokind: Rolename: hami-admission
subjects:- kind: ServiceAccountname: hami-admissionnamespace: kube-system
---
# Source: hami/templates/scheduler/job-patch/job-createSecret.yaml
apiVersion: batch/v1
kind: Job
metadata:name: hami-admission-createannotations:helm.sh/hook: pre-install,pre-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
spec:template:metadata:name: hami-admission-createlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhookhami.io/webhook: ignorespec:imagePullSecrets: []containers:- name: createimage: liangjw/kube-webhook-certgen:v1.1.1imagePullPolicy: IfNotPresentargs:- create- --cert-nametls.crt- --key-nametls.key- --hosthami-scheduler.kube-system.svc,127.0.0.1- --namespacekube-system- --secret-namehami-scheduler-tlsrestartPolicy: OnFailureserviceAccountName: hami-admissionsecurityContext:runAsNonRoot: truerunAsUser: 2000
---
# Source: hami/templates/scheduler/job-patch/job-patchWebhook.yaml
apiVersion: batch/v1
kind: Job
metadata:name: hami-admission-patchannotations:helm.sh/hook: post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
spec:template:metadata:name: hami-admission-patchlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhookhami.io/webhook: ignorespec:imagePullSecrets: []containers:- name: patchimage: liangjw/kube-webhook-certgen:v1.1.1imagePullPolicy: IfNotPresentargs:- patch- --webhook-namehami-webhook- --namespacekube-system- --patch-validatingfalse- --secret-namehami-scheduler-tlsrestartPolicy: OnFailureserviceAccountName: hami-admissionsecurityContext:runAsNonRoot: truerunAsUser: 2000部署dcgm-exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:name: dcgm-exporterlabels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1
spec:updateStrategy:type: RollingUpdateselector:matchLabels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1template:metadata:labels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1name: dcgm-exporterspec:containers:- image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04env:- name: DCGM_EXPORTER_LISTENvalue: :9400- name: DCGM_EXPORTER_KUBERNETESvalue: truename: dcgm-exporterports:- name: metricscontainerPort: 9400securityContext:runAsNonRoot: falserunAsUser: 0capabilities:add: [SYS_ADMIN]volumeMounts:- name: pod-gpu-resourcesreadOnly: truemountPath: /var/lib/kubelet/pod-resourcesvolumes:- name: pod-gpu-resourceshostPath:path: /var/lib/kubelet/pod-resources---kind: Service
apiVersion: v1
metadata:name: dcgm-exporterlabels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1
spec:selector:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1ports:- name: metricsport: 9400
dcgm-exporter安装成功 参考这个hami-vgpu dashboard 下载panel 的json文件
hami-vgpu-dashboard | Grafana Labs 导入后grafana中将创建一个名为“hami-vgpu-dashboard”的dashboard但此页面中有一些Panel如vGPUCorePercentage还没有数据 ServiceMonitor 是 Prometheus Operator 中的一个自定义资源主要用于监控 Kubernetes 中的服务。它的作用包括
1. 自动化发现
ServiceMonitor 允许 Prometheus 自动发现和监控 Kubernetes 中的服务。通过定义 ServiceMonitor您可以告诉 Prometheus 监控特定服务的端点。
2. 配置抓取参数
您可以在 ServiceMonitor 中设置抓取的相关参数例如
抓取间隔定义 Prometheus 多频繁抓取数据如每 30 秒。超时定义抓取请求的超时时间。标签选择器指定要监控的服务的标签确保 Prometheus 仅抓取相关服务的数据。 dcgm-exporter需要配置两个service monitor
hami-device-plugin-svc-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:name: hami-device-plugin-svc-monitornamespace: kube-system
spec:selector:matchLabels:app.kubernetes.io/component: hami-device-pluginnamespaceSelector:matchNames:- kube-systemendpoints:- path: /metricsport: monitorportinterval: 15shonorLabels: falserelabelings:- sourceLabels: [__meta_kubernetes_endpoints_name]regex: hami-.*replacement: $1action: keep- sourceLabels: [__meta_kubernetes_pod_node_name]regex: (.*)targetLabel: node_namereplacement: ${1}action: replace- sourceLabels: [__meta_kubernetes_pod_host_ip]regex: (.*)targetLabel: ipreplacement: $1action: replace
hami-scheduler-svc-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:name: hami-scheduler-svc-monitornamespace: kube-system
spec:selector:matchLabels:app.kubernetes.io/component: hami-schedulernamespaceSelector:matchNames:- kube-systemendpoints:- path: /metricsport: monitorinterval: 15shonorLabels: falserelabelings:- sourceLabels: [__meta_kubernetes_endpoints_name]regex: hami-.*replacement: $1action: keep- sourceLabels: [__meta_kubernetes_pod_node_name]regex: (.*)targetLabel: node_namereplacement: ${1}action: replace- sourceLabels: [__meta_kubernetes_pod_host_ip]regex: (.*)targetLabel: ipreplacement: $1action: replace
确认创建的ServiceMonitor 启动gpu pod一个测试下
apiVersion: v1
kind: Pod
metadata:name: gpu-pod-1
spec:restartPolicy: Nevercontainers:- name: cuda-containerimage: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1command: [sleep, infinity]resources:limits:nvidia.com/gpu: 1nvidia.com/gpumem: 1000nvidia.com/gpucores: 10
如果看到pod一直pending 状态 检查下节点如果出现下面gpu为0的情况 需要 docker1:下载NVIDIA-DOCKER2安装包并安装2:修改/etc/docker/daemon.json文件内容加上{default-runtime: nvidia,runtimes: {nvidia: {path: /usr/bin/nvidia-container-runtime,runtimeArgs: []}},}k8s:1:下载k8s-device-plugin 镜像2:编写nvidia-device-plugin.yml创建驱动pod
使用这个yml进行创建
apiVersion: apps/v1
kind: DaemonSet
metadata:name: nvidia-device-plugin-daemonsetnamespace: kube-system
spec:selector:matchLabels:name: nvidia-device-plugin-dsupdateStrategy:type: RollingUpdatetemplate:metadata:labels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedulepriorityClassName: system-node-criticalcontainers:- image: nvidia/k8s-device-plugin:1.11name: nvidia-device-plugin-ctrenv:- name: FAIL_ON_INIT_ERRORvalue: falsesecurityContext:allowPrivilegeEscalation: falsecapabilities:drop: [ALL]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins gpu pod启动后进入查看下 gpu内存和限制的大小相同设置成功 访问下{scheduler node ip}:31993/metrics 日志最后有两行
vGPUPodsDeviceAllocated{containeridx0,deviceusedcore40,deviceuuidGPU-7666e9de-679b-a768-51c6-260b81cd00ec,nodename192.168.110.126,podnamegpu-pod-1,podnamespacedefault,zonevGPU} 1.048576e10
vGPUPodsDeviceAllocated{containeridx0,deviceusedcore40,deviceuuidGPU-7666e9de-679b-a768-51c6-260b81cd00ec,nodename192.168.110.126,podnamegpu-pod-2,podnamespacedefault,zonevGPU} 1.048576e10
可以看到相同deviceuuid的gpu被不同pod共享使用
exec进入hami-device-plugin daemonset里面执行nvidia-smi -L 可以看到机器上所有显卡的信息 rootnode126:/# nvidia-smi -L GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-7666e9de-679b-a768-51c6-260b81cd00ec) GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-9f32af29-1a72-6e47-af2c-72b1130a176b) rootnode126:/# 之前创建的两个serviceMonitor会去请求
app.kubernetes.io/component: hami-scheduler 和app.kubernetes.io/component: hami-device-plugin 的/metrics 接口获取数据
当gpu-pod跑起来以后查看hami-vgpu-metrics-dashboard