当前位置: 首页 > news >正文

哪些网站页面简洁网站建设一般好久到期

哪些网站页面简洁,网站建设一般好久到期,做网站什么域名好,网站开发流程人物最近长沙跑了半个多月#xff0c;跟甲方客户对了下项目指标#xff0c;许久没更新 回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控#xff0c;毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控 先说下为啥要用HAMi吧#xff0c; 一个重要原…最近长沙跑了半个多月跟甲方客户对了下项目指标许久没更新 回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控 先说下为啥要用HAMi吧 一个重要原因是公司有人引见了这个工具的作者 很多问题我都可以直接向作者提问 HAMi是一个国产的GPU与国产加速卡支持的GPU与国产加速卡型号与具体特性请查看此项目官网https://github.com/Project-HAMi/HAMi/虚拟化开源项目实现以kubernetes为基础的容器场景下GPU或加速卡虚拟化。HAMi原名“k8s-vGPU-scheduler” 最初由我司开源现已在国内与国际上愈加流行是管理Kubernetes中异构设备的中间件。它可以管理不同类型的异构设备如GPU、NPU等在Pod之间共享异构设备根据设备的拓扑信息和调度策略做出更好的调度决策。为了阐述的简明性本文只提供一种可行的办法最终实现使用prometheus抓取监控指标并作为数据源、使用grafana来展示监控信息的目的。 本文假定已经部署好Kubernetes集群、HAMi。以下涉及到的相关组件都是在kubernetes集群内安装的相关组件或软件版本信息如下 组件或软件名称版本备注kubernetes集群v1.23.1AMD64构架服务器环境下HAMi根据向开源作者提问当前HAMi版本发行机制还不够成熟暂以安装HAMi的scheduler.kubeScheduler.imageTag 参数值为其版本此值要跟kubernetes版本看齐项目地址https://github.com/Project-HAMi/HAMi/kube-prometheus stack prom/prometheus:v2.27.1关于监控的安装参见实现prometheusgrafana的监控部署_prometheus grafana监控部署-CSDN博客dcgm-exporternvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04 HAMi  的默认安装方式是通过helm添加Helm仓库: helm repo add hami-charts https://project-hami.github.io/HAMi/ 检查Kubernetes版本并安装HAMi服务器版本为1.23.1: helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTagv1.23.1 -n kube-system 验证hami安装成功 kubectl get pods -n kube-system 确认hami-device-plugin和hami-scheduler都处于Running状态表示安装成功。 把helm安装转为hami-install.yaml helm template hami hami-charts/hami --set scheduler.kubeScheduler.imageTagv1.23.1 -n kube-system  hami-install.yaml 该格式部署 --- # Source: hami/templates/device-plugin/monitorserviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata:name: hami-device-pluginnamespace: kube-systemlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm --- # Source: hami/templates/scheduler/serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata:name: hami-schedulernamespace: kube-systemlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm --- # Source: hami/templates/device-plugin/configmap.yaml apiVersion: v1 kind: ConfigMap metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm data:config.json: |{nodeconfig: [{name: m5-cloudinfra-online02,devicememoryscaling: 1.8,devicesplitcount: 10,migstrategy:none,filterdevices: {uuid: [],index: []}}]} --- # Source: hami/templates/scheduler/configmap.yaml apiVersion: v1 kind: ConfigMap metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm data:config.json: |{kind: Policy,apiVersion: v1,extenders: [{urlPrefix: https://127.0.0.1:443,filterVerb: filter,bindVerb: bind,enableHttps: true,weight: 1,nodeCacheCapable: true,httpTimeout: 30000000000,tlsConfig: {insecure: true},managedResources: [{name: nvidia.com/gpu,ignoredByScheduler: true},{name: nvidia.com/gpumem,ignoredByScheduler: true},{name: nvidia.com/gpucores,ignoredByScheduler: true},{name: nvidia.com/gpumem-percentage,ignoredByScheduler: true},{name: nvidia.com/priority,ignoredByScheduler: true},{name: cambricon.com/vmlu,ignoredByScheduler: true},{name: hygon.com/dcunum,ignoredByScheduler: true},{name: hygon.com/dcumem,ignoredByScheduler: true },{name: hygon.com/dcucores,ignoredByScheduler: true},{name: iluvatar.ai/vgpu,ignoredByScheduler: true}],ignoreable: false}]} --- # Source: hami/templates/scheduler/configmapnew.yaml apiVersion: v1 kind: ConfigMap metadata:name: hami-scheduler-newversionlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm data:config.yaml: |apiVersion: kubescheduler.config.k8s.io/v1kind: KubeSchedulerConfigurationleaderElection:leaderElect: falseprofiles:- schedulerName: hami-schedulerextenders:- urlPrefix: https://127.0.0.1:443filterVerb: filterbindVerb: bindnodeCacheCapable: trueweight: 1httpTimeout: 30senableHTTPS: truetlsConfig:insecure: truemanagedResources:- name: nvidia.com/gpuignoredByScheduler: true- name: nvidia.com/gpumemignoredByScheduler: true- name: nvidia.com/gpucoresignoredByScheduler: true- name: nvidia.com/gpumem-percentageignoredByScheduler: true- name: nvidia.com/priorityignoredByScheduler: true- name: cambricon.com/vmluignoredByScheduler: true- name: hygon.com/dcunumignoredByScheduler: true- name: hygon.com/dcumemignoredByScheduler: true- name: hygon.com/dcucoresignoredByScheduler: true- name: iluvatar.ai/vgpuignoredByScheduler: true --- # Source: hami/templates/scheduler/device-configmap.yaml apiVersion: v1 kind: ConfigMap metadata:name: hami-scheduler-devicelabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm data:device-config.yaml: |-nvidia:resourceCountName: nvidia.com/gpuresourceMemoryName: nvidia.com/gpumemresourceMemoryPercentageName: nvidia.com/gpumem-percentageresourceCoreName: nvidia.com/gpucoresresourcePriorityName: nvidia.com/priorityoverwriteEnv: falsedefaultMemory: 0defaultCores: 0defaultGPUNum: 1deviceSplitCount: 10deviceMemoryScaling: 1deviceCoreScaling: 1cambricon:resourceCountName: cambricon.com/vmluresourceMemoryName: cambricon.com/mlu.smlu.vmemoryresourceCoreName: cambricon.com/mlu.smlu.vcorehygon:resourceCountName: hygon.com/dcunumresourceMemoryName: hygon.com/dcumemresourceCoreName: hygon.com/dcucoresmetax:resourceCountName: metax-tech.com/gpumthreads:resourceCountName: mthreads.com/vgpuresourceMemoryName: mthreads.com/sgpu-memoryresourceCoreName: mthreads.com/sgpu-coreiluvatar: resourceCountName: iluvatar.ai/vgpuresourceMemoryName: iluvatar.ai/vcuda-memoryresourceCoreName: iluvatar.ai/vcuda-corevnpus:- chipName: 910BcommonWord: Ascend910AresourceName: huawei.com/Ascend910AresourceMemoryName: huawei.com/Ascend910A-memorymemoryAllocatable: 32768memoryCapacity: 32768aiCore: 30templates:- name: vir02memory: 2184aiCore: 2- name: vir04memory: 4369aiCore: 4- name: vir08memory: 8738aiCore: 8- name: vir16memory: 17476aiCore: 16- chipName: 910B3commonWord: Ascend910BresourceName: huawei.com/Ascend910BresourceMemoryName: huawei.com/Ascend910B-memorymemoryAllocatable: 65536memoryCapacity: 65536aiCore: 20aiCPU: 7templates:- name: vir05_1c_16gmemory: 16384aiCore: 5aiCPU: 1- name: vir10_3c_32gmemory: 32768aiCore: 10aiCPU: 3- chipName: 310P3commonWord: Ascend310PresourceName: huawei.com/Ascend310PresourceMemoryName: huawei.com/Ascend310P-memorymemoryAllocatable: 21527memoryCapacity: 24576aiCore: 8aiCPU: 7templates:- name: vir01memory: 3072aiCore: 1aiCPU: 1- name: vir02memory: 6144aiCore: 2aiCPU: 2- name: vir04memory: 12288aiCore: 4aiCPU: 4 --- # Source: hami/templates/device-plugin/monitorrole.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata:name: hami-device-plugin-monitor rules:- apiGroups:- resources:- podsverbs:- get- create- watch- list- update- patch- apiGroups:- resources:- nodesverbs:- get- update- list- patch --- # Source: hami/templates/device-plugin/monitorrolebinding.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRole#name: cluster-adminname: hami-device-plugin-monitor subjects:- kind: ServiceAccountname: hami-device-pluginnamespace: kube-system --- # Source: hami/templates/scheduler/rolebinding.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: cluster-admin subjects:- kind: ServiceAccountname: hami-schedulernamespace: kube-system --- # Source: hami/templates/device-plugin/monitorservice.yaml apiVersion: v1 kind: Service metadata:name: hami-device-plugin-monitorlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm spec:externalTrafficPolicy: Localselector:app.kubernetes.io/component: hami-device-plugintype: NodePortports:- name: monitorportport: 31992targetPort: 9394nodePort: 31992 --- # Source: hami/templates/scheduler/service.yaml apiVersion: v1 kind: Service metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm spec:type: NodePortports:- name: httpport: 443targetPort: 443nodePort: 31998protocol: TCP- name: monitorport: 31993targetPort: 9395nodePort: 31993protocol: TCPselector:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hami --- # Source: hami/templates/device-plugin/daemonsetnvidia.yaml apiVersion: apps/v1 kind: DaemonSet metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm spec:selector:matchLabels:app.kubernetes.io/component: hami-device-pluginapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamitemplate:metadata:labels:app.kubernetes.io/component: hami-device-pluginhami.io/webhook: ignoreapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamispec:imagePullSecrets: []serviceAccountName: hami-device-pluginpriorityClassName: system-node-criticalhostPID: truehostNetwork: truecontainers:- name: device-pluginimage: projecthami/hami:latestimagePullPolicy: IfNotPresentlifecycle:postStart:exec:command: [/bin/sh,-c, cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/]command:- nvidia-device-plugin- --config-file/device-config.yaml- --mig-strategynone- --disable-core-limitfalse- -vfalseenv:- name: NODE_NAMEvalueFrom:fieldRef:fieldPath: spec.nodeName- name: NVIDIA_MIG_MONITOR_DEVICESvalue: all- name: HOOK_PATHvalue: /usr/localsecurityContext:allowPrivilegeEscalation: falsecapabilities:drop: [ALL]add: [SYS_ADMIN]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-plugins- name: libmountPath: /usr/local/vgpu- name: usrbinmountPath: /usrbin- name: deviceconfigmountPath: /config- name: hosttmpmountPath: /tmp- name: device-configmountPath: /device-config.yamlsubPath: device-config.yaml- name: vgpu-monitorimage: projecthami/hami:latestimagePullPolicy: IfNotPresentcommand: [vGPUmonitor]securityContext:allowPrivilegeEscalation: falsecapabilities:drop: [ALL]add: [SYS_ADMIN]env:- name: NVIDIA_VISIBLE_DEVICESvalue: all- name: NVIDIA_MIG_MONITOR_DEVICESvalue: all- name: HOOK_PATHvalue: /usr/local/vgpu volumeMounts:- name: ctrsmountPath: /usr/local/vgpu/containers- name: dockersmountPath: /run/docker- name: containerdsmountPath: /run/containerd- name: sysinfomountPath: /sysinfo- name: hostvarmountPath: /hostvarvolumes:- name: ctrshostPath:path: /usr/local/vgpu/containers- name: hosttmphostPath:path: /tmp- name: dockershostPath:path: /run/docker- name: containerdshostPath:path: /run/containerd- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins- name: libhostPath:path: /usr/local/vgpu- name: usrbinhostPath:path: /usr/bin- name: sysinfohostPath:path: /sys- name: hostvarhostPath:path: /var- name: deviceconfigconfigMap:name: hami-device-plugin- name: device-configconfigMap:name: hami-scheduler-devicenodeSelector: gpu: on --- # Source: hami/templates/scheduler/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helm spec:replicas: 1selector:matchLabels:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamitemplate:metadata:labels:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamihami.io/webhook: ignorespec:imagePullSecrets: []serviceAccountName: hami-schedulerpriorityClassName: system-node-criticalcontainers:- name: kube-schedulerimage: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.0imagePullPolicy: IfNotPresentcommand:- kube-scheduler- --config/config/config.yaml- -v4- --leader-electtrue- --leader-elect-resource-namehami-scheduler- --leader-elect-resource-namespacekube-systemvolumeMounts:- name: scheduler-configmountPath: /config- name: vgpu-scheduler-extenderimage: projecthami/hami:latestimagePullPolicy: IfNotPresentenv:command:- scheduler- --http_bind0.0.0.0:443- --cert_file/tls/tls.crt- --key_file/tls/tls.key- --scheduler-namehami-scheduler- --metrics-bind-address:9395- --node-scheduler-policybinpack- --gpu-scheduler-policyspread- --device-config-file/device-config.yaml- --debug- -v4ports:- name: httpcontainerPort: 443protocol: TCPvolumeMounts:- name: tls-configmountPath: /tls- name: device-configmountPath: /device-config.yamlsubPath: device-config.yamlvolumes:- name: tls-configsecret:secretName: hami-scheduler-tls- name: scheduler-configconfigMap:name: hami-scheduler-newversion- name: device-configconfigMap:name: hami-scheduler-device --- # Source: hami/templates/scheduler/webhook.yaml apiVersion: admissionregistration.k8s.io/v1 kind: MutatingWebhookConfiguration metadata:name: hami-webhook webhooks:- admissionReviewVersions:- v1beta1clientConfig:service:name: hami-schedulernamespace: kube-systempath: /webhookport: 443failurePolicy: IgnorematchPolicy: Equivalentname: vgpu.hami.ionamespaceSelector:matchExpressions:- key: hami.io/webhookoperator: NotInvalues:- ignoreobjectSelector:matchExpressions:- key: hami.io/webhookoperator: NotInvalues:- ignorereinvocationPolicy: Neverrules:- apiGroups:- apiVersions:- v1operations:- CREATEresources:- podsscope: *sideEffects: NonetimeoutSeconds: 10 --- # Source: hami/templates/scheduler/job-patch/serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook --- # Source: hami/templates/scheduler/job-patch/clusterrole.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook rules:- apiGroups:- admissionregistration.k8s.ioresources:#- validatingwebhookconfigurations- mutatingwebhookconfigurationsverbs:- get- update --- # Source: hami/templates/scheduler/job-patch/clusterrolebinding.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: hami-admission subjects:- kind: ServiceAccountname: hami-admissionnamespace: kube-system --- # Source: hami/templates/scheduler/job-patch/role.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook rules:- apiGroups:- resources:- secretsverbs:- get- create --- # Source: hami/templates/scheduler/job-patch/rolebinding.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata:name: hami-admissionannotations:helm.sh/hook: pre-install,pre-upgrade,post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook roleRef:apiGroup: rbac.authorization.k8s.iokind: Rolename: hami-admission subjects:- kind: ServiceAccountname: hami-admissionnamespace: kube-system --- # Source: hami/templates/scheduler/job-patch/job-createSecret.yaml apiVersion: batch/v1 kind: Job metadata:name: hami-admission-createannotations:helm.sh/hook: pre-install,pre-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook spec:template:metadata:name: hami-admission-createlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhookhami.io/webhook: ignorespec:imagePullSecrets: []containers:- name: createimage: liangjw/kube-webhook-certgen:v1.1.1imagePullPolicy: IfNotPresentargs:- create- --cert-nametls.crt- --key-nametls.key- --hosthami-scheduler.kube-system.svc,127.0.0.1- --namespacekube-system- --secret-namehami-scheduler-tlsrestartPolicy: OnFailureserviceAccountName: hami-admissionsecurityContext:runAsNonRoot: truerunAsUser: 2000 --- # Source: hami/templates/scheduler/job-patch/job-patchWebhook.yaml apiVersion: batch/v1 kind: Job metadata:name: hami-admission-patchannotations:helm.sh/hook: post-install,post-upgradehelm.sh/hook-delete-policy: before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook spec:template:metadata:name: hami-admission-patchlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: 2.4.0app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhookhami.io/webhook: ignorespec:imagePullSecrets: []containers:- name: patchimage: liangjw/kube-webhook-certgen:v1.1.1imagePullPolicy: IfNotPresentargs:- patch- --webhook-namehami-webhook- --namespacekube-system- --patch-validatingfalse- --secret-namehami-scheduler-tlsrestartPolicy: OnFailureserviceAccountName: hami-admissionsecurityContext:runAsNonRoot: truerunAsUser: 2000部署dcgm-exporter apiVersion: apps/v1 kind: DaemonSet metadata:name: dcgm-exporterlabels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1 spec:updateStrategy:type: RollingUpdateselector:matchLabels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1template:metadata:labels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1name: dcgm-exporterspec:containers:- image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04env:- name: DCGM_EXPORTER_LISTENvalue: :9400- name: DCGM_EXPORTER_KUBERNETESvalue: truename: dcgm-exporterports:- name: metricscontainerPort: 9400securityContext:runAsNonRoot: falserunAsUser: 0capabilities:add: [SYS_ADMIN]volumeMounts:- name: pod-gpu-resourcesreadOnly: truemountPath: /var/lib/kubelet/pod-resourcesvolumes:- name: pod-gpu-resourceshostPath:path: /var/lib/kubelet/pod-resources---kind: Service apiVersion: v1 metadata:name: dcgm-exporterlabels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1 spec:selector:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 3.6.1ports:- name: metricsport: 9400 dcgm-exporter安装成功 参考这个hami-vgpu  dashboard 下载panel 的json文件 hami-vgpu-dashboard | Grafana Labs 导入后grafana中将创建一个名为“hami-vgpu-dashboard”的dashboard但此页面中有一些Panel如vGPUCorePercentage还没有数据 ServiceMonitor 是 Prometheus Operator 中的一个自定义资源主要用于监控 Kubernetes 中的服务。它的作用包括 1. 自动化发现 ServiceMonitor 允许 Prometheus 自动发现和监控 Kubernetes 中的服务。通过定义 ServiceMonitor您可以告诉 Prometheus 监控特定服务的端点。 2. 配置抓取参数 您可以在 ServiceMonitor 中设置抓取的相关参数例如 抓取间隔定义 Prometheus 多频繁抓取数据如每 30 秒。超时定义抓取请求的超时时间。标签选择器指定要监控的服务的标签确保 Prometheus 仅抓取相关服务的数据。 dcgm-exporter需要配置两个service monitor hami-device-plugin-svc-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata:name: hami-device-plugin-svc-monitornamespace: kube-system spec:selector:matchLabels:app.kubernetes.io/component: hami-device-pluginnamespaceSelector:matchNames:- kube-systemendpoints:- path: /metricsport: monitorportinterval: 15shonorLabels: falserelabelings:- sourceLabels: [__meta_kubernetes_endpoints_name]regex: hami-.*replacement: $1action: keep- sourceLabels: [__meta_kubernetes_pod_node_name]regex: (.*)targetLabel: node_namereplacement: ${1}action: replace- sourceLabels: [__meta_kubernetes_pod_host_ip]regex: (.*)targetLabel: ipreplacement: $1action: replace hami-scheduler-svc-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata:name: hami-scheduler-svc-monitornamespace: kube-system spec:selector:matchLabels:app.kubernetes.io/component: hami-schedulernamespaceSelector:matchNames:- kube-systemendpoints:- path: /metricsport: monitorinterval: 15shonorLabels: falserelabelings:- sourceLabels: [__meta_kubernetes_endpoints_name]regex: hami-.*replacement: $1action: keep- sourceLabels: [__meta_kubernetes_pod_node_name]regex: (.*)targetLabel: node_namereplacement: ${1}action: replace- sourceLabels: [__meta_kubernetes_pod_host_ip]regex: (.*)targetLabel: ipreplacement: $1action: replace 确认创建的ServiceMonitor 启动gpu pod一个测试下 apiVersion: v1 kind: Pod metadata:name: gpu-pod-1 spec:restartPolicy: Nevercontainers:- name: cuda-containerimage: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1command: [sleep, infinity]resources:limits:nvidia.com/gpu: 1nvidia.com/gpumem: 1000nvidia.com/gpucores: 10 如果看到pod一直pending 状态 检查下节点如果出现下面gpu为0的情况 需要 docker1:下载NVIDIA-DOCKER2安装包并安装2:修改/etc/docker/daemon.json文件内容加上{default-runtime: nvidia,runtimes: {nvidia: {path: /usr/bin/nvidia-container-runtime,runtimeArgs: []}},}k8s:1:下载k8s-device-plugin 镜像2:编写nvidia-device-plugin.yml创建驱动pod 使用这个yml进行创建 apiVersion: apps/v1 kind: DaemonSet metadata:name: nvidia-device-plugin-daemonsetnamespace: kube-system spec:selector:matchLabels:name: nvidia-device-plugin-dsupdateStrategy:type: RollingUpdatetemplate:metadata:labels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedulepriorityClassName: system-node-criticalcontainers:- image: nvidia/k8s-device-plugin:1.11name: nvidia-device-plugin-ctrenv:- name: FAIL_ON_INIT_ERRORvalue: falsesecurityContext:allowPrivilegeEscalation: falsecapabilities:drop: [ALL]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins gpu pod启动后进入查看下 gpu内存和限制的大小相同设置成功 访问下{scheduler node ip}:31993/metrics  日志最后有两行 vGPUPodsDeviceAllocated{containeridx0,deviceusedcore40,deviceuuidGPU-7666e9de-679b-a768-51c6-260b81cd00ec,nodename192.168.110.126,podnamegpu-pod-1,podnamespacedefault,zonevGPU} 1.048576e10 vGPUPodsDeviceAllocated{containeridx0,deviceusedcore40,deviceuuidGPU-7666e9de-679b-a768-51c6-260b81cd00ec,nodename192.168.110.126,podnamegpu-pod-2,podnamespacedefault,zonevGPU} 1.048576e10 可以看到相同deviceuuid的gpu被不同pod共享使用 exec进入hami-device-plugin  daemonset里面执行nvidia-smi -L 可以看到机器上所有显卡的信息 rootnode126:/# nvidia-smi -L GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-7666e9de-679b-a768-51c6-260b81cd00ec) GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-9f32af29-1a72-6e47-af2c-72b1130a176b) rootnode126:/#  之前创建的两个serviceMonitor会去请求 app.kubernetes.io/component: hami-scheduler 和app.kubernetes.io/component: hami-device-plugin 的/metrics  接口获取数据 当gpu-pod跑起来以后查看hami-vgpu-metrics-dashboard
http://www.hkea.cn/news/14471646/

相关文章:

  • 开利网络企业网站建设第一品牌免费广告推广网站
  • 做网站寄生虫网站建设定义是什么意思
  • 儿童玩具商城网站建设棋牌网站怎么做
  • 陆家网站建设网站后台文档
  • 微信做网站支付工具时尚大气的网站设计
  • 厦门做网站的公司有哪些wordpress电脑客户端
  • 用html制作购物网站世界500强企业平均寿命
  • 如何网站开发语言做个公司网站
  • 英文网站定制哪家好石家庄建设银行网点
  • 小型网站开发成本编写网站策划书
  • 电子商务网站设计心得体会电商网站建设情况汇报
  • 佛山网站建设运营wordpress主题开发 书
  • 自己做网站除了域名还要买什么交友app自己开发
  • 云南网网站临汾网站建设价格
  • 电信网站备案常州建设网站代理商
  • 哪些网站可以做海报刺猬猫网站维护
  • 校园网站建设意见表填写广州开发区第二小学
  • 建立网站底线平台规划方案怎么写
  • 站长推广网南京尚网网络科技有限公司
  • 新河镇网站制作便宜虚拟主机做网站备份
  • 电子商务网站建设方案目录wordpress七牛汉化主题
  • 大气环保网站模板为什么很多中国人去菲律宾做网站
  • 售后服务方案 网站建设开锁做网站怎么样
  • 外贸网站 中英网站禁ping
  • 大邑县建设局网站简单网上书店网站建设php
  • dw中旅游网站怎么做wordpress 没有足够权限
  • 网页设计公司网站制作人力资源公司简介模板
  • 怎么加php网站登陆源码苏州市住房城乡建设局网站
  • 专门做酒店网站网址没封的来一个
  • 旅游网站开发与建设论文在线免费网站建设平台