Helm
用于在 Kubernetes 上部署 vLLM 的 Helm Chart。
Helm 是 Kubernetes 的软件包管理器,它可以帮助您在 k8s 上部署 vLLM,并自动化 vLLM Kubernetes 应用程序的部署。借助 Helm,您可以通过覆盖变量值,在多个命名空间中使用不同的配置部署相同的框架架构。
本指南将引导您完成使用 Helm 部署 vLLM 的过程,包括必要的先决条件、Helm 安装步骤以及架构和 values.yaml 配置文件的相关文档。
依赖
在开始之前,请确保您具备以下条件:
-
运行中的 Kubernetes 集群
-
NVIDIA Kubernetes 设备插件(
k8s-device-plugin):可在 NVIDIA/k8s-device-plugin 找到 -
集群中可用的 GPU 资源
-
包含要部署模型的 S3 存储
安装 Helm Chart
使用 test-vllm 作为发布名称安装 Chart,运行以下命令:
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
卸载 Helm Chart
如果要卸载 test-vllm 部署,可以运行以下命令:
helm uninstall test-vllm --namespace=ns-vllm
该命令将删除与 Chart 相关的所有 Kubernetes 组件(包括持久卷),并删除该发布。
架构

值
| 键 (Key) | 类型 | 默认值 | 描述 |
|---|---|---|---|
| autoscaling | object | {“enabled”:false,”maxReplicas”:100,”minReplicas”:1,”targetCPUUtilizationPercentage”:80} | Autoscaling configuration |
| autoscaling.enabled | bool | false | Enable autoscaling |
| autoscaling.maxReplicas | int | 100 | Maximum replicas |
| autoscaling.minReplicas | int | 1 | Minimum replicas |
| autoscaling.targetCPUUtilizationPercentage | int | 80 | Target CPU utilization for autoscaling |
| configs | object | {} | Configmap |
| containerPort | int | 8000 | Container port |
| customObjects | list | [] | Custom Objects configuration |
| deploymentStrategy | object | {} | Deployment strategy configuration |
| externalConfigs | list | [] | External configuration |
| extraContainers | list | [] | Additional containers configuration |
| extraInit | object | {“pvcStorage”:”1Gi”,”s3modelpath”:”relative_s3_model_path/opt-125m”, “awsEc2MetadataDisabled”: true} | Additional configuration for the init container |
| extraInit.pvcStorage | string | “50Gi” | Storage size of the s3 |
| extraInit.s3modelpath | string | “relative_s3_model_path/opt-125m” | Path of the model on the s3 which hosts model weights and config files |
| extraInit.awsEc2MetadataDisabled | boolean | true | Disables the use of the Amazon EC2 instance metadata service |
| extraPorts | list | [] | Additional ports configuration |
| gpuModels | list | [“TYPE_GPU_USED”] | Type of gpu used |
| image | object | {“command”:[“vllm”,”serve”,”/data/”,”–served-model-name”,”opt-125m”,”–host”,”0.0.0.0”,”–port”,”8000”],”repository”:”vllm/vllm-openai”,”tag”:”latest”} | Image configuration |
| image.command | list | [“vllm”,”serve”,”/data/”,”–served-model-name”,”opt-125m”,”–host”,”0.0.0.0”,”–port”,”8000”] | Container launch command |
| image.repository | string | “vllm/vllm-openai” | Image repository |
| image.tag | string | “latest” | Image tag |
| livenessProbe | object | {“failureThreshold”:3,”httpGet”:{“path”:”/health”,”port”:8000},”initialDelaySeconds”:15,”periodSeconds”:10} | Liveness probe configuration |
| livenessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive |
| livenessProbe.httpGet | object | {“path”:”/health”,”port”:8000} | Configuration of the Kubelet http request on the server |
| livenessProbe.httpGet.path | string | “/health” | Path to access on the HTTP server |
| livenessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
| livenessProbe.initialDelaySeconds | int | 15 | Number of seconds after the container has started before liveness probe is initiated |
| livenessProbe.periodSeconds | int | 10 | How often (in seconds) to perform the liveness probe |
| maxUnavailablePodDisruptionBudget | string | “” | Disruption Budget Configuration |
| readinessProbe | object | {“failureThreshold”:3,”httpGet”:{“path”:”/health”,”port”:8000},”initialDelaySeconds”:5,”periodSeconds”:5} | Readiness probe configuration |
| readinessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready |
| readinessProbe.httpGet | object | {“path”:”/health”,”port”:8000} | Configuration of the Kubelet http request on the server |
| readinessProbe.httpGet.path | string | “/health” | Path to access on the HTTP server |
| readinessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
| readinessProbe.initialDelaySeconds | int | 5 | Number of seconds after the container has started before readiness probe is initiated |
| readinessProbe.periodSeconds | int | 5 | How often (in seconds) to perform the readiness probe |
| replicaCount | int | 1 | Number of replicas |
| resources | object | {“limits”:{“cpu”:4,”memory”:”16Gi”,”nvidia.com/gpu”:1},”requests”:{“cpu”:4,”memory”:”16Gi”,”nvidia.com/gpu”:1}} | Resource configuration |
| resources.limits.”nvidia.com/gpu” | int | 1 | Number of gpus used |
| resources.limits.cpu | int | 4 | Number of CPUs |
| resources.limits.memory | string | “16Gi” | CPU memory configuration |
| resources.requests.”nvidia.com/gpu” | int | 1 | Number of gpus used |
| resources.requests.cpu | int | 4 | Number of CPUs |
| resources.requests.memory | string | “16Gi” | CPU memory configuration |
| secrets | object | {} | Secrets configuration |
| serviceName | string | Service name | |
| servicePort | int | 80 | Service port |
| labels.environment | string | test | Environment name |
| labels.release | string | test |