Deploying LLMs in a single Machine

In this post, I will share my experience deploying LLMs in production. I will cover the challenges I faced, the solutions I found, and the lessons I learned along the way.

  1. first I managed to find a gpu from runpod to experience one complete conversation(multi-inferences).
  2. now I know that if you want to deploy this in production, you not only need multiple gpu like the one I’m using in the following writing, you also need a system to help you manage resources (i.e. Kubernetes). 2.1. so they have the solution called vllm production-stack Got to find a cluster with powerful GPUs, OR I could just deploy the stack, since now I’ve experienced model inference. But it will be hard to tell if I deploy it successfully or not… ~(- <_ -)>

If using AKS: Setup automated deployments to the AKS cluster ( optional )

Create a runpod.io account to get cloud GPU access (API access)

  • You can use API keys for project access
  • Passkey for secure passwordless authentication access
  • How to choose the GPU
    • It depends on your needs, you can just spin up a cheapest choice on runpod, which is enough for me to run the model

run-pod-api-passkey-tutorial

Create ssh keys to for remote access (Virtual Machine access)

  • Runpod just a service to provide you powerful GPUs

  • create a ssh-key pair

ssh-keygen -t rsa -b 4096 -C "This is a comment"
  • Paste the public key to the runpod.io’s ssh public key at Account > Settings > SSH Public Keys

ssh-key-recognize

  • Click Update Public Key

ssh into the server

ssh [server-ip] -i ~/
  • Go to Pods -> Deploy -> create a pod -> wait till it creates -> click Connect

connect options

yay! now you got powerful GPUs

  • lets see what OS is it
root@f00742593164:/# uname -a
Linux f00742593164 6.8.0-49-generic #49~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov  6 17:42:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • alright, Ubuntu 22.04, classic

  • alright … same xxx, different day

  • install python, lets use uv this time

  • make it more intresting, lets build the vllm from source, so that

    1. in production, are u really sure you don’t want to customize it to fit your company’s need?
    2. are you sure vllm production-stack is mature enough?
  • I only have one machine, I could build few virtual pods, then add k8s to manage them, then use production-stack to get the “llm in production starter pack”

Install dependencies

Install uv python virtual environment manager

Install python

  • recommend python version: python 3.12

  • give a little old apt update && apt upgrade first to fetch the latest packages

    • always do this first, makes sure your OS packages are clean and fresh
  • for Ubuntu you have these package managers:

    • apt (Debian-based)
    • yum (Red Hat-based)
    • dnf (Fedora-based)
    • zypper (openSUSE-based)
apt update && apt upgrade

Lets check the GPU to make sure I’m not get scammed

root@f00742593164:/# nvidia-smi
Fri May  9 01:39:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:61:00.0 Off |                    0 |
| N/A   35C    P0             48W /  310W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Good…

Lets see if system has python or not

root@f00742593164:/# which python
# /usr/bin/python

root@f00742593164:/# whereis python
# python: /usr/bin/python

root@f00742593164:/# /usr/bin/python
# Python 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0] on linux
# Type "help", "copyright", "credits" or "license" for more information.
# >>> exit()

Impressive, but we need 3.12 -_-

Well don’t worry, 🤓☝️ we can use uv, a python package and project manager, written in Rust🐐.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# downloading uv 0.7.3 x86_64-unknown-linux-gnu
# no checksums to verify
# installing to /root/.local/bin
#   uv
#   uvx
# everything's installed!
#
# To add $HOME/.local/bin to your PATH, either restart your shell or run:
#
#     source $HOME/.local/bin/env (sh, bash, zsh)
#     source $HOME/.local/bin/env.fish (fish)
#

# know what shell you're using
echo $SHELL

# Add the env loader to your bash startup
#echo 'source $HOME/.local/bin/env' >> ~/.bashrc

# use this so that it won't accedentally crash the shell and disconnected
# but you have to do it everytime when login
export PATH="$HOME/.local/bin:$PATH"

root@f00742593164:/# echo $SHELL
/bin/bash

# Reload
source ~/.bashrc

# test uv
# root@f00742593164:/# uv
# An extremely fast Python package manager.
#
# Usage: uv [OPTIONS] <COMMAND>
#
# Commands:
#   run      Run a command or script
#   init     Create a new project
#   add      Add dependencies to the project
#   remove   Remove dependencies from the project
#   sync     Update the project's environment
#   lock     Update the project's lockfile
#   export   Export the project's lockfile to an alternate format
#   tree     Display the project's dependency tree
#   tool     Run and install commands provided by Python packages
#   python   Manage Python versions and installations
#   pip      Manage Python packages with a pip-compatible interface
#   venv     Create a virtual environment
#   build    Build Python packages into source distributions and wheels
#   publish  Upload distributions to an index
#   cache    Manage uv's cache
#   self     Manage the uv executable
#   version  Read or update the project's version
#   help     Display documentation for a command
#
# Cache options:
#   -n, --no-cache               Avoid reading from or writing to the cache, instead using a temporary directory for
#                                the duration of the operation [env: UV_NO_CACHE=]
#       --cache-dir <CACHE_DIR>  Path to the cache directory [env: UV_CACHE_DIR=]
#
# Python options:
#   --managed-python       Require use of uv-managed Python versions [env: UV_MANAGED_PYTHON=]
#   --no-managed-python    Disable use of uv-managed Python versions [env: UV_NO_MANAGED_PYTHON=]
#   --no-python-downloads  Disable automatic downloads of Python. [env: "UV_PYTHON_DOWNLOADS=never"]
#
# Global options:
#   -q, --quiet...
#           Use quiet output
#   -v, --verbose...
#           Use verbose output
#       --color <COLOR_CHOICE>
#           Control the use of color in output [possible values: auto, always, never]
#       --native-tls
#           Whether to load TLS certificates from the platform's native certificate store [env: UV_NATIVE_TLS=]
#       --offline
#           Disable network access [env: UV_OFFLINE=]
#       --allow-insecure-host <ALLOW_INSECURE_HOST>
#           Allow insecure connections to a host [env: UV_INSECURE_HOST=]
#       --no-progress
#           Hide all progress outputs [env: UV_NO_PROGRESS=]
#       --directory <DIRECTORY>
#           Change to the given directory prior to running the command
#       --project <PROJECT>
#           Run the command within the given project directory [env: UV_PROJECT=]
#       --config-file <CONFIG_FILE>
#           The path to a `uv.toml` file to use for configuration [env: UV_CONFIG_FILE=]
#       --no-config
#           Avoid discovering configuration files (`pyproject.toml`, `uv.toml`) [env: UV_NO_CONFIG=]
#   -h, --help
#           Display the concise help for this command
#   -V, --version
#           Display the uv version
#
# Use `uv help` for more details.



# create python virtual environment in one line
# NOTE: create this venv above both production-stack and vllm , for consistent

# uv [venv name] --python [python version]
uv venv --python 3.12 --seed
# --seed  # Install seed packages (one or more of: `pip`, `setuptools`, and
# #`wheel`) into the virtual environment [env: UV_VENV_SEED=]

# root@f00742593164:/# uv venv --python 3.12 --seed
# Using CPython 3.12.10
# Creating virtual environment with seed packages at: .venv
#  + pip==25.1.1
# Activate with: source .venv/bin/activate

source .venv/bin/activate
  • --seed option will install essential packages into the virtual environment after it’s created, not python itself
    • So it’s like using apt-get install ppa-xxx, setup-tools … “pre-packages” like this before building python

Build vllm from source

  • since it can build with cuda, lets check the cuda version
nvcc --version

# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2022 NVIDIA Corporation
# Built on Wed_Sep_21_10:33:58_PDT_2022
# Cuda compilation tools, release 11.8, V11.8.89
# Build cuda_11.8.r11.8/compiler.31833905_0
  • build the vllm from source , with correct cuda version

  • clone my fork(you can just clone the original repo)

(.venv) root@f00742593164:/home# git clone https://github.com/davidgao7/vllm.git
Cloning into 'vllm'...
remote: Enumerating objects: 70376, done.
remote: Counting objects: 100% (189/189), done.
remote: Compressing objects: 100% (157/157), done.
remote: Total 70376 (delta 92), reused 32 (delta 32), pack-reused 70187 (from 3)
Receiving objects: 100% (70376/70376), 48.99 MiB | 15.81 MiB/s, done.
Resolving deltas: 100% (54878/54878), done.
  • build from source (for nividia cuda)

  • In the docs: If you only need to change Python code, you can build and install vLLM without compilation. Using pip’s --editable flag, changes you make to the code will be reflected when you run vLLM, which is exactly what we want

cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install --editable .

This command will do the following:

  1. Look for the current branch in your vLLM clone(default branch: ).

  2. Identify the corresponding base commit in the main branch.

  3. Download the pre-built wheel of the base commit.

  4. Use its compiled libraries in the installation.

(vllm) root@49924402e807:/workspace/vllm# VLLM_USE_PRECOMPILED=1 uv pip install --editable .
Resolved 146 packages in 24.08s
      Built vllm @ file:///workspace/vllm
Prepared 1 package in 56.50s
░░░░░░░░░░░░░░░░░░░░ [0/146] Installing wheels...                                                                 warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
         If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
Installed 146 packages in 54.49s
 + aiohappyeyeballs==2.6.1
 + aiohttp==3.11.18
 + aiosignal==1.3.2
 + airportsdata==20250224
 + annotated-types==0.7.0
 + anyio==4.9.0
 + astor==0.8.1
 + attrs==25.3.0
 + blake3==1.0.4
 + cachetools==5.5.2
 + certifi==2025.4.26
 + charset-normalizer==3.4.2
 + click==8.1.8
 + cloudpickle==3.1.1
 + compressed-tensors==0.9.4
 + cupy-cuda12x==13.4.1
 + deprecated==1.2.18
 + depyf==0.18.0
 + dill==0.4.0
 + diskcache==5.6.3
 + distro==1.9.0
 + dnspython==2.7.0
 + einops==0.8.1
 + email-validator==2.2.0
 + fastapi==0.115.12
 + fastapi-cli==0.0.7
 + fastrlock==0.8.3
 + filelock==3.18.0
 + frozenlist==1.6.0
 + fsspec==2025.3.2
 + gguf==0.16.3
 + googleapis-common-protos==1.70.0
 + grpcio==1.71.0
 + h11==0.16.0
 + hf-xet==1.1.0
 + httpcore==1.0.9
 + httptools==0.6.4
 + httpx==0.28.1
 + huggingface-hub==0.31.1
 + idna==3.10
 + importlib-metadata==8.6.1
 + interegular==0.3.3
 + jinja2==3.1.6
 + jiter==0.9.0
 + jsonschema==4.23.0
 + jsonschema-specifications==2025.4.1
 + lark==1.2.2
 + llguidance==0.7.19
 + llvmlite==0.44.0
 + lm-format-enforcer==0.10.11
 + markdown-it-py==3.0.0
 + markupsafe==3.0.2
 + mdurl==0.1.2
 + mistral-common==1.5.4
 + mpmath==1.3.0
 + msgpack==1.1.0
 + msgspec==0.19.0
 + multidict==6.4.3
 + nest-asyncio==1.6.0
 + networkx==3.4.2
 + ninja==1.11.1.4
 + numba==0.61.2
 + numpy==2.2.5
 + nvidia-cublas-cu12==12.6.4.1
 + nvidia-cuda-cupti-cu12==12.6.80
 + nvidia-cuda-nvrtc-cu12==12.6.77
 + nvidia-cuda-runtime-cu12==12.6.77
 + nvidia-cudnn-cu12==9.5.1.17
 + nvidia-cufft-cu12==11.3.0.4
 + nvidia-cufile-cu12==1.11.1.6
 + nvidia-curand-cu12==10.3.7.77
 + nvidia-cusolver-cu12==11.7.1.2
 + nvidia-cusparse-cu12==12.5.4.2
 + nvidia-cusparselt-cu12==0.6.3
 + nvidia-nccl-cu12==2.26.2
 + nvidia-nvjitlink-cu12==12.6.85
 + nvidia-nvtx-cu12==12.6.77
 + openai==1.78.0
 + opencv-python-headless==4.11.0.86
 + opentelemetry-api==1.32.1
 + opentelemetry-exporter-otlp==1.32.1
 + opentelemetry-exporter-otlp-proto-common==1.32.1
 + opentelemetry-exporter-otlp-proto-grpc==1.32.1
 + opentelemetry-exporter-otlp-proto-http==1.32.1
 + opentelemetry-proto==1.32.1
 + opentelemetry-sdk==1.32.1
 + opentelemetry-semantic-conventions==0.53b1
 + opentelemetry-semantic-conventions-ai==0.4.7
 + outlines==0.1.11
 + outlines-core==0.1.26
 + packaging==25.0
 + partial-json-parser==0.2.1.1.post5
 + pillow==11.2.1
 + prometheus-client==0.21.1
 + prometheus-fastapi-instrumentator==7.1.0
 + propcache==0.3.1
 + protobuf==5.29.4
 + psutil==7.0.0
 + py-cpuinfo==9.0.0
 + pycountry==24.6.1
 + pydantic==2.11.4
 + pydantic-core==2.33.2
 + pygments==2.19.1
 + python-dotenv==1.1.0
 + python-json-logger==3.3.0
 + python-multipart==0.0.20
 + pyyaml==6.0.2
 + pyzmq==26.4.0
 + ray==2.46.0
 + referencing==0.36.2
 + regex==2024.11.6
 + requests==2.32.3
 + rich==14.0.0
 + rich-toolkit==0.14.5
 + rpds-py==0.24.0
 + safetensors==0.5.3
 + scipy==1.15.3
 + sentencepiece==0.2.0
 + setuptools==79.0.1
 + shellingham==1.5.4
 + six==1.17.0
 + sniffio==1.3.1
 + starlette==0.46.2
 + sympy==1.14.0
 + tiktoken==0.9.0
 + tokenizers==0.21.1
 + torch==2.7.0
 + torchaudio==2.7.0
 + torchvision==0.22.0
 + tqdm==4.67.1
 + transformers==4.51.3
 + triton==3.3.0
 + typer==0.15.3
 + typing-extensions==4.13.2
 + typing-inspection==0.4.0
 + urllib3==2.4.0
 + uvicorn==0.34.2
 + uvloop==0.21.0
 + vllm==0.1.dev6367+g3d1e387.precompiled (from file:///workspace/vllm)
 + watchfiles==1.0.5
 + websockets==15.0.1
 + wrapt==1.17.2
 + xformers==0.0.30
 + xgrammar==0.1.18
 + yarl==1.20.0
 + zipp==3.21.0

[!NOTE] 🤡 20GB is not engough to install the package

Usually workspace/ has more storage.

(.venv) root@f00742593164:/home/vllm# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          20G   12G  8.5G  58% /
tmpfs            64M     0   64M   0% /dev
tmpfs           756G     0  756G   0% /sys/fs/cgroup
shm              88G     0   88G   0% /dev/shm
/dev/sda2       3.5T   22G  3.3T   1% /usr/bin/nvidia-smi
/dev/sdb         20G     0   20G   0% /workspace
tmpfs           756G   12K  756G   1% /proc/driver/nvidia
tmpfs           756G  4.0K  756G   1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs           152G   52M  152G   1% /run/nvidia-persistenced/socket
tmpfs           756G     0  756G   0% /proc/acpi
tmpfs           756G     0  756G   0% /proc/scsi
tmpfs           756G     0  756G   0% /sys/firmware
tmpfs           756G     0  756G   0% /sys/devices/virtual/powercap
(.venv) root@f00742593164:/home/vllm# df -h --output=source,target,size,used,avail
Filesystem     Mounted on                                    Size  Used Avail
overlay        /                                              20G   12G  8.5G
tmpfs          /dev                                           64M     0   64M
tmpfs          /sys/fs/cgroup                                756G     0  756G
shm            /dev/shm                                       88G     0   88G
/dev/sda2      /usr/bin/nvidia-smi                           3.5T   22G  3.3T
/dev/sdb       /workspace                                     20G     0   20G
tmpfs          /proc/driver/nvidia                           756G   12K  756G
tmpfs          /etc/nvidia/nvidia-application-profiles-rc.d  756G  4.0K  756G
tmpfs          /run/nvidia-persistenced/socket               152G   52M  152G
tmpfs          /proc/acpi                                    756G     0  756G
tmpfs          /proc/scsi                                    756G     0  756G
tmpfs          /sys/firmware                                 756G     0  756G
tmpfs          /sys/devices/virtual/powercap                 756G     0  756G
(.venv) root@f00742593164:/home/vllm# cd ..
(.venv) root@f00742593164:/home# mv vllm/ ../workspace/
(.venv) root@f00742593164:/home# ls
(.venv) root@f00742593164:/home# cd ..
(.venv) root@f00742593164:/# cd workspace/
(.venv) root@f00742593164:/workspace# ls
vllm
(.venv) root@f00742593164:/workspace# cd vllm
(.venv) root@f00742593164:/workspace/vllm# df -h --output=source,target,size,used,avail
Filesystem     Mounted on                                    Size  Used Avail
overlay        /                                              20G   11G  9.6G
tmpfs          /dev                                           64M     0   64M
tmpfs          /sys/fs/cgroup                                756G     0  756G
shm            /dev/shm                                       88G     0   88G
/dev/sda2      /usr/bin/nvidia-smi                           3.5T   22G  3.3T
/dev/sdb       /workspace                                     20G  1.2G   19G
tmpfs          /proc/driver/nvidia                           756G   12K  756G
tmpfs          /etc/nvidia/nvidia-application-profiles-rc.d  756G  4.0K  756G
tmpfs          /run/nvidia-persistenced/socket               152G   52M  152G
tmpfs          /proc/acpi                                    756G     0  756G
tmpfs          /proc/scsi                                    756G     0  756G
tmpfs          /sys/firmware                                 756G     0  756G
tmpfs          /sys/devices/virtual/powercap                 756G     0  756G
(.venv) root@f00742593164:/workspace/vllm# rm -rf ../../.venv/

Git structure should’ve not changed, double check…

(.venv) root@f00742593164:/workspace/vllm# git remote -v
origin  https://github.com/davidgao7/vllm.git (fetch)
origin  https://github.com/davidgao7/vllm.git (push)

Noice, 😅

Try that again… Oh yea don’t forget to create virtual environment again and activate it.

VLLM_USE_PRECOMPILED=1 pip install --editable .

Install python env

(.venv) root@f00742593164:/workspace/vllm# uv venv --python 3.12 --seed
Using CPython 3.12.10
Creating virtual environment with seed packages at: .venv
warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
         If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
 + pip==25.1.1
Activate with: source .venv/bin/activate
(.venv) root@f00742593164:/workspace/vllm# source .venv/bin/activate
(vllm) root@f00742593164:/workspace/vllm# python
Python 3.12.10 (main, Apr  9 2025, 04:03:51) [Clang 20.1.0 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
(vllm) root@f00742593164:/workspace/vllm#

Install dependencies

VLLM_USE_PRECOMPILED=1 pip install --editable .

Let’s try if vllm really work, here is an example script


from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)


def main():
    # Create an LLM.
    llm = LLM(model="facebook/opt-125m")
    # Generate texts from the prompts.
    # The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt:    {prompt!r}")
        print(f"Output:    {generated_text!r}")
        print("-" * 60)


if __name__ == "__main__":
    main()

In this example, we pull the opt-125m model from huggingface, and generate some text with it.

uv run examples/offline_inference/basic/basic.py

Finally, if your machine is powerful enough(and you store at correct large disk), you could run one inference(multi-inferences when speaking about per token)

(vllm) root@49924402e807:/workspace/vllm# uv run examples/offline_inference/basic/basic.py
INFO 05-09 06:46:59 [__init__.py:248] Automatically detected platform cuda.
config.json: 100%|███████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 3.33MB/s]
INFO 05-09 06:47:58 [config.py:752] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 05-09 06:47:58 [config.py:2057] Chunked prefill is enabled with max_num_batched_tokens=8192.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████| 685/685 [00:00<00:00, 4.21MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 3.76MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 2.91MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████| 441/441 [00:00<00:00, 3.28MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████| 137/137 [00:00<00:00, 809kB/s]
INFO 05-09 06:48:05 [core.py:61] Initializing a V1 LLM engine (v0.1.dev6367+g3d1e387) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-09 06:48:09 [utils.py:2587] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fb0ff70d8b0>
INFO 05-09 06:48:10 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-09 06:48:10 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 05-09 06:48:10 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-09 06:48:10 [gpu_model_runner.py:1393] Starting to load model facebook/opt-125m...
INFO 05-09 06:48:11 [weight_utils.py:257] Using model weights format ['*.bin']
pytorch_model.bin: 100%|███████████████████████████████████████████████████████| 251M/251M [00:03<00:00, 83.0MB/s]
INFO 05-09 06:48:15 [weight_utils.py:273] Time spent downloading weights for facebook/opt-125m: 3.667131 seconds
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.20it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.19it/s]

INFO 05-09 06:48:15 [default_loader.py:278] Loading weights took 0.32 seconds
INFO 05-09 06:48:15 [gpu_model_runner.py:1411] Model loading took 0.2389 GiB and 4.768521 seconds
INFO 05-09 06:48:24 [backends.py:437] Using cache directory: /root/.cache/vllm/torch_compile_cache/f75f64201d/rank_0_0 for vLLM's torch.compile
INFO 05-09 06:48:24 [backends.py:447] Dynamo bytecode transform time: 8.85 s
INFO 05-09 06:48:27 [backends.py:138] Cache the graph of shape None for later use
INFO 05-09 06:48:34 [backends.py:150] Compiling a graph for general shape takes 9.91 s
INFO 05-09 06:48:37 [monitor.py:33] torch.compile takes 18.75 s in total
INFO 05-09 06:48:38 [kv_cache_utils.py:639] GPU KV cache size: 582,208 tokens
INFO 05-09 06:48:38 [kv_cache_utils.py:642] Maximum concurrency for 2,048 tokens per request: 284.28x
INFO 05-09 06:48:59 [gpu_model_runner.py:1774] Graph capturing finished in 21 secs, took 0.20 GiB
INFO 05-09 06:48:59 [core.py:163] init engine (profile, create kv cache, warmup model) took 43.73 seconds
INFO 05-09 06:48:59 [core_client.py:442] Core engine process 0 ready.
Adding requests: 100%|█████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 233.82it/s]
Processed prompts: 100%|█████| 4/4 [00:00<00:00, 11.66it/s, est. speed input: 75.80 toks/s, output: 186.57 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    ' Ken Li and I’m a computer science student. My primary interest is'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' attempting to subvert a very important issue, one which many of us are of'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' filled with tourists, retirees and independent artists but, by far, the most popular'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' becoming more connected. This requires more research and development than ever before. With the'
------------------------------------------------------------

Love the verbose logs.


Message from future me: you’ll face a lot of unrelated problem when trying to deploy on a single pod. so start with cluster if you have!

Now for the real problem needs to solve, we need to add the /v1/audio/transcriptions support to the production stack.

POC: Add /v1/audio/transcriptions Support to vLLM Production-Stack

1. Background

  • GitHub Issue: vllm-project/production-stack#410
  • Problem:
    The vLLM backend already supports /v1/audio/transcriptions (Whisper-based ASR),
    but the production-stack router does not expose it, causing 404 errors.
  • Goal:
    Extend the production-stack router to expose the /v1/audio/transcriptions endpoint and successfully forward requests to the backend.

2. Environment Setup

Pain in single pod Why it breaks How Instant Clusters fix it
OverlayFS / bind‑mount denied Container security on a pod Nodes are full VMs → kernel features allowed (RunPod Docs)
/dev/kmsg & other device files missing Device files masked Full device namespace, kubelet starts cleanly
Static intra‑cluster networking needed Pod IPs are ephemeral Each node gets a static NODE_ADDR; primary node gets PRIMARY_ADDR (RunPod Docs)
GPU scheduling No privileged runtime Install NVIDIA device‑plugin once; GPUs appear in kubectl describe node
Scalability Single host only Up to 8 nodes × 8 GPUs per node (64 GPU max) with high‑speed mesh (RunPod Docs)

So what machine configuration should I choose?

Option Meets K8s‑production need? Gives GPU node? Avoids earlier sandbox errors? Practical notes
RunPod Single Pod ✘ (kubelet fails) ✔ RTX 3090 ✘ sandbox limits Already ruled out
RunPod Instant Cluster ✘ (K8s not supported) ✔ multi‑GPU ✔ root VM, but no kube‑API Great for PyTorch/Slurm, not Helm
RunPod Bare Metal ✔ install K3s easily ✔ pick 1–8 GPUs ✔ full server Same RunPod UI; hourly billing
AWS / Azure / GCP managed K8s (EKS / AKS / GKE) ✔ fully managed control‑plane ✔ add GPU node‑pool ✔ real VMs Slightly higher cost; IAM setup
Self‑host K3s on any GPU VM (Lambda, Paperspace, Hetzner, OCI) ✔ single‑node or multi‑node ✔ if you rent GPU VM ✔ full root You manage everything, but simple for dev work
  • uhh, I still have 10 bucks remain on runpod…
    • Bare metal cluster cheapest one cost hundreds of bucks :(
    • I do need a k8s environment

I found GKE is the cheapest option

Why GKE?

  • Focus on your app, not infrastructure: GKE handles the heavy lifting of managing Kubernetes, letting you concentrate on building and innovating.
  • GPU Power Made Easy: GKE makes it straightforward to use GPUs, which is crucial for your AI/ML workloads and resolving your specific bug.
  • Scale Effortlessly: GKE grows with you, easily handling increased demands on your application without you having to worry about the details.
  • Production-Ready Reproduction: GKE gives you the environment you need to pinpoint and fix the router pod, vLLM + Whisper worker, and GPU interaction bug.
  • Plays Well with Everything: GKE integrates smoothly with other Google Cloud services you might need. Cost-Effective: GKE has features like Autopilot mode and cost optimization tools to help you manage your spending.
  • Secure: GKE offers built-in security features to protect your applications. Future-Proof: Kubernetes is the industry standard, and GKE is a leading managed Kubernetes service.
  1. standard cluster
  2. us-east-1
  3. fleet registration: A Google Cloud fleet is a top‑level “super‑cluster” object that lets you manage multiple Kubernetes clusters (GKE, on‑prem, or other clouds) as a single unit—enabling shared policies, multi‑cluster services, and centralized observability with no extra cost.
  4. CIDR only allow my ip addr

Connect to cluster

For google cloud we can install gcloud cli tool, and use gcloud container clusters get-credentials to connect to the cluster

gcloud container clusters get-credentials autopilot-cluster-1 --region us-east1
wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-darwin-arm.tar.gz
tar -xf google-cloud-cli-darwin-arm.tar.gz
cd google-cloud-sdk
chmod +x install.sh  # just in case...
sh install.sh

And it’s writing on your rc file… Check the execution order, you can modify yourself.

Then login your gcloud cli, and setup corresponding project id

  1. login
gcloud auth login
  1. setup corresponding project id
gcloud config set project gen-lang-client-0475913374
  1. connect the local machine kubectl to cluster
gcloud container clusters get-credentials autopilot-cluster-1 --region us-east1

# Fetching cluster endpoint and auth data.
# CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of kubectl, was not found or is not executable. Install gke-gcloud-auth-plugin for use with kubectl by following https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin
# kubeconfig entry generated for autopilot-cluster-1.

Ok more installations,

  1. Install required plugins

reference

gcloud components install gke-gcloud-auth-plugin

#
# Your current Google Cloud CLI version is: 521.0.0
# Installing components from version: 521.0.0
#
# ┌────────────────────────────────────────────────────────────────┐
# │              These components will be installed.               │
# ├────────────────────────────────────────────┬─────────┬─────────┤
# │                    Name                    │ Version │   Size  │
# ├────────────────────────────────────────────┼─────────┼─────────┤
# │ gke-gcloud-auth-plugin (Platform Specific) │  0.5.10 │ 3.3 MiB │
# └────────────────────────────────────────────┴─────────┴─────────┘
#
# For the latest full release notes, please visit:
#   https://cloud.google.com/sdk/release_notes
#
# Once started, canceling this operation may leave your SDK installation in an inconsistent state.
#
# Do you want to continue (Y/n)?  Y
#
# Performing in place update...
#
# ╔════════════════════════════════════════════════════════════╗
# ╠═ Downloading: gke-gcloud-auth-plugin                      ═╣
# ╠════════════════════════════════════════════════════════════╣
# ╠═ Downloading: gke-gcloud-auth-plugin (Platform Specific)  ═╣
# ╠════════════════════════════════════════════════════════════╣
# ╠═ Installing: gke-gcloud-auth-plugin                       ═╣
# ╠════════════════════════════════════════════════════════════╣
# ╠═ Installing: gke-gcloud-auth-plugin (Platform Specific)   ═╣
# ╚════════════════════════════════════════════════════════════╝
#
# Performing post processing steps...done.
#
# Update done!
  1. Try to initialize the cluster again
gcloud container clusters get-credentials autopilot-cluster-1 --region us-east1 # or other region

# Fetching cluster endpoint and auth data.
# kubeconfig entry generated for autopilot-cluster-1.

[!NOTE] Performance Monitoring: To monitor and verify the latency and performance between your location and the GKE cluster, you can utilize tools like GCPing, which provides latency measurements to various Google Cloud regions.

  1. Connect to the cluster 🔥
> kubectl get nodes
No resources found

Nice, it’s working

Install NVIDIA GPU Device Plugin (Need?)

  • Currently no, GKE Autopilot automatically provisions GPU nodes and installs the necessary NVIDIA drivers and device plugins when you deploy a workload that requests GPU resources.

Psst Here is the Plugin repo

Reproducing the issue

Deploy the current production-stack on GCP

For GCP, there are 2 tutorials:

  • /tutorials/cloud_deployments/02-GCP-GKE-deployment.md
  • /deployment_on_cloud/gcp/README.md
  • /deployment_on_cloud/gcp/OPT125_CPU/README.md

I already have GKE cluster, I just need to manually depoly the vLLM production stack using helm, no need to use the scripts to create cluster.

Add official Helm repo to my cluster.

helm repo add production-stack https://vllm-project.github.io/production-stack-helm-charts/
helm repo update

[!NOTE] I can access autopilot k8s cluster locally, think of autopilot as a serverless kubernetes experience. If I was to gain SSH access to an Autopilot node, it would be considered a security vulnerability because I could potentially bypass the Kubernetes API and run commands directly on the node.

Here is the official docs for vllm production deployment For code, we will use the Helm repo

  • The official documentation is using lmcache repo, which might be outdated.

So , the correct way to deploy production-stack is


# adds the URL to local list of helm repositories
helm repo add vllm https://vllm-project.github.io/production-stack

# > helm repo add vllm https://vllm-project.github.io/production-stack
# zsh: command not found: helm

# downloads and refreshes the latest list of charts from the repo added
helm repo update

Instead of lmcache.github.io/helm/

😅need to install helm, yay one more install

To manage Kubernetes deployments on GKE Autopilot cluster using Helm, need to install Helm CLI on local machine.

Here is the official docs on how to install Helm Here is the one-liner

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
#
# > curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
#   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
#                                  Dload  Upload   Total   Spent    Left  Speed
# 100 11913  100 11913    0     0  61922      0 --:--:-- --:--:-- --:--:-- 62046
# Downloading https://get.helm.sh/helm-v3.17.3-darwin-arm64.tar.gz
# Verifying checksum... Done.
# Preparing to install helm into /usr/local/bin
# Password:
# helm installed into /usr/local/bin/helm

Rerun the repo add and update command


# adds the URL to local list of helm repositories
helm repo add vllm https://vllm-project.github.io/production-stack
# "vllm" has been added to your repositories

# downloads and refreshes the latest list of charts from the repo added
helm repo update
# Hang tight while we grab the latest from your chart repositories...
# ...Successfully got an update from the "vllm" chart repository
# Update Complete. ⎈Happy Helming!⎈

helm search repo vllm
# NAME            CHART VERSION   APP VERSION     DESCRIPTION
# vllm/vllm-stack 0.1.2                           The stack deployment of vLLM

# quick sanity check
helm show values vllm/vllm-stack
# # -- Default values for llmstack helm chart
# # -- Declare variables to be passed into your templates.
#
# # -- Serving engine configuratoon
# servingEngineSpec:
#   enableEngine: true
#   # -- Customized labels for the serving engine deployment
#   labels:
#     environment: "test"
#     release: "test"
#   # vllmApiKey: (optional) api key for securing the vLLM models. Can be either:
#   #   - A string containing the token directly (will be stored in a generated secret)
#   #   - An object referencing an existing secret:
#   #     secretName: "my-existing-secret"
#   #     secretKey: "vllm-api-key"
#   #
#   # modelSpec - configuring multiple serving engines deployments that runs different models
#   # Each entry in the modelSpec array should contain the following fields:
#   # - name: (string) The name of the model, e.g., "example-model"
#   # - repository: (string) The repository of the model, e.g., "vllm/vllm-openai"
#   # - tag: (string) The tag of the model, e.g., "latest"
#   # - imagePullSecret: (Optional, string) Name of secret with credentials to private container repository, e.g. "secret"
#   # - modelURL: (string) The URL of the model, e.g., "facebook/opt-125m"
#   # - chatTemplate: (Optional, string) Chat template (Jinga2) specifying tokenizer configuration, e.g. "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ 'Question:\n' + message['content'] + '\n\n' }}{% elif message['role'] == 'system' %}\n{{ 'System:\n' + message['content'] + '\n\n' }}{% elif message['role'] == 'assistant' %}{{ 'Answer:\n'  + message['content'] + '\n\n' }}{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ 'Answer:\n' }}{% endif %}{% endfor %}"
#   #
#   # - replicaCount: (int) The number of replicas for the model, e.g. 1
#   # - requestCPU: (int) The number of CPUs requested for the model, e.g. 6
#   # - requestMemory: (string) The amount of memory requested for the model, e.g., "16Gi"
#   # - requestGPU: (int) The number of GPUs requested for the model, e.g., 1
#   # - requestGPUType: (Optional, string) The type of GPU requested, e.g., "nvidia.com/mig-4g.71gb". If not specified, defaults to "nvidia.com/gpu"
#   # - limitCPU: (Optional, string) The CPU limit for the model, e.g., "8"
#   # - limitMemory: (Optional, string) The memory limit for the model, e.g., "32Gi"
#   # Note: If limitCPU and limitMemory are not specified, only GPU resources will have limits set equal to their requests.
#   # - pvcStorage: (Optional, string) The amount of storage requested for the model, e.g., "50Gi".
#   # - pvcAccessMode: (Optional, list) The access mode policy for the mounted volume, e.g., ["ReadWriteOnce"]
#   # - storageClass: (Optional, String) The storage class of the PVC e.g., "", default is ""
#   # - pvcMatchLabels: (Optional, map) The labels to match the PVC, e.g., {model: "opt125m"}
#   # - extraVolumes: (Optional, list) Additional volumes to add to the pod, in Kubernetes volume format. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#volume-v1-core
#   #   Example for an emptyDir volume:
#   #   extraVolumes:
#   #   - name: tmp-volume
#   #     emptyDir:
#   #       medium: ""
#   #       sizeLimit: 5Gi
#   # - extraVolumeMounts: (Optional, list) Additional volume mounts to add to the container, in Kubernetes volumeMount format. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#volumemount-v1-core
#   #   Example for mounting the tmp-volume to /tmp:
#   #   extraVolumeMounts:
#   #   - name: tmp-volume
#   #     mountPath: /tmp
#   # - initContainer: (optional, list of objects) The configuration for the init container to be run before the main container.
#   #   - name: (string) The name of the init container, e.g., "init"
#   #   - image: (string) The Docker image for the init container, e.g., "busybox:latest"
#   #   - command: (optional, list) The command to run in the init container, e.g., ["sh", "-c"]
#   #   - args: (optional, list) Additional arguments to pass to the command, e.g., ["ls"]
#   #   - env: (optional, list) List of environment variables to set in the container, each being a map with:
#   #   - resources: (optional, map) The resource requests and limits for the container:
#   #   - mountPvcStorage: (optional, bool) Whether to mount the model's volume.
#   #
#   # - vllmConfig: (optional, map) The configuration for the VLLM model, supported options are:
#   #   - enablePrefixCaching: (optional, bool) Enable prefix caching, e.g., false
#   #   - enableChunkedPrefill: (optional, bool) Enable chunked prefill, e.g., false
#   #   - maxModelLen: (optional, int) The maximum model length, e.g., 16384
#   #   - dtype: (optional, string) The data type, e.g., "bfloat16"
#   #   - tensorParallelSize: (optional, int) The degree of tensor parallelism, e.g., 2
#   #   - maxNumSeqs: (optional, int) Maximum number of sequences to be processed in a single iteration., e.g., 32
#   #   - maxLoras: (optional, int) The maximum number of LoRA models to be loaded in a single batch, e.g., 4
#   #   - gpuMemoryUtilization: (optional, float) The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. e.g., 0.95
#   #   - extraArgs: (optional, list) Extra command line arguments to pass to vLLM, e.g., ["--disable-log-requests"]
#   #
#   # - lmcacheConfig: (optional, map) The configuration of the LMCache for KV offloading, supported options are:
#   #   - enabled: (optional, bool) Enable LMCache, e.g., true
#   #   - cpuOffloadingBufferSize: (optional, string) The CPU offloading buffer size, e.g., "30"
#   #
#   # - hf_token: (optional) Hugging Face token configuration. Can be either:
#   #   - A string containing the token directly (will be stored in a generated secret)
#   #   - An object referencing an existing secret:
#   #     secretName: "my-existing-secret"
#   #     secretKey: "hf-token-key"
#   #
#   # - env: (optional, list) The environment variables to set in the container, e.g., your HF_TOKEN
#   #
#   # - nodeSelectorTerms: (optional, list) The node selector terms to match the nodes
#   #
#   # - shmSize: (optional, string) The size of the shared memory, e.g., "20Gi"
#   #
#   # Example:
#   # vllmApiKey: "vllm_xxxxxxxxxxxxx"
#   # modelSpec:
#   # - name: "mistral"
#   #   repository: "lmcache/vllm-openai"
#   #   tag: "latest"
#   #   modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
#   #   replicaCount: 1
#   #
#   #   requestCPU: 10
#   #   requestMemory: "64Gi"
#   #   requestGPU: 1
#   #
#   #   pvcStorage: "50Gi"
#   #   pvcAccessMode:
#   #     - ReadWriteOnce
#   #   pvcMatchLabels:
#   #     model: "mistral"
#   #   initContainer:
#   #     name: my-container
#   #     image: busybox
#   #     command: ["sh"]
#   #     env: {}
#   #     args: []
#   #     resources: {}
#   #     mountPvcStorage: true
#   #
#   #   vllmConfig:
#   #     enableChunkedPrefill: false
#   #     enablePrefixCaching: false
#   #     maxModelLen: 16384
#   #     dtype: "bfloat16"
#   #     maxNumSeqs: 32
#   #     gpuMemoryUtilization: 0.95
#   #     maxLoras: 4
#   #     extraArgs: ["--disable-log-requests", "--trust-remote-code"]
#   #
#   #   lmcacheConfig:
#   #     enabled: true
#   #     cpuOffloadingBufferSize: "30"
#   #
#   #   hf_token: "hf_xxxxxxxxxxxxx"
#   #
#   #
#   #   nodeSelectorTerms:
#   #     - matchExpressions:
#   #       - key: nvidia.com/gpu.product
#   #         operator: "In"
#   #         values:
#   #         - "NVIDIA-RTX-A6000"
#   modelSpec: []
#
#   # -- Container port
#   containerPort: 8000
#   # -- Service port
#   servicePort: 80
#
#   # -- Set other environment variables from config map
#   configs: {}
#
#   # -- deployment strategy
#   strategy: {}
#
#   # -- Readiness probe configuration
#   startupProbe:
#     # -- Number of seconds after the container has started before startup probe is initiated
#     initialDelaySeconds: 15
#     # -- How often (in seconds) to perform the startup probe
#     periodSeconds: 10
#     # -- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
#     failureThreshold:
#       60
#       # -- Configuration of the Kubelet http request on the server
#     httpGet:
#       # -- Path to access on the HTTP server
#       path: /health
#       # -- Name or number of the port to access on the container, on which the server is listening
#       port: 8000
#
#   # -- Liveness probe configuration
#   livenessProbe:
#     # -- Number of seconds after the container has started before liveness probe is initiated
#     initialDelaySeconds: 15
#     # -- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
#     failureThreshold: 3
#     # -- How often (in seconds) to perform the liveness probe
#     periodSeconds: 10
#     # -- Configuration of the Kubelet http request on the server
#     httpGet:
#       # -- Path to access on the HTTP server
#       path: /health
#       # -- Name or number of the port to access on the container, on which the server is listening
#       port: 8000
#
#   # -- Disruption Budget Configuration
#   maxUnavailablePodDisruptionBudget: ""
#
#   # -- Tolerations configuration (when there are taints on nodes)
#   # Example:
#   # tolerations:
#   #   - key: "node-role.kubernetes.io/control-plane"
#   #     operator: "Exists"
#   #     effect: "NoSchedule"
#   tolerations: []
#
#   # -- RuntimeClassName configuration, set to "nvidia" if the model requires GPU
#   runtimeClassName: "nvidia"
#
#   # -- SchedulerName configuration
#   schedulerName: ""
#
#   # -- Pod-level security context configuration. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#podsecuritycontext-v1-core
#   securityContext: {}
#     # -- Run as a non-root user ID
#     # runAsUser: 1000
#     # -- Run with a non-root group ID
#     # runAsGroup: 1000
#     # -- Run as non-root
#     # runAsNonRoot: true
#     # -- Set the seccomp profile
#     # seccompProfile:
#     #   type: RuntimeDefault
#     # -- Drop all capabilities
#     # capabilities:
#     #   drop:
#     #   - ALL
#     # -- Set the file system group ID for all containers
#     # fsGroup: 1000
#
#   # -- Container-level security context configuration. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#securitycontext-v1-core
#   containerSecurityContext:
#     # -- Run as non-root
#     runAsNonRoot: false
#     # -- Don't allow privilege escalation
#     # allowPrivilegeEscalation: false
#     # -- Drop all capabilities
#     # capabilities:
#     #   drop:
#     #   - ALL
#     # -- Read-only root filesystem
#     # readOnlyRootFilesystem: true
#
# routerSpec:
#   # -- The docker image of the router. The following values are defaults:
#   repository: "lmcache/lmstack-router"
#   tag: "latest"
#   imagePullPolicy: "Always"
#
#   # -- Whether to enable the router service
#   enableRouter: true
#
#   # -- Number of replicas
#   replicaCount: 1
#
#   # -- Container port
#   containerPort: 8000
#
#   # -- Service type
#   serviceType: ClusterIP
#
#   # -- Service port
#   servicePort: 80
#
#   # -- Service discovery mode, supports "k8s" or "static". Defaults to "k8s" if not set.
#   serviceDiscovery: "k8s"
#
#   # -- If serviceDiscovery is set to "static", the comma-separated values below are required. There needs to be the same number of backends and models
#   staticBackends: ""
#   staticModels: ""
#
#   # -- routing logic, could be "roundrobin" or "session"
#   routingLogic: "roundrobin"
#
#   # -- session key if using "session" routing logic
#   sessionKey: ""
#
#   # -- extra router commandline arguments
#   extraArgs: []
#
#   # -- Interval in seconds to scrape the serving engine metrics
#   engineScrapeInterval: 15
#
#   # -- Window size in seconds to calculate the request statistics
#   requestStatsWindow: 60
#
#   # -- deployment strategy
#   strategy: {}
#
#   # vllmApiKey: (optional) api key for securing the vLLM models. Must be an object referencing an existing secret
#   #   secretName: "my-existing-secret"
#   #   secretKey: "vllm-api-key"
#
#   # -- router resource requests and limits
#   resources:
#     requests:
#       cpu: "4"
#       memory: "16G"
#     limits:
#       cpu: "8"
#       memory: "32G"
#
#   # -- Customized labels for the router deployment
#   labels:
#     environment: "router"
#     release: "router"
#
#   ingress:
#     # -- Enable ingress controller resource
#     enabled: false
#
#     # -- IngressClass that will be used to implement the Ingress
#     className: ""
#
#     # -- Additional annotations for the Ingress resource
#     annotations:
#       {}
#       # kubernetes.io/ingress.class: alb
#       # kubernetes.io/ingress.class: nginx
#       # kubernetes.io/tls-acme: "true"
#
#     # The list of hostnames to be covered with this ingress record.
#     hosts:
#       - host: vllm-router.local
#         paths:
#           - path: /
#             pathType: Prefix
#
#     # --  The tls configuration for hostnames to be covered with this ingress record.
#     tls: []
#     #  - secretName: vllm-router-tls
#     #    hosts:
#     #      - vllm-router.local
#
#   # The node selector terms to match the nodes
#   # Example:
#   #   nodeSelectorTerms:
#   #     - matchExpressions:
#   #       - key: nvidia.com/gpu.product
#   #         operator: "In"
#   #         values:
#   #         - "NVIDIA-RTX-A6000"
#   nodeSelectorTerms: []
#
#   # -- TODO: Readiness probe configuration
#   #startupProbe:
#   #  # -- Number of seconds after the container has started before startup probe is initiated
#   #  initialDelaySeconds: 5
#   #  # -- How often (in seconds) to perform the startup probe
#   #  periodSeconds: 5
#   #  # -- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
#   #  failureThreshold: 100
#   #   # -- Configuration of the Kubelet http request on the server
#   #  httpGet:
#   #    # -- Path to access on the HTTP server
#   #

Let’s parse the yaml file

> helm show values vllm/vllm-stack | yq
# -- Default values for llmstack helm chart
# -- Declare variables to be passed into your templates.

# -- Serving engine configuratoon
servingEngineSpec:
  enableEngine: true
  # -- Customized labels for the serving engine deployment
  labels:
    environment: "test"
    release: "test"
  # vllmApiKey: (optional) api key for securing the vLLM models. Can be either:
  #   - A string containing the token directly (will be stored in a generated secret)
  #   - An object referencing an existing secret:
  #     secretName: "my-existing-secret"
  #     secretKey: "vllm-api-key"
  #
  # modelSpec - configuring multiple serving engines deployments that runs different models
  # Each entry in the modelSpec array should contain the following fields:
  # - name: (string) The name of the model, e.g., "example-model"
  # - repository: (string) The repository of the model, e.g., "vllm/vllm-openai"
  # - tag: (string) The tag of the model, e.g., "latest"
  # - imagePullSecret: (Optional, string) Name of secret with credentials to private container repository, e.g. "secret"
  # - modelURL: (string) The URL of the model, e.g., "facebook/opt-125m"
  # - chatTemplate: (Optional, string) Chat template (Jinga2) specifying tokenizer configuration, e.g. "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ 'Question:\n' + message['content'] + '\n\n' }}{% elif message['role'] == 'system' %}\n{{ 'System:\n' + message['content'] + '\n\n' }}{% elif message['role'] == 'assistant' %}{{ 'Answer:\n'  + message['content'] + '\n\n' }}{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ 'Answer:\n' }}{% endif %}{% endfor %}"
  #
  # - replicaCount: (int) The number of replicas for the model, e.g. 1
  # - requestCPU: (int) The number of CPUs requested for the model, e.g. 6
  # - requestMemory: (string) The amount of memory requested for the model, e.g., "16Gi"
  # - requestGPU: (int) The number of GPUs requested for the model, e.g., 1
  # - requestGPUType: (Optional, string) The type of GPU requested, e.g., "nvidia.com/mig-4g.71gb". If not specified, defaults to "nvidia.com/gpu"
  # - limitCPU: (Optional, string) The CPU limit for the model, e.g., "8"
  # - limitMemory: (Optional, string) The memory limit for the model, e.g., "32Gi"
  # Note: If limitCPU and limitMemory are not specified, only GPU resources will have limits set equal to their requests.
  # - pvcStorage: (Optional, string) The amount of storage requested for the model, e.g., "50Gi".
  # - pvcAccessMode: (Optional, list) The access mode policy for the mounted volume, e.g., ["ReadWriteOnce"]
  # - storageClass: (Optional, String) The storage class of the PVC e.g., "", default is ""
  # - pvcMatchLabels: (Optional, map) The labels to match the PVC, e.g., {model: "opt125m"}
  # - extraVolumes: (Optional, list) Additional volumes to add to the pod, in Kubernetes volume format. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#volume-v1-core
  #   Example for an emptyDir volume:
  #   extraVolumes:
  #   - name: tmp-volume
  #     emptyDir:
  #       medium: ""
  #       sizeLimit: 5Gi
  # - extraVolumeMounts: (Optional, list) Additional volume mounts to add to the container, in Kubernetes volumeMount format. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#volumemount-v1-core
  #   Example for mounting the tmp-volume to /tmp:
  #   extraVolumeMounts:
  #   - name: tmp-volume
  #     mountPath: /tmp
  # - initContainer: (optional, list of objects) The configuration for the init container to be run before the main container.
  #   - name: (string) The name of the init container, e.g., "init"
  #   - image: (string) The Docker image for the init container, e.g., "busybox:latest"
  #   - command: (optional, list) The command to run in the init container, e.g., ["sh", "-c"]
  #   - args: (optional, list) Additional arguments to pass to the command, e.g., ["ls"]
  #   - env: (optional, list) List of environment variables to set in the container, each being a map with:
  #   - resources: (optional, map) The resource requests and limits for the container:
  #   - mountPvcStorage: (optional, bool) Whether to mount the model's volume.
  #
  # - vllmConfig: (optional, map) The configuration for the VLLM model, supported options are:
  #   - enablePrefixCaching: (optional, bool) Enable prefix caching, e.g., false
  #   - enableChunkedPrefill: (optional, bool) Enable chunked prefill, e.g., false
  #   - maxModelLen: (optional, int) The maximum model length, e.g., 16384
  #   - dtype: (optional, string) The data type, e.g., "bfloat16"
  #   - tensorParallelSize: (optional, int) The degree of tensor parallelism, e.g., 2
  #   - maxNumSeqs: (optional, int) Maximum number of sequences to be processed in a single iteration., e.g., 32
  #   - maxLoras: (optional, int) The maximum number of LoRA models to be loaded in a single batch, e.g., 4
  #   - gpuMemoryUtilization: (optional, float) The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. e.g., 0.95
  #   - extraArgs: (optional, list) Extra command line arguments to pass to vLLM, e.g., ["--disable-log-requests"]
  #
  # - lmcacheConfig: (optional, map) The configuration of the LMCache for KV offloading, supported options are:
  #   - enabled: (optional, bool) Enable LMCache, e.g., true
  #   - cpuOffloadingBufferSize: (optional, string) The CPU offloading buffer size, e.g., "30"
  #
  # - hf_token: (optional) Hugging Face token configuration. Can be either:
  #   - A string containing the token directly (will be stored in a generated secret)
  #   - An object referencing an existing secret:
  #     secretName: "my-existing-secret"
  #     secretKey: "hf-token-key"
  #
  # - env: (optional, list) The environment variables to set in the container, e.g., your HF_TOKEN
  #
  # - nodeSelectorTerms: (optional, list) The node selector terms to match the nodes
  #
  # - shmSize: (optional, string) The size of the shared memory, e.g., "20Gi"
  #
  # Example:
  # vllmApiKey: "vllm_xxxxxxxxxxxxx"
  # modelSpec:
  # - name: "mistral"
  #   repository: "lmcache/vllm-openai"
  #   tag: "latest"
  #   modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
  #   replicaCount: 1
  #
  #   requestCPU: 10
  #   requestMemory: "64Gi"
  #   requestGPU: 1
  #
  #   pvcStorage: "50Gi"
  #   pvcAccessMode:
  #     - ReadWriteOnce
  #   pvcMatchLabels:
  #     model: "mistral"
  #   initContainer:
  #     name: my-container
  #     image: busybox
  #     command: ["sh"]
  #     env: {}
  #     args: []
  #     resources: {}
  #     mountPvcStorage: true
  #
  #   vllmConfig:
  #     enableChunkedPrefill: false
  #     enablePrefixCaching: false
  #     maxModelLen: 16384
  #     dtype: "bfloat16"
  #     maxNumSeqs: 32
  #     gpuMemoryUtilization: 0.95
  #     maxLoras: 4
  #     extraArgs: ["--disable-log-requests", "--trust-remote-code"]
  #
  #   lmcacheConfig:
  #     enabled: true
  #     cpuOffloadingBufferSize: "30"
  #
  #   hf_token: "hf_xxxxxxxxxxxxx"
  #
  #
  #   nodeSelectorTerms:
  #     - matchExpressions:
  #       - key: nvidia.com/gpu.product
  #         operator: "In"
  #         values:
  #         - "NVIDIA-RTX-A6000"
  modelSpec: []
  # -- Container port
  containerPort: 8000
  # -- Service port
  servicePort: 80
  # -- Set other environment variables from config map
  configs: {}
  # -- deployment strategy
  strategy: {}
  # -- Readiness probe configuration
  startupProbe:
    # -- Number of seconds after the container has started before startup probe is initiated
    initialDelaySeconds: 15
    # -- How often (in seconds) to perform the startup probe
    periodSeconds: 10
    # -- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
    failureThreshold: 60
    # -- Configuration of the Kubelet http request on the server
    httpGet:
      # -- Path to access on the HTTP server
      path: /health
      # -- Name or number of the port to access on the container, on which the server is listening
      port: 8000
  # -- Liveness probe configuration
  livenessProbe:
    # -- Number of seconds after the container has started before liveness probe is initiated
    initialDelaySeconds: 15
    # -- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
    failureThreshold: 3
    # -- How often (in seconds) to perform the liveness probe
    periodSeconds: 10
    # -- Configuration of the Kubelet http request on the server
    httpGet:
      # -- Path to access on the HTTP server
      path: /health
      # -- Name or number of the port to access on the container, on which the server is listening
      port: 8000
  # -- Disruption Budget Configuration
  maxUnavailablePodDisruptionBudget: ""
  # -- Tolerations configuration (when there are taints on nodes)
  # Example:
  # tolerations:
  #   - key: "node-role.kubernetes.io/control-plane"
  #     operator: "Exists"
  #     effect: "NoSchedule"
  tolerations: []
  # -- RuntimeClassName configuration, set to "nvidia" if the model requires GPU
  runtimeClassName: "nvidia"
  # -- SchedulerName configuration
  schedulerName: ""
  # -- Pod-level security context configuration. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#podsecuritycontext-v1-core
  securityContext: {}
  # -- Run as a non-root user ID
  # runAsUser: 1000
  # -- Run with a non-root group ID
  # runAsGroup: 1000
  # -- Run as non-root
  # runAsNonRoot: true
  # -- Set the seccomp profile
  # seccompProfile:
  #   type: RuntimeDefault
  # -- Drop all capabilities
  # capabilities:
  #   drop:
  #   - ALL
  # -- Set the file system group ID for all containers
  # fsGroup: 1000

  # -- Container-level security context configuration. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.32/#securitycontext-v1-core
  containerSecurityContext:
    # -- Run as non-root
    runAsNonRoot: false
    # -- Don't allow privilege escalation
    # allowPrivilegeEscalation: false
    # -- Drop all capabilities
    # capabilities:
    #   drop:
    #   - ALL
    # -- Read-only root filesystem
    # readOnlyRootFilesystem: true
routerSpec:
  # -- The docker image of the router. The following values are defaults:
  repository: "lmcache/lmstack-router"
  tag: "latest"
  imagePullPolicy: "Always"
  # -- Whether to enable the router service
  enableRouter: true
  # -- Number of replicas
  replicaCount: 1
  # -- Container port
  containerPort: 8000
  # -- Service type
  serviceType: ClusterIP
  # -- Service port
  servicePort: 80
  # -- Service discovery mode, supports "k8s" or "static". Defaults to "k8s" if not set.
  serviceDiscovery: "k8s"
  # -- If serviceDiscovery is set to "static", the comma-separated values below are required. There needs to be the same number of backends and models
  staticBackends: ""
  staticModels: ""
  # -- routing logic, could be "roundrobin" or "session"
  routingLogic: "roundrobin"
  # -- session key if using "session" routing logic
  sessionKey: ""
  # -- extra router commandline arguments
  extraArgs: []
  # -- Interval in seconds to scrape the serving engine metrics
  engineScrapeInterval: 15
  # -- Window size in seconds to calculate the request statistics
  requestStatsWindow: 60
  # -- deployment strategy
  strategy: {}
  # vllmApiKey: (optional) api key for securing the vLLM models. Must be an object referencing an existing secret
  #   secretName: "my-existing-secret"
  #   secretKey: "vllm-api-key"

  # -- router resource requests and limits
  resources:
    requests:
      cpu: "4"
      memory: "16G"
    limits:
      cpu: "8"
      memory: "32G"
  # -- Customized labels for the router deployment
  labels:
    environment: "router"
    release: "router"
  ingress:
    # -- Enable ingress controller resource
    enabled: false
    # -- IngressClass that will be used to implement the Ingress
    className: ""
    # -- Additional annotations for the Ingress resource
    annotations: {}
    # kubernetes.io/ingress.class: alb
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"

    # The list of hostnames to be covered with this ingress record.
    hosts:
      - host: vllm-router.local
        paths:
          - path: /
            pathType: Prefix
    # --  The tls configuration for hostnames to be covered with this ingress record.
    tls: []
    #  - secretName: vllm-router-tls
    #    hosts:
    #      - vllm-router.local
  # The node selector terms to match the nodes
  # Example:
  #   nodeSelectorTerms:
  #     - matchExpressions:
  #       - key: nvidia.com/gpu.product
  #         operator: "In"
  #         values:
  #         - "NVIDIA-RTX-A6000"
  nodeSelectorTerms: []
# -- TODO: Readiness probe configuration
#startupProbe:
#  # -- Number of seconds after the container has started before startup probe is initiated
#  initialDelaySeconds: 5
#  # -- How often (in seconds) to perform the startup probe
#  periodSeconds: 5
#  # -- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
#  failureThreshold: 100
#   # -- Configuration of the Kubelet http request on the server
#  httpGet:
#    # -- Path to access on the HTTP server
#

Here is what this yaml settings has:

Section What It Means Impact
servingEngineSpec.enableEngine: true The vLLM serving engine will deploy ✅ A model server (FastAPI/vLLM backend) will run
servingEngineSpec.modelSpec: [] No models configured! ❌ vLLM backend will start EMPTY, no model loaded, not usable
servingEngineSpec.runtimeClassName: "nvidia" Requests GPU scheduling (runtimeClass = NVIDIA) ⚠️ Will require GPU-capable nodes if enforced (Autopilot will handle automatically, but might error if none available)
servingEngineSpec.containerPort: 8000, servicePort: 80 Internal pod listens on 8000; service listens on 80 Standard setup for HTTP access inside cluster
routerSpec.enableRouter: true The API router (proxy) will deploy ✅ Will be able to route user traffic
routerSpec.repository: lmcache/lmstack-router, tag: latest Uses default router Docker image Fine, but might pull latest, unstable sometimes
routerSpec.serviceType: ClusterIP Router is only internally reachable ❗ No public IP, need port-forward or LoadBalancer later if needed
routerSpec.routingLogic: roundrobin Router sends requests evenly across backends Good for simple load balancing
routerSpec.resources.requests: {cpu: "4", memory: "16G"} Router will reserve 4 CPU cores and 16Gi RAM minimum 🏋️ Heavy for tiny clusters; fine for prod-scale

I found that I could test it using only 1 pod with gpu

Open api port on runpod

  • 8001 is used by nginx, lets use 8002
# Serve on port 8002 instead:
vllm serve --task transcription openai/whisper-small \
  --host 0.0.0.0 --port 8002

# if success , you will get this log
# (workspace) root@33e3bb091952:/workspace/production-stack# vllm serve   --task transcription   openai/whisper-small   --host 0.0.0.0   --port 8002
# INFO 05-29 21:52:43 [__init__.py:243] Automatically detected platform cuda.
# INFO 05-29 21:52:59 [__init__.py:31] Available plugins for group vllm.general_plugins:
# INFO 05-29 21:52:59 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
# INFO 05-29 21:52:59 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
# INFO 05-29 21:53:01 [api_server.py:1289] vLLM API server version 0.9.0
# INFO 05-29 21:53:02 [cli_args.py:300] non-default args: {'host': '0.0.0.0', 'port': 8002, 'task': 'transcription'}
# config.json: 100%|███████████████████████████████████████████████████████████| 1.97k/1.97k [00:00<00:00, 11.5MB/s]
# preprocessor_config.json: 100%|████████████████████████████████████████████████| 185k/185k [00:00<00:00, 4.46MB/s]
# INFO 05-29 21:53:03 [config.py:3131] Downcasting torch.float32 to torch.float16.
# WARNING 05-29 21:53:30 [arg_utils.py:1583] --task transcription is not supported by the V1 Engine. Falling back to V0.
# INFO 05-29 21:53:46 [__init__.py:243] Automatically detected platform cuda.
# INFO 05-29 21:53:54 [api_server.py:257] Started engine process with PID 9035
# INFO 05-29 21:53:54 [__init__.py:31] Available plugins for group vllm.general_plugins:
# INFO 05-29 21:53:54 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
# INFO 05-29 21:53:54 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
# INFO 05-29 21:53:54 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.0) with config: model='openai/whisper-small', speculative_config=None, tokenizer='openai/whisper-small', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=448, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/whisper-small, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=True,
# tokenizer_config.json: 100%|███████████████████████████████████████████████████| 283k/283k [00:00<00:00, 4.41MB/s]
# vocab.json: 100%|██████████████████████████████████████████████████████████████| 836k/836k [00:00<00:00, 33.2MB/s]
# tokenizer.json: 100%|████████████████████████████████████████████████████████| 2.48M/2.48M [00:00<00:00, 31.7MB/s]
# merges.txt: 100%|██████████████████████████████████████████████████████████████| 494k/494k [00:00<00:00, 14.5MB/s]
# normalizer.json: 100%|████████████████████████████████████████████████████████| 52.7k/52.7k [00:00<00:00, 242MB/s]
# added_tokens.json: 100%|██████████████████████████████████████████████████████| 34.6k/34.6k [00:00<00:00, 203MB/s]
# special_tokens_map.json: 100%|███████████████████████████████████████████████| 2.19k/2.19k [00:00<00:00, 34.0MB/s]
# generation_config.json: 100%|████████████████████████████████████████████████| 3.87k/3.87k [00:00<00:00, 48.9MB/s]
# INFO 05-29 21:53:55 [cuda.py:292] Using Flash Attention backend.
# INFO 05-29 21:53:56 [parallel_state.py:1064] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
# INFO 05-29 21:53:56 [model_runner.py:1170] Starting to load model openai/whisper-small...
# INFO 05-29 21:53:57 [weight_utils.py:291] Using model weights format ['*.safetensors']
# model.safetensors: 100%|████████████████████████████████████████████████████████| 967M/967M [00:03<00:00, 294MB/s]
# INFO 05-29 21:54:01 [weight_utils.py:307] Time spent downloading weights for openai/whisper-small: 3.485868 seconds
# INFO 05-29 21:54:01 [weight_utils.py:344] No model.safetensors.index.json found in remote.
# Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
# Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.22it/s]
# Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.22it/s]
#
# INFO 05-29 21:54:01 [default_loader.py:280] Loading weights took 0.28 seconds
# INFO 05-29 21:54:01 [model_runner.py:1202] Model loading took 0.4668 GiB and 5.083266 seconds
# INFO 05-29 21:54:03 [enc_dec_model_runner.py:315] Starting profile run for multi-modal models.
# INFO 05-29 21:54:07 [worker.py:291] Memory profiling takes 5.44 seconds
# INFO 05-29 21:54:07 [worker.py:291] the current vLLM instance can use total_gpu_memory (23.54GiB) x gpu_memory_utilization (0.90) = 21.19GiB
# INFO 05-29 21:54:07 [worker.py:291] model weights take 0.47GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 6.29GiB; the rest of the memory reserved for KV Cache is 14.35GiB.
# INFO 05-29 21:54:07 [executor_base.py:112] # cuda blocks: 26125, # CPU blocks: 7281
# INFO 05-29 21:54:07 [executor_base.py:117] Maximum concurrency for 448 tokens per request: 933.04x
# INFO 05-29 21:54:12 [model_runner.py:1512] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
# Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████| 35/35 [00:15<00:00,  2.22it/s]
# INFO 05-29 21:54:27 [model_runner.py:1670] Graph capturing finished in 16 secs, took 0.15 GiB
# INFO 05-29 21:54:27 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 25.88 seconds
# INFO 05-29 21:54:29 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:8002
# INFO 05-29 21:54:29 [launcher.py:28] Available routes are:
# INFO 05-29 21:54:29 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /docs, Methods: HEAD, GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /health, Methods: GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /load, Methods: GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /ping, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /ping, Methods: GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /tokenize, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /detokenize, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /v1/models, Methods: GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /version, Methods: GET
# INFO 05-29 21:54:29 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /v1/completions, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /v1/embeddings, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /pooling, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /classify, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /score, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /v1/score, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /rerank, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /v1/rerank, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /v2/rerank, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /invocations, Methods: POST
# INFO 05-29 21:54:29 [launcher.py:36] Route: /metrics, Methods: GET
# INFO:     Started server process [8831]
# INFO:     Waiting for application startup.
# INFO:     Application startup complete.

# And update your router to point there:
export VLLM_BACKEND_URL="http://localhost:8002"
uvicorn main_router:app --reload --port 8000

also run the production stacks

Good thing we’ve the consistent python env, we can set the PYTHONPATH

also install the dependencies for production-stack in editable mode

source /workspace/.venv/bin/activate
pip install -e /workspace/production-stack[dev]

[!NOTE] production-stack need python 3.12, let’s use a python 3.12 venv

If you faced no space while installing, move cache to workspace

export PIP_CACHE_DIR=/workspace/.pip_cache
mkdir -p /workspace/.pip_cache
pip cache purge            # clear the old cache on overlay
pip install vllm[audio]
pip install -e production-stack[dev]

If there’s no pip locally, install the pip

(workspace) root@33e3bb091952:/workspace# which pip
/usr/local/bin/pip
(workspace) root@33e3bb091952:/workspace# python -m pip install --upgrade pip setuptools
/workspace/.venv/bin/python: No module named pip
(workspace) root@33e3bb091952:/workspace# ls .venv/
CACHEDIR.TAG  bin  lib  lib64  pyvenv.cfg
(workspace) root@33e3bb091952:/workspace# ls .venv/bin
activate      activate.csh   activate.nu   activate_this.py  pydoc.bat  python3
activate.bat  activate.fish  activate.ps1  deactivate.bat    python     python3.12
(workspace) root@33e3bb091952:/workspace# python -m ensurepip --upgrade
Looking in links: /tmp/tmp8w8h4iwe
Processing /tmp/tmp8w8h4iwe/pip-23.2.1-py3-none-any.whl
Installing collected packages: pip
Successfully installed pip-23.2.1

Install vLLM (with audio) without filling the small overlay cache

export PIP_NO_CACHE_DIR=1
pip install vllm[audio]

Install router in editable mode

pip install -e production-stack[dev]

verify both are installed under python3.12

pip list | grep -E "vllm|vllm-router"

# (workspace) root@33e3bb091952:/workspace# pip list | grep -E "vllm|vllm-router"
# vllm                                     0.9.0

Testing script

  1. start the whisper backend
vllm serve --task transcription openai/whisper-small \
  --host 0.0.0.0 --port 8002
  1. Started the production-stack router (your main_router on port 8000), for example with:
  • add this to src/vllm_router/run-router.sh and comment other router services for testing
# Use this command when testing with whisper transcription
ROUTER_PORT=$1
BACKEND_URL=$2

python3 -m vllm_router.app \
    --host 0.0.0.0 \
    --port "${ROUTER_PORT}" \
    --service-discovery static \
    --static-backends "${BACKEND_URL}" \
    --static-models "openai/whisper-small" \
    --static-model-types "transcription" \
    --routing-logic roundrobin \
    --log-stats \
    --engine-stats-interval 10 \
    --request-stats-window 10

# To disable this warning, you can either:
#         - Avoid using `tokenizers` before the fork if possible
#         - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
# INFO 05-29 23:29:15 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:8002
# INFO 05-29 23:29:15 [launcher.py:28] Available routes are:
# INFO 05-29 23:29:15 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /docs, Methods: HEAD, GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /health, Methods: GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /load, Methods: GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /ping, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /ping, Methods: GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /tokenize, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /detokenize, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /v1/models, Methods: GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /version, Methods: GET
# INFO 05-29 23:29:15 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /v1/completions, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /v1/embeddings, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /pooling, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /classify, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /score, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /v1/score, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /rerank, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /v1/rerank, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /v2/rerank, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /invocations, Methods: POST
# INFO 05-29 23:29:15 [launcher.py:36] Route: /metrics, Methods: GET
# INFO:     Started server process [9747]
# INFO:     Waiting for application startup.
# INFO:     Application startup complete.
# INFO:     127.0.0.1:46250 - "GET /metrics HTTP/1.1" 200 OK
  • make sure make it executable
(workspace) root@33e3bb091952:/workspace/production-stack# cd src/vllm_router
(workspace) root@33e3bb091952:/workspace/production-stack# chmod +x run-router.sh
(workspace) root@33e3bb091952:/workspace/production-stack# ./run-router.sh 8000

(workspace) root@33e3bb091952:/workspace/production-stack/src/vllm_router# ./run-router.sh 8000
# [2025-05-29 23:30:44,964] INFO: Scraping metrics from 1 serving engine(s) (engine_stats.py:136:vllm_router.stats.engine_stats)
# [2025-05-29 23:30:44,964] INFO: Initializing round-robin routing logic (routing_logic.py:392:vllm_router.routers.routing_logic)
# INFO:     Started server process [10309]
# INFO:     Waiting for application startup.
# [2025-05-29 23:30:45,089] INFO: httpx AsyncClient instantiated. Id 140322398720160 (httpx_client.py:31:vllm_router.httpx_client)
# INFO:     Application startup complete.
# INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
# [2025-05-29 23:30:54,965] INFO:
# ==================================================
# Server: http://localhost:8002
# Models:
#   - openai/whisper-small
#  Engine Stats: Running Requests: 0.0, Queued Requests: 0.0, GPU Cache Hit Rate: 0.00
#  Request Stats: No stats available
# --------------------------------------------------
# ==================================================
#  (log_stats.py:104:vllm_router.stats.log_stats)
# [2025-05-29 23:30:54,984] INFO: Scraping metrics from 1 serving engine(s) (engine_stats.py:136:vllm_router.stats.engine_stats)
# [2025-05-29 23:31:04,966] INFO:
# ==================================================
# Server: http://localhost:8002
# Models:
#   - openai/whisper-small
#  Engine Stats: Running Requests: 0.0, Queued Requests: 0.0, GPU Cache Hit Rate: 0.00
#  Request Stats: No stats available
# --------------------------------------------------
# ==================================================
#  (log_stats.py:104:vllm_router.stats.log_stats)
# [2025-05-29 23:31:04,991] INFO: Scraping metrics from 1 serving engine(s) (engine_stats.py:136:vllm_router.stats.engine_stats)
# [2025-05-29 23:31:14,966] INFO:
# ==================================================
# Server: http://localhost:8002
# Models:
#   - openai/whisper-small
#  Engine Stats: Running Requests: 0.0, Queued Requests: 0.0, GPU Cache Hit Rate: 0.00
#  Request Stats: No stats available
# --------------------------------------------------
# ==================================================
#  (log_stats.py:104:vllm_router.stats.log_stats)
# [2025-05-29 23:31:14,997] INFO: Scraping metrics from 1 serving engine(s) (engine_stats.py:136:vllm_router.stats.engine_stats)
# [2025-05-29 23:31:24,967] INFO:
# ==================================================
# Server: http://localhost:8002
# Models:
#   - openai/whisper-small
#  Engine Stats: Running Requests: 0.0, Queued Requests: 0.0, GPU Cache Hit Rate: 0.00
#  Request Stats: No stats available
# --------------------------------------------------
# ==================================================
#  (log_stats.py:104:vllm_router.stats.log_stats)
# [2025-05-29 23:31:25,004] INFO: Scraping metrics from 1 serving engine(s) (engine_stats.py:136:vllm_router.stats.engine_stats)
  1. smoke test the transcription endpoint
curl -v http://localhost:8000/v1/audio/transcriptions \
  -F 'file=@/path/to/audio.wav;type=audio/wav' \
  -F 'model=whisper-small' \
  -F 'language=en'

[!NOTE] Record an input audio before testing

Implementation Details

Enabling the transcription model type

Add a new enum member in utils.py so the parser recognizes transcription as a valid model type:

class ModelType(enum.Enum):
      
    transcription = "/v1/audio/transcriptions"

Then include it in your get_all_fields() so validate_static_model_types() accepts it .

Parser validation

Update validate_static_model_types in parsers/parser.py to allow the new transcription string:

def validate_static_model_types(model_types: str | None) -> None:
      
    all_models = utils.ModelType.get_all_fields()
    # now includes 'transcription'

This ensures --static-model-types transcription passes validation .

Main router proxy logic

In routers/main_router.py’s audio_transcriptions handler, use the same service‐discovery + routing logic you already have for chat/completions, but:

  1. Filter by model_label=="transcription" (or skip it in dev) and model in ep.model_names.
  2. Increase HTTPX timeout, since Whisper can take a while:
async with httpx.AsyncClient(
    base_url=chosen_url,
    timeout=httpx.Timeout(connect=5, read=120, write=30, pool=None)
) as client:
    
  1. Proxy the multipart form exactly to /v1/audio/transcriptions on the backend .

Smoke-Testing

  1. Start the Whisper backend on port 8002:
vllm serve --task transcription openai/whisper-small \
  --host 0.0.0.0 --port 8002
  1. run the modified router on port 8000:
./run-router.sh 8000 http://localhost:8002
  1. hit the endpoint with curl, noting you must specify a valid response_format (e.g. json):
curl -v http://localhost:8000/v1/audio/transcriptions \
  -F 'file=@/path/to/audio.wav;type=audio/wav' \
  -F 'model=openai/whisper-small' \
  -F 'response_format=json' \
  -F 'language=en'
  1. Finally, you should get back something like:
{"text":"…your transcription…"}

which confirms end-to-end routing.

Troubleshooting & Tips

  • “No transcription backend” 503 Means your router didn’t detect any backend with model_label=="transcription". Double-check --static-model-labels transcription or skip the label filter in dev.

  • Timeouts If you see ReadTimeout, bump up read in httpx.Timeout(...).

  • Response format errors Whisper only accepts one of json, text, srt, verbose_json, vtt for response_format.