Configuring vLLM for Production Deployment (More Complex)

For production use, I want to run the container in the background and ensure that it starts automatically after a reboot. Additionally, I mount the Hugging Face cache so that models do not have to be redownloaded every time the container starts. This saves you the time otherwise required for the download and, in my opinion, leads to a more stable system. For one, the model as such cannot change, so you always have exactly the same behavior, and above all, you avoid potential problems with the download process itself and the many GBs that such a model encompasses.

First, I create a directory on my server on a drive where the models should be stored. Many hundreds of GBs can accumulate here quickly. Therefore, think about how you best want to manage the models. Of course, you can later copy this folder along with the models to another drive. For now, I’m just creating the folder in my user’s home directory ~/.

Command: mkdir -p ~/models

Now I stop the test container (if it is still running) with Ctrl+C:

In a next step, I restart the container in the background with a name, a restart policy, and the mounted model cache.

Note: I needed several attempts here before I could successfully download the model at all. The reason for my problems was likely the Christmas season and the resulting high data volume with my internet service provider, presumably due to streaming & co. In the docker run command that follows, all models are stored in your home folder in the subfolder models, each with the model name that you must change in the following command when you specify the model. I have highlighted the 3 points to be adjusted in bold in the command.

Model: Qwen2.5-Math-1.5B-Instruct

Command: docker run -it --rm --name vllm-Qwen2.5-Math-1.5B-Instruct -v ~/models:/data --dns 8.8.8.8 --dns 8.8.4.4 --ipc=host nvcr.io/nvidia/vllm:25.11-py3 /bin/bash -c "pip install hf_transfer && export HF_HUB_ENABLE_HF_TRANSFER=1 && python3 -c \"from huggingface_hub import snapshot_download; snapshot_download(repo_id='Qwen/Qwen2.5-Math-1.5B-Instruct', local_dir='/data/Qwen2.5-Math-1.5B-Instruct', max_workers=1, resume_download=True)\""

Here is another model that I was able to download successfully.

Model: openai/gpt-oss-20b

Command: docker run -it --rm --name vllm-gpt-oss-20b -v ~/models:/data --dns 8.8.8.8 --dns 8.8.4.4 --ipc=host nvcr.io/nvidia/vllm:25.11-py3 /bin/bash -c "pip install hf_transfer && export HF_HUB_ENABLE_HF_TRANSFER=1 && python3 -c \"from huggingface_hub import snapshot_download; snapshot_download(repo_id='openai/gpt-oss-20b', local_dir='/data/gpt-oss-20b', max_workers=1, resume_download=True)\""

After the model has been successfully downloaded, I now start the vLLM server with the locally stored model. Important: I use the local path /data/gpt-oss-20b instead of the Hugging Face name:

Command: docker run -d --gpus all --name vllm-server -p 8000:8000 --restart unless-stopped -v ~/models:/data --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/vllm:25.11-py3 vllm serve /data/gpt-oss-20b --gpu-memory-utilization 0.6

The parameter -d starts the container in the background, --name vllm-server gives it a name, --restart unless-stopped ensures that the container automatically restarts after a reboot, and -v ~/models:/data mounts the local model directory. With /data/gpt-oss-20b, I refer to the locally stored model.

Now you can access the vLLM server from any computer in the network. Open http://<IP-Address-AI-TOP-ATOM>:8000 in your browser or with cURL (replace <IP-Address-AI-TOP-ATOM> with the IP address of your AI TOP ATOM).

This command is optimized to download huge model files from Hugging Face stably and performantly, bypassing typical sources of error (such as DNS problems or slow downloads). When I wrote this guide during the Christmas holidays 2025, my internet provider probably had massive bandwidth problems and I had to restart the download very frequently because even the provider’s DNS resolution no longer worked.

Docker Parameters Used

In the docker run command I used, I utilized a variety of flags that I would like to explain to you below. Whether this is truly useful for you, I cannot say, but when I wrote this report, I had problems with the internet connection and had to create a stable call with which I could download the approx. 65 GB of the gpt-oss-120b model. For testing purposes to make it faster, I used gpt-oss-20b.

1. Runtime Parameters

The following parameters all refer to how the created Docker container behaves at runtime.

  • docker run: The base command for creating and starting a new container.

  • -it: Combines -i (interactive) and -t (tty). This allows you to see the download progress bars live in the terminal and interact with Ctrl+C if necessary.

  • --rm: “Auto-Remove”. Ensures that the container is deleted immediately after the download is complete. This keeps your system clean since we only want to keep the data (in the volume), i.e., the downloaded model, but not the container itself.

  • --name vllm-gpt-oss-20b: Assigns a fixed name to the container. Without this parameter, Docker would assign a random name. With a name, you can easily monitor the status in another terminal via docker stats vllm-gpt-oss-20b.

2. System & Network Configuration

Here it gets a bit technical, and I needed a few hours with many attempts until I could finally download a model successfully. These parameters therefore solve specific problems with large amounts of data.

  • -v ~/models:/data: The Volume Mapping. It links the folder ~/models on your computer like the AI TOP ATOM with the path /data inside the container. This way, the gigabytes of model data land directly on your hard drive and not in the container’s volatile memory. This has the great advantage that you don’t always have to redownload the models, but can quickly switch between models via vLLM. Because unlike Ollama, with vLLM you always have to restart the container with a specific model.

  • --dns 8.8.8.8 --dns 8.8.4.4: Forces the container to use Google DNS servers. This is a “life hack” against the error Temporary failure in name resolution, which often occurs when the default DNS is overloaded by thousands of requests (like during Hugging Face token refreshes).

  • --ipc=host: Allows the container access to the host system’s shared memory. vLLM needs this for efficient data processing between different processes.

3. The Container Image

Here we specify exactly which image should be used; for the NVIDIA playbooks for the DGX Spark and thus also for my Gigabyte AI TOP ATOM, this is exactly the image that is required.

  • nvcr.io/nvidia/vllm:25.11-py3: The official vLLM image from the NVIDIA Container Registry (NVCR). The version 25.11 indicates that it is a release (as of late 2024/early 2025) that is already optimized for state-of-the-art GPUs. I assume it is not the very latest, but stable enough that the playbook creators chose it.

4. Execution Logic (Bash Scripting)

Everything after the image name is executed inside the container. I chose this path to better control the download process for myself. These were the most important parameters for my download problems at the time, alongside the --dns parameter.

  • /bin/bash -c "...": Starts a bash shell in the container to execute a chain of commands.

  • pip install hf_transfer: Installs a specialized Rust-based package that massively speeds up downloads from Hugging Face by loading files in parallel streams, which I limited back to exactly one process with the max_workers=1 parameter.

  • export HF_HUB_ENABLE_HF_TRANSFER=1: Enables this high-speed mode for the Hugging Face library.

5. The Python Download Script

Now come the parameters that somewhat influence the way the download is handled.

  • snapshot_download(...): In my understanding, the safest method to download an entire model repository.

  • repo_id='openai/gpt-oss-20b': The unique identifier of the model on Hugging Face.

  • local_dir='/data/gpt-oss-20b': Storage destination inside the container (which lands on your machine thanks to volume mapping). This name becomes important later when we want to start a container specifically with this or another already downloaded model.

  • max_workers=1: Limits the number of parallel file downloads. For unstable connections or DNS problems, “1” is often the stablest choice. While it still led to error messages during the download, it was ultimately successful.

  • resume_download=True: A crucial parameter! It allows the script to continue exactly where it left off in case of an interruption, instead of redownloading gigabytes of data. However, this parameter was apparently no longer supported and can probably be omitted.

That’s it for the parameters I used.

Firewall Settings (Optional)

If a firewall is active, you must open port 8000:

Command: sudo ufw allow 8000

You can view the container logs at any time:

Command: docker logs -f vllm-server

Click here for Part 3 of the installation and configuration guide.

Install vLLM on Gigabyte AI TOP ATOM: High-Performance LLM Inference with OpenAI-Compatible API – Part 3-3