Silicon Vampires 2
Posted on October 19, 2025 • 8 min read • 1,518 wordsHow Do We Run LLMs
In our previous post, we described how we use the Ollama and vLLM inference software stacks to run large language models (LLMs) within e-INFRA CZ. We explained what’s required to operate LLMs on-premises — including model weights, their formats, and the necessary hardware (GPUs). Most importantly, we shared the lessons we learned from serving these models on our own infrastructure.
We explored what the KV cache (key-value buffer) is and how it affects memory requirements, how to effectively use multiple GPUs, and why CPUs aren’t ideal for inference. We concluded the post by noting that we successfully ran two models (DeepSeek R1 and Qwen3-Coder) simultaneously on the same GPU set — with the hope that everything would perform smoothly.
Our assumption that DeepSeek R1 and Qwen3-Coder models would not conflict when running in parallel turned out to be incorrect. We observed significant slowdowns, particularly when some users ran continuous benchmarking workloads. To speed things up, initially, we split the models across GPUs — five for DeepSeek R1 and three for Qwen3-Coder — and executed them in pipeline-parallel mode using vLLM. This setup eventually stabilized, but a new issue appeared.
Pipeline-parallel mode is generally known to be slower than tensor-parallel mode, but in our case, the slowdown was far more severe than expected. Throughput dropped from approximately 60 tokens per second per model to only ~20 tokens per second per model.
Additionally, Qwen3-Coder running in FP8 precision consistently failed in tensor-parallel mode — a known issue acknowledged by the Qwen team on their Hugging Face model page.
While troubleshooting Qwen3-Coder failing in tensor-parallel mode, we deployed SGLang — and immediately observed substantial performance improvements compared to vLLM:
The performance above reflects single-request throughput, meaning the inference API was handling only one request at a time. However, LLM inference systems are typically capable of serving multiple requests concurrently, and in such multi-request scenarios, we observed a significant increase in total throughput:
Despite these gains, Qwen3-Coder continued to crash consistently in tensor-parallel mode, even when using SGLang. We also noticed that pipeline-parallel mode suffered from very low GPU utilization (peaking around 26%).
This prompted an experiment: we ran DeepSeek R1 in tensor-parallel mode across all 8 GPUs, while simultaneously running Qwen3-Coder in pipeline-parallel mode on the same GPU set.
Our hypothesis was that Qwen3-Coder’s low GPU utilization would not significantly impact DeepSeek R1’s performance — and this proved correct. We achieved stable concurrent operation of both models. The performance of DeepSeek R1 was fully satisfactory, Qwen3-Coder was working fine (no crashes), but its speed could be significantly improved.
Despite stability, we were still dissatisfied with Qwen3-Coder’s performance. Support from both the vLLM and SGLang developers was limited (though SGLang at least flagged the issue for someone to investigate). Eventually, we tested a special build optimized for the Blackwell architecture with CUDA 12.9 (instead of the standard CUDA 12.8 build).
Surprisingly, this version was stable for Qwen3-Coder even in tensor-parallel FP8 mode, allowing us to achieve full speed: ~60 tokens/s for a single request.
Running two tensor-parallel models on the same GPU set still caused significant slowdowns. The solution is CUDA MPS (Multi-Process Service), which allows multiple processes to access GPUs simultaneously without major performance drops. The caveat: if one process crashes, all processes crash.
Our entire setup runs in Kubernetes, with the NVIDIA GPU Operator handling low-level NVIDIA driver management. Although the operator claims to support CUDA MPS, in practice it does not — requiring manual configuration. CUDA MPS depends on shared IPC memory, which conflicts with Kubernetes’ default model of isolating containers in separate kernel namespaces. We identified two potential workarounds:
HostIPC
approach: Deploy models and the MPS server separately, but enable HostIPC=true
to share the host IPC namespace. This requires privileged containers, but GPU sharing requires elevated permissions anyway.We opted for the HostIPC
solution, which proved stable.
We have achieved a stable, high-performance setup using CUDA MPS combined with SGLang for both DeepSeek R1 and Qwen3-Coder, as well as SGLang for GPT-OSS-120B, which serves as one of our main models.
We conducted several rounds of testing:
For context, we provide comparison with American JetStream:
We conducted a series of tests focused on Czech language proficiency and published the results on the benczechmark
Hugging Face page:
Model | Mean Score | Evaluation Seconds |
---|---|---|
DeepSeek-R1-0528 | 85.3 | 1165 |
Command-A | 82.8 | 1309 |
LLaMA-3.3 | 73.5 | 2536 |
LLaMA-4-Scout | 71.5 | 637 |
Aya Expanse | 63.5 | 2794 |
Gemma 3 | 59.2 | 1642 |
Phi-4 | 58.2 | 1210 |
Here, the Mean Score represents the average performance across various Czech language tasks, while Evaluation Seconds indicates the duration of the first test run.
A particularly noteworthy result comes from the Command-A model, which has only 111B parameters compared to the DeepSeek-R1 model’s 685B parameters, yet scores only a few points lower. This suggests that Command-A likely offers the best performance-to-cost ratio.
The highest-scoring model we evaluated for benczechmark
is DeepSeek V3-0324, achieving an 86.8 Mean Score. However, this is largely because the tests do not heavily measure reasoning capabilities, allowing instruction-tuned models to perform better. Since our GPU resources are insufficient to run both DeepSeek V3‑0324 and DeepSeek‑R1‑0528 concurrently, we have selected DeepSeek‑R1‑0528 as the sole model.
We also ran another set of tests using the Aider coding benchmark, which evaluates models across 225 programming tasks in various languages.
Model | Score | Mean Task Seconds |
---|---|---|
GPT-5 (high) | 88 * | |
DeepSeek-R1-0528 | 69.3 | 310.3 |
Qwen3-Coder | 57.8 | 29.3 |
GPT-OSS-120B | 54.2 | 24.2 |
Qwen-2.5-Coder | 16.4 * | 318.5 |
Command-A | 12 * | |
Gemma 3 | 4.9 * | |
LLaMA 4 Scout | 4.9 | 20.2 |
(* An asterisk indicates results taken directly from the Aider leaderboard; all others are from our own measurements.)
The GPT‑5 model is included only as a reference point, representing the best performance observed to date. Among the models we actually run, GPT‑OSS‑120B offers the most favorable performance‑to‑cost ratio: it can be executed on a single NVIDIA H100 GPU, whereas Qwen3‑Coder requires at least three B200 GPUs and DeepSeek‑R1‑0528 needs five. The H100 is roughly half the capacity of a B200, making GPT‑OSS‑120B the most economical choice.
We are interested in exploring the following directions:
At this point, small models seem to offer little value if large models can be efficiently deployed.
CohereLabs/command-a-translate-08-2025
— promising for Czech-language use.zai-org/GLM-4.5V
— example of an open-source multimodal model (included as a multimodal example).CohereLabs/command-a-vision-07-2025
— another multimodal example (included for the multimodal category).OpenGVLab/InternVL3_5-38B-HF
– yet another multimodal example with potentially good performance-to-cost ratio as it is only 38B model.Current LLMs are already very capable, but they still struggle with factual accuracy. In the Czech Republic we also observe a deficiency in handling the Czech language—while OpenAI/GPT performs well overall, it still makes mistakes when generating Czech text.
Although we cannot directly improve the model’s Czech proficiency, there are many techniques for injecting factual knowledge and reducing hallucinations. We believe AI agents are the appropriate solution. However, agentic AI depends on the ability to search the Internet, and this introduces complications:
This is precisely where an AI agent can add value—it can decide whether a given request needs external information and select the most appropriate source.
This is the subject of our further research.