{"id":2485,"date":"2026-06-08T19:23:02","date_gmt":"2026-06-08T19:23:02","guid":{"rendered":"https:\/\/ai-box.eu\/?p=2485"},"modified":"2026-06-09T03:17:52","modified_gmt":"2026-06-09T03:17:52","slug":"llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint","status":"publish","type":"post","link":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/","title":{"rendered":"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint"},"content":{"rendered":"<p>If, like me, you run your models locally and sovereignly, you know the problem: I want to know how fast a model really is on <strong>my<\/strong> hardware and not as a theoretical figure, but the way it actually reaches me as an end user. For exactly that, the <strong>llama.cpp<\/strong> world has the popular tool <strong>llama-bench<\/strong>. The catch: it works exclusively with llama.cpp. The moment I bring <strong>vLLM<\/strong>, <strong>SGLang<\/strong> or any other inference server into play, I&#8217;m left without a comparable tool.<\/p>\n<p>In this post I introduce you to <strong>llama-benchy<\/strong>. It&#8217;s a tool that closes exactly this gap: the great thing is that I now get measurements at different context depths, but for <strong>any OpenAI-compatible endpoint<\/strong>. No matter whether you run Ollama, vLLM or llama.cpp.<\/p>\n<p>You&#8217;ll find llama-benchy here on GitHub: <a href=\"https:\/\/github.com\/eugr\/llama-benchy\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/eugr\/llama-benchy<\/a><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#Why_the_existing_tools_fall_short\" >Why the existing tools fall short<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#What_llama-benchy_can_do\" >What llama-benchy can do<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#Installation_with_uv\" >Installation with uv<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#The_first_measurement\" >The first measurement<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#Running_llama-benchy_with_Ollama\" >Running llama-benchy with Ollama<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#My_result_on_two_NVIDIA_RTX_A6000\" >My result on two NVIDIA RTX A6000<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#My_verdict_what_are_these_numbers_good_for\" >My verdict: what are these numbers good for?<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#And_now_under_load_four_parallel_clients\" >And now under load: four parallel clients<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#My_verdict_on_concurrency\" >My verdict on concurrency<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#How_to_find_the_right_tokenizer_ID\" >How to find the right tokenizer ID<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#Understanding_the_metrics\" >Understanding the metrics<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#Measuring_prefix_caching_realistically\" >Measuring prefix caching realistically<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#Throughput_under_load_concurrency\" >Throughput under load: concurrency<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#Further_processing_of_the_results\" >Further processing of the results<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h3><span class=\"ez-toc-section\" id=\"Why_the_existing_tools_fall_short\"><\/span>Why the existing tools fall short<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Before we get started, it&#8217;s worth a look at the underlying motivation, because it explains very well <em>what<\/em> llama-benchy does better.<\/p>\n<p><strong>llama-bench<\/strong> is great, but it has two limitations: it&#8217;s tied to llama.cpp, and it measures directly through the C++ engine. That measurement is therefore not necessarily representative of what you, as a user, actually experience through the API.<\/p>\n<p><strong>vLLM<\/strong> does ship its own powerful benchmarking tool that also runs against other engines, but there are pitfalls in the details (source: <a href=\"https:\/\/github.com\/eugr\/llama-benchy#motivation\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/eugr\/llama-benchy#motivation<\/a>):<\/p>\n<ul>\n<li>Measuring prompt-processing speeds cleanly at different context lengths is tricky to impossible. <code>vllm bench sweep serve<\/code> repeats the same prompt across multiple runs, which hits the prefix cache directly in <strong>llama-server<\/strong>. The result: unrealistically low TTFT values and absurdly high prompt-processing speeds.<\/li>\n<li>The TTFT measurement doesn&#8217;t measure the time to the first <em>usable token<\/em>, but to the very first data chunk from the server \u2013 and in <code>\/v1\/chat\/completions<\/code> mode that chunk often doesn&#8217;t contain any generated token yet.<\/li>\n<li>Only the random dataset allows a freely chosen token count. But a randomly generated token sequence can&#8217;t be meaningfully used for <strong>Speculative Decoding<\/strong> or <strong>MTP<\/strong> (Multi-Token Prediction).<\/li>\n<\/ul>\n<p>The author of llama-benchy writes that in early January 2026 he simply couldn&#8217;t find any tool delivering llama-bench-style measurements at different context depths for arbitrary OpenAI-compatible endpoints. So he built it himself.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"What_llama-benchy_can_do\"><\/span>What llama-benchy can do<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The feature list is tailored exactly to the weaknesses above:<\/p>\n<ul>\n<li>Measures <strong>Prompt Processing (pp)<\/strong> and <strong>Token Generation (tg)<\/strong> at different context depths.<\/li>\n<li>Optionally separates pre-filling the context from the actual prompt processing over already-cached context (<strong>prefix-caching measurement<\/strong>).<\/li>\n<li>Reports <strong>TTFR<\/strong> (Time To First Response), <strong>est_ppt<\/strong> (estimated prompt-processing time) and <strong>e2e_ttft<\/strong> (End-to-End Time To First Token).<\/li>\n<li>Configurable prompt length (<code>--pp<\/code>), generation length (<code>--tg<\/code>) and context depth (<code>--depth<\/code>).<\/li>\n<li>Multiple runs (<code>--runs<\/code>) with mean \u00b1 standard deviation.<\/li>\n<li>Uses <strong>HuggingFace tokenizers<\/strong> for accurate token counts.<\/li>\n<li>Handles <strong>MTP chunks<\/strong> correctly.<\/li>\n<li>Downloads a book from Project Gutenberg as the source text, so that spec-decoding\/MTP models are measured realistically (default: Sherlock Holmes).<\/li>\n<li>Supports concurrent requests (<code>--concurrency<\/code>) to measure throughput under load.<\/li>\n<li>Saves results as <strong>Markdown<\/strong>, <strong>JSON<\/strong> or <strong>CSV<\/strong>.<\/li>\n<li>Auto-detects the HuggingFace model name via the <code>\/models<\/code> endpoint when <code>--model<\/code> isn&#8217;t set.<\/li>\n<\/ul>\n<p>One current limitation you should know: it measures exclusively against the <code>\/v1\/chat\/completions<\/code> endpoint.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Installation_with_uv\"><\/span>Installation with uv<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Installation via <strong>uv<\/strong> is recommended. The nice part: you don&#8217;t have to install anything permanently. With <code>uvx<\/code> you launch the release version straight from PyPI:<\/p>\n<p><strong>Command: <\/strong><code>uvx llama-benchy --base-url &lt;ENDPOINT_URL&gt; --model &lt;MODEL_NAME&gt;<\/code><\/p>\n<p>If you&#8217;d rather test the current version from the main branch, the command looks like this:<\/p>\n<p><strong>Command: <\/strong><code>uvx --from git+https:\/\/github.com\/eugr\/llama-benchy llama-benchy --base-url &lt;ENDPOINT_URL&gt; --model &lt;MODEL_NAME&gt;<\/code><\/p>\n<p>Alternatively, you can of course install the tool the classic way into a virtual environment (<code>uv venv<\/code> + <code>uv pip install -e .<\/code>) or system-wide (<code>uv pip install -U llama-benchy<\/code>).<\/p>\n<h3><span class=\"ez-toc-section\" id=\"The_first_measurement\"><\/span>The first measurement<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Once you&#8217;ve set up llama-benchy, here&#8217;s a typical call against a local endpoint:<\/p>\n<p><strong>Command: <\/strong><code>llama-benchy --base-url http:\/\/spark:8888\/v1 --model openai\/gpt-oss-120b --depth 0 4096 8192 16384 32768 --latency-mode generation<\/code><\/p>\n<p>Out comes a table in the familiar llama-bench look. For each context depth it gives one <code>pp<\/code> and one <code>tg<\/code> row, including the standard deviation. Exactly what I need for a serious comparison: I see at a glance how prefill and decode speed drop off as the context grows.<\/p>\n<p>My tip from the README, which I can only underline: use the <strong>&#8220;generation&#8221;<\/strong> latency mode. It gets the prompt-processing values closest to the real numbers. This is especially true for shorter prompts.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Running_llama-benchy_with_Ollama\"><\/span>Running llama-benchy with Ollama<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>I like to run my models locally with <strong>Ollama<\/strong>, and that can be benchmarked with no trouble at all. Ollama exposes an <strong>OpenAI-compatible endpoint<\/strong> at <code>\/v1<\/code>. So you simply pass the address of your Ollama server as the base URL. If Ollama runs locally, that&#8217;s <code>http:\/\/localhost:11434\/v1<\/code>; in my case the inference runs on a dedicated machine, so the base URL is e.g. <code>http:\/\/192.168.2.57:11434\/v1<\/code>.<\/p>\n<p><strong>Command:<\/strong> <code>uvx llama-benchy --base-url http:\/\/localhost:11434\/v1 --model qwen3.6:27b --tokenizer Qwen\/Qwen3.6-27B --depth 0 4096 8192 --latency-mode generation<\/code><\/p>\n<p><strong>The crucial pitfall:<\/strong> by default the <code>--tokenizer<\/code> parameter falls back to the value of <code>--model<\/code>. With Ollama, though, <code>--model<\/code> is the Ollama tag (e.g. <code>qwen3.6:27b<\/code>) and that is <strong>not a valid HuggingFace tokenizer name<\/strong>. llama-benchy would then try to load an HF repo with exactly that name and fail. That&#8217;s why, for Ollama models, you must always set <code>--tokenizer<\/code> <strong>explicitly<\/strong> to the matching HF repo so the token counting is correct.<\/p>\n<p>Good to know: the tokenizer is identical across all quantizations of a model. Whether you run BF16, FP8 or a GGUF quantization makes no difference to the tokenizer. All you have to do is always point to the vendor&#8217;s official base repo.<\/p>\n<p>For my four currently most-used models, the matching tokenizer IDs look like this:<\/p>\n<table>\n<thead>\n<tr>\n<th>Ollama tag<\/th>\n<th>HF tokenizer ID (for <code>--tokenizer<\/code>)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>nemotron3:33b-bf16<\/code><\/td>\n<td><code>nvidia\/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>qwen3.6:27b<\/code><\/td>\n<td><code>Qwen\/Qwen3.6-27B<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>qwen3.6:35b-a3b-bf16<\/code><\/td>\n<td><code>Qwen\/Qwen3.6-35B-A3B<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>nemotron3:33b<\/code><\/td>\n<td><code>nvidia\/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As you can see, <code>nemotron3:33b<\/code> and <code>nemotron3:33b-bf16<\/code> point to the same HF repo \u2013 it&#8217;s the same model (NVIDIA Nemotron 3 Nano Omni), just once as Q4_K_M and once as BF16. That&#8217;s exactly the point mentioned above: quantization doesn&#8217;t change the tokenizer.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"My_result_on_two_NVIDIA_RTX_A6000\"><\/span>My result on two NVIDIA RTX A6000<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>I ran the benchmark on my inference server with two <strong>NVIDIA RTX A6000<\/strong>. The model <code>qwen3.6:35b-a3b-bf16<\/code> is unquantized, in <strong>BF16<\/strong>.<\/p>\n<p><strong>Command:<\/strong> <code>uvx llama-benchy --base-url http:\/\/192.168.2.57:11434\/v1 --model qwen3.6:35b-a3b-bf16 --tokenizer Qwen\/Qwen3.6-35B-A3B --depth 0 4096 8192 --latency-mode generation<\/code><\/p>\n<table>\n<thead>\n<tr>\n<th>test<\/th>\n<th>t\/s<\/th>\n<th>peak t\/s<\/th>\n<th>ttfr (ms)<\/th>\n<th>est_ppt (ms)<\/th>\n<th>e2e_ttft (ms)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>pp2048<\/td>\n<td>1530.59 \u00b1 21.06<\/td>\n<td>\u2014<\/td>\n<td>1618.61 \u00b1 18.72<\/td>\n<td>1338.74 \u00b1 18.72<\/td>\n<td>1618.61 \u00b1 18.72<\/td>\n<\/tr>\n<tr>\n<td>tg32<\/td>\n<td>57.18 \u00b1 0.56<\/td>\n<td>59.05 \u00b1 0.58<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<\/tr>\n<tr>\n<td>pp2048 @ d4096<\/td>\n<td>1490.79 \u00b1 2.93<\/td>\n<td>\u2014<\/td>\n<td>4402.98 \u00b1 8.73<\/td>\n<td>4123.11 \u00b1 8.73<\/td>\n<td>4402.98 \u00b1 8.73<\/td>\n<\/tr>\n<tr>\n<td>tg32 @ d4096<\/td>\n<td>56.98 \u00b1 0.43<\/td>\n<td>58.85 \u00b1 0.44<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<\/tr>\n<tr>\n<td>pp2048 @ d8192<\/td>\n<td>1469.24 \u00b1 2.98<\/td>\n<td>\u2014<\/td>\n<td>7249.49 \u00b1 14.70<\/td>\n<td>6969.62 \u00b1 14.70<\/td>\n<td>7249.49 \u00b1 14.70<\/td>\n<\/tr>\n<tr>\n<td>tg32 @ d8192<\/td>\n<td>56.96 \u00b1 1.88<\/td>\n<td>58.83 \u00b1 1.94<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><em>llama-benchy 0.3.7 \u00b7 2026-06-08 \u00b7 latency mode: generation<\/em><\/p>\n<h4><span class=\"ez-toc-section\" id=\"My_verdict_what_are_these_numbers_good_for\"><\/span>My verdict: what are these numbers good for?<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>For <strong>interactive single-user use<\/strong>, this is a thoroughly solid result. Around <strong>57 tokens\/s<\/strong> in decode is well above my reading speed. So the response feels fluid, almost like real-time typing. I especially like that the decode rate stays <strong>nearly constant<\/strong> across all context depths: whether 0 or 8k context, I always land at ~57 t\/s. That makes the behavior nicely predictable.<\/p>\n<p>The honest weak spot is the <strong>time to the first token with long context<\/strong>. Prefill runs at a stable ~1,500 t\/s, but with 8k context that means about 7 seconds of waiting before anything comes back at all. For a chat with short prompts that&#8217;s irrelevant; for a <strong>RAG setup with large context windows<\/strong> it&#8217;s noticeable.<\/p>\n<p>I see two levers for more speed: first, the model runs here <strong>unquantized in BF16<\/strong>. Since the A6000 (Ampere) has no FP8 in hardware, a <strong>Q4\/Q5 or AWQ quantization<\/strong> would be the obvious lever. It would speed up decode considerably and free up VRAM. Exactly that kind of comparison is quickly done with llama-benchy. Second, Ollama spreads the model across both cards as a layer split; an engine with true <strong>tensor parallelism<\/strong> (e.g. vLLM) could lift both prefill and decode here.<\/p>\n<p><strong>My verdict:<\/strong> for my sovereign homelab use with a single user and the model in full BF16 quality, this is absolutely usable. Anyone optimizing for throughput under load or fast responses with huge contexts should benchmark quantization and a different inference engine \u2013 and that&#8217;s exactly what llama-benchy is for.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"And_now_under_load_four_parallel_clients\"><\/span>And now under load: four parallel clients<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>It gets interesting when several requests hit the Ollama server at the same time. With <code>--concurrency 4<\/code>, llama-benchy fires off four clients in parallel. The table then gains two throughput columns: <strong>t\/s (total)<\/strong> for the aggregate throughput of all four combined, and <strong>t\/s (req)<\/strong> for the average per individual request.<\/p>\n<p><strong>Command:<\/strong> <code>uvx llama-benchy --base-url http:\/\/192.168.2.57:11434\/v1 --model qwen3.6:35b-a3b-bf16 --tokenizer Qwen\/Qwen3.6-35B-A3B --depth 0 4096 8192 --latency-mode generation --concurrency 4<\/code><\/p>\n<table>\n<thead>\n<tr>\n<th>test<\/th>\n<th>t\/s (total)<\/th>\n<th>t\/s (req)<\/th>\n<th>ttfr (ms)<\/th>\n<th>est_ppt (ms)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>pp2048 (c4)<\/td>\n<td>1116.50 \u00b1 1.40<\/td>\n<td>1561.19 \u00b1 1738.91<\/td>\n<td>4495.64 \u00b1 2116.70<\/td>\n<td>3289.39 \u00b1 2116.70<\/td>\n<\/tr>\n<tr>\n<td>tg32 (c4)<\/td>\n<td>20.06 \u00b1 0.06<\/td>\n<td>61.72 \u00b1 0.84<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<\/tr>\n<tr>\n<td>pp2048 @ d4096 (c4)<\/td>\n<td>1333.33 \u00b1 2.71<\/td>\n<td>879.66 \u00b1 605.50<\/td>\n<td>11439.95 \u00b1 5213.15<\/td>\n<td>10233.70 \u00b1 5213.15<\/td>\n<\/tr>\n<tr>\n<td>tg32 @ d4096 (c4)<\/td>\n<td>8.55 \u00b1 0.02<\/td>\n<td>59.96 \u00b1 0.94<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<\/tr>\n<tr>\n<td>pp2048 @ d8192 (c4)<\/td>\n<td>1379.59 \u00b1 1.94<\/td>\n<td>822.71 \u00b1 518.98<\/td>\n<td>18478.59 \u00b1 8353.74<\/td>\n<td>17272.34 \u00b1 8353.74<\/td>\n<\/tr>\n<tr>\n<td>tg32 @ d8192 (c4)<\/td>\n<td>5.41 \u00b1 0.01<\/td>\n<td>60.01 \u00b1 0.87<\/td>\n<td>\u2014<\/td>\n<td>\u2014<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><em>llama-benchy 0.3.7 \u00b7 2026-06-08 \u00b7 latency mode: generation \u00b7 peak t\/s (req) ~62\u201364<\/em><\/p>\n<h4><span class=\"ez-toc-section\" id=\"My_verdict_on_concurrency\"><\/span>My verdict on concurrency<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>At first glance, <code>t\/s (req)<\/code> looks excellent: every single request still streams at around <strong>60 tokens\/s<\/strong>, virtually unchanged from single-user operation. The interesting value, though, is <code>t\/s (total)<\/code> and it <strong>drops under load instead of rising<\/strong>: from ~20 t\/s without context, through 8.5, down to 5.4 t\/s at 8k. So four parallel users get fewer tokens per second in total than a single user (where it was 57 t\/s).<\/p>\n<p>The reason lies in the TTFT: it shoots up to as much as <strong>~18 seconds<\/strong> at 8k context, and the enormous spread (\u00b18.3 s) reveals that some requests are served immediately while others sit in the queue for a long time. That&#8217;s exactly how a server behaves when it processes parallel requests one after another instead of bundling them. Ollama, with its llama.cpp backend, is simply built for sovereign single-user use on your own hardware not as a multi-user serving engine.<\/p>\n<p>My takeaway: for me as a single user in the homelab, everything is great. But anyone who wants to serve <strong>multiple users at the same time<\/strong> needs an engine with <strong>continuous batching<\/strong> like vLLM or SGLang. There, the four requests would share the compute step and the total throughput would be well above the single-user value. And that&#8217;s exactly the point: llama-benchy made this difference visible to me in a single run instead of just guessing, I now have the numbers.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"How_to_find_the_right_tokenizer_ID\"><\/span>How to find the right tokenizer ID<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>For your own models, the ID is quickly determined:<\/p>\n<ol>\n<li>Go to <a href=\"https:\/\/huggingface.co\" target=\"_blank\" rel=\"noopener\">huggingface.co<\/a> and search for the model name (e.g. &#8220;Qwen3.6 27B&#8221; or &#8220;Nemotron 3 Nano Omni&#8221;).<\/li>\n<li>Pick the <strong>vendor&#8217;s official base repo<\/strong>. You can recognize it by the organization in front: <code>nvidia\/\u2026<\/code>, <code>Qwen\/\u2026<\/code>, <code>openai\/\u2026<\/code> or <code>zai-org\/\u2026<\/code>. Stay away from third-party GGUF or quant forks (such as <code>unsloth\/\u2026<\/code> or <code>lmstudio-community\/\u2026<\/code>); you want the canonical tokenizer repo.<\/li>\n<li>In the <strong>&#8220;Files and versions&#8221;<\/strong> tab, check whether <code>tokenizer.json<\/code> or <code>tokenizer_config.json<\/code> are present. If so, the repo ID \u2013 i.e. the <code>Organization\/Model<\/code> part of the path \u2013 is exactly what you enter for <code>--tokenizer<\/code>.<\/li>\n<\/ol>\n<p>A practical trick: in the vLLM and SGLang examples on the HF model cards, the canonical ID always appears as <code>--model-path<\/code> or <code>--model<\/code>. That&#8217;s exactly the string you need for <code>--tokenizer<\/code>. And on the model page at <a href=\"https:\/\/ollama.com\" target=\"_blank\" rel=\"noopener\">ollama.com<\/a> the upstream source is often linked directly.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Understanding_the_metrics\"><\/span>Understanding the metrics<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>So that the table doesn&#8217;t just look pretty but you can actually interpret it, here are the most important columns. All times are in milliseconds.<\/p>\n<p><strong>t\/s (tokens per second)<\/strong> means something different depending on the row:<\/p>\n<ul>\n<li>For <strong>Prompt Processing<\/strong>: <code>total prompt tokens \/ est_ppt<\/code> &#8211; i.e. the prefill speed.<\/li>\n<li>For <strong>Token Generation<\/strong>: <code>(generated tokens - 1) \/ (time of last token - time of first token)<\/code> &#8211; the pure decode speed, excluding the latency of the first token.<\/li>\n<\/ul>\n<p><strong>peak t\/s<\/strong> exists only for token generation: the highest throughput measured in any 1-second window during the run.<\/p>\n<p><strong>ttfr (Time To First Response)<\/strong> is the raw time until the client receives <em>any<\/em> stream data from the server. That can also be empty chunks or role definitions. This number includes the network latency. It&#8217;s exactly the measurement method that <code>vllm bench serve<\/code> also reports as TTFT.<\/p>\n<p><strong>est_ppt (Estimated Prompt Processing Time)<\/strong> is calculated as <code>TTFR - estimated latency<\/code> and thereby estimates the pure server-side processing time of the prompt.<\/p>\n<p><strong>e2e_ttft (End-to-End Time To First Token)<\/strong> is <code>time of the first content token - start time<\/code>. So the total time I, as a user, perceive from sending the request to the first visible generated token.<\/p>\n<p>The latency mechanism behind it is interesting. Via <code>--latency-mode<\/code> the tool estimates the latency and subtracts it from <code>ttfr<\/code> to arrive at <code>est_ppt<\/code>:<\/p>\n<ul>\n<li><strong>api<\/strong> (default): time to fetch <code>\/models<\/code> &#8211; eliminates only the network latency.<\/li>\n<li><strong>generation<\/strong>: time to generate exactly 1 token &#8211; tries to factor out network <em>and<\/em> server overhead.<\/li>\n<li><strong>none<\/strong>: latency is assumed to be 0.<\/li>\n<\/ul>\n<h3><span class=\"ez-toc-section\" id=\"Measuring_prefix_caching_realistically\"><\/span>Measuring prefix caching realistically<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>This gets interesting for everyone who works with long, recurring context. That can be the case with system prompts or RAG, for example. With <code>--enable-prefix-caching<\/code> (and <code>--depth &gt; 0<\/code>), llama-benchy runs a two-stage process per run:<\/p>\n<ol>\n<li><strong>Context Load:<\/strong> the context is sent as a system message with an empty user message. The server has to process and cache it. Reported as <code>ctx_pp @ d{depth}<\/code>.<\/li>\n<li><strong>Inference:<\/strong> the same context plus the actual prompt as a user message. Now the server should reuse the cached context. Reported as <code>pp{tokens} @ d{depth}<\/code>.<\/li>\n<\/ol>\n<p>This is exactly how I see how fast a follow-up prompt really runs with an already pre-filled context and that&#8217;s exactly the number that matters in everyday use.<\/p>\n<p><strong>Command:<\/strong> <code>llama-benchy --base-url http:\/\/spark:8888\/v1 --model openai\/gpt-oss-120b --depth 0 4096 8192 16384 32768 --latency-mode generation --enable-prefix-caching<\/code><\/p>\n<p>A note from the README: normally you don&#8217;t have to disable prompt caching on the server, because the probability of cache hits is low. If you do get hits, <code>--no-cache<\/code> adds some noise and additionally sends <code>cache-prompt=false<\/code> to the server.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Throughput_under_load_concurrency\"><\/span>Throughput under load: concurrency<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Anyone running a server for multiple parallel users wants to know the saturation point. With <code>--concurrency 1 2 4<\/code>, llama-benchy launches several parallel clients and adds two columns to the table:<\/p>\n<ul>\n<li><strong>t\/s (total):<\/strong> the aggregate throughput of all clients combined.<\/li>\n<li><strong>t\/s (req):<\/strong> the average throughput per client.<\/li>\n<\/ul>\n<p>This way you find exactly the point at which more clients no longer increase the total throughput. That&#8217;s extremely handy for correctly gauging your own hardware.<\/p>\n<p>By the way, <code>--depth<\/code>, <code>--pp<\/code>, <code>--tg<\/code> and <code>--concurrency<\/code> can be combined freely. The benchmarks then run over all combinations in the hierarchy depth \u2192 pp \u2192 tg \u2192 concurrency.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Further_processing_of_the_results\"><\/span>Further processing of the results<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>For your own analysis or visualizations, you export the data as JSON or CSV. <code>--format json<\/code> delivers the most detailed data, and with <code>--save-total-throughput-timeseries<\/code> you even get the total throughput written into the JSON in 1-second intervals. With that, the course of a run can be plotted cleanly.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>llama-benchy is exactly the tool I&#8217;d been missing in my homelab. I can finally compare my locally running models with the same familiar llama-bench metrics. The great thing on top of that is that I&#8217;m independent of whichever inference engine works behind the API endpoint. The fact that the tool deliberately addresses the weaknesses of existing benchmarks (real TTFT to the first usable token, clean prefix caching, a realistic text source for MTP) makes it credible to me.<\/p>\n<p>For me it fits perfectly into the idea of <strong>sovereign AI<\/strong>: I measure on my own hardware, with my own models, using a lean open-source tool that launches via <code>uvx<\/code> without any installation ballast. If you run local LLM endpoints and finally want solid numbers instead of gut feeling, I can clearly recommend llama-benchy.<\/p>\n<p>You&#8217;ll find the project on GitHub: <a href=\"https:\/\/github.com\/eugr\/llama-benchy\" target=\"_blank\" rel=\"noopener\">github.com\/eugr\/llama-benchy<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>If, like me, you run your models locally and sovereignly, you know the problem: I want to know how fast a model really is on my hardware and not as a theoretical figure, but the way it actually reaches me as an end user. For exactly that, the llama.cpp world has the popular tool llama-bench. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2487,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[162,8,50],"tags":[1534,1530,1521,1524,1522,980,1533,1523,1351,1529,1526,1525,1032,1532,1527,1531,1528,877],"class_list":["post-2485","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models-en","category-news","category-top-story-en","tag-gpu-benchmark","tag-huggingface-tokenizer","tag-llama-benchy","tag-llama-cpp","tag-llm-benchmark","tag-local-llm-inference","tag-mtp","tag-ollama-benchmark","tag-openai-compatible-endpoint","tag-prefix-caching","tag-prompt-processing","tag-sglang","tag-sovereign-ai","tag-speculative-decoding","tag-token-generation","tag-tokens-per-second","tag-ttft","tag-vllm","et-has-post-format-content","et_post_format-et-post-format-standard"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint - Exploring the Future: Inside the AI Box<\/title>\n<meta name=\"description\" content=\"llama-benchy brings llama-bench benchmarks to any OpenAI-compatible endpoint \u2013 Ollama, vLLM or llama.cpp. Measure pp, tg &amp; TTFT on your own hardware.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint - Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"og:description\" content=\"llama-benchy brings llama-bench benchmarks to any OpenAI-compatible endpoint \u2013 Ollama, vLLM or llama.cpp. Measure pp, tg &amp; TTFT on your own hardware.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/\" \/>\n<meta property=\"og:site_name\" content=\"Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-08T19:23:02+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-06-09T03:17:52+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/llama-benchy-1-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1928\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Maker\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:site\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Maker\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/\"},\"author\":{\"name\":\"Maker\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"headline\":\"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint\",\"datePublished\":\"2026-06-08T19:23:02+00:00\",\"dateModified\":\"2026-06-09T03:17:52+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/\"},\"wordCount\":2385,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/llama-benchy-1-scaled.jpg\",\"keywords\":[\"GPU Benchmark\",\"HuggingFace Tokenizer\",\"llama-benchy\",\"llama.cpp\",\"LLM Benchmark\",\"local LLM inference\",\"MTP\",\"Ollama Benchmark\",\"OpenAI-compatible endpoint\",\"Prefix Caching\",\"Prompt Processing\",\"SGLang\",\"sovereign AI\",\"Speculative Decoding\",\"Token Generation\",\"tokens per second\",\"TTFT\",\"vLLM\"],\"articleSection\":[\"Large Language Models\",\"News\",\"Top story\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/\",\"name\":\"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint - Exploring the Future: Inside the AI Box\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/llama-benchy-1-scaled.jpg\",\"datePublished\":\"2026-06-08T19:23:02+00:00\",\"dateModified\":\"2026-06-09T03:17:52+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"description\":\"llama-benchy brings llama-bench benchmarks to any OpenAI-compatible endpoint \u2013 Ollama, vLLM or llama.cpp. Measure pp, tg & TTFT on your own hardware.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/#primaryimage\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/llama-benchy-1-scaled.jpg\",\"contentUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/llama-benchy-1-scaled.jpg\",\"width\":2560,\"height\":1928,\"caption\":\"llama-benchy\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\\\/2485\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Start\",\"item\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\",\"name\":\"Exploring the Future: Inside the AI Box\",\"description\":\"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\",\"name\":\"Maker\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"caption\":\"Maker\"},\"description\":\"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.\",\"sameAs\":[\"https:\\\/\\\/ai-box.eu\"],\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/author\\\/ingmars\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint - Exploring the Future: Inside the AI Box","description":"llama-benchy brings llama-bench benchmarks to any OpenAI-compatible endpoint \u2013 Ollama, vLLM or llama.cpp. Measure pp, tg & TTFT on your own hardware.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/","og_locale":"en_US","og_type":"article","og_title":"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint - Exploring the Future: Inside the AI Box","og_description":"llama-benchy brings llama-bench benchmarks to any OpenAI-compatible endpoint \u2013 Ollama, vLLM or llama.cpp. Measure pp, tg & TTFT on your own hardware.","og_url":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/","og_site_name":"Exploring the Future: Inside the AI Box","article_published_time":"2026-06-08T19:23:02+00:00","article_modified_time":"2026-06-09T03:17:52+00:00","og_image":[{"width":2560,"height":1928,"url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/llama-benchy-1-scaled.jpg","type":"image\/jpeg"}],"author":"Maker","twitter_card":"summary_large_image","twitter_creator":"@Ingmar_Stapel","twitter_site":"@Ingmar_Stapel","twitter_misc":{"Written by":"Maker","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#article","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/"},"author":{"name":"Maker","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"headline":"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint","datePublished":"2026-06-08T19:23:02+00:00","dateModified":"2026-06-09T03:17:52+00:00","mainEntityOfPage":{"@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/"},"wordCount":2385,"commentCount":0,"image":{"@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/llama-benchy-1-scaled.jpg","keywords":["GPU Benchmark","HuggingFace Tokenizer","llama-benchy","llama.cpp","LLM Benchmark","local LLM inference","MTP","Ollama Benchmark","OpenAI-compatible endpoint","Prefix Caching","Prompt Processing","SGLang","sovereign AI","Speculative Decoding","Token Generation","tokens per second","TTFT","vLLM"],"articleSection":["Large Language Models","News","Top story"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/","url":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/","name":"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint - Exploring the Future: Inside the AI Box","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#primaryimage"},"image":{"@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/llama-benchy-1-scaled.jpg","datePublished":"2026-06-08T19:23:02+00:00","dateModified":"2026-06-09T03:17:52+00:00","author":{"@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"description":"llama-benchy brings llama-bench benchmarks to any OpenAI-compatible endpoint \u2013 Ollama, vLLM or llama.cpp. Measure pp, tg & TTFT on your own hardware.","breadcrumb":{"@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#primaryimage","url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/llama-benchy-1-scaled.jpg","contentUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/llama-benchy-1-scaled.jpg","width":2560,"height":1928,"caption":"llama-benchy"},{"@type":"BreadcrumbList","@id":"https:\/\/ai-box.eu\/en\/news\/llama-benchy-llama-bench-style-llm-benchmarks-for-any-openai-compatible-endpoint\/2485\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Start","item":"https:\/\/ai-box.eu\/en\/"},{"@type":"ListItem","position":2,"name":"llama-benchy: llama-bench-style LLM benchmarks for any OpenAI-compatible endpoint"}]},{"@type":"WebSite","@id":"https:\/\/ai-box.eu\/en\/#website","url":"https:\/\/ai-box.eu\/en\/","name":"Exploring the Future: Inside the AI Box","description":"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ai-box.eu\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1","name":"Maker","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","caption":"Maker"},"description":"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.","sameAs":["https:\/\/ai-box.eu"],"url":"https:\/\/ai-box.eu\/en\/author\/ingmars\/"}]}},"_links":{"self":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2485","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/comments?post=2485"}],"version-history":[{"count":1,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2485\/revisions"}],"predecessor-version":[{"id":2486,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2485\/revisions\/2486"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media\/2487"}],"wp:attachment":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media?parent=2485"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/categories?post=2485"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/tags?post=2485"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}