{"id":2266,"date":"2026-05-16T08:51:39","date_gmt":"2026-05-16T08:51:39","guid":{"rendered":"https:\/\/ai-box.eu\/?p=2266"},"modified":"2026-05-16T11:25:34","modified_gmt":"2026-05-16T11:25:34","slug":"tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada","status":"publish","type":"post","link":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/","title":{"rendered":"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada"},"content":{"rendered":"<p>In the first three posts of this little series I explained <strong>why<\/strong> I&#8217;m tackling TensorRT-LLM on the A6000 Ada (<a href=\"#\">Part 1<\/a>), <strong>how<\/strong> I built the setup with Docker and helper scripts (<a href=\"#\">Part 2<\/a>), and which <strong>build pipeline<\/strong> for FP16 and FP8 sits behind all of this (<a href=\"#\">Part 3<\/a>). Now on to the most exciting part: <strong>the real numbers<\/strong> and what can be learned from them for the transfer to Edge-LLM.<\/p>\n<p>What I&#8217;m showing here are not synthetic benchmarks from a marketing slide, but measurements from my own setup: Qwen2.5-7B-Instruct, RTX 6000 Ada Generation (SM89), Ubuntu 24.04, TensorRT-LLM 1.2.1 in the NGC container. The generation tokens\/sec are measured with identical prompts and identical sampling parameters, so that FP16 and FP8 are directly comparable.<\/p>\n<div id=\"attachment_2172\" style=\"width: 310px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-scaled.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2172\" class=\"NVIDIA RTX A6000 Ada wp-image-2172 size-medium\" title=\"NVIDIA RTX A6000 Ada\" src=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-300x226.jpg\" alt=\"NVIDIA RTX A6000 Ada\" width=\"300\" height=\"226\" srcset=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-300x226.jpg 300w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-1024x771.jpg 1024w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-768x578.jpg 768w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-1536x1157.jpg 1536w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-2048x1542.jpg 2048w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-1080x813.jpg 1080w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-2172\" class=\"wp-caption-text\">NVIDIA RTX A6000 Ada<\/p><\/div>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#The_comparison_table\" >The comparison table<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#The_real_insight_which_numbers_are_reliable\" >The real insight: which numbers are reliable?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#Engine_load_factor_184%C3%97_thanks_to_a_smaller_file\" >Engine load: factor 1.84\u00d7 thanks to a smaller file<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#Tokenssec_where_the_real_FP8_gain_is_to_be_found\" >Tokens\/sec: where the real FP8 gain is to be found<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#The_lesson_that_isnt_in_the_docs_KV-cache_quantization\" >The lesson that isn&#8217;t in the docs: KV-cache quantization<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#What_of_this_transfers_to_Edge-LLM\" >What of this transfers to Edge-LLM?<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#Directly_transferable\" >Directly transferable<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#What_I_would_have_to_learn_anew\" >What I would have to learn anew<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#What_has_to_be_thought_about_completely_differently\" >What has to be thought about completely differently<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#Conclusion_was_it_worth_it\" >Conclusion: was it worth it?<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#Concretely_on_disk\" >Concretely on disk<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#Mentally\" >Mentally<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_comparison_table\"><\/span>The comparison table<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th>Metric<\/th>\n<th>FP16 TensorRT<\/th>\n<th>FP8 TensorRT<\/th>\n<th>Delta<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td colspan=\"4\"><strong>Build pipeline<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Quantize\/Convert<\/td>\n<td>2:00 (120s)<\/td>\n<td>7:34 (454s)<\/td>\n<td>+278 %<\/td>\n<\/tr>\n<tr>\n<td>Engine build<\/td>\n<td>2:46 (166s)<\/td>\n<td>3:56 (236s)<\/td>\n<td>+42 %<\/td>\n<\/tr>\n<tr>\n<td>Total<\/td>\n<td>4:46 (286s)<\/td>\n<td>11:30 (690s)<\/td>\n<td>+141 %<\/td>\n<\/tr>\n<tr>\n<td colspan=\"4\"><strong>Artifacts<\/strong><\/td>\n<\/tr>\n<tr>\n<td>TRT-LLM checkpoint<\/td>\n<td>15 GB<\/td>\n<td>8.2 GB<\/td>\n<td><strong>\u221245 %<\/strong><\/td>\n<\/tr>\n<tr>\n<td>TensorRT engine<\/td>\n<td>15 GB<\/td>\n<td>8.2 GB<\/td>\n<td><strong>\u221245 %<\/strong><\/td>\n<\/tr>\n<tr>\n<td colspan=\"4\"><strong>Runtime<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Engine load<\/td>\n<td>33.24 s<\/td>\n<td>18.04 s<\/td>\n<td><strong>\u221246 % (1.84\u00d7)<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Generation time (3 prompts)<\/td>\n<td>2.50 s<\/td>\n<td>1.53 s<\/td>\n<td>\u221239 %<\/td>\n<\/tr>\n<tr>\n<td>Tokens\/sec (batched)<\/td>\n<td>153.72<\/td>\n<td><strong>251.15<\/strong><\/td>\n<td><strong>+63 % (1.63\u00d7)<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Output quality<\/td>\n<td>\u2713<\/td>\n<td>\u2713<\/td>\n<td>equal<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>The most important number is in the second-to-last row: <strong>+63 % throughput at a 45 % smaller engine.<\/strong> That is exactly in the range NVIDIA advertises for FP8 on Ada, and which I could previously only have argued for theoretically before this experiment.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_real_insight_which_numbers_are_reliable\"><\/span>The real insight: which numbers are reliable?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When repeating the experiment \u2014 and I really did this many times until everything ran the way I wanted \u2014 that is, after successfully running through my own build scripts across multiple sessions, I noticed the following: the <strong>build times fluctuate considerably<\/strong>, while the <strong>inference performance remains remarkably reproducible<\/strong>.<\/p>\n<p>Example from two of my sessions:<\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th>Metric<\/th>\n<th>Session A<\/th>\n<th>Session B<\/th>\n<th>Difference<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FP16 total build<\/td>\n<td>9:10<\/td>\n<td>4:46<\/td>\n<td>nearly halved<\/td>\n<\/tr>\n<tr>\n<td>FP8 total build<\/td>\n<td>5:54<\/td>\n<td>11:30<\/td>\n<td>nearly doubled<\/td>\n<\/tr>\n<tr>\n<td>FP16 tokens\/sec<\/td>\n<td>154.00<\/td>\n<td>153.72<\/td>\n<td>\u00b10.2 %<\/td>\n<\/tr>\n<tr>\n<td>FP8 tokens\/sec<\/td>\n<td>250.04<\/td>\n<td>251.15<\/td>\n<td>\u00b10.4 %<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>The explanation lies in the nature of the respective operations. <strong>Build time is I\/O-dominated<\/strong>, and I have a somewhat aging machine and SSD here. So writing a 15 GB engine file depends on the disk-cache state, on other running processes, and possibly on the power management of the SSD. <strong>Inference is compute-dominated<\/strong> \u2014 the GPU computes deterministically for the same engine and the same prompts, largely independent of the surrounding system state once the model is loaded into the GPU&#8217;s VRAM.<\/p>\n<p>From this follows a practical note: <strong>inference benchmarks are reliable, build-time comparisons need more caution<\/strong> \u2014 but when a model is running well, you don&#8217;t constantly rebuild it. In my opinion, the fact that the build process fluctuates in the time it takes is not too bad.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Engine_load_factor_184%C3%97_thanks_to_a_smaller_file\"><\/span>Engine load: factor 1.84\u00d7 thanks to a smaller file<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>For the engine load time, the smaller FP8 file pays off directly. FP16 engine load on my setup: 33.24 seconds. FP8 engine load: 18.04 seconds. <strong>Factor 1.84\u00d7.<\/strong> This is of course important when we&#8217;re not thinking of an RTX A6000 GPU in a workstation but of an edge device that has to get, for example, a voice assistant up and running as fast as possible after power-on.<\/p>\n<p>If I now look at the load times in detail: in both cases, a constant share goes to KV-cache allocation and MPI-worker initialization, which is independent of the engine size. The rest is pure reading of the <code>.engine<\/code> file into \u2014 let&#8217;s just call it VRAM here, since the values are from my GPU \u2014 and this is where the load time shrinks with engine size. With FP8 the file is only about half the size, so exactly this part is halved.<\/p>\n<p>For me personally, engine load time is rarely critical. Because I start the engine once at container start. But for applications where the service cold start matters \u2014 for example a cloud worker that spins up on demand \u2014 that&#8217;s a real advantage, or when models have to be swapped regularly.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Tokenssec_where_the_real_FP8_gain_is_to_be_found\"><\/span>Tokens\/sec: where the real FP8 gain is to be found<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Whether 251 tokens\/sec is a lot or not depends on the comparison \u2014 and the comparison is tricky. On the same A6000 Ada, I also measured Ollama in two quantization levels in order to cleanly separate the effects of compilation and quantization:<\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th>Engine<\/th>\n<th>Quantization<\/th>\n<th>Tokens\/sec<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Ollama FP16<\/td>\n<td>16 bit<\/td>\n<td>60<\/td>\n<\/tr>\n<tr>\n<td>TRT-LLM FP16<\/td>\n<td>16 bit<\/td>\n<td>154<\/td>\n<\/tr>\n<tr>\n<td>Ollama Q4_K_M<\/td>\n<td>4 bit<\/td>\n<td>168<\/td>\n<\/tr>\n<tr>\n<td><strong>TRT-LLM FP8<\/strong><\/td>\n<td><strong>8 bit<\/strong><\/td>\n<td><strong>251<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>From this, three effects can be cleanly separated:<\/p>\n<p><strong>At the same bit width, TRT-LLM clearly wins.<\/strong> FP16 against FP16 is a factor of 2.5\u00d7 \u2014 the pure advantage of a hardware-specific, compiled pipeline over a portable llama.cpp implementation with generic dispatch paths. That is the actual &#8220;compilation win&#8221; that is often advertised as a blanket statement of &#8220;TRT-LLM is faster&#8221;.<\/p>\n<p><strong>Aggressive quantization wins back memory bandwidth disproportionately less.<\/strong> Ollama gains a factor of 2.8\u00d7 with Q4_K_M over its own FP16; TRT-LLM, on the other hand, only gains a factor of 1.6\u00d7 with FP8 over its FP16. That isn&#8217;t because FP8 quantizes worse \u2014 it&#8217;s that the TRT-LLM FP16 baseline is simply already much higher.<\/p>\n<p><strong>The real surprise lies in the realistic user comparison.<\/strong> If you naively pit &#8220;TRT-LLM with FP16&#8221; against &#8220;Ollama with default quantization&#8221;, you compare 154 tok\/s against 168 tok\/s \u2014 and <strong>Ollama wins<\/strong>. The 4-bit quantization recovers more performance in the decode loop than compilation brings. Only with TRT-LLM FP8 does the compiled backend pull ahead again \u2014 factor 1.5\u00d7 over Ollama Q4. That is significantly less than the often-claimed &#8220;3\u00d7 faster than Ollama&#8221;, but still a clear lead.<\/p>\n<p>Where TRT-LLM plays out its full strength is the combination: FP8 quantization <strong>plus<\/strong> hardware FP8 tensor cores. Against an unquantized Ollama FP16, that&#8217;s a factor of 4.2\u00d7. But without the hardware FP8 feature of the Ada generation (or Hopper, Blackwell, Jetson Thor), TRT-LLM would <em>no longer<\/em> win in many use cases against a well-quantized llama.cpp model. That makes the choice of hardware at least as decisive as the choice of inference backend.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_lesson_that_isnt_in_the_docs_KV-cache_quantization\"><\/span>The lesson that isn&#8217;t in the docs: KV-cache quantization<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>I already told the story in <a href=\"#\">Part 3<\/a>, but I want to record it here once more as a central lesson: my first FP8 build had additionally set <code>--kv_cache_dtype fp8<\/code>. The performance numbers were <strong>even better<\/strong> (236 instead of 251 tok\/s), but the model produced <strong>complete token salad<\/strong>:<\/p>\n<pre class=\"wp-block-code\"><code>strugg (str, 1, 1, 1, 1, 1, 1, 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) ...<\/code><\/pre>\n<p>On all three prompts the same phenomenon. The engine ran perfectly \u2014 KV-cache allocation, tensor-core utilization, throughput, all metrics looked great. Only one thing wasn&#8217;t measured: <strong>output quality.<\/strong><\/p>\n<p>The diagnosis: for 7B models, FP8 quantization of the KV cache is too aggressive. The quality headroom of these smaller models isn&#8217;t enough to compress both weights and activations in the KV cache down to FP8. For 70B+ models \u2014 which typically have more redundancy and therefore more robustness \u2014 FP8 KV cache can work.<\/p>\n<p>The fix was a single configuration change: instead of <code>--kv_cache_dtype fp8<\/code>, I just leave the option out entirely. The KV cache then stays at native FP16. The output becomes sensible again, and the other performance advantages of FP8 weights are preserved.<\/p>\n<p><strong>The general lesson I take away from this session:<\/strong> quantization performance without quality verification is worthless. A benchmark script would never have caught the bug \u2014 the engine did generate tokens, and even faster than the variant without FP8 KV. Only reading the actual outputs revealed the collapse.<\/p>\n<p>That applies to every quantization, not just FP8: INT8, INT4-AWQ, NF4 \u2014 they all have a quality\/performance trade-off. And as the Ollama comparison above has shown, cross-engine comparisons are also not as simple as &#8220;the compiled backend automatically wins&#8221;. Anyone who runs inference in production absolutely needs a quality sample in CI\/CD and an honest view of the baseline against which the comparison is made \u2014 not just latency measurements from a single session.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"What_of_this_transfers_to_Edge-LLM\"><\/span>What of this transfers to Edge-LLM?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Back to the actual goal of this series. What have I learned that I can later take with me straight to Jetson Thor (or another edge target)?<\/p>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Directly_transferable\"><\/span>Directly transferable<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul class=\"wp-block-list\">\n<li>The <strong>pipeline concepts<\/strong>: HF \u2192 ONNX\/TRT-LLM checkpoint \u2192 TensorRT engine. With Edge-LLM the middle stage is called ONNX instead of TRT-LLM checkpoint, but the separation of convert\/quantize and build is identical.<\/li>\n<li>The <strong>build-once-deploy-many pattern<\/strong>: the engine is built on a workstation, then deployed as a static artifact. Edge-LLM is explicitly designed only for this pattern \u2014 the high-level Python API with on-the-fly engine build, as TRT-LLM has it, doesn&#8217;t exist in Edge-LLM at all.<\/li>\n<li>The <strong>understanding of the build bottlenecks<\/strong>: kernel auto-tuning vs. disk I\/O. On edge targets with limited disk performance, serialization becomes relatively even more important.<\/li>\n<li>The <strong>FP8 quantization with ModelOpt<\/strong>. Works the same way on SM89 (Ada) and on the corresponding Jetson variants.<\/li>\n<li>The <strong>KV-cache lesson<\/strong> is universal. Just as relevant on edge.<\/li>\n<li>The <strong>methodology note<\/strong>: inference benchmarks are reliable, build-time comparisons need multiple runs. That will be just as important when comparing edge configurations.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"What_I_would_have_to_learn_anew\"><\/span>What I would have to learn anew<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>C++-only runtime.<\/strong> Edge-LLM has no Python layer. Anyone who wants to execute an engine there writes a C++ application against the LLM runtime API. For my tests so far I&#8217;ve always used Python \u2014 on edge that isn&#8217;t available.<\/li>\n<li><strong>Static memory layout<\/strong> instead of paged KV cache. Edge-LLM forgoes dynamic allocation because the workloads are predictable. Different tuning knobs.<\/li>\n<li><strong>Cross-compile tooling.<\/strong> Engines for the Jetson Thor (SM101, aarch64) are typically built on an x86 host, and the finished <code>.engine<\/code> file is then deployed to the edge device. With TRT-LLM on my Ada I build natively \u2014 same architecture, container, build and run on the same machine.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"What_has_to_be_thought_about_completely_differently\"><\/span>What has to be thought about completely differently<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Hardware FP8 availability on the edge target.<\/strong> The Jetson Thor brings hardware FP8 with it (same architecture class as Ada\/Hopper). On older edge platforms like the Jetson Orin this isn&#8217;t available. That naturally means that there, the advantage over a well-quantized llama.cpp model melts away, as the Ollama comparison above hinted at. Anyone planning Edge-LLM performance therefore first has to clarify which architecture generation the target hardware is.<\/li>\n<li><strong>Power\/thermal budget.<\/strong> On the A6000 Ada I can burn 300 watts. On a Jetson Thor that&#8217;s maybe 60 watts under full load, often 30 watts sustained. The choice of quantization is then no longer just performance vs. quality, but also performance per watt.<\/li>\n<li><strong>Model size.<\/strong> What fits easily on the A6000 Ada (Qwen-7B in FP8: 8 GB) does in principle also fit on a Jetson Thor with 128 GB of shared memory \u2014 but there the model engine competes with vision pipelines, the OS, other tasks. Realistic model sizes are more like 1B\u201314B.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Conclusion_was_it_worth_it\"><\/span>Conclusion: was it worth it?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Several days of work, four blog posts. What have I learned in the end for my LLM projects?<\/p>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Concretely_on_disk\"><\/span>Concretely on disk<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul class=\"wp-block-list\">\n<li>An FP16 and an FP8 engine for Qwen-7B, both deployable as <code>.engine<\/code> files<\/li>\n<li>Seven build and run scripts that make the complete workflow reproducible<\/li>\n<li>Measurements with all relevant metrics<\/li>\n<li>A pitfall collection that I don&#8217;t have to learn twice<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Mentally\"><\/span>Mentally<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul class=\"wp-block-list\">\n<li>A really concrete understanding of what happens between a HuggingFace repo and an efficient edge deployment<\/li>\n<li>An honest assessment of FP8 as a quantization technique \u2014 it isn&#8217;t only &#8220;1.63\u00d7 faster&#8221;, but also &#8220;watch out for the KV cache&#8221; and &#8220;view build-time comparisons critically&#8221; should that ever become a topic<\/li>\n<li>A roadmap for what I&#8217;ll still have to learn later for Edge-LLM (C++ runtime, cross-compile, static memory model)<\/li>\n<\/ul>\n<p>For me, this was the right investment. Edge AI on sovereign hardware is no longer &#8220;coming soon&#8221; \u2014 the tools are there, the architectural concepts are clear, and with every step the distance between my workstation and a Jetson Thor gets smaller. Anyone who works in this field, or who wants to learn about it, can take on this exercise themselves with any Ada GPU or newer card.<\/p>\n<p>Now I&#8217;ll go and have a look in my workshop to see what edge devices I still have around. An old Jetson Nano should still be findable.<\/p>\n<br>\r\n\t<br>\r\n<h2>Article overview - TensorRT-LLM on the RTX A6000 Ada:<\/h2>\r\n<a title=\"Preparing an Ubuntu 24.04 Server for AI Inference: CUDA, Docker, NVIDIA Container Toolkit\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/preparing-an-ubuntu-24-04-server-for-ai-inference-cuda-docker-nvidia-container-toolkit\/2268\/\">Preparing an Ubuntu 24.04 Server for AI Inference: CUDA, Docker, NVIDIA Container Toolkit<\/a><br>\r\n<a title=\"TensorRT-LLM on the RTX A6000 Ada: Preparing for the Edge-LLM Ecosystem\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-on-the-rtx-a6000-ada-preparing-for-the-edge-llm-ecosystem\/2255\/\">TensorRT-LLM on the RTX A6000 Ada: Preparing for the Edge-LLM Ecosystem<\/a><br>\r\n<a title=\"TensorRT-LLM on Ubuntu 24.04: Setup with Docker and Helper Scripts\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-on-ubuntu-24-04-setup-with-docker-and-helper-scripts\/2257\/\">TensorRT-LLM on Ubuntu 24.04: Setup with Docker and Helper Scripts<\/a><br>\r\n<a title=\"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/\">TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8<\/a><br>\r\n<a title=\"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/\">TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada<\/a><br>\r\n\t\r\n\t<br>\r\n\t<br>\n","protected":false},"excerpt":{"rendered":"<p>In the first three posts of this little series I explained why I&#8217;m tackling TensorRT-LLM on the A6000 Ada (Part 1), how I built the setup with Docker and helper scripts (Part 2), and which build pipeline for FP16 and FP8 sits behind all of this (Part 3). Now on to the most exciting part: [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2173,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[162,50],"tags":[1039,1172,1163,1162,1167,1171,1169,1174,1166,1170,1164,1176,1173,1175,1165,1168],"class_list":["post-2266","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models-en","category-top-story-en","tag-edge-ai","tag-edge-llm","tag-fp16-vs-fp8","tag-fp8-quantization","tag-hardware-fp8-ada","tag-kv-cache-quantization","tag-llm-inference-benchmark","tag-llm-inference-performance","tag-modelopt-fp8-ptq","tag-nvidia-ada-sm89","tag-qwen-7b-benchmark","tag-rtx-a6000-ada","tag-tensorrt-engine-build","tag-tensorrt-llm","tag-tensorrt-llm-benchmark","tag-tensorrt-llm-vs-ollama","et-has-post-format-content","et_post_format-et-post-format-standard"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada - Exploring the Future: Inside the AI Box<\/title>\n<meta name=\"description\" content=\"Real-world TensorRT-LLM FP8 vs FP16 benchmark on RTX A6000 Ada with Qwen-7B: 251 tok\/s, 45% smaller engine, plus Ollama Q4_K_M comparison.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada - Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"og:description\" content=\"Real-world TensorRT-LLM FP8 vs FP16 benchmark on RTX A6000 Ada with Qwen-7B: 251 tok\/s, 45% smaller engine, plus Ollama Q4_K_M comparison.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/\" \/>\n<meta property=\"og:site_name\" content=\"Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-16T08:51:39+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-16T11:25:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1928\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Maker\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:site\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Maker\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/\"},\"author\":{\"name\":\"Maker\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"headline\":\"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada\",\"datePublished\":\"2026-05-16T08:51:39+00:00\",\"dateModified\":\"2026-05-16T11:25:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/\"},\"wordCount\":1969,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/NVIDIA_RTX_A6000_ADA-scaled.jpg\",\"keywords\":[\"Edge AI\",\"Edge-LLM\",\"FP16 vs FP8\",\"FP8 quantization\",\"Hardware FP8 Ada\",\"KV cache quantization\",\"LLM inference benchmark\",\"LLM inference performance\",\"ModelOpt FP8 PTQ\",\"NVIDIA Ada SM89\",\"Qwen 7B benchmark\",\"RTX A6000 Ada\",\"TensorRT engine build\",\"TensorRT-LLM\",\"TensorRT-LLM benchmark\",\"TensorRT-LLM vs Ollama\"],\"articleSection\":[\"Large Language Models\",\"Top story\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/\",\"name\":\"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada - Exploring the Future: Inside the AI Box\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/NVIDIA_RTX_A6000_ADA-scaled.jpg\",\"datePublished\":\"2026-05-16T08:51:39+00:00\",\"dateModified\":\"2026-05-16T11:25:34+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"description\":\"Real-world TensorRT-LLM FP8 vs FP16 benchmark on RTX A6000 Ada with Qwen-7B: 251 tok\\\/s, 45% smaller engine, plus Ollama Q4_K_M comparison.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/#primaryimage\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/NVIDIA_RTX_A6000_ADA-scaled.jpg\",\"contentUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/NVIDIA_RTX_A6000_ADA-scaled.jpg\",\"width\":2560,\"height\":1928},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\\\/2266\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Start\",\"item\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\",\"name\":\"Exploring the Future: Inside the AI Box\",\"description\":\"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\",\"name\":\"Maker\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"caption\":\"Maker\"},\"description\":\"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.\",\"sameAs\":[\"https:\\\/\\\/ai-box.eu\"],\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/author\\\/ingmars\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada - Exploring the Future: Inside the AI Box","description":"Real-world TensorRT-LLM FP8 vs FP16 benchmark on RTX A6000 Ada with Qwen-7B: 251 tok\/s, 45% smaller engine, plus Ollama Q4_K_M comparison.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/","og_locale":"en_US","og_type":"article","og_title":"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada - Exploring the Future: Inside the AI Box","og_description":"Real-world TensorRT-LLM FP8 vs FP16 benchmark on RTX A6000 Ada with Qwen-7B: 251 tok\/s, 45% smaller engine, plus Ollama Q4_K_M comparison.","og_url":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/","og_site_name":"Exploring the Future: Inside the AI Box","article_published_time":"2026-05-16T08:51:39+00:00","article_modified_time":"2026-05-16T11:25:34+00:00","og_image":[{"width":2560,"height":1928,"url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-scaled.jpg","type":"image\/jpeg"}],"author":"Maker","twitter_card":"summary_large_image","twitter_creator":"@Ingmar_Stapel","twitter_site":"@Ingmar_Stapel","twitter_misc":{"Written by":"Maker","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#article","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/"},"author":{"name":"Maker","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"headline":"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada","datePublished":"2026-05-16T08:51:39+00:00","dateModified":"2026-05-16T11:25:34+00:00","mainEntityOfPage":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/"},"wordCount":1969,"commentCount":0,"image":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-scaled.jpg","keywords":["Edge AI","Edge-LLM","FP16 vs FP8","FP8 quantization","Hardware FP8 Ada","KV cache quantization","LLM inference benchmark","LLM inference performance","ModelOpt FP8 PTQ","NVIDIA Ada SM89","Qwen 7B benchmark","RTX A6000 Ada","TensorRT engine build","TensorRT-LLM","TensorRT-LLM benchmark","TensorRT-LLM vs Ollama"],"articleSection":["Large Language Models","Top story"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/","url":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/","name":"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada - Exploring the Future: Inside the AI Box","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#primaryimage"},"image":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-scaled.jpg","datePublished":"2026-05-16T08:51:39+00:00","dateModified":"2026-05-16T11:25:34+00:00","author":{"@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"description":"Real-world TensorRT-LLM FP8 vs FP16 benchmark on RTX A6000 Ada with Qwen-7B: 251 tok\/s, 45% smaller engine, plus Ollama Q4_K_M comparison.","breadcrumb":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#primaryimage","url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-scaled.jpg","contentUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_RTX_A6000_ADA-scaled.jpg","width":2560,"height":1928},{"@type":"BreadcrumbList","@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Start","item":"https:\/\/ai-box.eu\/en\/"},{"@type":"ListItem","position":2,"name":"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada"}]},{"@type":"WebSite","@id":"https:\/\/ai-box.eu\/en\/#website","url":"https:\/\/ai-box.eu\/en\/","name":"Exploring the Future: Inside the AI Box","description":"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ai-box.eu\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1","name":"Maker","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","caption":"Maker"},"description":"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.","sameAs":["https:\/\/ai-box.eu"],"url":"https:\/\/ai-box.eu\/en\/author\/ingmars\/"}]}},"_links":{"self":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2266","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/comments?post=2266"}],"version-history":[{"count":1,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2266\/revisions"}],"predecessor-version":[{"id":2267,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2266\/revisions\/2267"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media\/2173"}],"wp:attachment":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media?parent=2266"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/categories?post=2266"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/tags?post=2266"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}