{"id":2262,"date":"2026-05-16T07:26:08","date_gmt":"2026-05-16T07:26:08","guid":{"rendered":"https:\/\/ai-box.eu\/?p=2262"},"modified":"2026-05-16T11:30:17","modified_gmt":"2026-05-16T11:30:17","slug":"tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8","status":"publish","type":"post","link":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/","title":{"rendered":"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8"},"content":{"rendered":"<p>In the <a href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-on-ubuntu-24-04-setup-with-docker-and-helper-scripts\/2257\/\" target=\"_blank\" rel=\"noopener\">second part<\/a> I set up TensorRT-LLM inside a Docker container and validated with TinyLlama that the pipeline basically runs. In doing so, one important detail came to light: the default API <code>from tensorrt_llm import LLM<\/code> uses the <strong>PyTorch backend<\/strong> and does <strong>not<\/strong> produce a deployable engine file. For the classic build-once-deploy-many pattern, and in order to try out an architecture that will later also work on Edge-LLM, I need the <strong>TensorRT backend<\/strong> and its two-stage build workflow.<\/p>\n<p>In this part I&#8217;ll walk through, step by step together with you, the building of <strong>two persistent engines<\/strong> for Qwen2.5-7B-Instruct: once in FP16 as a baseline, once in FP8 to activate the hardware Transformer Engine on the Ada architecture. Both as deployable <code>.engine<\/code> files, both reproducible with build scripts.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#The_two-stage_pipeline\" >The two-stage pipeline<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#build_qwen_fp16sh_the_baseline\" >build_qwen_fp16.sh: the baseline<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Stage_1_HF_%E2%86%92_TRT-LLM_checkpoint\" >Stage 1: HF \u2192 TRT-LLM checkpoint<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Stage_2_TRT-LLM_checkpoint_%E2%86%92_TensorRT_engine\" >Stage 2: TRT-LLM checkpoint \u2192 TensorRT engine<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Preparation_pulling_Qwen-7B_into_the_HF_cache\" >Preparation: pulling Qwen-7B into the HF cache<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Starting_the_build_process\" >Starting the build process<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Script_output_after_the_run\" >Script output after the run:<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Using_the_engine_at_runtime\" >Using the engine at runtime<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#FP8_%E2%80%94_the_Ada_joker\" >FP8 \u2014 the Ada joker<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Stage_1_HF_%E2%86%92_TRT-LLM_checkpoint_with_quantization\" >Stage 1: HF \u2192 TRT-LLM checkpoint with quantization<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Stage_2_TRT-LLM_checkpoint_%E2%86%92_TensorRT_engine-2\" >Stage 2: TRT-LLM checkpoint \u2192 TensorRT engine<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#What_happens_around_it_in_the_build_qwen_fp8sh_script\" >What happens around it in the build_qwen_fp8.sh script<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#The_KV_cache_trap_or_how_I_built_token_salad\" >The KV cache trap (or: how I built token salad)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Starting_the_FP8_build_process\" >Starting the FP8 build process<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Script_output_after_the_run-2\" >Script output after the run<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#Using_the_FP8_engine_at_runtime\" >Using the FP8 engine at runtime<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_two-stage_pipeline\"><\/span>The two-stage pipeline<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>TRT-LLM splits the build process into two clearly defined stages, each with its own CLI tool:<\/p>\n<pre class=\"wp-block-preformatted\">HuggingFace Checkpoint\r\n        \u2193\r\n[Stage 1] convert_checkpoint.py  (or quantize.py for FP8)\r\n        \u2193\r\nTRT-LLM Checkpoint  (rank0.safetensors + config.json)\r\n        \u2193\r\n[Stage 2] trtllm-build\r\n        \u2193\r\nTensorRT Engine  (rank0.engine + config.json)<\/pre>\n<p><strong>Stage 1<\/strong> converts the HuggingFace weights into the internal TRT-LLM checkpoint format. For FP8, this step additionally performs the post-training quantization based on calibration samples. The output is still a portable format that is not hardware-specific.<\/p>\n<p><strong>Stage 2<\/strong> compiles the TRT-LLM checkpoint into a <strong>hardware-specific<\/strong> TensorRT engine. This is where kernel auto-tuning, graph optimization and the precision choice per operation happen. The output is an <code>.engine<\/code> file that only runs on the target GPU architecture (in my case SM89 for Ada).<\/p>\n<p>This separation is exactly the same logic as in Edge-LLM, where Stage 1 runs as an ONNX export on an x86 host, and Stage 2 (engine build) happens either on the x86 host for a specific target or directly on the edge device.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"build_qwen_fp16sh_the_baseline\"><\/span>build_qwen_fp16.sh: the baseline<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>My first build script automates both stages for Qwen2.5-7B-Instruct in FP16. Until I had this script running\u2026 phew, it took a while. The problem was that I couldn&#8217;t find the Python script <code>convert_checkpoint.py<\/code>. Only after intensive reading of error messages and searching inside the container did I finally find it.<\/p>\n<p>Here, briefly, the core logic of what happens:<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Stage_1_HF_%E2%86%92_TRT-LLM_checkpoint\"><\/span>Stage 1: HF \u2192 TRT-LLM checkpoint<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The model downloaded from HF is converted.<\/p>\n<pre class=\"wp-block-code\"><code>python3 \/app\/tensorrt_llm\/examples\/models\/core\/qwen\/convert_checkpoint.py \\\r\n    --model_dir \"$QWEN_HF\" \\\r\n    --output_dir \/workspace\/checkpoints\/qwen2.5-7b-fp16 \\\r\n    --dtype float16\r\n<\/code><\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Stage_2_TRT-LLM_checkpoint_%E2%86%92_TensorRT_engine\"><\/span>Stage 2: TRT-LLM checkpoint \u2192 TensorRT engine<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>This is where the TensorRT engine is built.<\/p>\n<pre class=\"wp-block-code\"><code>trtllm-build \\\r\n    --checkpoint_dir \/workspace\/checkpoints\/qwen2.5-7b-fp16 \\\r\n    --output_dir \/workspace\/engines\/qwen2.5-7b-fp16 \\\r\n    --gemm_plugin float16 \\\r\n    --max_batch_size 4 \\\r\n    --max_seq_len 4096<\/code><\/pre>\n<p>Around this I build the following features into the script:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong><code>set -euo pipefail<\/code><\/strong> for strict error handling<\/li>\n<li><strong>Sanity check:<\/strong> running inside the container, <code>trtllm-build<\/code> on PATH, HF model in the cache<\/li>\n<li><strong>Idempotency:<\/strong> if <code>rank0.engine<\/code> already exists, prompt the user (the user can skip the build \u2014 this really saved me time)<\/li>\n<li><strong>Per-stage timing:<\/strong> with <code>date +%s<\/code> for start and end time, nicely formatted output<\/li>\n<li><strong>Statistics log:<\/strong> results are additionally written to <code>\/workspace\/engines\/qwen2.5-7b-fp16-build.log<\/code> so that multiple builds can be compared side by side<\/li>\n<li><strong>Colorized log output:<\/strong> matches the style of my <code>setup_trtllm.sh<\/code><\/li>\n<\/ul>\n<h3 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\"><span class=\"ez-toc-section\" id=\"Preparation_pulling_Qwen-7B_into_the_HF_cache\"><\/span>Preparation: pulling Qwen-7B into the HF cache<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Before <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">build_qwen_fp16.sh<\/code> can get going, the Qwen-7B model itself first has to be sitting on disk. As a sanity check, the script verifies whether Qwen2.5-7B-Instruct can be found in the HF cache under <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">\/workspace\/cache\/hub\/<\/code>, and aborts if the model isn&#8217;t there.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">A deliberate decision I made is the clear split between the different scripts. The build script should not download models from the net itself. That would be untransparent \u2014 starting a 14 GB download without explicit confirmation \u2014 and makes debugging hard. Hence the clear separation, so that you can also tell: was it a download error or a build error? Because in my case the internet connection is a bit flaky from time to time. Cleanly separated responsibilities.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Instead, I download Qwen up front with a small Python script: <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">qwen_fp16.py<\/code>. It uses the PyTorch backend of TRT-LLM \u2014 that is, the default import <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">from tensorrt_llm import LLM<\/code> that we already got to know in part 2. It then downloads the model from HuggingFace and afterwards performs a brief inference for verification:<\/p>\n<p>Download the script now and save it, as usual, in your working directory. In my case that&#8217;s the path <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">\/data\/trtllm\/<\/code>.<\/p>\n<p>GitHub: <a href=\"https:\/\/github.com\/custom-build-robots\/tensorrt-llm-edge-prep\/tree\/main\/script\" target=\"_blank\" rel=\"noopener\">tensorrt-llm-edge-prep-script<\/a><\/p>\n<p>On the host you now have to jump into the container as usual:<br \/>\nCommand: <code>.\/start_trtllm.sh exec<\/code><\/p>\n<p>Then you run the script as follows:<\/p>\n<p>Command: <code>python3 qwen_fp16.py<\/code><\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">On the first run it took a while in my case. The download ran for about 10 minutes for the 14 GB of the model. Depending on your internet connection, that can easily turn into two cups of coffee. After the successful download, it took another 1\u20132 minutes to complete the model init and to produce the two generated answers. On the second and all following runs, the model is in the cache and the startup is noticeably faster.<\/p>\n<p>Note: I also kept running into the following error. This has nothing to do with the script itself \u2014 it&#8217;s just that the download went wrong. If this happens to you as well, just start the script again.<\/p>\n<p><code>RuntimeError: Data processing error: CAS service error : ReqwestMiddleware Error: Request failed after 5 retries<\/code><\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">All that matters is that the script runs through once completely \u2014 the concrete inference quality isn&#8217;t the goal here. We need the model on disk. Once <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">qwen_fp16.py<\/code> has run successfully, Qwen-7B is in the HF cache and <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">build_qwen_fp16.sh<\/code> will find it.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Now on to the actual build.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Starting_the_build_process\"><\/span>Starting the build process<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Now download the script <code>build_qwen_fp16.sh<\/code> from GitHub here: <a href=\"https:\/\/github.com\/custom-build-robots\/tensorrt-llm-edge-prep\/tree\/main\/script\" target=\"_blank\" rel=\"noopener\">tensorrt-llm-edge-prep-script<\/a><\/p>\n<p>Save it, as usual, into your working directory on your disk. In my case that&#8217;s the path \/data\/trtllm\/<\/p>\n<p>Then make it executable with the following command:<\/p>\n<p>Command: <code>chmod +x build_qwen_fp16.sh<\/code><\/p>\n<p>On the host you now have to jump into the container as usual:<br \/>\nCommand: <code>.\/start_trtllm.sh exec<\/code><\/p>\n<p>Inside the container, switch to the workspace directory:<br \/>\nCommand: <code>cd \/workspace<\/code><\/p>\n<p>Now you run it inside the container.<\/p>\n<p>Command: <code>.\/build_qwen_fp16.sh<\/code><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Script_output_after_the_run\"><\/span>Script output after the run:<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>On the first run this takes about <strong>5 minutes<\/strong> \u2014 roughly <strong>2 min convert<\/strong> and <strong>3 min build<\/strong>. Output at the end:<\/p>\n<p><code>==================================================<\/code><br \/>\n<code>[INFO] STUFE 3: Verifikation<\/code><br \/>\n<code>==================================================<\/code><br \/>\n<code>[ OK ] Engine erfolgreich erzeugt:<\/code><br \/>\n<code>total 15G<\/code><br \/>\n<code>-rw-r--r-- 1 root root 5.3K May 16 06:10 config.json<\/code><br \/>\n<code>-rw-r--r-- 1 root root 15G May 16 06:11 rank0.engine<\/code><\/p>\n<p><code>==================================================<\/code><br \/>\n<code>Build-Statistik (f\u00fcr Interview-Tabelle)<\/code><br \/>\n<code>==================================================<\/code><br \/>\n<code>Modell: Qwen2.5-7B-Instruct<\/code><br \/>\n<code>Pr\u00e4zision: float16<\/code><br \/>\n<code>GPU: NVIDIA RTX 6000 Ada Generation<\/code><br \/>\n<code>Convert-Zeit: 00:02:00 (120s)<\/code><br \/>\n<code>Build-Zeit: 00:02:46 (166s)<\/code><br \/>\n<code>Gesamt-Zeit: 00:04:46 (286s)<\/code><br \/>\n<code>Checkpoint: 15G<\/code><br \/>\n<code>Engine-Datei: 15G<\/code><br \/>\n<code>Engine-Verz.: 15G<\/code><br \/>\n<code>Pfad: \/workspace\/engines\/qwen2.5-7b-fp16<\/code><br \/>\n<code>==================================================<\/code><\/p>\n<p><code>[ OK ] Statistik geloggt nach: \/workspace\/engines\/qwen2.5-7b-fp16-build.log<\/code><\/p>\n<p><code>[INFO] N\u00e4chster Schritt: Engine mit Python laden und Tokens generieren<\/code><br \/>\n<code>Beispiel:<\/code><br \/>\n<code>from tensorrt_llm import LLM, SamplingParams<\/code><br \/>\n<code>llm = LLM(model=\"\/workspace\/engines\/qwen2.5-7b-fp16\")<\/code><br \/>\n<code>out = llm.generate([\"Hallo\"], SamplingParams(max_tokens=50))<\/code><\/p>\n<p><code>root@b0f64442cfcb:\/workspace#<\/code><\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">15 GB for the engine isn&#8217;t exactly small, but that&#8217;s the normal size for a 7-billion-parameter model in FP16 (14 GB weights plus engine overhead).<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">In the build log there are a few interesting detail lines on kernel auto-tuning, engine serialization and peak RAM. I&#8217;m saving that analysis for part 4, where it only gets its real significance in direct comparison with the FP8 build.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Using_the_engine_at_runtime\"><\/span>Using the engine at runtime<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The built engine is not loaded by the default Python API \u2014 that one uses the PyTorch backend, after all. I need the TensorRT-backend import:<\/p>\n<p><code><span class=\"token token\">from<\/span> tensorrt_llm<span class=\"token token\">.<\/span>_tensorrt_engine <span class=\"token token\">import<\/span> LLM <span class=\"token token\"># NOT the default import!<\/span> <\/code><\/p>\n<p><code><span class=\"token token\">from<\/span> tensorrt_llm <span class=\"token token\">import<\/span> SamplingParams<\/code><\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">The complete test script is called <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">run_engine_qwen_fp16.py<\/code> and is available in the repo:<\/p>\n<p>GitHub: <a href=\"https:\/\/github.com\/custom-build-robots\/tensorrt-llm-edge-prep\/tree\/main\/script\" target=\"_blank\" rel=\"noopener\">tensorrt-llm-edge-prep-script<\/a><\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">It loads the engine, measures the load time, generates tokens for three test prompts and prints a performance statistic at the end.<\/p>\n<p>Run the script inside the container as usual.<\/p>\n<p>Command: <code>python3 run_engine_qwen_fp16.py<\/code><\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">The concrete measurements that get printed \u2014 for example how long the loading takes, how many tokens per second are generated \u2014 are coming in the <strong>last part of this series<\/strong>, where they get their significance from direct comparison with FP8. For now it&#8217;s enough to know: the engine is loaded cleanly and produces readable German answers.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"FP8_%E2%80%94_the_Ada_joker\"><\/span>FP8 \u2014 the Ada joker<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Up to this point I&#8217;ve played through the pipeline once in full. But the really interesting argument for an Ada GPU is the <strong>hardware FP8<\/strong> support via the Transformer Engine. Ampere can&#8217;t do this, Hopper can, Ada can, and the new Blackwell architecture can too. The Jetson Thor packs the same architecture class inside.<\/p>\n<p>FP8 is not &#8220;FP16 \/ 2&#8221;. FP8 is a <strong>different number representation<\/strong> that requires significantly more care with scaling. That&#8217;s why there is no simple <code>--dtype fp8<\/code> switch. Instead, an explicit <strong>post-training quantization<\/strong> (PTQ) step runs using the <a href=\"https:\/\/github.com\/NVIDIA\/TensorRT-Model-Optimizer\" target=\"_blank\" rel=\"noreferrer noopener\">NVIDIA ModelOpt<\/a> tool.<\/p>\n<p>The pipeline stays two-stage as with FP16 \u2014 only that Stage 1 now runs the <code>quantize.py<\/code> script instead of <code>convert_checkpoint.py<\/code>.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Stage_1_HF_%E2%86%92_TRT-LLM_checkpoint_with_quantization\"><\/span>Stage 1: HF \u2192 TRT-LLM checkpoint with quantization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The model is run through 1024 calibration samples (default dataset). ModelOpt observes the activation distributions per layer and determines <strong>optimal FP8 scaling factors<\/strong> so that as little information as possible is lost on the FP8 cast. That is the &#8220;intelligent&#8221; part \u2014 no simple rounding, but a statistically grounded value-range adaptation.<\/p>\n<pre class=\"wp-block-code\"><code>python3 \/app\/tensorrt_llm\/examples\/quantization\/quantize.py \\\r\n    --model_dir \"$QWEN_HF\" \\\r\n    --output_dir \/workspace\/checkpoints\/qwen2.5-7b-fp8 \\\r\n    --dtype float16 \\\r\n    --qformat fp8 \\\r\n    --calib_size 1024<\/code><\/pre>\n<p>The output is a TRT-LLM checkpoint with FP8 weights and a quantization plan.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Stage_2_TRT-LLM_checkpoint_%E2%86%92_TensorRT_engine-2\"><\/span>Stage 2: TRT-LLM checkpoint \u2192 TensorRT engine<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Then the engine build with FP8-specific plugins:<\/p>\n<pre class=\"wp-block-code\"><code>trtllm-build \\\r\n    --checkpoint_dir \/workspace\/checkpoints\/qwen2.5-7b-fp8 \\\r\n    --output_dir \/workspace\/engines\/qwen2.5-7b-fp8 \\\r\n    --gemm_plugin auto \\\r\n    --use_fp8_context_fmha enable \\\r\n    --max_batch_size 4 \\\r\n    --max_seq_len 4096<\/code><\/pre>\n<p>The <code>--use_fp8_context_fmha enable<\/code> activates an <strong>FP8-optimized attention<\/strong> in the prefill path. In the FP16 build the log had shown this line: <em>&#8220;FP8 Context FMHA is disabled because it must be used together with the fp8 quantization workflow.&#8221;<\/em> \u2014 now we are in the fp8 workflow, so it can finally be enabled.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"What_happens_around_it_in_the_build_qwen_fp8sh_script\"><\/span>What happens around it in the build_qwen_fp8.sh script<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>My <code>build_qwen_fp8.sh<\/code> has the same basic structure as the FP16 script (sanity check, idempotency, per-stage timing, statistics log), but with three FP8-specific extensions:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Quantize instead of convert<\/strong> in Stage 1 with the ModelOpt parameters<\/li>\n<li><strong>FP8 plugins<\/strong> in Stage 2 (<code>--use_fp8_context_fmha enable<\/code>)<\/li>\n<li><strong>KV_CACHE_DTYPE as a variable<\/strong> \u2014 empty by default. I&#8217;ll explain why in the next section.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_KV_cache_trap_or_how_I_built_token_salad\"><\/span>The KV cache trap (or: how I built token salad)<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>My first attempt with FP8 included an additional quantization on top: I had set <code>--kv_cache_dtype fp8<\/code>, assuming that an FP8-quantized KV cache would also bring performance gains. It does \u2014 tokens\/sec went up to 236, engine size stayed at 8.2 GB.<\/p>\n<p>The problem: <strong>the output was completely unusable.<\/strong><\/p>\n<p>Remember the test prompt &#8220;Briefly explain what a TensorRT engine is&#8221;? In the FP8 + FP8-KV version, this is what came out:<\/p>\n<pre class=\"wp-block-code\"><code>strugg (str, 1, 1, 1, 1, 1, 1, 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1)\r\n1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1) 1)<\/code><\/pre>\n<p>For all three prompts, similar token salad came out. The <strong>infrastructure<\/strong> ran perfectly, the engine was loaded, generated with high throughput, all metrics looked great. The <strong>model quality<\/strong> was completely destroyed.<\/p>\n<p>Diagnosis: For 7B models, FP8 quantization of the KV cache is <strong>too aggressive<\/strong>. The quality headroom is too small. With 70B+ models that would probably still be okay, because the model has enough robustness \u2014 but with a 7B model it collapses.<\/p>\n<p>The fix was one line: instead of <code>--kv_cache_dtype fp8<\/code> I just leave the option out entirely. The KV cache then stays at native FP16, only the weights and activations are quantized. The script now documents this cleanly: an empty <code>KV_CACHE_DTYPE=\"\"<\/code> variable means &#8220;do not quantize&#8221;, and in that case the argument switch is conditionally omitted.<\/p>\n<p>That was the most important lesson of this exercise: <strong>quantization performance without quality verification is worthless.<\/strong> A benchmark script would never have caught the bug. The tokens\/sec were actually better. Only reading the generated texts made the problem visible.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Starting_the_FP8_build_process\"><\/span>Starting the FP8 build process<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>After the FP16 step, the Qwen-7B model is already in the HF cache \u2014 so preparation as with FP16 is not needed. Just grab the FP8 build script directly:<\/p>\n<p>GitHub: <a href=\"https:\/\/github.com\/custom-build-robots\/tensorrt-llm-edge-prep\/tree\/main\/script\" target=\"_blank\" rel=\"noopener\">tensorrt-llm-edge-prep-script<\/a><\/p>\n<p>Save it, as usual, in your working directory (in my case <code>\/data\/trtllm\/<\/code>) and make it executable:<\/p>\n<p>Command: <code>chmod +x build_qwen_fp8.sh<\/code><\/p>\n<p>If you&#8217;re not yet in the container:<br \/>\nCommand: <code>.\/start_trtllm.sh exec<\/code><\/p>\n<p>Inside the container, switch to the workspace directory:<br \/>\nCommand: <code>cd \/workspace<\/code><\/p>\n<p>And run it:<br \/>\nCommand: <code>.\/build_qwen_fp8.sh<\/code><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Script_output_after_the_run-2\"><\/span>Script output after the run<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>On my run this took about <strong>6 minutes<\/strong> \u2014 roughly <strong>5 min quantize<\/strong> (the expensive ModelOpt step with 1024 calibration samples) and only <strong>48 seconds build<\/strong>. The build is noticeably shorter than with FP16, because the engine is only 8.2 GB in size (instead of 15 GB) and the serialization is correspondingly faster.<\/p>\n<p><code>[ OK ] Build fertig in 00:03:56 (236s)<\/code><\/p>\n<p><code>==================================================<\/code><br \/>\n<code>[INFO] STUFE 3: Verifikation<\/code><br \/>\n<code>==================================================<\/code><br \/>\n<code>[ OK ] Engine erfolgreich gebaut:<\/code><br \/>\n<code>total 8.2G<\/code><br \/>\n<code>-rw-r--r-- 1 root root 5.7K May 16 07:03 config.json<\/code><br \/>\n<code>-rw-r--r-- 1 root root 8.2G May 16 07:06 rank0.engine<\/code><\/p>\n<p><code>==================================================<\/code><br \/>\n<code>Build-Statistik FP16 (f\u00fcr Interview-Tabelle)<\/code><br \/>\n<code>==================================================<\/code><br \/>\n<code>Modell: Qwen2.5-7B-Instruct<\/code><br \/>\n<code>Pr\u00e4zision: FP8 (Weights + KV-Cache)<\/code><br \/>\n<code>Activations: float16<\/code><br \/>\n<code>GPU: NVIDIA RTX 6000 Ada Generation<\/code><br \/>\n<code>Quantize-Zeit: 00:07:34 (454s)<\/code><br \/>\n<code>Build-Zeit: 00:03:56 (236s)<\/code><br \/>\n<code>Gesamt-Zeit: 00:11:30 (690s)<\/code><br \/>\n<code>Checkpoint: 8.2G<\/code><br \/>\n<code>Engine-Datei: 8.2G<\/code><br \/>\n<code>Engine-Verz.: 8.2G<\/code><br \/>\n<code>Pfad: \/workspace\/engines\/qwen2.5-7b-fp8<\/code><br \/>\n<code>==================================================<\/code><\/p>\n<p><code>[ OK ] Statistik geloggt nach: \/workspace\/engines\/qwen2.5-7b-fp8-build.log<\/code><\/p>\n<p><code>[INFO] N\u00e4chster Schritt: FP8-Engine testen mit run_engine_qwen_fp8.py<\/code><br \/>\n<code>Vergleichswerte aus FP16-Lauf:<\/code><br \/>\n<code>Engine-Gr\u00f6\u00dfe: 14.5 GB<\/code><br \/>\n<code>Engine-Load: ~13 s<\/code><br \/>\n<code>Tokens\/sec: ~154 (batched)<\/code><br \/>\n<code>Erwartung FP8:<\/code><br \/>\n<code>Engine-Gr\u00f6\u00dfe: ~8 GB (44% kleiner)<\/code><br \/>\n<code>Tokens\/sec: 1,4-1,8x h\u00f6her dank Hardware-FP8<\/code><\/p>\n<p><code>root@b0f64442cfcb:\/workspace#<\/code><\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Using_the_FP8_engine_at_runtime\"><\/span>Using the FP8 engine at runtime<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The FP8 engine is loaded by the TensorRT backend just like the FP16 engine. The program code in <code>run_engine_qwen_fp8.py<\/code> is almost identical to <code>run_engine_qwen_fp16.py<\/code>, only the engine path now points to <code>\/workspace\/engines\/qwen2.5-7b-fp8<\/code>, which we created earlier. As always, the program <code>run_engine_qwen_fp8.py<\/code> is available on GitHub.<\/p>\n<p>GitHub: <a href=\"https:\/\/github.com\/custom-build-robots\/tensorrt-llm-edge-prep\/tree\/main\/script\" target=\"_blank\" rel=\"noopener\">tensorrt-llm-edge-prep-script<\/a><\/p>\n<p>Once you&#8217;ve downloaded the Python program, you run it inside the container as usual with the following command:<\/p>\n<p>Command: <code>python3 run_engine_qwen_fp8.py<\/code><\/p>\n<p>Now we have generated a really nice FP8 model, and the following image clearly shows the output for the three questions.<\/p>\n<div id=\"attachment_2239\" style=\"width: 1034px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result-1024x682.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2239\" class=\"size-large wp-image-2239\" src=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result-1024x682.jpg\" alt=\"Tensor RT LLM FP8 result\" width=\"1024\" height=\"682\" srcset=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result-1024x682.jpg 1024w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result-300x200.jpg 300w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result-768x511.jpg 768w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result-1080x719.jpg 1080w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result.jpg 1325w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-2239\" class=\"wp-caption-text\">Tensor RT LLM FP8 result<\/p><\/div>\n<p>If everything is running correctly, you get three coherent German answers again to the same prompts as with FP16 \u2014 but noticeably faster. The concrete numbers, the comparison with FP16, and the insights into build bottlenecks and engine-load behavior are coming in the <strong>last part of this series<\/strong>.<\/p>\n<br>\r\n\t<br>\r\n<h2>Article overview - TensorRT-LLM on the RTX A6000 Ada:<\/h2>\r\n<a title=\"Preparing an Ubuntu 24.04 Server for AI Inference: CUDA, Docker, NVIDIA Container Toolkit\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/preparing-an-ubuntu-24-04-server-for-ai-inference-cuda-docker-nvidia-container-toolkit\/2268\/\">Preparing an Ubuntu 24.04 Server for AI Inference: CUDA, Docker, NVIDIA Container Toolkit<\/a><br>\r\n<a title=\"TensorRT-LLM on the RTX A6000 Ada: Preparing for the Edge-LLM Ecosystem\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-on-the-rtx-a6000-ada-preparing-for-the-edge-llm-ecosystem\/2255\/\">TensorRT-LLM on the RTX A6000 Ada: Preparing for the Edge-LLM Ecosystem<\/a><br>\r\n<a title=\"TensorRT-LLM on Ubuntu 24.04: Setup with Docker and Helper Scripts\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-on-ubuntu-24-04-setup-with-docker-and-helper-scripts\/2257\/\">TensorRT-LLM on Ubuntu 24.04: Setup with Docker and Helper Scripts<\/a><br>\r\n<a title=\"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/\">TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8<\/a><br>\r\n<a title=\"TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-in-numbers-fp16-vs-fp8-on-the-rtx-a6000-ada\/2266\/\">TensorRT-LLM in Numbers: FP16 vs. FP8 on the RTX A6000 Ada<\/a><br>\r\n\t\r\n\t<br>\r\n\t<br>\n","protected":false},"excerpt":{"rendered":"<p>In the second part I set up TensorRT-LLM inside a Docker container and validated with TinyLlama that the pipeline basically runs. In doing so, one important detail came to light: the default API from tensorrt_llm import LLM uses the PyTorch backend and does not produce a deployable engine file. For the classic build-once-deploy-many pattern, and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2240,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[162,50],"tags":[1203,1204,1201,1207,1162,1182,1171,1200,1206,1205,1176,1196,1173,1175,1202],"class_list":["post-2262","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models-en","category-top-story-en","tag-build_qwen_fp16-sh","tag-build_qwen_fp8-sh","tag-convert_checkpoint-py","tag-fp8-context-fmha","tag-fp8-quantization","tag-hardware-fp8","tag-kv-cache-quantization","tag-modelopt-ptq","tag-post-training-quantization","tag-qwen2-5-7b-instruct","tag-rtx-a6000-ada","tag-tensorrt-backend","tag-tensorrt-engine-build","tag-tensorrt-llm","tag-trtllm-build","et-has-post-format-content","et_post_format-et-post-format-standard"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8 - Exploring the Future: Inside the AI Box<\/title>\n<meta name=\"description\" content=\"TensorRT-LLM FP8 quantization with ModelOpt PTQ on RTX A6000 Ada \u2014 build_qwen_fp16.sh and build_qwen_fp8.sh scripts plus the KV-cache trap.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8 - Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"og:description\" content=\"TensorRT-LLM FP8 quantization with ModelOpt PTQ on RTX A6000 Ada \u2014 build_qwen_fp16.sh and build_qwen_fp8.sh scripts plus the KV-cache trap.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/\" \/>\n<meta property=\"og:site_name\" content=\"Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-16T07:26:08+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-16T11:30:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1325\" \/>\n\t<meta property=\"og:image:height\" content=\"882\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Maker\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:site\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Maker\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/\"},\"author\":{\"name\":\"Maker\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"headline\":\"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8\",\"datePublished\":\"2026-05-16T07:26:08+00:00\",\"dateModified\":\"2026-05-16T11:30:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/\"},\"wordCount\":1938,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Tensor_RT_LLM_FP8_result.jpg\",\"keywords\":[\"build_qwen_fp16.sh\",\"build_qwen_fp8.sh\",\"convert_checkpoint.py\",\"FP8 Context FMHA\",\"FP8 quantization\",\"Hardware FP8\",\"KV cache quantization\",\"ModelOpt PTQ\",\"post-training quantization\",\"Qwen2.5-7B-Instruct\",\"RTX A6000 Ada\",\"TensorRT backend\",\"TensorRT engine build\",\"TensorRT-LLM\",\"trtllm-build\"],\"articleSection\":[\"Large Language Models\",\"Top story\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/\",\"name\":\"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8 - Exploring the Future: Inside the AI Box\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Tensor_RT_LLM_FP8_result.jpg\",\"datePublished\":\"2026-05-16T07:26:08+00:00\",\"dateModified\":\"2026-05-16T11:30:17+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"description\":\"TensorRT-LLM FP8 quantization with ModelOpt PTQ on RTX A6000 Ada \u2014 build_qwen_fp16.sh and build_qwen_fp8.sh scripts plus the KV-cache trap.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/#primaryimage\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Tensor_RT_LLM_FP8_result.jpg\",\"contentUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Tensor_RT_LLM_FP8_result.jpg\",\"width\":1325,\"height\":882,\"caption\":\"Tensor RT LLM FP8 result\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/top-story-en\\\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\\\/2262\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Start\",\"item\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\",\"name\":\"Exploring the Future: Inside the AI Box\",\"description\":\"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\",\"name\":\"Maker\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"caption\":\"Maker\"},\"description\":\"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.\",\"sameAs\":[\"https:\\\/\\\/ai-box.eu\"],\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/author\\\/ingmars\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8 - Exploring the Future: Inside the AI Box","description":"TensorRT-LLM FP8 quantization with ModelOpt PTQ on RTX A6000 Ada \u2014 build_qwen_fp16.sh and build_qwen_fp8.sh scripts plus the KV-cache trap.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/","og_locale":"en_US","og_type":"article","og_title":"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8 - Exploring the Future: Inside the AI Box","og_description":"TensorRT-LLM FP8 quantization with ModelOpt PTQ on RTX A6000 Ada \u2014 build_qwen_fp16.sh and build_qwen_fp8.sh scripts plus the KV-cache trap.","og_url":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/","og_site_name":"Exploring the Future: Inside the AI Box","article_published_time":"2026-05-16T07:26:08+00:00","article_modified_time":"2026-05-16T11:30:17+00:00","og_image":[{"width":1325,"height":882,"url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result.jpg","type":"image\/jpeg"}],"author":"Maker","twitter_card":"summary_large_image","twitter_creator":"@Ingmar_Stapel","twitter_site":"@Ingmar_Stapel","twitter_misc":{"Written by":"Maker","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#article","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/"},"author":{"name":"Maker","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"headline":"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8","datePublished":"2026-05-16T07:26:08+00:00","dateModified":"2026-05-16T11:30:17+00:00","mainEntityOfPage":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/"},"wordCount":1938,"commentCount":0,"image":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result.jpg","keywords":["build_qwen_fp16.sh","build_qwen_fp8.sh","convert_checkpoint.py","FP8 Context FMHA","FP8 quantization","Hardware FP8","KV cache quantization","ModelOpt PTQ","post-training quantization","Qwen2.5-7B-Instruct","RTX A6000 Ada","TensorRT backend","TensorRT engine build","TensorRT-LLM","trtllm-build"],"articleSection":["Large Language Models","Top story"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/","url":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/","name":"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8 - Exploring the Future: Inside the AI Box","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#primaryimage"},"image":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result.jpg","datePublished":"2026-05-16T07:26:08+00:00","dateModified":"2026-05-16T11:30:17+00:00","author":{"@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"description":"TensorRT-LLM FP8 quantization with ModelOpt PTQ on RTX A6000 Ada \u2014 build_qwen_fp16.sh and build_qwen_fp8.sh scripts plus the KV-cache trap.","breadcrumb":{"@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#primaryimage","url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result.jpg","contentUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/Tensor_RT_LLM_FP8_result.jpg","width":1325,"height":882,"caption":"Tensor RT LLM FP8 result"},{"@type":"BreadcrumbList","@id":"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-pipeline-building-persistent-engines-with-fp16-and-fp8\/2262\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Start","item":"https:\/\/ai-box.eu\/en\/"},{"@type":"ListItem","position":2,"name":"TensorRT-LLM Pipeline: Building Persistent Engines with FP16 and FP8"}]},{"@type":"WebSite","@id":"https:\/\/ai-box.eu\/en\/#website","url":"https:\/\/ai-box.eu\/en\/","name":"Exploring the Future: Inside the AI Box","description":"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ai-box.eu\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1","name":"Maker","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","caption":"Maker"},"description":"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.","sameAs":["https:\/\/ai-box.eu"],"url":"https:\/\/ai-box.eu\/en\/author\/ingmars\/"}]}},"_links":{"self":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2262","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/comments?post=2262"}],"version-history":[{"count":4,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2262\/revisions"}],"predecessor-version":[{"id":2272,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2262\/revisions\/2272"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media\/2240"}],"wp:attachment":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media?parent=2262"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/categories?post=2262"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/tags?post=2262"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}