{"id":2603,"date":"2026-06-17T04:28:58","date_gmt":"2026-06-17T04:28:58","guid":{"rendered":"https:\/\/ai-box.eu\/?p=2603"},"modified":"2026-06-18T02:43:51","modified_gmt":"2026-06-18T02:43:51","slug":"local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally","status":"publish","type":"post","link":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/","title":{"rendered":"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally"},"content":{"rendered":"<p>Every building block is now available locally: ASR (<a href=\"https:\/\/ai-box.eu\/en\/news\/nvidia-nim-locally-running-german-speech-recognition-as-a-microservice\/2556\/\" target=\"_blank\" rel=\"noopener\">Parakeet<\/a>, <a href=\"https:\/\/ai-box.eu\/en\/news\/nvidia-canary-locally-multilingual-speech-recognition-and-translation-as-a-nim\/2562\/\" target=\"_blank\" rel=\"noopener\">Canary<\/a>), TTS (<a href=\"https:\/\/ai-box.eu\/en\/news\/nvidia-magpie-tts-locally-german-speech-output-as-a-microservice\/2570\/\" target=\"_blank\" rel=\"noopener\">Magpie<\/a>), an LLM via my Ollama server and the orchestrator (<a href=\"https:\/\/ai-box.eu\/en\/news\/nvidia-nemo-agent-toolkit-nat-set-up-the-agent-orchestrator-locally\/2579\/\" target=\"_blank\" rel=\"noopener\">NAT, Part 5<\/a>). Now I&#8217;m connecting them into a continuous, <strong>interruptible speech loop<\/strong>. This will be my first small local voice agent. What I&#8217;m aiming for is a kind of <strong>general-purpose agent<\/strong>: I speak, it understands, it gets things done and answers me in natural language. The things it does will be small at first, but the point here is the principle \u2013 how it works at all.<\/p>\n<p>In this part I&#8217;m building along <strong>Path A<\/strong>: the LLM sits directly in the loop (my Ollama server), so we first get a cleanly running voice-to-voice loop. The &#8220;real brain&#8221; \u2013 the NAT agent with tool calling (telling the time and so on) that we already met in Part 5 \u2013 I&#8217;ll wire in for <strong>Part 7<\/strong>. The wake word follows in <strong>Part 8<\/strong>.<\/p>\n<div id=\"attachment_2597\" style=\"width: 617px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_09.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2597\" class=\" wp-image-2597\" src=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_09.jpg\" alt=\"NVIDIA NIM Voice Assistant Web Frontend\" width=\"607\" height=\"667\" srcset=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_09.jpg 737w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_09-273x300.jpg 273w\" sizes=\"(max-width: 607px) 100vw, 607px\" \/><\/a><p id=\"caption-attachment-2597\" class=\"wp-caption-text\">NVIDIA NIM Voice Assistant Web Frontend<\/p><\/div>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#What_is_NVIDIA_Pipecat\" >What is NVIDIA Pipecat?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#The_architecture_at_a_glance\" >The architecture at a glance<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Prerequisites\" >Prerequisites<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Step_1_Install_nvidia-pipecat\" >Step 1: Install nvidia-pipecat<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Step_2_Start_both_NIMs_at_the_same_time_and_mind_the_ports\" >Step 2: Start both NIMs at the same time and mind the ports<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Step_3_Get_the_official_nvidia-pipecat_example_as_a_scaffold\" >Step 3: Get the official nvidia-pipecat example as a scaffold<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Step_4_Configure_for_local_Ollama_and_German_env\" >Step 4: Configure for local, Ollama and German (.env)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Step_5_Start_the_voice_loop_and_test_it_in_the_browser\" >Step 5: Start the voice loop and test it in the browser<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Step_6_Turn_detection_interruptibility_and_speculative_speech\" >Step 6: Turn detection, interruptibility and speculative speech<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Step_7_Measure_the_latency_chain\" >Step 7: Measure the latency chain<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Tips_and_troubleshooting\" >Tips and troubleshooting<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"What_is_NVIDIA_Pipecat\"><\/span>What is NVIDIA Pipecat?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>To understand the loop, it&#8217;s worth looking at the two layers that come together here.<\/p>\n<p><strong>Pipecat<\/strong> is an open-source framework for real-time voice and multimodal agents. At its core is a <strong>pipeline architecture<\/strong>: audio and text flow as small &#8220;frames&#8221; through a chain of stages (input \u2192 ASR \u2192 LLM \u2192 TTS \u2192 output). Pipecat takes care of exactly the tricky real-time problems you&#8217;d otherwise have to build laboriously yourself: the continuous streaming of audio chunks, <strong>turn detection<\/strong> (when have I finished speaking?) and <strong>interruptibility<\/strong> (cutting in on the agent). The framework is deliberately vendor-neutral; you can plug in open, commercial or your own models.<\/p>\n<p><strong>NVIDIA Pipecat<\/strong> is the NVIDIA-native extension of it and part of the <strong>ACE Controller<\/strong> project. It ships ready-made building blocks for exactly the models I run locally in this series:<\/p>\n<ul>\n<li><strong>Riva ASR service:<\/strong> talks to my speech recognition (Parakeet or Canary).<\/li>\n<li><strong>Riva TTS service:<\/strong> talks to my speech output (Magpie).<\/li>\n<li><strong>LLM service:<\/strong> plugs in an LLM as the &#8220;thinking&#8221; stage; in my case my Ollama server via the OpenAI-compatible interface.<\/li>\n<\/ul>\n<p>On top of that come NVIDIA-specific helpers such as the FastAPI WebSocket transport (a WebSocket-based audio transport including a small browser test UI), synchronizing the transcripts with audio playback, and latency optimizations like <em>Speculative Speech Processing<\/em>, which starts processing while I&#8217;m still speaking.<\/p>\n<p>The decisive point for my sovereign approach: by default the Riva services point to the NVIDIA cloud, but they can just as easily be aimed at <strong>local<\/strong> endpoints. That makes Pipecat the connective tissue that turns my individual, locally running NIMs into a real, coherent dialogue. With this local configuration it&#8217;s once again guaranteed that not a single audio packet leaves my machine \u2013 that is, my network.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_architecture_at_a_glance\"><\/span>The architecture at a glance<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>[Diagram: microphone (browser) \u2192 ASR (Parakeet NIM) \u2192 LLM (Ollama) \u2192 TTS (Magpie NIM) \u2192 speaker, with feedback and interruptibility. Insert the architecture graphic here.]<\/p>\n<p>The loop itself is a pipeline in this order:<\/p>\n<pre><code class=\"language-text\">transport.input() \u2192 STT \u2192 context.user() \u2192 LLM \u2192 TTS \u2192 transport.output() \u2192 context.assistant()<\/code><\/pre>\n<h2><span class=\"ez-toc-section\" id=\"Prerequisites\"><\/span>Prerequisites<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li>The <strong>Parakeet ASR NIM<\/strong> (Part 2) and the <strong>Magpie TTS NIM<\/strong> (Part 4) now have to run <strong>at the same time<\/strong>.<\/li>\n<li>My <strong>Ollama server<\/strong> (on the second machine) as the LLM backend.<\/li>\n<li>A <strong>Python 3.12 venv<\/strong> \u2013 nvidia-pipecat requires Python 3.12 (so a dedicated environment, separate from the venvs of the other parts).<\/li>\n<li>A browser for testing (microphone access), just like the Gradio demo in Part 1.<\/li>\n<\/ul>\n<h2><span class=\"ez-toc-section\" id=\"Step_1_Install_nvidia-pipecat\"><\/span>Step 1: Install nvidia-pipecat<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>nvidia-pipecat explicitly requires <strong>Python 3.12<\/strong> \u2013 and unfortunately nothing newer, which is a small problem on my Ubuntu server. On current distributions this is a stumbling block: Ubuntu 25.10, for instance, ships only Python 3.13 (and 3.14), and a <code>python3.12 -m venv<\/code> promptly answers with <code>Command 'python3.12' not found<\/code>. I wouldn&#8217;t force the newer versions here, because some dependencies (Pipecat itself, the ONNX\/Silero components for the VAD, and so on) often don&#8217;t have ready-made wheels for 3.13\/3.14 yet, so we hopefully sidestep potential problems. If we take a current Python version, the build will probably end in errors. So we specifically need a Python 3.12 \u2013 at least at the time I wrote this article, that was my path.<\/p>\n<p>First check whether you even have it:<\/p>\n<p><strong>Command:<\/strong> <code>python3.12 --version<\/code><\/p>\n<p><strong>Case A \u2013 Python 3.12 is available.<\/strong> Then the classic venv is enough:<\/p>\n<p><strong>Command:<\/strong> <code>python3.12 -m venv ~\/venvs\/pipecat<\/code><\/p>\n<p><strong>Command:<\/strong> <code>source ~\/venvs\/pipecat\/bin\/activate<\/code><\/p>\n<p><strong>Command:<\/strong> <code>pip install --upgrade pip setuptools wheel<\/code><\/p>\n<p><strong>Case B \u2013 Python 3.12 is missing<\/strong> (e.g. on Ubuntu 25.10). Here <code>uv<\/code> is the cleanest route: the tool fetches an isolated Python 3.12 itself \u2013 entirely without <code>sudo<\/code> and without touching your system Python.<\/p>\n<p>First install uv (if not already present) and put it on the PATH:<\/p>\n<p><strong>Command:<\/strong> <code>curl -LsSf https:\/\/astral.sh\/uv\/install.sh | sh<\/code><\/p>\n<p><strong>Command:<\/strong> <code>source $HOME\/.local\/bin\/env<\/code><\/p>\n<p>Then create the venv with Python 3.12 (uv downloads 3.12 automatically if needed) and activate it:<\/p>\n<p><strong>Command:<\/strong> <code>uv venv --python 3.12 ~\/venvs\/pipecat<\/code><\/p>\n<p><strong>Command:<\/strong> <code>source ~\/venvs\/pipecat\/bin\/activate<\/code><\/p>\n<p><strong>Note:<\/strong> In a uv venv you install packages with <code>uv pip install \u2026<\/code> instead of <code>pip install \u2026<\/code>.<\/p>\n<p><strong>Now install the package:<\/strong> depending on which route you took, you again have to choose A or B. For me it&#8217;s still the B variant:<\/p>\n<p><strong>Command (Case A):<\/strong> <code>pip install nvidia-pipecat<\/code><\/p>\n<p><strong>Command (Case B):<\/strong> <code>uv pip install nvidia-pipecat<\/code><\/p>\n<p><strong>Good to know &#8211; the important extras already come along:<\/strong><\/p>\n<p>A glance at the install output shows that <code>nvidia-pipecat<\/code> (for me it was version 0.4.0 based on <code>pipecat-ai<\/code> 0.0.98) already pulls in the building blocks we&#8217;re about to need as dependencies: <code>onnxruntime<\/code> (runtime for the Silero VAD), <code>aiortc<\/code> together with <code>pipecat-ai-small-webrtc-prebuilt<\/code> (WebRTC transport including a ready-made browser client), <code>nvidia-riva-client<\/code> (ASR\/TTS) and <code>openai<\/code> (for the OpenAI-compatible LLM connection to Ollama). So you most likely won&#8217;t need to install anything extra. If a <code>ModuleNotFoundError<\/code> does show up on the first run, install the matching extra specifically, e.g. <code>uv pip install \"pipecat-ai[silero]\"<\/code>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Step_2_Start_both_NIMs_at_the_same_time_and_mind_the_ports\"><\/span>Step 2: Start both NIMs at the same time and mind the ports<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Up to now we always ran the NIMs individually. For the loop it&#8217;s different: ASR <strong>and<\/strong> TTS now have to run <strong>at the same time<\/strong> and stay started for the whole session. Each NIM is its own foreground container, so we start them in <strong>two separate terminals<\/strong> (we&#8217;ll use a third terminal later for the Pipecat bot).<\/p>\n<p><strong>And yes, here we have to watch the ports.<\/strong><\/p>\n<p>By default both NIMs want ports <code>9000<\/code> (HTTP) and <code>50051<\/code> (gRPC). If we started both like that, there&#8217;d be a port conflict and the second container wouldn&#8217;t even come up. The solution: Parakeet stays on the default ports, and we remap Magpie to <code>9001<\/code>\/<code>50052<\/code>. Worth remembering: the <strong>gRPC ports<\/strong> (50051 for ASR, 50052 for TTS) are exactly the addresses we&#8217;ll point the Pipecat services at in Step 3.<\/p>\n<p>Now open two terminal windows to your application server.<\/p>\n<p><strong>Note:<\/strong> In a freshly opened terminal the API key is usually no longer set. So in each one we should first run the familiar command below (as in Part 2).<\/p>\n<p><strong>Command:<\/strong> <code>export NGC_API_KEY=\"nvapi-xxxxxxxxxxxxxxxxxxxxx\"<\/code><\/p>\n<p><strong>Terminal 1 \u2013 Parakeet (ASR)<\/strong> on the default ports 9000\/50051:<\/p>\n<p><strong>Command:<\/strong> <code class=\"language-bash\">docker run -it --rm --name=parakeet-1-1b-rnnt-multilingual --runtime=nvidia --gpus '\"device=0\"' --shm-size=8GB -e NGC_API_KEY -e NIM_HTTP_API_PORT=9000 -e NIM_GRPC_API_PORT=50051 -p 9000:9000 -p 50051:50051 -e NIM_TAGS_SELECTOR=\"mode=str\" -v ~\/.cache\/nim:\/opt\/nim\/.cache nvcr.io\/nim\/nvidia\/parakeet-1-1b-rnnt-multilingual:latest<\/code><\/p>\n<p><strong>Terminal 2 \u2013 Magpie (TTS)<\/strong>, remapped to 9001\/50052. Important: here I deliberately force <code>batch_size=8<\/code> \u2013 why is explained in the VRAM box just below.<\/p>\n<p><strong>Command:<\/strong> <code class=\"language-bash\">docker run -it --rm --name=magpie-tts-multilingual --runtime=nvidia --gpus '\"device=0\"' --shm-size=8GB -e NGC_API_KEY -e NIM_HTTP_API_PORT=9001 -e NIM_GRPC_API_PORT=50052 -p 9001:9001 -p 50052:50052 -e NIM_TAGS_SELECTOR=\"name=magpie-tts-multilingual,batch_size=8\" -v ~\/.cache\/nim:\/opt\/nim\/.cache nvcr.io\/nim\/nvidia\/magpie-tts-multilingual:latest<\/code><\/p>\n<p>Both containers keep running \u2013 meaning you must not close the two terminal windows. That&#8217;s not ideal, but later, once everything works, we can also start the containers automatically in the background if we want. In a third terminal, briefly check that both services are active and working \u2013 let&#8217;s say healthy. Here we again have to mind the <strong>different ports<\/strong>.<\/p>\n<p><strong>Command:<\/strong> <code>curl http:\/\/localhost:9000\/v1\/health\/ready<\/code> (ASR, Parakeet)<\/p>\n<p>As a result I got the following back, confirming that the container and service are running.<\/p>\n<blockquote><p><code>{\"object\":\"health.response\",\"message\":\"ready\",\"status\":\"ready\"}<\/code><\/p><\/blockquote>\n<p><strong>Command:<\/strong> <code>curl http:\/\/localhost:9001\/v1\/health\/ready<\/code> (TTS, Magpie)<\/p>\n<p>Here too I got a positive result, confirming the container and service are running.<\/p>\n<blockquote><p><code>{\"object\":\"health.response\",\"message\":\"ready\",\"status\":\"ready\"}<\/code><\/p><\/blockquote>\n<p><strong>Note: VRAM and why batch_size=8:<\/strong> Unlike Part 4, where Magpie ran alone, ASR and TTS now share the same GPU (and even more so if other services are running on the card). The profile selector initially reached for the large <code>batch_size=32<\/code> profile. That occupied around 31 GB of VRAM for me, whereas <code>batch_size=8<\/code> gets by with ~11 GB. Together with Parakeet (and everything else), the 48 GB card filled up and the container aborted while loading. In the log it looked like this:<\/p>\n<blockquote><p><code>[TRT] [E] Error Code 2: OutOfMemory (Requested size was 10786262016 bytes.)<\/code><br \/>\n<code>... Failed to create an execution context!<\/code><br \/>\n<code>&gt; Triton server died before reaching ready state.<\/code><\/p><\/blockquote>\n<p>That&#8217;s why I deliberately append <code>batch_size=8<\/code> in the <code>NIM_TAGS_SELECTOR<\/code> above. Check beforehand with <code>nvidia-smi<\/code> how much VRAM is still free. If other GPU services are running there, stop them for the loop session or give them a different card.<\/p>\n<p><strong>Note: Persist the model build (no more 28-minute rebuild):<\/strong><\/p>\n<p>You&#8217;ve surely noticed that on startup Magpie <em>builds<\/em> the TensorRT codec decoder for several minutes. That&#8217;s because Magpie ships as an RMIR model and there&#8217;s no ready-made engine to download for the Ada GPU (the log says <code>model_type: rmir<\/code>). The mounted <code>~\/.cache\/nim<\/code> only saves the download, not the build. The finished engine lands in the container-internal <code>\/data\/models<\/code> and is discarded every time with <code>--rm<\/code>. NVIDIA solves this with a one-time <strong>export<\/strong>: build the engine once, write it to a mounted folder, and load it from there on later starts.<\/p>\n<p><strong>Step A<\/strong><\/p>\n<p><strong>Export the finished engine once<\/strong> (the container builds the engine, writes it to the export folder and exits):<\/p>\n<p><strong>Command: <\/strong><code class=\"language-bash\">export NIM_EXPORT_PATH=~\/nim_export<\/code><\/p>\n<p><strong>Command: <\/strong> <code class=\"language-bash\">mkdir -p $NIM_EXPORT_PATH &amp;&amp; chmod 777 $NIM_EXPORT_PATH<\/code><\/p>\n<p>We don&#8217;t need to specify a port for this command, because the container is built, exported immediately after the build, and then shut down right away once the export is complete.<\/p>\n<p><strong>Command: <\/strong><code class=\"language-bash\">docker run -it --rm --name=magpie-tts-multilingual --runtime=nvidia --gpus '\"device=0\"' --shm-size=8GB -e NGC_API_KEY -e NIM_TAGS_SELECTOR=\"name=magpie-tts-multilingual,batch_size=8\" -v ~\/.cache\/nim:\/opt\/nim\/.cache -v $NIM_EXPORT_PATH:\/opt\/nim\/export -e NIM_EXPORT_PATH=\/opt\/nim\/export nvcr.io\/nim\/nvidia\/magpie-tts-multilingual:latest<\/code><\/p>\n<p>Now there should be a file named <code>tts-MagpieTTS_21hz_codec_trtllm_Multilingual.tar.gz<\/code> in the path <code class=\"language-bash\">~\/nim_export<\/code>. To check, run the following command:<\/p>\n<p><strong>Command:<\/strong> <code>ls ~\/nim_export<\/code><\/p>\n<p>This file then contains the built model. Now load this model as described in Step B.<\/p>\n<p><strong>Step B<\/strong><\/p>\n<p><strong>Load the finished engine from persistent storage:<\/strong> on subsequent starts you mount the same export folder and additionally set <code>NIM_DISABLE_MODEL_DOWNLOAD=true<\/code>. The container then skips both the download and the rebuild and is ready in seconds instead of minutes.<\/p>\n<p>For the exact flags and paths for your NIM version, it&#8217;s best to use NVIDIA&#8217;s docs <a href=\"https:\/\/docs.nvidia.com\/nim\/speech\/latest\/deployment\/docker\/model-caching.html\" target=\"_blank\" rel=\"noopener\">&#8220;Model Caching for Speech NIM Containers&#8221;<\/a>. The export workflow for RMIR models is described there step by step.<\/p>\n<p>For me the command looked like this.<\/p>\n<p><strong>Command:<\/strong> <code>docker run -it --rm --name=magpie-tts-multilingual --runtime=nvidia --gpus '\"device=0\"' --shm-size=8GB -e NGC_API_KEY -e NIM_TAGS_SELECTOR=\"name=magpie-tts-multilingual,batch_size=8\" -e NIM_DISABLE_MODEL_DOWNLOAD=true -e NIM_HTTP_API_PORT=9001 -e NIM_GRPC_API_PORT=50052 -p 9001:9001 -p 50052:50052 -v ~\/.cache\/nim:\/opt\/nim\/.cache -v $NIM_EXPORT_PATH:\/opt\/nim\/export -e NIM_EXPORT_PATH=\/opt\/nim\/export nvcr.io\/nim\/nvidia\/magpie-tts-multilingual:latest<\/code><\/p>\n<p><strong>Note:<\/strong> Please remember, if you get an error like <code>docker: invalid spec: :\/opt\/nim\/export: empty section between colons<\/code>, that both variables are set after a server restart:<\/p>\n<p><strong>Command:<\/strong> <code>export NGC_API_KEY=\"nvapi-xxxxxxxxxxxxxxxxxxxxx\"<\/code><\/p>\n<p><strong>Command: <\/strong><code class=\"language-bash\">export NIM_EXPORT_PATH=~\/nim_export<\/code><\/p>\n<p><strong>Command:<\/strong> <code>curl http:\/\/localhost:9001\/v1\/health\/ready<\/code> (TTS, Magpie)<\/p>\n<p>Here too I got a positive result, confirming the container and service are running.<\/p>\n<blockquote><p><code>{\"object\":\"health.response\",\"message\":\"ready\",\"status\":\"ready\"}<\/code><\/p><\/blockquote>\n<p>With the following command you can list all voices available in the model and their languages.<\/p>\n<p><strong>Command:<\/strong> <code>curl http:\/\/localhost:9001\/v1\/audio\/list_voices<\/code><\/p>\n<p>For German, the following voices are available, and they become relevant for section 4, where we configure them.<\/p>\n<blockquote><p># Mia (female)<br \/>\nMagpie-Multilingual.DE-DE.Mia<br \/>\nMagpie-Multilingual.DE-DE.Mia.Neutral<br \/>\nMagpie-Multilingual.DE-DE.Mia.Calm<br \/>\nMagpie-Multilingual.DE-DE.Mia.Angry<br \/>\nMagpie-Multilingual.DE-DE.Mia.Happy<br \/>\nMagpie-Multilingual.DE-DE.Mia.Sad<\/p>\n<p># Sofia (female)<br \/>\nMagpie-Multilingual.DE-DE.Sofia<br \/>\nMagpie-Multilingual.DE-DE.Sofia.Neutral<br \/>\nMagpie-Multilingual.DE-DE.Sofia.Calm<br \/>\nMagpie-Multilingual.DE-DE.Sofia.Angry<br \/>\nMagpie-Multilingual.DE-DE.Sofia.Happy<br \/>\nMagpie-Multilingual.DE-DE.Sofia.Fearful<\/p>\n<p># Pascal (male)<br \/>\nMagpie-Multilingual.DE-DE.Pascal<br \/>\nMagpie-Multilingual.DE-DE.Pascal.Neutral<br \/>\nMagpie-Multilingual.DE-DE.Pascal.Calm<br \/>\nMagpie-Multilingual.DE-DE.Pascal.Angry<br \/>\nMagpie-Multilingual.DE-DE.Pascal.Happy<br \/>\nMagpie-Multilingual.DE-DE.Pascal.Disgust<br \/>\nMagpie-Multilingual.DE-DE.Pascal.Sad<\/p>\n<p># Diego (male)<br \/>\nMagpie-Multilingual.DE-DE.Diego<br \/>\nMagpie-Multilingual.DE-DE.Diego.Neutral<br \/>\nMagpie-Multilingual.DE-DE.Diego.Calm<br \/>\nMagpie-Multilingual.DE-DE.Diego.Angry<br \/>\nMagpie-Multilingual.DE-DE.Diego.Happy<br \/>\nMagpie-Multilingual.DE-DE.Diego.PleasantSurprised<br \/>\nMagpie-Multilingual.DE-DE.Diego.Disgust<\/p><\/blockquote>\n<h2><span class=\"ez-toc-section\" id=\"Step_3_Get_the_official_nvidia-pipecat_example_as_a_scaffold\"><\/span>Step 3: Get the official nvidia-pipecat example as a scaffold<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Instead of building the FastAPI and runner mechanics by hand, we take NVIDIA&#8217;s official <strong>&#8220;speech-to-speech&#8221; example<\/strong> as the basis. It already ships the ACETransport, the pipeline runner and a small web test UI. The best part of this solution: ASR, LLM and TTS can be switched entirely via <strong>environment variables<\/strong>, without touching the Python code.<\/p>\n<p><strong>Command:<\/strong> <code>git clone https:\/\/github.com\/NVIDIA\/voice-agent-examples.git<\/code><\/p>\n<p><strong>Command:<\/strong> <code>cd ~\/voice-agent-examples\/examples\/voice_agent_websocket<\/code><\/p>\n<p>The example brings its own uv-managed environment. We create it and install the dependencies \u2013 from here on we work in this project venv (the venv from Step 1 mainly served to verify Python 3.12 and the installation):<\/p>\n<p><strong>Command:<\/strong> <code>uv venv<\/code><\/p>\n<p><strong>Command:<\/strong> <code>source .venv\/bin\/activate<\/code><\/p>\n<p><strong>Command:<\/strong> <code>uv sync<\/code><\/p>\n<p><strong>Important:<\/strong> the example can also spin up its own ASR\/TTS\/LLM containers via <code>docker compose<\/code>. But we don&#8217;t want that here<em>.<\/em> Our NIMs (Step 2) and the Ollama server are already running. So we take the pure Python route and simply point the example at our already-running services via the <code>.env<\/code>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Step_4_Configure_for_local_Ollama_and_German_env\"><\/span>Step 4: Configure for local, Ollama and German (.env)<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Now the clever bit: we copy the example environment file and point it at our local services and at German.<\/p>\n<p><strong>Command:<\/strong> <code>cp env.example .env<\/code><\/p>\n<p><strong>Command:<\/strong> <code>nano .env<\/code><\/p>\n<p>The bundled <code>env.example<\/code> points to the NVIDIA cloud by default and is set to English. We remap it to our local NIM gRPC ports (remember the Magpie remap to 50052 from Step 2) and to the Ollama server. Here&#8217;s what my <code>.env<\/code> looks like:<\/p>\n<pre><code class=\"language-bash\">NVIDIA_API_KEY=dummy\r\n\r\nENABLE_SPECULATIVE_SPEECH=true\r\nCHAT_HISTORY_LIMIT=20\r\n\r\n# ASR \u2013 local Parakeet NIM (default gRPC port from Step 2)\r\nASR_SERVER_URL=localhost:50051\r\nASR_LANGUAGE=de-DE\r\n# Leave ASR_MODEL_NAME empty = default model of the local NIM (details in the note below)\r\nASR_MODEL_NAME=\r\n\r\n# TTS \u2013 local Magpie NIM (remapped to 50052 in Step 2)\r\nTTS_SERVER_URL=localhost:50052\r\nTTS_LANGUAGE=de-DE\r\nTTS_VOICE_ID=Magpie-Multilingual.DE-DE.Mia.Calm\r\nTTS_MODEL_NAME=magpie_tts_ensemble-Magpie-Multilingual\r\n\r\n# LLM \u2013 my Ollama server (OpenAI-compatible)\r\nNVIDIA_LLM_URL=http:\/\/&lt;OLLAMA-IP&gt;:11434\/v1\r\nNVIDIA_LLM_MODEL=&lt;your-ollama-model&gt;   # e.g. qwen3:8b<\/code><\/pre>\n<p>A few words on why exactly these values:<\/p>\n<ul>\n<li><strong>NVIDIA_API_KEY=dummy:<\/strong> The key isn&#8217;t needed for the local services. But the OpenAI-compatible client for Ollama insists on a non-empty value, hence the placeholder.<\/li>\n<li><strong>ASR_SERVER_URL \/ TTS_SERVER_URL:<\/strong> point to the gRPC ports of our two NIMs (50051 for Parakeet, 50052 for Magpie). In <code>env.example<\/code> these lines are commented out \u2013 we enable them and enter our local addresses.<\/li>\n<li><strong>ASR_LANGUAGE \/ TTS_LANGUAGE \/ TTS_VOICE_ID:<\/strong> these three are <em>not<\/em> in <code>env.example<\/code>, but <code>bot.py<\/code> does evaluate them (otherwise the default en-US, or the voice &#8220;Aria&#8221;, kicks in). So we add them by hand to switch to German and the Magpie voice from Part 4.<\/li>\n<li><strong>TTS_MODEL_NAME:<\/strong> is already the default and matches our Magpie export exactly (<code>magpie_tts_ensemble-Magpie-Multilingual<\/code>) \u2013 nothing to do here.<\/li>\n<\/ul>\n<p><strong>Note on ASR_MODEL_NAME:<\/strong> The default in the example (<code>parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer<\/code>) is an <em>English<\/em> model with a different name than the one our local <code>parakeet-1-1b-rnnt-multilingual<\/code> provides. The Parakeet container reveals the exact name on startup: look in its log for the line <code>Successfully registered: &lt;name&gt; for ASR<\/code> and enter that <code>&lt;name&gt;<\/code>. If you want it quick, leave <code>ASR_MODEL_NAME=<\/code> empty \u2013 then the local endpoint uses its default model.<\/p>\n<p><strong>Important:<\/strong> don&#8217;t comment out the line, because a commented-out <code>ASR_MODEL_NAME<\/code> makes <code>bot.py<\/code> fall back to the English default.<\/p>\n<p>You set the <strong>system prompt<\/strong> \u2013 the role of the general assistant \u2013 directly in <code>bot.py<\/code>, where the <code>messages<\/code> with the <code>system<\/code> role are defined. I replace the English default with:<\/p>\n<pre><code class=\"language-python\">    messages = [\r\n        {\r\n            \"role\": \"system\",\r\n            \"content\": \"Du bist ein hilfsbereiter Sprachassistent der in allen Lebenslagen des Benutzers von Dir bereitwillig hilft.\"\r\n            \"Antworte freundlich, h\u00f6flich und in h\u00f6chstens einem kurzen Satz auf Deutsch.\"\r\n            \"Keine Aufz\u00e4hlungen oder Listen.\",\r\n        },\r\n    ]<\/code><\/pre>\n<p>A few lines further down, the <code>on_client_connected<\/code> handler holds the greeting with which the bot opens the conversation. I switch that to German as well, otherwise it might introduce itself in English:<\/p>\n<pre><code class=\"language-python\">        messages.append({\"role\": \"system\", \"content\": \"Bitte stell dich dem Nutzer kurz auf Deutsch vor.\"})<\/code><\/pre>\n<h2><span class=\"ez-toc-section\" id=\"Step_5_Start_the_voice_loop_and_test_it_in_the_browser\"><\/span>Step 5: Start the voice loop and test it in the browser<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Now everything comes together. Before you start the bot, make sure the <strong>two NIMs from Step 2 are still running<\/strong> \u2013 because of the one-time export or closed terminals, they may have been stopped in the meantime. Restart them if necessary:<\/p>\n<ul>\n<li><strong>Parakeet (ASR)<\/strong> as in Step 2 (ports 9000\/50051).<\/li>\n<li><strong>Magpie (TTS)<\/strong> via the fast <strong>Step-B command<\/strong> (loads the exported engine, ports 9001\/50052) \u2013 no more rebuild.<\/li>\n<\/ul>\n<p>Quickly verify both are healthy:<\/p>\n<p><strong>Command:<\/strong> <code>curl http:\/\/localhost:9000\/v1\/health\/ready<\/code><\/p>\n<p><strong>Command:<\/strong> <code>curl http:\/\/localhost:9001\/v1\/health\/ready<\/code><\/p>\n<p>If both report <code>ready<\/code>, start the bot in a third terminal \u2013 in the project folder and with the project venv active:<\/p>\n<p><strong>Command:<\/strong> <code>cd ~\/voice-agent-examples\/examples\/voice_agent_websocket<\/code><\/p>\n<p><strong>Command:<\/strong> <code>source .venv\/bin\/activate<\/code><\/p>\n<p><strong>Command:<\/strong> <code>python bot.py<\/code><\/p>\n<p>This hosts the small test UI together with the voice agent server. At the end, Uvicorn should be listening on <code>0.0.0.0:8100<\/code>. Then open in the browser:<\/p>\n<p><strong>URL:<\/strong> <code>http:\/\/&lt;SERVER-IP&gt;:8100\/static\/index.html<\/code><\/p>\n<p><strong>Microphone access without HTTPS:<\/strong> Since my A6000 Ada server is headless and has no HTTPS encryption, the browser initially won&#8217;t allow the microphone. For me the <strong>SSH tunnel<\/strong> worked reliably:<\/p>\n<p><strong>Command:<\/strong> <code>ssh -L 8100:localhost:8100 ingmar@192.168.2.119<\/code><\/p>\n<p>Then open <code>http:\/\/localhost:8100\/static\/index.html<\/code> \u2013 over <code>localhost<\/code> the browser allows the microphone. (Alternatively, in Chrome under <code>chrome:\/\/flags\/<\/code>, add the address <code>http:\/\/&lt;SERVER-IP&gt;:8100<\/code> under &#8220;Insecure origins treated as secure&#8221;.)<\/p>\n<p>In the UI you click <strong>Start Audio<\/strong> \u2013 and the loop is live: I speak a German sentence, Parakeet transcribes, the Ollama LLM answers, and Magpie reads the answer aloud in the voice Mia. Entirely in German and on my own hardware.<\/p>\n<p>[Placeholder: describe one or two of my own example sentences and the spoken answer; insert a screenshot of the UI.]<\/p>\n<p><strong>Port note:<\/strong> If you want to change the port, adjust it in the <code>uvicorn.run<\/code> line in <code>bot.py<\/code> and in the <code>wsUrl<\/code> in <code>static\/index.html<\/code>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Step_6_Turn_detection_interruptibility_and_speculative_speech\"><\/span>Step 6: Turn detection, interruptibility and speculative speech<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>What makes the whole thing feel like a conversation rather than a walkie-talkie lies in three mechanisms \u2013 and the nice part is: all three are already built into the example and controlled via the <code>.env<\/code> or <code>bot.py<\/code>, without having to rebuild the pipeline itself.<\/p>\n<p><strong>1. Interruptibility.<\/strong> In <code>bot.py<\/code> the pipeline is configured with <code>allow_interruptions=True<\/code>. This lets me <strong>cut in<\/strong> on the agent \u2013 it stops the output and listens again. Especially in the in-car scenario this matters: you want to be able to correct without waiting for the end of the answer.<\/p>\n<p><strong>2. Turn detection via the VAD profile.<\/strong> When have I finished speaking? That&#8217;s controlled by the <code>VAD_PROFILE<\/code> environment variable in <code>bot.py<\/code>:<\/p>\n<ul>\n<li><strong>ASR<\/strong> (default): the Nemotron ASR delivers the end-of-speech and interruption signals itself \u2013 no separate VAD needed.<\/li>\n<li><strong>Silero<\/strong>: with <code>VAD_PROFILE=Silero<\/code> the WebSocket transport instead gets a <code>SileroVADAnalyzer<\/code> placed in front.<\/li>\n<\/ul>\n<p><strong>3. <\/strong><strong>Speculative Speech Processing.<\/strong> Already enabled in Step 4 via <code>ENABLE_SPECULATIVE_SPEECH=true<\/code>. Instead of waiting for the final transcript, the agent already works on the early, preliminary ASR transcripts \u2013 which noticeably reduces response latency. This only works with the Nemotron ASR and is switched on automatically in the example; no code edit needed.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Step_7_Measure_the_latency_chain\"><\/span>Step 7: Measure the latency chain<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>For the conversation to feel natural, what counts is the <strong>total latency<\/strong> from the end of my utterance to the first sound of the answer. The good news: Pipecat already measures these values for every stage, so we don&#8217;t have to design or program anything ourselves. Pipecat just doesn&#8217;t write them out by default, which is why we see no measurements in the terminal. The values flow through the pipeline as small <code>MetricsFrame<\/code>s and are simply discarded at the end as long as nobody listens for them or picks them up. So we have to do two things: turn the metrics <strong>on<\/strong> and attach an <strong>observer<\/strong> that writes them to the log.<\/p>\n<p><strong>Enable and log the metrics.<\/strong> Pipecat ships a ready-made <code>MetricsLogObserver<\/code> for this. So, as said at the start, you don&#8217;t have to build anything yourself. In <code>bot.py<\/code>, find the place where the <code>PipelineTask<\/code> is created and add the two metrics switches and the observer.<\/p>\n<p>Insert the following import line at the very top among the imports of the Python program:<\/p>\n<pre><code class=\"language-python\">from pipecat.observers.loggers.metrics_log_observer import MetricsLogObserver<\/code><\/pre>\n<p>Make the following change in <code class=\"language-python\">\"task = PipelineTask(...\"<\/code>:<\/p>\n<p><code>task = PipelineTask(<\/code><br \/>\n<code>       pipeline,<\/code><br \/>\n<code>       params=PipelineParams(<\/code><br \/>\n<code>             allow_interruptions=True,<\/code><br \/>\n<code>             enable_metrics=True,<\/code><br \/>\n<code>             enable_usage_metrics=True,<\/code><br \/>\n<code>             send_initial_empty_metrics=True,<\/code><br \/>\n<code>             report_only_initial_ttfb=True,<\/code><br \/>\n<code>             start_metadata={\"stream_id\": stream_id},<\/code><br \/>\n<code>       ),<\/code><br \/>\n<code>       observers=[MetricsLogObserver()],<\/code><br \/>\n<code>)<\/code><\/p>\n<p>Now save the changes to the Python program.<\/p>\n<p><strong>What you see in the log afterwards.<\/strong> As soon as you restart the bot and speak a sentence, lines like these appear per conversation turn (with the service names from our local setup):<\/p>\n<blockquote><p><code>NvidiaLLMService#0 TTFB: 0.837<\/code><br \/>\n<code>NemotronTTSService#0 TTFB: 0.171<\/code><br \/>\n<code>NemotronTTSService#0 processing time: 0.0005<\/code><\/p><\/blockquote>\n<p>The values are given in <strong>seconds<\/strong>. Here&#8217;s how to map them to my three stages:<\/p>\n<ul>\n<li><strong>ASR (Parakeet):<\/strong> the <code>processing time<\/code> of the <code>NemotronASRService<\/code> \u2013 roughly the time to the final transcript.<\/li>\n<li><strong>LLM (Ollama) \u2013 TTFT:<\/strong> the <code>TTFB<\/code> value of the <code>NvidiaLLMService<\/code> \u2013 from the finished transcript to the first token of my Ollama model.<\/li>\n<li><strong>TTS (Magpie) \u2013 TTFA:<\/strong> the <code>TTFB<\/code> value of the <code>NemotronTTSService<\/code> \u2013 from the first LLM text to the first audio sample.<\/li>\n<\/ul>\n<p>The approximate voice-to-voice latency is then the sum of ASR + LLM TTFB + TTS TTFB \u2013 plus the network hops that, in my case, come from the separate Ollama server on the second machine.<\/p>\n<p><strong>The bonus round: measuring the real end-to-end time.<\/strong> If you want &#8220;from the end of my utterance to the first sound&#8221; as a single number, hook onto the frames that mark the turn boundaries: <code>VADUserStoppedSpeakingFrame<\/code> (I&#8217;m done \u2013 start the clock) and <code>BotStartedSpeakingFrame<\/code> (the bot starts speaking \u2013 stop the clock). A compact custom observer is enough:<\/p>\n<pre><code class=\"language-python\">import time\r\nfrom loguru import logger\r\nfrom pipecat.observers.base_observer import BaseObserver, FramePushed\r\nfrom pipecat.frames.frames import VADUserStoppedSpeakingFrame, BotStartedSpeakingFrame\r\n\r\nclass VoiceLatencyObserver(BaseObserver):\r\n    def __init__(self):\r\n        super().__init__()\r\n        self._t_user_stopped = None\r\n\r\n    async def on_push_frame(self, data: FramePushed) -&gt; None:\r\n        frame = data.frame\r\n        if isinstance(frame, VADUserStoppedSpeakingFrame):\r\n            self._t_user_stopped = time.time()\r\n        elif isinstance(frame, BotStartedSpeakingFrame) and self._t_user_stopped:\r\n            ms = (time.time() - self._t_user_stopped) * 1000\r\n            logger.info(f\"Voice-to-Voice latency: {ms:.0f} ms\")\r\n            self._t_user_stopped = None<\/code><\/pre>\n<p>You simply add it to the <code>observers=[...]<\/code> list as well:<\/p>\n<pre><code class=\"language-python\">    observers=[MetricsLogObserver(), VoiceLatencyObserver()],<\/code><\/pre>\n<p><strong>Note:<\/strong> With <code>VAD_PROFILE=ASR<\/code> (our default) the Nemotron ASR delivers the end-of-speech signals itself \u2013 depending on the Pipecat version the start frame may then be named slightly differently. The <code>MetricsLogObserver<\/code> works in any case, so it&#8217;s the reliable starting point; take the custom observer as the icing on the cake.<\/p>\n<p><strong>My measurements.<\/strong> This is what the latency chain looked like on my first run. I asked my assistant &#8220;Kannst du mir sagen, wie du dich heute f\u00fchlst?&#8221; (Can you tell me how you&#8217;re feeling today?); the LLM was the large <code>gemma4:26b<\/code> running on the separate dual-RTX-A6000 inference Ollama server, with ASR (Parakeet) and TTS (Magpie) on the RTX A6000 Ada:<\/p>\n<table>\n<thead>\n<tr>\n<th>Stage<\/th>\n<th>What&#8217;s measured<\/th>\n<th>My value (gemma4:26b)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>ASR (Parakeet)<\/strong><\/td>\n<td>Compute latency to the final transcript<\/td>\n<td>\u2248 32 ms<\/td>\n<\/tr>\n<tr>\n<td><strong>LLM (Ollama)<\/strong><\/td>\n<td>to the first complete sentence (this releases TTS)<\/td>\n<td>\u2248 7.05 s<\/td>\n<\/tr>\n<tr>\n<td><strong>TTS (Magpie)<\/strong><\/td>\n<td>Time-to-First-Audio (TTFB)<\/td>\n<td>\u2248 73 ms<\/td>\n<\/tr>\n<tr>\n<td><strong>Voice-to-voice (total)<\/strong><\/td>\n<td>End of utterance \u2192 first sound<\/td>\n<td>\u2248 7.1 s<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The numbers speak a clear language: <strong>ASR and TTS together, at around 100 ms, are practically free. But the entire latency sits in the LLM.<\/strong> In my case the large <code>gemma4:26b<\/code> was running, and for my short question it produced a hefty 423 completion tokens, even though the system prompt asks for only a short sentence. On top of that comes a subtlety of the pipeline: the TTS stage doesn&#8217;t wait for the first <em>token<\/em>, but for the first complete <em>sentence<\/em>. A fast first token (my greeting arrived in 0.49 s) therefore helps little if the model rambles on before the first sentence ends.<\/p>\n<p>A word on the often-suspected &#8220;model loading&#8221;: yes, on the very first call Ollama pulls the model into VRAM, which can cost time once. In my measurement, though, the model was already warm \u2013 the greeting did arrive in 0.49 s. So the seven seconds here come almost entirely from the sheer generation of such a large model, not from loading.<\/p>\n<p>The conclusion is clear: for a fluid conversation, a 26B model is simply too heavy at this spot. If you want to hit the target of roughly <strong>under ~700\u2013800 ms<\/strong>, use a smaller, faster model for the reasoning in the loop. That&#8217;s exactly what the model dropdown in my upgraded UI is handy for: switch to a 7\u20138B instruct model once, speak the same sentence, and the latency drops noticeably. The two-machine split is interesting here too: the LLM on the separate Ollama server takes load off the A6000 Ada GPU (which runs ASR and TTS) but costs a network hop. Only measuring shows in black and white that this hop is utterly negligible here \u2013 the bottleneck is the model size alone.<\/p>\n<div id=\"attachment_2599\" style=\"width: 1034px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10-1024x522.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2599\" class=\"size-large wp-image-2599\" src=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10-1024x522.jpg\" alt=\"NVIDIA NIM Voice Assistant web frontend inference server\" width=\"1024\" height=\"522\" srcset=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10-1024x522.jpg 1024w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10-300x153.jpg 300w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10-768x391.jpg 768w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10-1536x782.jpg 1536w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10-1080x550.jpg 1080w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg 1543w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-2599\" class=\"wp-caption-text\">NVIDIA NIM Voice Assistant web frontend inference server<\/p><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Tips_and_troubleshooting\"><\/span>Tips and troubleshooting<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li><strong>Port conflict:<\/strong> the most common stumbling block \u2013 ASR and TTS both want 9000\/50051. That&#8217;s exactly why we remapped Magpie to 9001\/50052 (Step 2).<\/li>\n<li><strong>Sample rates between stages:<\/strong> keep a consistent 16 kHz mono; mismatches between ASR output and TTS input are a typical source of errors.<\/li>\n<li><strong>Two machines:<\/strong> the LLM runs separately \u2013 check network latency and the reachability of the <code>base_url<\/code>.<\/li>\n<li><strong>Python 3.12:<\/strong> nvidia-pipecat doesn&#8217;t run in the venvs of the other parts \u2013 use a dedicated environment.<\/li>\n<li><strong>Service rename as of nvidia-pipecat 0.4.0:<\/strong> if you write your own pipeline code, import <code>NemotronASRService<\/code> or <code>NemotronTTSService<\/code> (from <code>nvidia_pipecat.services.riva_speech<\/code>). The old names <code>RivaASRService<\/code>\/<code>RivaTTSService<\/code> only work as deprecated aliases with a warning now.<\/li>\n<li><strong>Exact voice name:<\/strong> take the German Magpie <code>voice_id<\/code> exactly as from <code>--list-voices<\/code>.<\/li>\n<\/ul>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>With that, the first complete <strong>local voice loop<\/strong> is in place: I speak, Parakeet transcribes, my Ollama LLM answers, Magpie speaks. I&#8217;ve built the whole setup end to end on my own hardware, interruptible, in German. The skeleton of my agent is alive.<\/p>\n<p>For now, though, the agent just &#8220;thinks&#8221; freely off the cuff. In the next part I replace the bare LLM with the <strong>NAT agent from Part 5<\/strong> (Path B). That&#8217;s another big step for the setup, because then the agent can trigger real actions \u2013 telling the time, for instance \u2013 and turns from a conversation partner into an <em>assistant<\/em>. In Part 8, finally, the <strong>wake word<\/strong> comes in as the doorman in front of it.<\/p>\n<p>If you build the loop yourself: feel free to write in the comments what total latency you achieve on your hardware.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every building block is now available locally: ASR (Parakeet, Canary), TTS (Magpie), an LLM via my Ollama server and the orchestrator (NAT, Part 5). Now I&#8217;m connecting them into a continuous, interruptible speech loop. This will be my first small local voice agent. What I&#8217;m aiming for is a kind of general-purpose agent: I speak, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2600,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[162,8],"tags":[1708,1703,1711,1704,1699,1707,1617,1705,1698,306,1706,1712,1176,1700,1032,1709,1701,1710,1702],"class_list":["post-2603","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models-en","category-news","tag-asr-tts-llm-pipeline","tag-barge-in","tag-fastapi-websocket","tag-latency-measurement","tag-local-voice-agent","tag-magpie-tts","tag-nvidia-nim","tag-nvidia-pipecat","tag-nvidia-pipecat-locally","tag-ollama-en","tag-parakeet-asr","tag-python-3-12-uv","tag-rtx-a6000-ada","tag-self-hosted-speech-assistant","tag-sovereign-ai","tag-speculative-speech","tag-speech-to-speech","tag-turn-detection","tag-voice-loop","et-has-post-format-content","et_post_format-et-post-format-standard"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally - Exploring the Future: Inside the AI Box<\/title>\n<meta name=\"description\" content=\"Build an interruptible German voice agent with NVIDIA Pipecat locally \u2013 wiring Parakeet ASR, Magpie TTS and an Ollama LLM into one speech-to-speech loop.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally - Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"og:description\" content=\"Build an interruptible German voice agent with NVIDIA Pipecat locally \u2013 wiring Parakeet ASR, Magpie TTS and an Ollama LLM into one speech-to-speech loop.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/\" \/>\n<meta property=\"og:site_name\" content=\"Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-17T04:28:58+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-06-18T02:43:51+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1543\" \/>\n\t<meta property=\"og:image:height\" content=\"786\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Maker\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:site\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Maker\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"23 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/\"},\"author\":{\"name\":\"Maker\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"headline\":\"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally\",\"datePublished\":\"2026-06-17T04:28:58+00:00\",\"dateModified\":\"2026-06-18T02:43:51+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/\"},\"wordCount\":3771,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg\",\"keywords\":[\"ASR TTS LLM Pipeline\",\"barge-in\",\"FastAPI WebSocket\",\"latency measurement\",\"local voice agent\",\"Magpie TTS\",\"NVIDIA NIM\",\"NVIDIA Pipecat\",\"NVIDIA Pipecat locally\",\"Ollama\",\"Parakeet ASR\",\"Python 3.12 uv\",\"RTX A6000 Ada\",\"self-hosted speech assistant\",\"sovereign AI\",\"Speculative Speech\",\"speech-to-speech\",\"Turn-Detection\",\"voice loop\"],\"articleSection\":[\"Large Language Models\",\"News\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/\",\"name\":\"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally - Exploring the Future: Inside the AI Box\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg\",\"datePublished\":\"2026-06-17T04:28:58+00:00\",\"dateModified\":\"2026-06-18T02:43:51+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"description\":\"Build an interruptible German voice agent with NVIDIA Pipecat locally \u2013 wiring Parakeet ASR, Magpie TTS and an Ollama LLM into one speech-to-speech loop.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/#primaryimage\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg\",\"contentUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg\",\"width\":1543,\"height\":786,\"caption\":\"NVIDIA NIM Voice Assistant web frontend hardware\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/news\\\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\\\/2603\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Start\",\"item\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\",\"name\":\"Exploring the Future: Inside the AI Box\",\"description\":\"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\",\"name\":\"Maker\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"caption\":\"Maker\"},\"description\":\"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.\",\"sameAs\":[\"https:\\\/\\\/ai-box.eu\"],\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/author\\\/ingmars\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally - Exploring the Future: Inside the AI Box","description":"Build an interruptible German voice agent with NVIDIA Pipecat locally \u2013 wiring Parakeet ASR, Magpie TTS and an Ollama LLM into one speech-to-speech loop.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/","og_locale":"en_US","og_type":"article","og_title":"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally - Exploring the Future: Inside the AI Box","og_description":"Build an interruptible German voice agent with NVIDIA Pipecat locally \u2013 wiring Parakeet ASR, Magpie TTS and an Ollama LLM into one speech-to-speech loop.","og_url":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/","og_site_name":"Exploring the Future: Inside the AI Box","article_published_time":"2026-06-17T04:28:58+00:00","article_modified_time":"2026-06-18T02:43:51+00:00","og_image":[{"width":1543,"height":786,"url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg","type":"image\/jpeg"}],"author":"Maker","twitter_card":"summary_large_image","twitter_creator":"@Ingmar_Stapel","twitter_site":"@Ingmar_Stapel","twitter_misc":{"Written by":"Maker","Est. reading time":"23 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#article","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/"},"author":{"name":"Maker","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"headline":"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally","datePublished":"2026-06-17T04:28:58+00:00","dateModified":"2026-06-18T02:43:51+00:00","mainEntityOfPage":{"@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/"},"wordCount":3771,"commentCount":0,"image":{"@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg","keywords":["ASR TTS LLM Pipeline","barge-in","FastAPI WebSocket","latency measurement","local voice agent","Magpie TTS","NVIDIA NIM","NVIDIA Pipecat","NVIDIA Pipecat locally","Ollama","Parakeet ASR","Python 3.12 uv","RTX A6000 Ada","self-hosted speech assistant","sovereign AI","Speculative Speech","speech-to-speech","Turn-Detection","voice loop"],"articleSection":["Large Language Models","News"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/","url":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/","name":"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally - Exploring the Future: Inside the AI Box","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#primaryimage"},"image":{"@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg","datePublished":"2026-06-17T04:28:58+00:00","dateModified":"2026-06-18T02:43:51+00:00","author":{"@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"description":"Build an interruptible German voice agent with NVIDIA Pipecat locally \u2013 wiring Parakeet ASR, Magpie TTS and an Ollama LLM into one speech-to-speech loop.","breadcrumb":{"@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#primaryimage","url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg","contentUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/06\/NVIDIA_nim_voice_assistant_web_frontend_10.jpg","width":1543,"height":786,"caption":"NVIDIA NIM Voice Assistant web frontend hardware"},{"@type":"BreadcrumbList","@id":"https:\/\/ai-box.eu\/en\/news\/local-voice-agent-wiring-asr-llm-and-tts-into-a-loop-with-nvidia-pipecat-locally\/2603\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Start","item":"https:\/\/ai-box.eu\/en\/"},{"@type":"ListItem","position":2,"name":"Local Voice Agent: Wiring ASR, LLM and TTS into a Loop with NVIDIA Pipecat Locally"}]},{"@type":"WebSite","@id":"https:\/\/ai-box.eu\/en\/#website","url":"https:\/\/ai-box.eu\/en\/","name":"Exploring the Future: Inside the AI Box","description":"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ai-box.eu\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1","name":"Maker","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","caption":"Maker"},"description":"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.","sameAs":["https:\/\/ai-box.eu"],"url":"https:\/\/ai-box.eu\/en\/author\/ingmars\/"}]}},"_links":{"self":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2603","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/comments?post=2603"}],"version-history":[{"count":2,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2603\/revisions"}],"predecessor-version":[{"id":2605,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2603\/revisions\/2605"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media\/2600"}],"wp:attachment":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media?parent=2603"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/categories?post=2603"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/tags?post=2603"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}