{"id":2277,"date":"2026-05-16T19:43:29","date_gmt":"2026-05-16T19:43:29","guid":{"rendered":"https:\/\/ai-box.eu\/?p=2277"},"modified":"2026-05-16T19:50:27","modified_gmt":"2026-05-16T19:50:27","slug":"nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer","status":"publish","type":"post","link":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/","title":{"rendered":"NeMo Agent Toolkit on the RTX A6000 Ada &#8211; From Inference Layer to Orchestrator Layer"},"content":{"rendered":"<p>In my four-part <a href=\"https:\/\/ai-box.eu\/en\/top-story-en\/tensorrt-llm-on-the-rtx-a6000-ada-preparing-for-the-edge-llm-ecosystem\/2255\/\">TensorRT-LLM series<\/a> I showed how I optimize inference performance on the RTX A6000 Ada \u2014 251 tokens\/sec with Qwen-2.5-7B in FP8, deployable <code>.engine<\/code> files, all cleanly reproducible. But in doing so, I had only built one part of the stack: the <strong>inference layer<\/strong>.<\/p>\n<p>Inspired by the ever-present GenAI agent publications, it became clear to me: a production AI stack consists of multiple layers. Inference is only one of them. Above it sits the <strong>orchestrator<\/strong> \u2014 the layer that decides which tool gets called when, that does multi-step reasoning, that produces the actual agent behavior.<\/p>\n<p>NVIDIA has released the <strong>NeMo Agent Toolkit (NAT)<\/strong> as an open-source library for exactly this. In this post I&#8217;ll show you how I installed NAT on my Ubuntu 24.04 server \u2014 cleanly isolated in a Python venv, with my existing Ollama setup as the backend \u2014 and how I got my first ReAct agent up and running. Including two non-trivial pitfalls that I&#8217;d rather spare you.<\/p>\n<p>Here&#8217;s the link to the NeMo Agent Toolkit: <a href=\"https:\/\/github.com\/NVIDIA\/NeMo-Agent-Toolkit\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/NVIDIA\/NeMo-Agent-Toolkit<\/a><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#What_is_this_actually_about\" >What is this actually about?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Prerequisites\" >Prerequisites<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Architecture_overview_what_runs_where\" >Architecture overview: what runs where?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Step_1_Install_uv\" >Step 1: Install uv<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Step_2_Create_the_directory_structure\" >Step 2: Create the directory structure<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Step_3_Create_a_Python_venv_for_NAT\" >Step 3: Create a Python venv for NAT<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Step_4_Install_the_NeMo_Agent_Toolkit\" >Step 4: Install the NeMo Agent Toolkit<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Step_5_Verify_the_Ollama_OpenAI_API\" >Step 5: Verify the Ollama OpenAI API<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Step_6_First_agent_workflow_configuration\" >Step 6: First agent workflow configuration<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Step_7_Run_the_agent\" >Step 7: Run the agent<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#The_Qwen_trap_when_the_agent_suddenly_speaks_Chinese\" >The Qwen trap: when the agent suddenly speaks Chinese<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Step_8_Tool_extension_with_Wikipedia_search\" >Step 8: Tool extension with Wikipedia search<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Pitfalls_youll_probably_hit\" >Pitfalls you&#8217;ll probably hit<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#1_Qwen-7B_doesnt_always_do_ReAct_cleanly\" >1. Qwen-7B doesn&#8217;t always do ReAct cleanly<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#2_Custom_system_prompt_needs_template_variables\" >2. Custom system prompt needs template variables<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#3_Connection_error_to_Ollama\" >3. Connection error to Ollama<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#4_Performance_expectations\" >4. Performance expectations<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#5_What_you_should_NOT_do\" >5. What you should NOT do<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-19\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#What_comes_next\" >What comes next?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-20\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"What_is_this_actually_about\"><\/span>What is this actually about?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>An <strong>agent<\/strong> in the modern LLM context is more than a chatbot. A chatbot gets a question and answers it. An agent gets a task and decides on its own which tools to call, in which order, and when it has enough information for a final answer. The classic pattern is called <strong>ReAct<\/strong> (Reason + Act): the model thinks, picks an action (tool call), observes the result, thinks again \u2014 until it has a finished answer.<\/p>\n<p>At the architectural level, it looks like this:<\/p>\n<div id=\"attachment_2278\" style=\"width: 894px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2278\" class=\"wp-image-2278 \" src=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-1024x710.jpg\" alt=\"NVIDIA NeMo Agent Toolkit - Architecture\" width=\"884\" height=\"613\" srcset=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-1024x710.jpg 1024w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-300x208.jpg 300w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-768x532.jpg 768w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-1536x1065.jpg 1536w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-2048x1420.jpg 2048w, https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-1080x749.jpg 1080w\" sizes=\"(max-width: 884px) 100vw, 884px\" \/><\/a><p id=\"caption-attachment-2278\" class=\"wp-caption-text\">NVIDIA NeMo Agent Toolkit &#8211; Architecture<\/p><\/div>\n<p>NAT is explicitly built to be <strong>framework-agnostic<\/strong>: you can hang LangChain, LlamaIndex, CrewAI or even custom frameworks behind it. And the inference layer is decoupled via the OpenAI-compatible API. That&#8217;s a very flexible architecture \u2014 what&#8217;s hanging underneath doesn&#8217;t matter to NAT at all. Ollama, vLLM, TensorRT-LLM, or an NVIDIA NIM \u2014 as long as it speaks OpenAI format, it works.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Prerequisites\"><\/span>Prerequisites<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This guide assumes that your server is already fundamentally prepared for AI inference. If not, work through my foundation post first: <a href=\"https:\/\/ai-box.eu\/en\/top-story-en\/preparing-an-ubuntu-24-04-server-for-ai-inference-cuda-docker-nvidia-container-toolkit\/2268\/\">Preparing an Ubuntu 24.04 server for AI inference<\/a>.<\/p>\n<p>Specifically you need:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Ubuntu 24.04 LTS<\/strong><\/li>\n<li>An <strong>NVIDIA GPU<\/strong> (I use the RTX A6000 Ada \u2014 for 7B models in 4-bit quantization any card with at least 8 GB VRAM will do)<\/li>\n<li><strong>Ollama already running<\/strong> with <code>qwen2.5:7b-instruct<\/code> in the model cache (or any other model of your choice)<\/li>\n<li><strong>Python 3.11, 3.12 or 3.13<\/strong> on the host \u2014 system Python on Ubuntu 24.04 is Python 3.12, which is fine<\/li>\n<li>Internet connection for the initial package downloads (about 1.5 GB)<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Architecture_overview_what_runs_where\"><\/span>Architecture overview: what runs where?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Before we get started, a quick look at the clean separation of components. On my server, multiple layers coexist with different isolation mechanisms:<\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th>Component<\/th>\n<th>Where it runs<\/th>\n<th>Why there<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NVIDIA drivers, CUDA, Docker<\/td>\n<td>System-wide<\/td>\n<td>Needed by all applications, no isolation required<\/td>\n<\/tr>\n<tr>\n<td>Ollama<\/td>\n<td>System service (systemctl)<\/td>\n<td>Daemon character, listens on port 11434<\/td>\n<\/tr>\n<tr>\n<td>TensorRT-LLM<\/td>\n<td>Docker container (NGC image)<\/td>\n<td>Complex dependency stack \u2192 container isolates it<\/td>\n<\/tr>\n<tr>\n<td><strong>NeMo Agent Toolkit (NAT)<\/strong><\/td>\n<td><strong>Python venv on the host<\/strong><\/td>\n<td>Medium complexity \u2014 venv is sufficient<\/td>\n<\/tr>\n<tr>\n<td>Custom tools (Python)<\/td>\n<td>In the same venv as NAT<\/td>\n<td>Direct access to NAT API<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>The key point: NAT doesn&#8217;t need a container because it&#8217;s a pure Python library. But it absolutely needs its own Python environment because its roughly 80 to 120 dependencies would otherwise collide with other Python projects or system tools. The NAT documentation explicitly warns against <code>conda<\/code>, by the way. So I use vanilla venv.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Step_1_Install_uv\"><\/span>Step 1: Install uv<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>uv<\/strong> is a modern Python package manager that is about 10-100x faster than pip. The NVIDIA documentation explicitly recommends it as the preferred option for the NAT installation. A series of commands now needs to be executed. Step by step.<\/p>\n<p><strong>Command:<\/strong> <code>curl -LsSf https:\/\/astral.sh\/uv\/install.sh | sh<\/code><\/p>\n<p><strong>Command:<\/strong> <code>source ~\/.bashrc<\/code><\/p>\n<p><strong>Command:<\/strong> <code>uv --version<\/code><\/p>\n<p>On my system uv ends up at <code>~\/.local\/bin\/uv<\/code>, and the version line looks something like: <code>uv 0.x.x<\/code>. If the verification fails, the installer typically ended up in the wrong place \u2014 a <code>source ~\/.profile<\/code> or a fresh terminal helps.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Step_2_Create_the_directory_structure\"><\/span>Step 2: Create the directory structure<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>I create my own project directory for NAT, separate from my other AI projects. The advantage: I can later have any number of agent projects running in parallel without their dependencies getting in each other&#8217;s way.<\/p>\n<p><strong>Command:<\/strong> <code>mkdir -p ~\/nat-playground\/configs<\/code><\/p>\n<p><strong>Command:<\/strong> <code>mkdir -p ~\/nat-playground\/tools<\/code><\/p>\n<p><strong>Command:<\/strong> <code>cd ~\/nat-playground<\/code><\/p>\n<p>The eventual directory structure looks like this:<\/p>\n<pre class=\"wp-block-preformatted\">~\/nat-playground\/\r\n\u251c\u2500\u2500 .venv\/                     # Own Python environment (Step 3)\r\n\u251c\u2500\u2500 configs\/                   # YAML workflow configurations\r\n\u2514\u2500\u2500 tools\/                     # Custom tools (Python modules)<\/pre>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Step_3_Create_a_Python_venv_for_NAT\"><\/span>Step 3: Create a Python venv for NAT<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Now we create the virtual Python environment. With uv this happens in a single step, and I&#8217;m still amazed how easy it is:<\/p>\n<p><strong>Command:<\/strong> <code>uv venv --python 3.12 --seed .venv<\/code><\/p>\n<p><strong>Command:<\/strong> <code>source .venv\/bin\/activate<\/code><\/p>\n<p>The <code>--seed<\/code> flag ensures that pip is installed inside the venv as well, which simplifies plugin installations. You can tell the venv is active by the <code>(.venv)<\/code> in the shell prompt.<\/p>\n<p>What happens here? A venv is basically just a directory with its own <code>python<\/code> binary (often a symlink to the system Python), its own <code>site-packages<\/code> folder, and an activation script. When activated, the shell PATH is manipulated so that when <code>python<\/code> or <code>pip<\/code> is called, the venv binaries are found first. These install packages into the venv-owned <code>site-packages<\/code> folder \u2014 completely isolated from the system, which is exactly what we want.<\/p>\n<p>Advantage: if I break the installation, I just delete <code>.venv\/<\/code> and start over. The system Python remains untouched.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Step_4_Install_the_NeMo_Agent_Toolkit\"><\/span>Step 4: Install the NeMo Agent Toolkit<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>With the venv active, I install NAT with the LangChain plugin. LangChain is the standard framework bridge and is needed for the ReAct agent:<\/p>\n<p><strong>Command:<\/strong> <code>uv pip install \"nvidia-nat[langchain]\"<\/code><\/p>\n<p>The command pulls about 80-120 packages \u2014 LangChain, Pydantic, httpx, openai client and quite a bit more. With uv the installation took about 5-10 minutes for me.<\/p>\n<p>Now let&#8217;s check whether NAT has actually been installed.<\/p>\n<p><strong>Command:<\/strong> <code>nat --version<\/code><\/p>\n<p><strong>Command:<\/strong> <code>nat info components -t llm_provider<\/code><\/p>\n<p>The second command lists the available LLM providers. You should see at least <code>openai<\/code> and <code>nim<\/code> in the list. The <code>openai<\/code> provider is the one we&#8217;ll use for Ollama. It works with any OpenAI-compatible endpoint.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Step_5_Verify_the_Ollama_OpenAI_API\"><\/span>Step 5: Verify the Ollama OpenAI API<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Since version 0.1.24, Ollama provides an OpenAI-compatible endpoint at <code>\/v1<\/code>. A quick function test before we connect NAT to it. It&#8217;s important that your Ollama inference server is running and reachable:<\/p>\n<p><strong>Command:<\/strong> <code>curl http:\/\/localhost:11434\/v1\/chat\/completions -H \"Content-Type: application\/json\" -d '{<br \/>\n\"model\": \"qwen2.5:7b-instruct\",<br \/>\n\"messages\": [{\"role\":\"user\",\"content\":\"Answer with OK.\"}],<br \/>\n\"max_tokens\": 10<br \/>\n}'<\/code><\/p>\n<p>If you get back a JSON response with <code>\"OK\"<\/code> in the content field, everything is ready.<\/p>\n<p><strong>Note<\/strong>: If NAT is supposed to run on a different machine than Ollama, Ollama must listen on all interfaces.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Step_6_First_agent_workflow_configuration\"><\/span>Step 6: First agent workflow configuration<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>NAT workflows are defined in YAML files. We start with a minimal workflow that uses a single tool only. It should return the current time. The workflow is stored in a *.yml file. I created the following file in the configs folder using nano.<\/p>\n<p><strong>Command:<\/strong> <code>~\/nat-playground\/configs<\/code><\/p>\n<p><strong>Command:<\/strong> <code>nano ollama_agent.yml<\/code><\/p>\n<p>Into this file you paste the following content describing the workflow.<\/p>\n<pre class=\"wp-block-code\"><code>llms:\r\n  ollama_llm:\r\n    _type: openai\r\n    api_key: \"EMPTY\"\r\n    base_url: \"http:\/\/localhost:11434\/v1\"\r\n    model_name: \"qwen2.5:7b-instruct\"\r\n    temperature: 0.0\r\n    max_retries: 3\r\n\r\nfunctions:\r\n  current_datetime:\r\n    _type: current_datetime\r\n\r\nworkflow:\r\n  _type: react_agent\r\n  tool_names: [current_datetime]\r\n  llm_name: ollama_llm\r\n  verbose: true\r\n  parse_agent_response_max_retries: 3<\/code><\/pre>\n<p>With CTRL + X followed by Y, you save the change.<\/p>\n<p>Three main blocks:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong><code>llms:<\/code><\/strong> Defines the available LLM backends. <code>_type: openai<\/code> makes Ollama usable as a generic OpenAI endpoint. The <code>api_key: \"EMPTY\"<\/code> is mandatory \u2014 even though Ollama doesn&#8217;t check it, the field has to be set, otherwise you&#8217;ll get a validation error.<\/li>\n<li><strong><code>functions:<\/code><\/strong> Defines the tools the agent is allowed to call. <code>current_datetime<\/code> is a built-in of NAT.<\/li>\n<li><strong><code>workflow:<\/code><\/strong> Defines the agent pattern. <code>react_agent<\/code> is the classic Reason+Act pattern.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Step_7_Run_the_agent\"><\/span>Step 7: Run the agent<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When you now want to execute the agent workflow, it&#8217;s important that you&#8217;re in the active .venv. Now execute the following commands.<\/p>\n<p><strong>Command:<\/strong> <code>cd ~\/nat-playground<\/code><\/p>\n<p><strong>Command:<\/strong> <code>nat run --config_file configs\/ollama_agent.yml --input \"What time is it now and what can I derive from that for my workday?\"<\/code><\/p>\n<p>If everything works, you&#8217;ll see a ReAct trace in the terminal that might look like this. In my case, there were still a lot of Chinese characters in between.<\/p>\n<pre class=\"wp-block-code\"><code>Thought: I need to find out the current time.\r\nAction: current_datetime\r\nAction Input: {}\r\nObservation: 2026-05-16 16:56:52 +0000\r\nThought: It's Saturday evening, 18:56 Munich time. From this I can derive...\r\nFinal Answer: It is currently 18:56 on May 16, 2026...<\/code><\/pre>\n<p>With this, you have cleanly coupled the inference layer (Ollama) and the orchestrator layer (NAT). Exactly the architectural separation we want to achieve. In this case with Ollama instead of TensorRT-LLM.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_Qwen_trap_when_the_agent_suddenly_speaks_Chinese\"><\/span>The Qwen trap: when the agent suddenly speaks Chinese<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This is where it gets interesting. On my first run, I got the following trace:<\/p>\n<pre class=\"wp-block-code\"><code>Thought: \u4e3a\u4e86\u56de\u7b54\u8fd9\u4e2a\u95ee\u9898\uff0c\u6211\u9700\u8981\u83b7\u53d6\u5f53\u524d\u7684\u65f6\u95f4\u548c\u65e5\u671f\u4fe1\u606f...\r\nAction: current_datetime\r\nAction Input: {\"unused\": \"2023-11-29T15:48:00Z\"}\r\n...\r\nFinal Answer: \u5f53\u524d\u65f6\u95f4\u4e3a2026\u5e745\u670816\u65e5\u4e0b\u53484\u70b956\u520652\u79d2...<\/code><\/pre>\n<p>The agent called the tool correctly, received the right time \u2014 but the entire response came back in <strong>Chinese<\/strong>. That&#8217;s a well-known quirk of the Qwen 2.5 family: the model originates from Alibaba and regularly falls back into its main training language under structured reasoning. Especially pronounced in models below 14B parameters.<\/p>\n<p>The solution: an explicit <strong>system prompt<\/strong> that enforces the language. But watch out \u2014 NAT&#8217;s <code>react_agent<\/code> is template-based. If you set your own system prompt, you must include the placeholders <code>{tools}<\/code> and <code>{tool_names}<\/code> yourself, otherwise the agent doesn&#8217;t know which tools are available. That cost me a few minutes on my first attempt because I hadn&#8217;t done it.<\/p>\n<p>The correct solution with a system prompt that enforces English looks like this. Just create a new workflow.<\/p>\n<p><strong>Command:<\/strong> <code>~\/nat-playground\/configs<\/code><\/p>\n<p><strong>Command:<\/strong> <code>nano ollama_agent_system_prompt.yml<\/code><\/p>\n<pre class=\"wp-block-code\"><code>llms:\r\n  ollama_llm:\r\n    _type: openai\r\n    api_key: \"EMPTY\"\r\n    base_url: \"http:\/\/localhost:11434\/v1\"\r\n    model_name: \"qwen2.5:7b-instruct\"\r\n    temperature: 0.0\r\n    max_retries: 3\r\nworkflow:\r\n  _type: react_agent\r\n  tool_names: [current_datetime]\r\n  llm_name: ollama_llm\r\n  system_prompt: |\r\n    You are a helpful English-speaking assistant.\r\n    IMPORTANT: Answer EXCLUSIVELY in English. Your thoughts must also be in English.\r\n    NEVER use Chinese or any other language.\r\n    You have access to the following tools:\r\n    {tools}\r\n    Use the following format for your answer:\r\n    Question: the input question you must answer\r\n    Thought: reason in English about what to do next\r\n    Action: the action to take \u2014 must be one of: [{tool_names}]\r\n    Action Input: the input to the action\r\n    Observation: the result of the action\r\n    ... (this cycle can repeat)\r\n    Thought: I now know the final answer\r\n    Final Answer: the final answer \u2014 in English\r\n    Begin!\r\n  verbose: true\r\n  parse_agent_response_max_retries: 3<\/code><\/pre>\n<p>The two curly braces <code>{tools}<\/code> and <code>{tool_names}<\/code> are template variables that NAT replaces at runtime:<\/p>\n<ul class=\"wp-block-list\">\n<li><code>{tools}<\/code> becomes the detailed description of all tools (name, description, parameters)<\/li>\n<li><code>{tool_names}<\/code> becomes the comma-separated list of tool names<\/li>\n<\/ul>\n<p>With NAT&#8217;s default prompt (i.e. if you omit <code>system_prompt<\/code>), NAT does this automatically \u2014 but then the prompt is in English by default, which is fine for English output but doesn&#8217;t help if you want to enforce a different language.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Step_8_Tool_extension_with_Wikipedia_search\"><\/span>Step 8: Tool extension with Wikipedia search<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The current_datetime test we first tried is trivial. Things get more interesting with real tools. NAT has <code>wiki_search<\/code> as a built-in. Now extend the config with the Wiki search as shown briefly below:<\/p>\n<pre class=\"wp-block-code\"><code>functions:\r\n  current_datetime:\r\n    _type: current_datetime\r\n  \r\n<strong>  wikipedia_search:\r\n    _type: wiki_search\r\n    max_results: 3<\/strong>\r\n\r\nworkflow:\r\n  _type: react_agent\r\n  tool_names: [current_datetime, <strong>wikipedia_search<\/strong>]\r\n  llm_name: ollama_llm\r\n  system_prompt: |\r\n    ...as above...\r\n  verbose: true\r\n  parse_agent_response_max_retries: 3<\/code><\/pre>\n<p>Now you can run the workflows as already shown. Be sure to use the correct names matching what you named your *.yml files.<\/p>\n<p><strong>Command:<\/strong> <code>nat run --config_file configs\/ollama_agent.yml --input \"Who was Nikola Tesla and in what year did he die?\"<\/code><\/p>\n<p>The agent should now independently select <code>wikipedia_search<\/code>, interpret the result and return a summarized answer \u2014 this time, ideally, in English.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Pitfalls_youll_probably_hit\"><\/span>Pitfalls you&#8217;ll probably hit<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"1_Qwen-7B_doesnt_always_do_ReAct_cleanly\"><\/span>1. Qwen-7B doesn&#8217;t always do ReAct cleanly<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Even with a system prompt: smaller models (7B class) don&#8217;t always stick perfectly to the ReAct format. You&#8217;ll occasionally see output where action and final answer are mixed together. That&#8217;s the reason for <code>parse_agent_response_max_retries: 3<\/code> in the config \u2014 NAT attempts the re-parsing automatically.<\/p>\n<p>If it fails persistently: switch to a larger model that you can still run. Larger models are significantly more reliable at ReAct reasoning.<\/p>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"2_Custom_system_prompt_needs_template_variables\"><\/span>2. Custom system prompt needs template variables<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>If you set <code>system_prompt<\/code>, then <code>{tools}<\/code> and <code>{tool_names}<\/code> must be in it. Otherwise you&#8217;ll get the ugly <code>ValueError: Invalid system_prompt<\/code>.<\/p>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"3_Connection_error_to_Ollama\"><\/span>3. Connection error to Ollama<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>If you see <code>Connection refused<\/code>, check in order:<\/p>\n<ul class=\"wp-block-list\">\n<li>Is Ollama running? <code>systemctl status ollama<\/code><\/li>\n<li>Is <code>base_url<\/code> correct? Watch out: NAT needs <code>\/v1<\/code> at the end<\/li>\n<li>If Ollama runs remotely: does Ollama accept connections from outside? (set <code>OLLAMA_HOST=0.0.0.0<\/code>)<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"4_Performance_expectations\"><\/span>4. Performance expectations<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>A ReAct loop with two to three tool calls typically takes 5 to 15 seconds. That feels slower than simple chat inference. The reason: the agent generates significantly more tokens than a simple answer (Thoughts, Actions, Observations, final answer).<\/p>\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"5_What_you_should_NOT_do\"><\/span>5. What you should NOT do<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul class=\"wp-block-list\">\n<li><code>sudo pip install nvidia-nat<\/code> \u2014 global installation as root can mess up system-Python-based tools<\/li>\n<li><code>pip install nvidia-nat<\/code> without an active venv \u2014 lands in the user directory depending on config<\/li>\n<li>Conda environment instead of venv \u2014 the NAT docs explicitly warn against it<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"What_comes_next\"><\/span>What comes next?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>With this setup, you have the orchestrator layer on your workstation. The exciting extensions are obvious:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Connect MCP tools:<\/strong> NAT supports the Model Context Protocol natively. This lets you attach tools like filesystem access, GitHub API or your own REST endpoints<\/li>\n<li><strong>Custom Python tool:<\/strong> e.g. an MQTT bridge to an ESP32 robot. That would be the exciting bridge between LLM agent and physical AI that personally interests me a lot<\/li>\n<li><strong>A2A protocol:<\/strong> (Agent-to-Agent) orchestrate multiple agents, distribute tasks<\/li>\n<li><strong>NAT as FastAPI server:<\/strong> with <code>nat serve<\/code> \u2014 that would be the web UI hookup<\/li>\n<li><strong>Vision-Language-Models:<\/strong> if you want to feed the agent with images too, that&#8217;s the next model step<\/li>\n<\/ul>\n<p>My next plan: a custom tool that communicates with an ESP32 robot car via MQTT. That turns the ReAct agent into a real hybrid between language model and embedded hardware \u2014 and physical AI gets a concrete, tangible meaning in my AI workshop.<\/p>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The transition from the pure inference layer like Ollama to the orchestrator layer (NAT) is conceptually a bigger jump than it first appears. Suddenly you&#8217;re dealing with multi-step reasoning, tool selection, prompt templating and format robustness. These are all topics that played no role in pure inference. At the same time, the setup with NAT is surprisingly straightforward: one venv, one <code>uv pip install<\/code>, one YAML file \u2014 and you have the orchestrator standing. The complexity lies not in the setup, but in the details: the language drift in smaller models, the template variables, the prompt-engineering subtleties.<\/p>\n<p>For me, this is the second pillar of the stack, and together with the first pillar \u2014 the inference server \u2014 I now have a solid foundation to tackle the next edge AI topics: VLMs, Physical AI via ESP32 bridges, and eventually the porting to a Jetson Thor, when I get that far.<\/p>\n<p>Good luck with your own setup!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my four-part TensorRT-LLM series I showed how I optimize inference performance on the RTX A6000 Ada \u2014 251 tokens\/sec with Qwen-2.5-7B in FP8, deployable .engine files, all cleanly reproducible. But in doing so, I had only built one part of the stack: the inference layer. Inspired by the ever-present GenAI agent publications, it became [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2278,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[1219,162,50],"tags":[1221,333,1224,1220,43,1227,306,1134,1231,1225,1230,1226,1223,1222,1176,1229,1198,1228],"class_list":["post-2277","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-genai-agents","category-large-language-models-en","category-top-story-en","tag-agentic-ai","tag-langchain-en","tag-nat","tag-nemo-agent-toolkit","tag-nvidia-en","tag-nvidia-nat","tag-ollama-en","tag-openai-compatible-api","tag-openai-kompatibel","tag-orchestrator","tag-orchestrator-layer","tag-python-venv","tag-qwen-2-5","tag-react-agent","tag-rtx-a6000-ada","tag-system_prompt-template","tag-ubuntu-24-04","tag-uv-pip","et-has-post-format-content","et_post_format-et-post-format-standard"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>NeMo Agent Toolkit on the RTX A6000 Ada - From Inference Layer to Orchestrator Layer - Exploring the Future: Inside the AI Box<\/title>\n<meta name=\"description\" content=\"NeMo Agent Toolkit mit Ollama: Vom Inferenz-Layer zum Orchestrator-Layer. Schritt-f\u00fcr-Schritt-Setup auf Ubuntu 24.04 mit Qwen-2.5.NeMo Agent Toolkit with Ollama: from inference layer to orchestrator layer. Step-by-step setup on Ubuntu 24.04 with Qwen-2.5.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NeMo Agent Toolkit on the RTX A6000 Ada - From Inference Layer to Orchestrator Layer - Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"og:description\" content=\"NeMo Agent Toolkit mit Ollama: Vom Inferenz-Layer zum Orchestrator-Layer. Schritt-f\u00fcr-Schritt-Setup auf Ubuntu 24.04 mit Qwen-2.5.NeMo Agent Toolkit with Ollama: from inference layer to orchestrator layer. Step-by-step setup on Ubuntu 24.04 with Qwen-2.5.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/\" \/>\n<meta property=\"og:site_name\" content=\"Exploring the Future: Inside the AI Box\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-16T19:43:29+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-16T19:50:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1774\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Maker\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:site\" content=\"@Ingmar_Stapel\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Maker\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/\"},\"author\":{\"name\":\"Maker\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"headline\":\"NeMo Agent Toolkit on the RTX A6000 Ada &#8211; From Inference Layer to Orchestrator Layer\",\"datePublished\":\"2026-05-16T19:43:29+00:00\",\"dateModified\":\"2026-05-16T19:50:27+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/\"},\"wordCount\":2092,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg\",\"keywords\":[\"Agentic AI\",\"LangChain\",\"NAT\",\"NeMo Agent Toolkit\",\"nvidia\",\"NVIDIA NAT\",\"Ollama\",\"OpenAI compatible API\",\"OpenAI kompatibel\",\"Orchestrator\",\"Orchestrator Layer\",\"Python venv\",\"Qwen 2.5\",\"ReAct Agent\",\"RTX A6000 Ada\",\"system_prompt template\",\"Ubuntu 24.04\",\"uv pip\"],\"articleSection\":[\"GenAI Agents\",\"Large Language Models\",\"Top story\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/\",\"name\":\"NeMo Agent Toolkit on the RTX A6000 Ada - From Inference Layer to Orchestrator Layer - Exploring the Future: Inside the AI Box\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg\",\"datePublished\":\"2026-05-16T19:43:29+00:00\",\"dateModified\":\"2026-05-16T19:50:27+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\"},\"description\":\"NeMo Agent Toolkit mit Ollama: Vom Inferenz-Layer zum Orchestrator-Layer. Schritt-f\u00fcr-Schritt-Setup auf Ubuntu 24.04 mit Qwen-2.5.NeMo Agent Toolkit with Ollama: from inference layer to orchestrator layer. Step-by-step setup on Ubuntu 24.04 with Qwen-2.5.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/#primaryimage\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg\",\"contentUrl\":\"https:\\\/\\\/ai-box.eu\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg\",\"width\":2560,\"height\":1774,\"caption\":\"NVIDIA NeMo Agent Toolkit - Architecture\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/large-language-models-en\\\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\\\/2277\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Start\",\"item\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NeMo Agent Toolkit on the RTX A6000 Ada &#8211; From Inference Layer to Orchestrator Layer\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/\",\"name\":\"Exploring the Future: Inside the AI Box\",\"description\":\"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/#\\\/schema\\\/person\\\/cc91d08618b3feeef6926591b465eab1\",\"name\":\"Maker\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g\",\"caption\":\"Maker\"},\"description\":\"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.\",\"sameAs\":[\"https:\\\/\\\/ai-box.eu\"],\"url\":\"https:\\\/\\\/ai-box.eu\\\/en\\\/author\\\/ingmars\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NeMo Agent Toolkit on the RTX A6000 Ada - From Inference Layer to Orchestrator Layer - Exploring the Future: Inside the AI Box","description":"NeMo Agent Toolkit mit Ollama: Vom Inferenz-Layer zum Orchestrator-Layer. Schritt-f\u00fcr-Schritt-Setup auf Ubuntu 24.04 mit Qwen-2.5.NeMo Agent Toolkit with Ollama: from inference layer to orchestrator layer. Step-by-step setup on Ubuntu 24.04 with Qwen-2.5.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/","og_locale":"en_US","og_type":"article","og_title":"NeMo Agent Toolkit on the RTX A6000 Ada - From Inference Layer to Orchestrator Layer - Exploring the Future: Inside the AI Box","og_description":"NeMo Agent Toolkit mit Ollama: Vom Inferenz-Layer zum Orchestrator-Layer. Schritt-f\u00fcr-Schritt-Setup auf Ubuntu 24.04 mit Qwen-2.5.NeMo Agent Toolkit with Ollama: from inference layer to orchestrator layer. Step-by-step setup on Ubuntu 24.04 with Qwen-2.5.","og_url":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/","og_site_name":"Exploring the Future: Inside the AI Box","article_published_time":"2026-05-16T19:43:29+00:00","article_modified_time":"2026-05-16T19:50:27+00:00","og_image":[{"width":2560,"height":1774,"url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg","type":"image\/jpeg"}],"author":"Maker","twitter_card":"summary_large_image","twitter_creator":"@Ingmar_Stapel","twitter_site":"@Ingmar_Stapel","twitter_misc":{"Written by":"Maker","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#article","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/"},"author":{"name":"Maker","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"headline":"NeMo Agent Toolkit on the RTX A6000 Ada &#8211; From Inference Layer to Orchestrator Layer","datePublished":"2026-05-16T19:43:29+00:00","dateModified":"2026-05-16T19:50:27+00:00","mainEntityOfPage":{"@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/"},"wordCount":2092,"commentCount":0,"image":{"@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg","keywords":["Agentic AI","LangChain","NAT","NeMo Agent Toolkit","nvidia","NVIDIA NAT","Ollama","OpenAI compatible API","OpenAI kompatibel","Orchestrator","Orchestrator Layer","Python venv","Qwen 2.5","ReAct Agent","RTX A6000 Ada","system_prompt template","Ubuntu 24.04","uv pip"],"articleSection":["GenAI Agents","Large Language Models","Top story"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/","url":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/","name":"NeMo Agent Toolkit on the RTX A6000 Ada - From Inference Layer to Orchestrator Layer - Exploring the Future: Inside the AI Box","isPartOf":{"@id":"https:\/\/ai-box.eu\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#primaryimage"},"image":{"@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#primaryimage"},"thumbnailUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg","datePublished":"2026-05-16T19:43:29+00:00","dateModified":"2026-05-16T19:50:27+00:00","author":{"@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1"},"description":"NeMo Agent Toolkit mit Ollama: Vom Inferenz-Layer zum Orchestrator-Layer. Schritt-f\u00fcr-Schritt-Setup auf Ubuntu 24.04 mit Qwen-2.5.NeMo Agent Toolkit with Ollama: from inference layer to orchestrator layer. Step-by-step setup on Ubuntu 24.04 with Qwen-2.5.","breadcrumb":{"@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#primaryimage","url":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg","contentUrl":"https:\/\/ai-box.eu\/wp-content\/uploads\/2026\/05\/NVIDIA_NeMo_Agent_Toolkit_02-scaled.jpg","width":2560,"height":1774,"caption":"NVIDIA NeMo Agent Toolkit - Architecture"},{"@type":"BreadcrumbList","@id":"https:\/\/ai-box.eu\/en\/large-language-models-en\/nemo-agent-toolkit-auf-der-rtx-a6000-ada-vom-inferenz-layer-zum-orchestrator-layer\/2277\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Start","item":"https:\/\/ai-box.eu\/en\/"},{"@type":"ListItem","position":2,"name":"NeMo Agent Toolkit on the RTX A6000 Ada &#8211; From Inference Layer to Orchestrator Layer"}]},{"@type":"WebSite","@id":"https:\/\/ai-box.eu\/en\/#website","url":"https:\/\/ai-box.eu\/en\/","name":"Exploring the Future: Inside the AI Box","description":"Inside the AI Box, we share our experiences and discoveries in the world of artificial intelligence.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ai-box.eu\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/ai-box.eu\/en\/#\/schema\/person\/cc91d08618b3feeef6926591b465eab1","name":"Maker","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e96b93fc3c7e50c1f21c5c6b1f146dc4867936141360830b328947b32cacf93a?s=96&d=mm&r=g","caption":"Maker"},"description":"I live in Bavaria near Munich. In my head I always have many topics and try out especially in the field of Internet new media much in my spare time. I write on the blog because it makes me fun to report about the things that inspire me. I am happy about every comment, about suggestion and very about questions.","sameAs":["https:\/\/ai-box.eu"],"url":"https:\/\/ai-box.eu\/en\/author\/ingmars\/"}]}},"_links":{"self":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2277","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/comments?post=2277"}],"version-history":[{"count":4,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2277\/revisions"}],"predecessor-version":[{"id":2285,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/posts\/2277\/revisions\/2285"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media\/2278"}],"wp:attachment":[{"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/media?parent=2277"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/categories?post=2277"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ai-box.eu\/en\/wp-json\/wp\/v2\/tags?post=2277"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}