NeMo Agent Toolkit on the RTX A6000 Ada - From Inference Layer to Orchestrator Layer

In my four-part TensorRT-LLM series I showed how I optimize inference performance on the RTX A6000 Ada — 251 tokens/sec with Qwen-2.5-7B in FP8, deployable .engine files, all cleanly reproducible. But in doing so, I had only built one part of the stack: the inference layer.

Inspired by the ever-present GenAI agent publications, it became clear to me: a production AI stack consists of multiple layers. Inference is only one of them. Above it sits the orchestrator — the layer that decides which tool gets called when, that does multi-step reasoning, that produces the actual agent behavior.

NVIDIA has released the NeMo Agent Toolkit (NAT) as an open-source library for exactly this. In this post I’ll show you how I installed NAT on my Ubuntu 24.04 server — cleanly isolated in a Python venv, with my existing Ollama setup as the backend — and how I got my first ReAct agent up and running. Including two non-trivial pitfalls that I’d rather spare you.

Here’s the link to the NeMo Agent Toolkit: https://github.com/NVIDIA/NeMo-Agent-Toolkit

What is this actually about?

An agent in the modern LLM context is more than a chatbot. A chatbot gets a question and answers it. An agent gets a task and decides on its own which tools to call, in which order, and when it has enough information for a final answer. The classic pattern is called ReAct (Reason + Act): the model thinks, picks an action (tool call), observes the result, thinks again — until it has a finished answer.

At the architectural level, it looks like this:

NVIDIA NeMo Agent Toolkit – Architecture

NAT is explicitly built to be framework-agnostic: you can hang LangChain, LlamaIndex, CrewAI or even custom frameworks behind it. And the inference layer is decoupled via the OpenAI-compatible API. That’s a very flexible architecture — what’s hanging underneath doesn’t matter to NAT at all. Ollama, vLLM, TensorRT-LLM, or an NVIDIA NIM — as long as it speaks OpenAI format, it works.

Prerequisites

This guide assumes that your server is already fundamentally prepared for AI inference. If not, work through my foundation post first: Preparing an Ubuntu 24.04 server for AI inference.

Specifically you need:

Ubuntu 24.04 LTS
An NVIDIA GPU (I use the RTX A6000 Ada — for 7B models in 4-bit quantization any card with at least 8 GB VRAM will do)
Ollama already running with qwen2.5:7b-instruct in the model cache (or any other model of your choice)
Python 3.11, 3.12 or 3.13 on the host — system Python on Ubuntu 24.04 is Python 3.12, which is fine
Internet connection for the initial package downloads (about 1.5 GB)

Architecture overview: what runs where?

Before we get started, a quick look at the clean separation of components. On my server, multiple layers coexist with different isolation mechanisms:

Component	Where it runs	Why there
NVIDIA drivers, CUDA, Docker	System-wide	Needed by all applications, no isolation required
Ollama	System service (systemctl)	Daemon character, listens on port 11434
TensorRT-LLM	Docker container (NGC image)	Complex dependency stack → container isolates it
NeMo Agent Toolkit (NAT)	Python venv on the host	Medium complexity — venv is sufficient
Custom tools (Python)	In the same venv as NAT	Direct access to NAT API

The key point: NAT doesn’t need a container because it’s a pure Python library. But it absolutely needs its own Python environment because its roughly 80 to 120 dependencies would otherwise collide with other Python projects or system tools. The NAT documentation explicitly warns against conda, by the way. So I use vanilla venv.

Step 1: Install uv

uv is a modern Python package manager that is about 10-100x faster than pip. The NVIDIA documentation explicitly recommends it as the preferred option for the NAT installation. A series of commands now needs to be executed. Step by step.

Command: curl -LsSf https://astral.sh/uv/install.sh | sh

Command: source ~/.bashrc

Command: uv --version

On my system uv ends up at ~/.local/bin/uv, and the version line looks something like: uv 0.x.x. If the verification fails, the installer typically ended up in the wrong place — a source ~/.profile or a fresh terminal helps.

Step 2: Create the directory structure

I create my own project directory for NAT, separate from my other AI projects. The advantage: I can later have any number of agent projects running in parallel without their dependencies getting in each other’s way.

Command: mkdir -p ~/nat-playground/configs

Command: mkdir -p ~/nat-playground/tools

Command: cd ~/nat-playground

The eventual directory structure looks like this:

~/nat-playground/
├── .venv/                     # Own Python environment (Step 3)
├── configs/                   # YAML workflow configurations
└── tools/                     # Custom tools (Python modules)

Step 3: Create a Python venv for NAT

Now we create the virtual Python environment. With uv this happens in a single step, and I’m still amazed how easy it is:

Command: uv venv --python 3.12 --seed .venv

Command: source .venv/bin/activate

The --seed flag ensures that pip is installed inside the venv as well, which simplifies plugin installations. You can tell the venv is active by the (.venv) in the shell prompt.

What happens here? A venv is basically just a directory with its own python binary (often a symlink to the system Python), its own site-packages folder, and an activation script. When activated, the shell PATH is manipulated so that when python or pip is called, the venv binaries are found first. These install packages into the venv-owned site-packages folder — completely isolated from the system, which is exactly what we want.

Advantage: if I break the installation, I just delete .venv/ and start over. The system Python remains untouched.

Step 4: Install the NeMo Agent Toolkit

With the venv active, I install NAT with the LangChain plugin. LangChain is the standard framework bridge and is needed for the ReAct agent:

Command: uv pip install "nvidia-nat[langchain]"

The command pulls about 80-120 packages — LangChain, Pydantic, httpx, openai client and quite a bit more. With uv the installation took about 5-10 minutes for me.

Now let’s check whether NAT has actually been installed.

Command: nat --version

Command: nat info components -t llm_provider

The second command lists the available LLM providers. You should see at least openai and nim in the list. The openai provider is the one we’ll use for Ollama. It works with any OpenAI-compatible endpoint.

Step 5: Verify the Ollama OpenAI API

Since version 0.1.24, Ollama provides an OpenAI-compatible endpoint at /v1. A quick function test before we connect NAT to it. It’s important that your Ollama inference server is running and reachable:

Command: curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "qwen2.5:7b-instruct", "messages": [{"role":"user","content":"Answer with OK."}], "max_tokens": 10 }'

If you get back a JSON response with "OK" in the content field, everything is ready.

Note: If NAT is supposed to run on a different machine than Ollama, Ollama must listen on all interfaces.

Step 6: First agent workflow configuration

NAT workflows are defined in YAML files. We start with a minimal workflow that uses a single tool only. It should return the current time. The workflow is stored in a *.yml file. I created the following file in the configs folder using nano.

Command: ~/nat-playground/configs

Command: nano ollama_agent.yml

Into this file you paste the following content describing the workflow.

llms:
  ollama_llm:
    _type: openai
    api_key: "EMPTY"
    base_url: "http://localhost:11434/v1"
    model_name: "qwen2.5:7b-instruct"
    temperature: 0.0
    max_retries: 3

functions:
  current_datetime:
    _type: current_datetime

workflow:
  _type: react_agent
  tool_names: [current_datetime]
  llm_name: ollama_llm
  verbose: true
  parse_agent_response_max_retries: 3

With CTRL + X followed by Y, you save the change.

Three main blocks:

llms: Defines the available LLM backends. _type: openai makes Ollama usable as a generic OpenAI endpoint. The api_key: "EMPTY" is mandatory — even though Ollama doesn’t check it, the field has to be set, otherwise you’ll get a validation error.
functions: Defines the tools the agent is allowed to call. current_datetime is a built-in of NAT.
workflow: Defines the agent pattern. react_agent is the classic Reason+Act pattern.

Step 7: Run the agent

When you now want to execute the agent workflow, it’s important that you’re in the active .venv. Now execute the following commands.

Command: cd ~/nat-playground

Command: nat run --config_file configs/ollama_agent.yml --input "What time is it now and what can I derive from that for my workday?"

If everything works, you’ll see a ReAct trace in the terminal that might look like this. In my case, there were still a lot of Chinese characters in between.

Thought: I need to find out the current time.
Action: current_datetime
Action Input: {}
Observation: 2026-05-16 16:56:52 +0000
Thought: It's Saturday evening, 18:56 Munich time. From this I can derive...
Final Answer: It is currently 18:56 on May 16, 2026...

With this, you have cleanly coupled the inference layer (Ollama) and the orchestrator layer (NAT). Exactly the architectural separation we want to achieve. In this case with Ollama instead of TensorRT-LLM.

The Qwen trap: when the agent suddenly speaks Chinese

This is where it gets interesting. On my first run, I got the following trace:

Thought: 为了回答这个问题，我需要获取当前的时间和日期信息...
Action: current_datetime
Action Input: {"unused": "2023-11-29T15:48:00Z"}
...
Final Answer: 当前时间为2026年5月16日下午4点56分52秒...

The agent called the tool correctly, received the right time — but the entire response came back in Chinese. That’s a well-known quirk of the Qwen 2.5 family: the model originates from Alibaba and regularly falls back into its main training language under structured reasoning. Especially pronounced in models below 14B parameters.

The solution: an explicit system prompt that enforces the language. But watch out — NAT’s react_agent is template-based. If you set your own system prompt, you must include the placeholders {tools} and {tool_names} yourself, otherwise the agent doesn’t know which tools are available. That cost me a few minutes on my first attempt because I hadn’t done it.

The correct solution with a system prompt that enforces English looks like this. Just create a new workflow.

Command: ~/nat-playground/configs

Command: nano ollama_agent_system_prompt.yml

llms:
  ollama_llm:
    _type: openai
    api_key: "EMPTY"
    base_url: "http://localhost:11434/v1"
    model_name: "qwen2.5:7b-instruct"
    temperature: 0.0
    max_retries: 3
workflow:
  _type: react_agent
  tool_names: [current_datetime]
  llm_name: ollama_llm
  system_prompt: |
    You are a helpful English-speaking assistant.
    IMPORTANT: Answer EXCLUSIVELY in English. Your thoughts must also be in English.
    NEVER use Chinese or any other language.
    You have access to the following tools:
    {tools}
    Use the following format for your answer:
    Question: the input question you must answer
    Thought: reason in English about what to do next
    Action: the action to take — must be one of: [{tool_names}]
    Action Input: the input to the action
    Observation: the result of the action
    ... (this cycle can repeat)
    Thought: I now know the final answer
    Final Answer: the final answer — in English
    Begin!
  verbose: true
  parse_agent_response_max_retries: 3

The two curly braces {tools} and {tool_names} are template variables that NAT replaces at runtime:

{tools} becomes the detailed description of all tools (name, description, parameters)
{tool_names} becomes the comma-separated list of tool names

With NAT’s default prompt (i.e. if you omit system_prompt), NAT does this automatically — but then the prompt is in English by default, which is fine for English output but doesn’t help if you want to enforce a different language.

Step 8: Tool extension with Wikipedia search

The current_datetime test we first tried is trivial. Things get more interesting with real tools. NAT has wiki_search as a built-in. Now extend the config with the Wiki search as shown briefly below:

functions:
  current_datetime:
    _type: current_datetime
  
  wikipedia_search:
    _type: wiki_search
    max_results: 3

workflow:
  _type: react_agent
  tool_names: [current_datetime, wikipedia_search]
  llm_name: ollama_llm
  system_prompt: |
    ...as above...
  verbose: true
  parse_agent_response_max_retries: 3

Now you can run the workflows as already shown. Be sure to use the correct names matching what you named your *.yml files.

Command: nat run --config_file configs/ollama_agent.yml --input "Who was Nikola Tesla and in what year did he die?"

The agent should now independently select wikipedia_search, interpret the result and return a summarized answer — this time, ideally, in English.

Pitfalls you’ll probably hit

1. Qwen-7B doesn’t always do ReAct cleanly

Even with a system prompt: smaller models (7B class) don’t always stick perfectly to the ReAct format. You’ll occasionally see output where action and final answer are mixed together. That’s the reason for parse_agent_response_max_retries: 3 in the config — NAT attempts the re-parsing automatically.

If it fails persistently: switch to a larger model that you can still run. Larger models are significantly more reliable at ReAct reasoning.

2. Custom system prompt needs template variables

If you set system_prompt, then {tools} and {tool_names} must be in it. Otherwise you’ll get the ugly ValueError: Invalid system_prompt.

3. Connection error to Ollama

If you see Connection refused, check in order:

Is Ollama running? systemctl status ollama
Is base_url correct? Watch out: NAT needs /v1 at the end
If Ollama runs remotely: does Ollama accept connections from outside? (set OLLAMA_HOST=0.0.0.0)

4. Performance expectations

A ReAct loop with two to three tool calls typically takes 5 to 15 seconds. That feels slower than simple chat inference. The reason: the agent generates significantly more tokens than a simple answer (Thoughts, Actions, Observations, final answer).

5. What you should NOT do

sudo pip install nvidia-nat — global installation as root can mess up system-Python-based tools
pip install nvidia-nat without an active venv — lands in the user directory depending on config
Conda environment instead of venv — the NAT docs explicitly warn against it

What comes next?

With this setup, you have the orchestrator layer on your workstation. The exciting extensions are obvious:

Connect MCP tools: NAT supports the Model Context Protocol natively. This lets you attach tools like filesystem access, GitHub API or your own REST endpoints
Custom Python tool: e.g. an MQTT bridge to an ESP32 robot. That would be the exciting bridge between LLM agent and physical AI that personally interests me a lot
A2A protocol: (Agent-to-Agent) orchestrate multiple agents, distribute tasks
NAT as FastAPI server: with nat serve — that would be the web UI hookup
Vision-Language-Models: if you want to feed the agent with images too, that’s the next model step

My next plan: a custom tool that communicates with an ESP32 robot car via MQTT. That turns the ReAct agent into a real hybrid between language model and embedded hardware — and physical AI gets a concrete, tangible meaning in my AI workshop.

Conclusion

The transition from the pure inference layer like Ollama to the orchestrator layer (NAT) is conceptually a bigger jump than it first appears. Suddenly you’re dealing with multi-step reasoning, tool selection, prompt templating and format robustness. These are all topics that played no role in pure inference. At the same time, the setup with NAT is surprisingly straightforward: one venv, one uv pip install, one YAML file — and you have the orchestrator standing. The complexity lies not in the setup, but in the details: the language drift in smaller models, the template variables, the prompt-engineering subtleties.

For me, this is the second pillar of the stack, and together with the first pillar — the inference server — I now have a solid foundation to tackle the next edge AI topics: VLMs, Physical AI via ESP32 bridges, and eventually the porting to a Jetson Thor, when I get that far.

Good luck with your own setup!

NeMo Agent Toolkit on the RTX A6000 Ada – From Inference Layer to Orchestrator Layer

What is this actually about?

Prerequisites

Architecture overview: what runs where?

Step 1: Install uv

Step 2: Create the directory structure

Step 3: Create a Python venv for NAT

Step 4: Install the NeMo Agent Toolkit

Step 5: Verify the Ollama OpenAI API

Step 6: First agent workflow configuration

Step 7: Run the agent

The Qwen trap: when the agent suddenly speaks Chinese

Step 8: Tool extension with Wikipedia search

Pitfalls you’ll probably hit

1. Qwen-7B doesn’t always do ReAct cleanly

2. Custom system prompt needs template variables

3. Connection error to Ollama

4. Performance expectations

5. What you should NOT do

What comes next?

Conclusion

Related Posts:

About The Author

Maker

Leave a reply Cancel reply

Latest Posts

NeMo Agent Toolkit on the RTX A6000 Ada – From Inference Layer to Orchestrator Layer

What is this actually about?

Prerequisites

Architecture overview: what runs where?

Step 1: Install uv

Step 2: Create the directory structure

Step 3: Create a Python venv for NAT

Step 4: Install the NeMo Agent Toolkit

Step 5: Verify the Ollama OpenAI API

Step 6: First agent workflow configuration

Step 7: Run the agent

The Qwen trap: when the agent suddenly speaks Chinese

Step 8: Tool extension with Wikipedia search

Pitfalls you’ll probably hit

1. Qwen-7B doesn’t always do ReAct cleanly

2. Custom system prompt needs template variables

3. Connection error to Ollama

4. Performance expectations

5. What you should NOT do

What comes next?

Conclusion

Related Posts:

About The Author

Maker

Related Posts

Installing LM Studio on Gigabyte AI TOP ATOM: User-friendly GUI with OpenAI-compatible API for local LLMs

Installation of Anaconda on Ubuntu LTS version

YOLOv5 – Optimization of training data for PFM-1 antipersonnel mine detection

How to Scale Ollama with Two or More GPUs – Parallel Instances for Maximum Performance

Leave a reply Cancel reply

Latest Posts