Building Your First Agentic AI Workflow with NeMo Agent Toolkit

Agentic AI, models that call tools, plan, and iterate, has moved from research to production. NVIDIA’s NeMo Agent Toolkit gives you a structured framework for building these workflows on NVIDIA GPUs. This tutorial walks through your first agent, end to end.

What You’ll Build

A research-assistant agent that can:

Search a local document corpus via RAG
Call a Python tool to execute a code snippet
Reason over the combined results to produce a structured answer

By the end you’ll have a working agent and a deployment template you can extend.

Prerequisites

An NVIDIA GPU (H100, B200, RTX PRO 6000, L40S, or DGX Spark all work)
Python 3.11+, Docker
An NGC API key from build.nvidia.com
Approximately 200 GB free disk for model weights

Step 1: Install the Toolkit

pip install nemo-agent-toolkit
export NGC_API_KEY=your_key_here

Step 2: Pull a Reasoning NIM

Use a function-calling capable model. Llama 3.3 70B Instruct is a strong default:

docker run -d --gpus all \
  -p 8000:8000 \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v ~/nim-cache:/opt/nim/.cache \
  nvcr.io/nim/meta/llama-3.3-70b-instruct:latest

Wait for the container to log “Application startup complete.” Verify:

curl http://localhost:8000/v1/models

Step 3: Define Your Tools

Create tools.py:

from nemo_agent_toolkit import tool

@tool
def search_docs(query: str) -> str:
    '''Search the local document corpus for relevant passages.'''
    from chromadb import PersistentClient
    client = PersistentClient(path="./chroma")
    coll = client.get_collection("docs")
    results = coll.query(query_texts=[query], n_results=3)
    return "\n---\n".join(results["documents"][0])

@tool
def run_python(code: str) -> str:
    '''Execute a Python snippet and return its stdout.'''
    import subprocess, tempfile
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(code)
        path = f.name
    out = subprocess.run(["python", path], capture_output=True, text=True, timeout=10)
    return out.stdout or out.stderr

Step 4: Build a Document Index

from chromadb import PersistentClient
client = PersistentClient(path="./chroma")
coll = client.get_or_create_collection("docs")

# Index your local corpus
import glob
for path in glob.glob("docs/**/*.md", recursive=True):
    with open(path) as f:
        text = f.read()
    coll.add(documents=[text], ids=[path])

Step 5: Define the Agent

Create agent.py:

from nemo_agent_toolkit import Agent, OpenAIClient
from tools import search_docs, run_python

client = OpenAIClient(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    model="meta/llama-3.3-70b-instruct"
)

agent = Agent(
    client=client,
    tools=[search_docs, run_python],
    system_prompt=(
        "You are a research assistant. Use the search_docs tool to find "
        "relevant context, and run_python to compute or verify numbers. "
        "Always cite the documents you used."
    ),
    max_iterations=6
)

Step 6: Run It

if __name__ == "__main__":
    result = agent.run(
        "What does our internal documentation say about RTX PRO 6000 "
        "memory configurations, and how much total VRAM is in a 4-GPU setup?"
    )
    print(result)

python agent.py

What Happens Under the Hood

The agent receives the user query
The model decides to call search_docs with a focused search query
RAG returns the top 3 passages
The model decides to call run_python to multiply VRAM x 4
The Python tool returns the computed value
The model composes a final answer with citations

Step 7: Add Structured Output

Production agents need schema-validated responses. NeMo Agent Toolkit supports Pydantic models:

from pydantic import BaseModel
from typing import List

class Answer(BaseModel):
    summary: str
    total_vram_gb: int
    sources: List[str]

result = agent.run(query, response_model=Answer)
print(result.total_vram_gb)

Step 8: Deploy

For production, package the agent as a FastAPI service and serve behind a NIM:

from fastapi import FastAPI

app = FastAPI()

@app.post("/ask")
def ask(query: str):
    return agent.run(query, response_model=Answer).model_dump()

Containerize, deploy to Kubernetes alongside the NIM, and front with your API gateway.

Production Considerations

Sandbox tool execution, never run untrusted Python in the same process as your agent
Set max_iterations to bound runaway loops
Cache RAG results for repeated queries
Observability, log every tool call, token count, and decision
Eval harness, build regression tests before changing the system prompt

Where to Take It Next

Add domain-specific tools (database queries, internal APIs)
Try multi-agent patterns where one agent supervises others
Move to NVIDIA AI Enterprise for production support
Profile and tune with TensorRT-LLM if latency matters

Building agentic AI in production? Browse our NVIDIA AI Enterprise and DGX Spark product pages or contact our team for an architecture review tailored to your use case.