Building Your First Agentic AI Workflow with NeMo Agent Toolkit
Agentic AI, models that call tools, plan, and iterate, has moved from research to production. NVIDIA’s NeMo Agent Toolkit gives you a structured framework for building these workflows on NVIDIA GPUs. This tutorial walks through your first agent, end to end.
What You’ll Build
A research-assistant agent that can:
- Search a local document corpus via RAG
- Call a Python tool to execute a code snippet
- Reason over the combined results to produce a structured answer
By the end you’ll have a working agent and a deployment template you can extend.
Prerequisites
- An NVIDIA GPU (H100, B200, RTX PRO 6000, L40S, or DGX Spark all work)
- Python 3.11+, Docker
- An NGC API key from build.nvidia.com
- Approximately 200 GB free disk for model weights
Step 1: Install the Toolkit
pip install nemo-agent-toolkit
export NGC_API_KEY=your_key_here
Step 2: Pull a Reasoning NIM
Use a function-calling capable model. Llama 3.3 70B Instruct is a strong default:
docker run -d --gpus all \
-p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
-v ~/nim-cache:/opt/nim/.cache \
nvcr.io/nim/meta/llama-3.3-70b-instruct:latest
Wait for the container to log “Application startup complete.” Verify:
curl http://localhost:8000/v1/models
Step 3: Define Your Tools
Create tools.py:
from nemo_agent_toolkit import tool
@tool
def search_docs(query: str) -> str:
'''Search the local document corpus for relevant passages.'''
from chromadb import PersistentClient
client = PersistentClient(path="./chroma")
coll = client.get_collection("docs")
results = coll.query(query_texts=[query], n_results=3)
return "\n---\n".join(results["documents"][0])
@tool
def run_python(code: str) -> str:
'''Execute a Python snippet and return its stdout.'''
import subprocess, tempfile
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(code)
path = f.name
out = subprocess.run(["python", path], capture_output=True, text=True, timeout=10)
return out.stdout or out.stderr
Step 4: Build a Document Index
from chromadb import PersistentClient
client = PersistentClient(path="./chroma")
coll = client.get_or_create_collection("docs")
# Index your local corpus
import glob
for path in glob.glob("docs/**/*.md", recursive=True):
with open(path) as f:
text = f.read()
coll.add(documents=[text], ids=[path])
Step 5: Define the Agent
Create agent.py:
from nemo_agent_toolkit import Agent, OpenAIClient
from tools import search_docs, run_python
client = OpenAIClient(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="meta/llama-3.3-70b-instruct"
)
agent = Agent(
client=client,
tools=[search_docs, run_python],
system_prompt=(
"You are a research assistant. Use the search_docs tool to find "
"relevant context, and run_python to compute or verify numbers. "
"Always cite the documents you used."
),
max_iterations=6
)
Step 6: Run It
if __name__ == "__main__":
result = agent.run(
"What does our internal documentation say about RTX PRO 6000 "
"memory configurations, and how much total VRAM is in a 4-GPU setup?"
)
print(result)
python agent.py
What Happens Under the Hood
- The agent receives the user query
- The model decides to call
search_docswith a focused search query - RAG returns the top 3 passages
- The model decides to call
run_pythonto multiply VRAM x 4 - The Python tool returns the computed value
- The model composes a final answer with citations
Step 7: Add Structured Output
Production agents need schema-validated responses. NeMo Agent Toolkit supports Pydantic models:
from pydantic import BaseModel
from typing import List
class Answer(BaseModel):
summary: str
total_vram_gb: int
sources: List[str]
result = agent.run(query, response_model=Answer)
print(result.total_vram_gb)
Step 8: Deploy
For production, package the agent as a FastAPI service and serve behind a NIM:
from fastapi import FastAPI
app = FastAPI()
@app.post("/ask")
def ask(query: str):
return agent.run(query, response_model=Answer).model_dump()
Containerize, deploy to Kubernetes alongside the NIM, and front with your API gateway.
Production Considerations
- Sandbox tool execution, never run untrusted Python in the same process as your agent
- Set
max_iterationsto bound runaway loops - Cache RAG results for repeated queries
- Observability, log every tool call, token count, and decision
- Eval harness, build regression tests before changing the system prompt
Where to Take It Next
- Add domain-specific tools (database queries, internal APIs)
- Try multi-agent patterns where one agent supervises others
- Move to NVIDIA AI Enterprise for production support
- Profile and tune with TensorRT-LLM if latency matters
Building agentic AI in production? Browse our NVIDIA AI Enterprise and DGX Spark product pages or contact our team for an architecture review tailored to your use case.