NVIDIA Agentic AI Blueprint

Deploying NVIDIA NIM Microservices on a Kubernetes Cluster

NVIDIA NIM microservices give you OpenAI-compatible model endpoints in containers. Running one on a laptop is easy. Running them in production on a Kubernetes cluster, with autoscaling, observability, and tenant isolation, takes a structured approach. This tutorial walks through that path end to end.

Prerequisites

  • A Kubernetes cluster (1.28+) with GPU nodes (H100, H200, B200, RTX PRO 6000, or L40S recommended)
  • kubectl and helm installed locally
  • An NGC API key from build.nvidia.com (free for development)
  • Cluster admin permissions for the initial GPU Operator install

Step 1: Install the NVIDIA GPU Operator

The GPU Operator manages drivers, runtime, monitoring, and device plugins. Install it once per cluster:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

kubectl create namespace gpu-operator

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true

Wait until all pods in gpu-operator namespace are Running:

kubectl get pods -n gpu-operator

Step 2: Verify GPUs Are Visible to Kubernetes

kubectl describe nodes | grep nvidia.com/gpu

You should see nvidia.com/gpu: N under both Capacity and Allocatable for each GPU node.

Step 3: Create the NIM Namespace and Secrets

kubectl create namespace nim
kubectl create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=$NGC_API_KEY \
  --namespace=nim

kubectl create secret generic ngc-api \
  --from-literal=NGC_API_KEY=$NGC_API_KEY \
  --namespace=nim

Step 4: Deploy a NIM

Use the NVIDIA NIM Helm chart. Example for the Llama 3.3 70B Instruct NIM:

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.x.y.tgz \
  --username='$oauthtoken' --password=$NGC_API_KEY

cat > values.yaml <

Step 5: Verify the Endpoint

kubectl port-forward -n nim svc/llama-nim-llm 8000:8000

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [{"role":"user","content":"Hello"}]
  }'

Step 6: Add Autoscaling

NIM containers expose Prometheus metrics. Combine with KEDA for queue-aware autoscaling:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

cat > scaler.yaml <

Step 7: Add Observability

Install Prometheus and Grafana with the NVIDIA DCGM exporter dashboard:

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

# Import the NVIDIA DCGM dashboard (ID 12239) into Grafana

You now have GPU utilization, NIM request rates, and queue depth in one place.

Step 8: Front With an API Gateway

For production add an API gateway in front of the NIM service for auth, rate limiting, and routing. Common choices: Kong, Envoy, NGINX. The NIM exposes OpenAI-compatible endpoints, so any gateway that understands HTTP/JSON works.

Step 9: Multi-Model Routing

Most production deployments end up serving multiple models. Pattern:

  • Deploy each NIM as its own Kubernetes Service
  • Use a thin gateway (Triton's model router or a custom service) to dispatch by model name
  • Apply per-model autoscaling rules

Operational Checklist

  • Backup the NIM model cache (/opt/nim/.cache), saves cold-start time
  • Set GPU memory request/limit explicitly per replica
  • Configure pod anti-affinity to spread replicas across GPU nodes
  • Monitor token/sec, p99 latency, and queue depth as your golden metrics
  • Plan model upgrades, NIMs version cleanly, but test new tags in staging first

Need help operationalizing NIM deployments? We help architect, deploy, and tune NIM-based inference platforms. Browse our NVIDIA AI Enterprise product page or contact our team.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *