Deploying NVIDIA NIM Microservices on a Kubernetes Cluster
NVIDIA NIM microservices give you OpenAI-compatible model endpoints in containers. Running one on a laptop is easy. Running them in production on a Kubernetes cluster, with autoscaling, observability, and tenant isolation, takes a structured approach. This tutorial walks through that path end to end.
Prerequisites
- A Kubernetes cluster (1.28+) with GPU nodes (H100, H200, B200, RTX PRO 6000, or L40S recommended)
kubectlandhelminstalled locally- An NGC API key from build.nvidia.com (free for development)
- Cluster admin permissions for the initial GPU Operator install
Step 1: Install the NVIDIA GPU Operator
The GPU Operator manages drivers, runtime, monitoring, and device plugins. Install it once per cluster:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
kubectl create namespace gpu-operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set dcgmExporter.enabled=true
Wait until all pods in gpu-operator namespace are Running:
kubectl get pods -n gpu-operator
Step 2: Verify GPUs Are Visible to Kubernetes
kubectl describe nodes | grep nvidia.com/gpu
You should see nvidia.com/gpu: N under both Capacity and Allocatable for each GPU node.
Step 3: Create the NIM Namespace and Secrets
kubectl create namespace nim
kubectl create secret docker-registry ngc-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=$NGC_API_KEY \
--namespace=nim
kubectl create secret generic ngc-api \
--from-literal=NGC_API_KEY=$NGC_API_KEY \
--namespace=nim
Step 4: Deploy a NIM
Use the NVIDIA NIM Helm chart. Example for the Llama 3.3 70B Instruct NIM:
helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.x.y.tgz \
--username='$oauthtoken' --password=$NGC_API_KEY
cat > values.yaml <
Step 5: Verify the Endpoint
kubectl port-forward -n nim svc/llama-nim-llm 8000:8000
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.3-70b-instruct",
"messages": [{"role":"user","content":"Hello"}]
}'
Step 6: Add Autoscaling
NIM containers expose Prometheus metrics. Combine with KEDA for queue-aware autoscaling:
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
cat > scaler.yaml <
Step 7: Add Observability
Install Prometheus and Grafana with the NVIDIA DCGM exporter dashboard:
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace
# Import the NVIDIA DCGM dashboard (ID 12239) into Grafana
You now have GPU utilization, NIM request rates, and queue depth in one place.
Step 8: Front With an API Gateway
For production add an API gateway in front of the NIM service for auth, rate limiting, and routing. Common choices: Kong, Envoy, NGINX. The NIM exposes OpenAI-compatible endpoints, so any gateway that understands HTTP/JSON works.
Step 9: Multi-Model Routing
Most production deployments end up serving multiple models. Pattern:
- Deploy each NIM as its own Kubernetes Service
- Use a thin gateway (Triton's model router or a custom service) to dispatch by model name
- Apply per-model autoscaling rules
Operational Checklist
- Backup the NIM model cache (
/opt/nim/.cache), saves cold-start time - Set GPU memory request/limit explicitly per replica
- Configure pod anti-affinity to spread replicas across GPU nodes
- Monitor token/sec, p99 latency, and queue depth as your golden metrics
- Plan model upgrades, NIMs version cleanly, but test new tags in staging first
Need help operationalizing NIM deployments? We help architect, deploy, and tune NIM-based inference platforms. Browse our NVIDIA AI Enterprise product page or contact our team.