Deploying NVIDIA NIM Microservices on a Kubernetes Cluster

NVIDIA NIM microservices give you OpenAI-compatible model endpoints in containers. Running one on a laptop is easy. Running them in production on a Kubernetes cluster, with autoscaling, observability, and tenant isolation, takes a structured approach. This tutorial walks through that path end to end.

Prerequisites

A Kubernetes cluster (1.28+) with GPU nodes (H100, H200, B200, RTX PRO 6000, or L40S recommended)
kubectl and helm installed locally
An NGC API key from build.nvidia.com (free for development)
Cluster admin permissions for the initial GPU Operator install

Step 1: Install the NVIDIA GPU Operator

The GPU Operator manages drivers, runtime, monitoring, and device plugins. Install it once per cluster:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

kubectl create namespace gpu-operator

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true

Wait until all pods in gpu-operator namespace are Running:

kubectl get pods -n gpu-operator

Step 2: Verify GPUs Are Visible to Kubernetes

kubectl describe nodes | grep nvidia.com/gpu

You should see nvidia.com/gpu: N under both Capacity and Allocatable for each GPU node.

Step 3: Create the NIM Namespace and Secrets

kubectl create namespace nim
kubectl create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=$NGC_API_KEY \
  --namespace=nim

kubectl create secret generic ngc-api \
  --from-literal=NGC_API_KEY=$NGC_API_KEY \
  --namespace=nim

Step 4: Deploy a NIM

Use the NVIDIA NIM Helm chart. Example for the Llama 3.3 70B Instruct NIM:

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.x.y.tgz \
  --username='$oauthtoken' --password=$NGC_API_KEY

cat > values.yaml <


Step 5: Verify the Endpoint
kubectl port-forward -n nim svc/llama-nim-llm 8000:8000

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [{"role":"user","content":"Hello"}]
  }'
Step 6: Add Autoscaling
NIM containers expose Prometheus metrics. Combine with KEDA for queue-aware autoscaling:
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

cat > scaler.yaml <

Step 7: Add Observability
Install Prometheus and Grafana with the NVIDIA DCGM exporter dashboard:
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

# Import the NVIDIA DCGM dashboard (ID 12239) into Grafana
You now have GPU utilization, NIM request rates, and queue depth in one place.
Step 8: Front With an API Gateway
For production add an API gateway in front of the NIM service for auth, rate limiting, and routing. Common choices: Kong, Envoy, NGINX. The NIM exposes OpenAI-compatible endpoints, so any gateway that understands HTTP/JSON works.
Step 9: Multi-Model Routing
Most production deployments end up serving multiple models. Pattern:

Deploy each NIM as its own Kubernetes Service
Use a thin gateway (Triton's model router or a custom service) to dispatch by model name
Apply per-model autoscaling rules

Operational Checklist

Backup the NIM model cache (/opt/nim/.cache), saves cold-start time
Set GPU memory request/limit explicitly per replica
Configure pod anti-affinity to spread replicas across GPU nodes
Monitor token/sec, p99 latency, and queue depth as your golden metrics
Plan model upgrades, NIMs version cleanly, but test new tags in staging first

Need help operationalizing NIM deployments? We help architect, deploy, and tune NIM-based inference platforms. Browse our NVIDIA AI Enterprise product page or contact our team.

Prerequisites

Step 1: Install the NVIDIA GPU Operator

Step 2: Verify GPUs Are Visible to Kubernetes

Step 3: Create the NIM Namespace and Secrets

Step 4: Deploy a NIM

Step 5: Verify the Endpoint

Step 6: Add Autoscaling

Step 7: Add Observability

Step 8: Front With an API Gateway

Step 9: Multi-Model Routing

Operational Checklist

Talk it through with a specialist