NVIDIA Rubin CPX: A New Class of GPU for Massive-Context Inference
Most discussions of GPU performance focus on training. Inference gets less ink, but it is where the bulk of the AI bill ultimately lands. NVIDIA’s Rubin CPX is the first GPU designed specifically for one phase of inference: context prefill. In this article we explain what that means, why it changes the economics of long-context AI, and where Rubin CPX fits in your roadmap.
Two Phases of LLM Inference
Every transformer inference call has two distinct phases:
- Prefill (context): The model ingests the prompt, possibly hundreds of thousands of tokens, and computes attention key/value caches. This phase is compute-heavy and embarrassingly parallel within a single request.
- Decode (generation): The model emits one token at a time, each step depending on all previous KV cache entries. This phase is memory-bandwidth bound and inherently sequential.
The two phases stress GPUs in different ways. Running both on identical hardware is the simple choice, but it is rarely the optimal one.
The Disaggregated Inference Idea
Disaggregated inference assigns prefill and decode to different pools of GPUs. Prefill nodes burn through the prompt; KV caches are then transferred over NVLink or fast networking to decode nodes. The idea is well-established in research; Rubin CPX is the first production-grade silicon that bakes the architecture in.
Why Now
The case for disaggregation strengthens as context windows grow. Three trends drive Rubin CPX:
- Million-token contexts are becoming common in coding agents and document analysis
- Agentic workflows repeatedly re-prefill on tool-call results, multiplying prefill demand
- Video and multimodal models tokenize huge inputs, dwarfing the cost of generation
What’s Different About CPX
NVIDIA has not published a complete spec sheet, but the architecture is optimized for:
- High HBM4 capacity to hold KV caches for very long sequences
- Compute-density tuned for attention prefill rather than balanced for decode
- NVLink 6 connectivity to standard Rubin nodes for cache transfer
- NVFP4 native precision
System Architecture
In a Rubin NVL72 rack you can mix CPX and standard Rubin GPUs. The serving stack, Triton Inference Server with disaggregated extensions, or a custom orchestrator, routes incoming requests through prefill on CPX, transfers KV caches over NVLink 6, and pins decode to standard Rubin GPUs that maximize tokens-per-second.
TCO Implications
For long-context workloads NVIDIA cites step-function improvements in tokens-per-dollar versus monolithic Rubin or Blackwell deployments. The exact ratio depends on:
- Prompt-to-output length ratio (CPX wins more as prompts grow)
- Request mix (a long-tailed distribution amortizes CPX better than a uniform one)
- Cache-reuse patterns (RAG with shared prefixes benefits from KV cache pinning)
When Rubin CPX Is Not the Answer
Rubin CPX adds operational complexity. If your workload is dominated by short prompts and long outputs (chatbots with brief context, code completion with limited file scope), monolithic decode-optimized GPUs win. The disaggregated architecture pays off when prefill cost dominates total inference cost.
Software Maturity
The hardware is half the story. The other half is a serving stack that orchestrates prefill and decode pools, manages KV cache transfer, and respects per-tenant SLAs. NVIDIA is investing in Triton extensions and reference designs; expect open-source frameworks (vLLM, SGLang, TensorRT-LLM) to gain first-class CPX support over 2026.
Should You Buy CPX?
Yes, if you operate a long-context inference fleet at scale. No, if your inference workload is short-prompt or you’re not yet on the Rubin platform. Maybe, if you’re standing up a new fleet, start with a CPX-aware design even if your initial mix is small.
Want a CPX sizing analysis for your workload mix? Browse our NVIDIA Rubin CPX product page or contact our team for a TCO model tailored to your traffic profile.