According to a recent LinkedIn post from Perplexity, the company is highlighting new research on serving post-trained Qwen3 235B models using NVIDIA GB200 NVL72 Blackwell racks. The post emphasizes that GB200 appears to deliver a major improvement over the prior Hopper generation for high-throughput inference on large mixture-of-experts models, particularly beyond its role as a training platform.
Meet Samuel – Your Personal Investing Prophet
- Start a conversation with TipRanks’ trusted, data-backed investment intelligence
- Ask Samuel about stocks, your portfolio, or the market and get instant, personalized insights in seconds
The LinkedIn post explains that prefill and decode phases stress hardware differently, with prefill characterized as compute-bound and decode as latency and memory-bound. Perplexity reports benchmark data suggesting that NVLS all-reduce latency drops from 586.1 microseconds on H200 to 313.3 microseconds on GB200, while MoE prefill combine time at EP=4 falls from 730.1 to 438.5 microseconds.
The post further indicates that GB200’s rack-scale NVLink domain enables levels of parallelism for the decode phase that Hopper could not, resulting in higher throughput at elevated token generation speeds. It also notes that features such as Blackwell Tensor Cores, enhanced memory bandwidth, SHARP reductions, Blackwell-native quantization, custom kernels, and rack-scale NVLink can translate into faster answers and lower serving cost for large-model inference.
For investors, this research focus suggests Perplexity is positioning its infrastructure to capture efficiency gains from NVIDIA’s latest data center platform, potentially improving unit economics as model sizes and traffic scale. If these performance and cost advantages prove durable, they could strengthen Perplexity’s competitiveness in AI search and assistant markets, while also underscoring NVIDIA’s strategic importance as a core technology partner for large-model inference at scale.

