According to a recent LinkedIn post from FriendliAI, the company is emphasizing support for draft-model speculative decoding in its Dedicated Endpoints for large language model (LLM) inference. The post describes a setup in which draft models are trained and automatically paired by FriendliAI, enabling multiple tokens to be predicted ahead and then verified in parallel with a single configuration toggle.
Meet Samuel – Your Personal Investing Prophet
- Start a conversation with TipRanks’ trusted, data-backed investment intelligence
- Ask Samuel about stocks, your portfolio, or the market and get instant, personalized insights in seconds
The company’s LinkedIn post highlights that this approach aims to accelerate output by allowing the target model to verify several proposed tokens in one forward pass, at roughly the cost of standard single-token decoding. The post also suggests that quality is preserved because verification follows the target model’s own next-token distribution while avoiding extra training, model management, or application code changes.
According to the post, this speculative decoding method is positioned as a way to reduce the autoregressive bottleneck that typically slows long generations, particularly where memory bandwidth dominates inference latency. FriendliAI contrasts its method with N-gram speculative decoding, indicating that draft models can generalize beyond literal repetition and may better serve agentic pipelines, long-form generation, and code completion use cases.
The LinkedIn content notes support for several target models, including Gemma-4-31b-it, Kimi-K2.6, Qwen3.6-27B, GLM-5.1, GLM-5, DeepSeek-V3.2, and MiniMax-M2.5, suggesting a focus on compatibility with a range of high-capacity LLMs. For investors, this feature set may indicate an effort to differentiate FriendliAI in the competitive inference infrastructure market by addressing latency and developer friction, potentially improving customer retention and expanding usage among AI-intensive enterprises.

