tiprankstipranks
Advertisement
Advertisement

FriendliAI – Weekly Recap

FriendliAI – Weekly Recap

FriendliAI spent the week underscoring its positioning as a high-performance provider of AI inference infrastructure, highlighting both benchmark results and new product capabilities. Independent leaderboards from Artificial Analysis were cited as showing FriendliAI’s Model APIs delivering faster output speeds and lower latency than key third-party endpoints for open-weight models GLM-5.1 and Gemma-4-31B.

Meet Samuel – Your Personal Investing Prophet

The company reported roughly 133 output tokens per second with about a one-second time-to-first-token for GLM-5.1, and around 62 tokens per second for Gemma-4-31B in non-reasoning workloads. FriendliAI framed these results as evidence of strength on a speed-to-latency ratio, emphasizing that its infrastructure is tuned for real-world production use rather than just peak theoretical throughput.

Product updates focused on speculative decoding support in Dedicated Endpoints, aimed at accelerating large language model inference without major application changes. FriendliAI’s approach uses draft models that it trains and automatically pairs with target models, allowing multiple tokens to be proposed ahead and then verified in parallel through a single configuration toggle.

The company stated that verification follows each target model’s own next-token distribution, which is intended to preserve output quality while keeping the computational cost per step close to standard decoding. This method is positioned to reduce autoregressive bottlenecks, particularly for long-form text, code completion, and agentic pipelines where memory bandwidth often dominates latency.

FriendliAI noted that speculative decoding is available for several prominent open and regional models, including Gemma-4-31b-it, Kimi-K2.6, Qwen3.6-27B, GLM-5.1, GLM-5, DeepSeek-V3.2, and MiniMax-M2.5. By contrasting its draft-model approach with N-gram-based methods, the company suggested its solution can generalize beyond token repetition and may offer broader performance gains for complex workloads.

In parallel, FriendliAI expanded its model portfolio by adding DeepSeek AI’s DeepSeek-V4-Pro and DeepSeek-V4-Flash to its Dedicated Endpoints, both featuring 1 million-token context windows. DeepSeek-V4-Flash is marketed as a performance-efficiency option with 284 billion total parameters but only 13 billion active per token, while DeepSeek-V4-Pro targets more demanding reasoning, coding, and long-context tasks.

These DeepSeek models, which have quickly become heavily used on OpenRouter, are intended to serve high-capacity, cost-conscious workloads on FriendliAI’s platform. Strategically, offering both a throughput-optimized variant and a higher-capability tier allows the company to address a range of use cases, from AI agents and coding assistants to enterprise applications requiring extended context.

Across these announcements, FriendliAI is reinforcing a value proposition built around speed, latency, and open-weight model flexibility, rather than proprietary systems. If current adoption trends and benchmark advantages persist, the company’s focus on production-grade performance and ease of integration could support deeper enterprise usage and strengthen its competitive position in the AI infrastructure market.

Overall, the week’s developments portray FriendliAI as increasingly focused on differentiated inference capabilities, combining benchmarked speed, speculative decoding features, and access to frontier-level open-weight models to compete for high-volume, performance-sensitive AI workloads.

Disclaimer & DisclosureReport an Issue

1