According to a recent LinkedIn post from FriendliAI, the company is highlighting an infrastructure feature for its Friendli Dedicated Endpoints called Host KV Cache. The post indicates that when GPU memory is exhausted, key-value cache data can be offloaded to host memory and retrieved as needed, effectively tying cache capacity to system memory rather than GPU VRAM limits.
Claim 55% Off TipRanks
- Unlock hedge fund-level data and powerful investing tools for smarter, sharper decisions
- Discover top-performing stock ideas and upgrade to a portfolio of market leaders with Smart Investor Picks
The post suggests this architecture is designed to support higher concurrency while maintaining long context windows, a key requirement for multi-turn conversations, document Q&A, and code assistants. For investors, this points to FriendliAI’s focus on performance optimization in inference workloads, a competitive factor in AI infrastructure that could enhance product stickiness and appeal to enterprise customers.
The feature is positioned as requiring no API changes and can be enabled at endpoint creation, which may lower friction for adoption among existing users. If this capability delivers the claimed benefits at scale, it could help FriendliAI differentiate in a crowded AI serving market, potentially supporting customer retention, higher usage-based revenues, and expansion into latency-sensitive, high-throughput applications.

