Protege is a private AI-focused data infrastructure company that curates and licenses compliant, multimodal datasets, and this weekly recap reviews notable developments shaping its positioning in healthcare AI and broader data infrastructure. The company’s recent updates center on new medical benchmarks for AI in clinical documentation and coding, alongside ongoing validation of its data-centric strategy and investor backing.
Claim 55% Off TipRanks
- Unlock hedge fund-level data and powerful investing tools for smarter, sharper decisions
- Discover top-performing stock ideas and upgrade to a portfolio of market leaders with Smart Investor Picks
Protege detailed specialized benchmarks built from “uncontaminated, evaluation-ready” electronic medical record datasets linked to payer-approved bills, including raw clinical notes, submitted billing codes, and ancillary codes. By holding out these datasets from model pretraining at the patient level, the benchmarks are designed to reduce data contamination and benchmark inflation, addressing a key weakness of many public medical coding datasets.
The new benchmarks focus on payer-approved rather than merely submitted claims, aligning model evaluation with reimbursement and compliance outcomes critical to providers and payers. Protege collaborated with Vals AI to assess models on primary and secondary ICD code assignment and optimization of compliant code sets, with expert coder review used to validate performance and highlight practical gaps.
Reported results show models achieving about 88% accuracy on clinical documentation tasks but only 56% on medical coding, underscoring the higher complexity and structured reasoning required for billing. Protege characterizes medical coding as an evidence extraction and optimization problem that must account for disease severity, comorbidities, procedural context, and institution-specific SOPs, indicating that current AI still struggles with these nuanced requirements.
These findings reinforce Protege’s thesis that robust, payer-aligned benchmarks and datasets will remain critical infrastructure for healthcare AI, particularly in revenue-cycle, compliance, and administrative workflows. If widely adopted, the benchmarks could drive recurring demand for Protege’s specialized data products and evaluation frameworks, supporting deeper integration with healthcare AI developers and providers.
In parallel, CEO Bobby Samuels used an appearance on Andreessen Horowitz’s Raising Health podcast to reiterate that data, rather than compute or model architecture, has become the primary bottleneck in AI performance. He argued that unlocking real-world data at scale, while compensating data holders, is essential for the next wave of AI progress and sits at the core of Protege’s business model and platform design.
The podcast, hosted by a16z partners Daisy Wolf and Eva Steinman who led Protege’s latest funding round, also highlighted that the company has completed three financings in under two years. This cadence suggests sustained venture confidence in its data infrastructure strategy and provides capital to expand dataset coverage, refine benchmarks, and build scalable products for regulated sectors like healthcare.
Samuels emphasized a philosophy that data owners should share in the value created from their assets, pointing toward structured partnerships and revenue-sharing arrangements as a foundation for long-term data access. Such an approach may help Protege navigate regulatory, privacy, and ethical constraints while attracting institutional data providers seeking compliant monetization paths for their information.
From a financial and strategic perspective, the combination of rigorous, payer-aligned medical benchmarks and continued backing from a high-profile venture firm positions Protege as a potential key player in AI data infrastructure. While long-term outcomes will depend on execution, competition, and policy, this week’s updates indicate steady progress in deepening its healthcare footprint and reinforcing its role as a trusted provider of high-quality, real-world AI datasets.
Overall, the week underscored Protege’s focus on closing the AI data gap in healthcare through specialized benchmarks and a compensated data-partnership model, supported by ongoing investor confidence and a clear thesis that high-quality, compliant data remains the core constraint and opportunity in AI development.

