A LinkedIn post from Deccan AI describes internal benchmarking work focused on the reasoning capabilities of so‑called deep research agents, beyond their ability to retrieve information. The post contrasts retrieval with higher‑order tasks such as outcome forecasting and hypothesis generation, suggesting that existing benchmarks may underweight these dimensions.
Claim 55% Off TipRanks
- Unlock hedge fund-level data and powerful investing tools for smarter, sharper decisions
- Discover top-performing stock ideas and upgrade to a portfolio of market leaders with Smart Investor Picks
According to the post, Deccan AI designed new rubrics, including “Outcome Forecasting” and “Future Scope,” and evaluated 104 PhD‑level prompts on ChatGPT Deep Research and Gemini Deep Research, with scoring performed by subject‑matter experts. The results cited indicate a disconnect between retrieval and reasoning, with ChatGPT reportedly achieving a 74% retrieval score but only 33% on forecasting, while Gemini allegedly performed better on forecasting but weaker on retrieval.
The company’s LinkedIn commentary also highlights what it terms a “compliance paradox,” where more specific instructions in prompts were associated with deteriorating performance in both models, and in divergent ways. The post implies that this behavior could have implications for how enterprises design evaluation prompts and deploy research agents in production workflows, particularly where reliability and interpretability are critical.
For investors, this emphasis on nuanced benchmarking suggests Deccan AI may be positioning itself as a specialist in evaluation frameworks and agent reliability, rather than just model deployment. If the firm can convert these insights into proprietary tooling or consulting offerings, it could tap into growing enterprise demand for robust AI evaluation, potentially enhancing its competitive standing within the AI tooling and model‑ops ecosystem.
The focus on reasoning‑centric metrics may also reflect a strategic attempt to differentiate from vendors that primarily emphasize retrieval or raw model performance scores. As enterprises in regulated or high‑stakes domains seek stronger assurances around forecasting and decision support, capabilities in designing and validating such benchmarks could become a meaningful driver of customer adoption and pricing power for companies like Deccan AI.

