According to a recent LinkedIn post from Turing, the company is working with ServiceNow on EnterpriseOps-Gym, a benchmark aimed at evaluating how AI agents perform on real-world enterprise workflows. The post highlights that Turing contributed more than 1,000 prompts spanning HR, IT service management, customer service, email, calendar, drive, collaboration tools, and hybrid workflows.
Claim 55% Off TipRanks
- Unlock hedge fund-level data and powerful investing tools for smarter, sharper decisions
- Discover top-performing stock ideas and upgrade to a portfolio of market leaders with Smart Investor Picks
The LinkedIn post indicates that benchmarked tasks ranged from seven to thirty steps and incorporated real policy constraints, with outcomes evaluated by deterministic verifier scripts that checked actual system state rather than just output quality. According to the post, the top frontier model completed only 37.4% of tasks, while providing human-authored plans improved completion rates by 14 to 35 percentage points.
For investors, the post suggests a sizable gap between current AI agent capabilities and the reliability required for complex enterprise operations, which may prolong demand for specialized tooling, orchestration, and evaluation frameworks. Turing’s role in building a domain-specific benchmark with a large enterprise software provider could strengthen its positioning in the enterprise AI stack, potentially supporting future monetization via benchmarking, consulting, or workflow-automation solutions.
The emphasis on planning as a key bottleneck, rather than raw model capability, points to a near-term focus on agent design, workflow engineering, and integration with enterprise systems. If EnterpriseOps-Gym gains traction as a reference standard, Turing could benefit from increased visibility among large corporate buyers evaluating AI agents for mission-critical processes, though the post does not provide information on commercial terms or revenue impact.
The post also cautions that enterprises deploying AI agents without testing them on “long-horizon, stateful workflows” may face evaluation blind spots, implying potential operational risk for adopters. This perspective may incentivize more rigorous testing and validation budgets, creating an adjacent market opportunity for firms like Turing that position themselves around benchmarking, safety, and reliability in enterprise AI deployments.

