According to a recent LinkedIn post from Turing, the company collaborated with ServiceNow on EnterpriseOps-Gym, an evaluation benchmark focused on how enterprise work is actually performed. The post notes that Turing contributed more than 1,000 prompts spanning HR, IT service management, customer service, email, calendars, file storage, collaboration tools, and hybrid workflows.
Claim 55% Off TipRanks
- Unlock hedge fund-level data and powerful investing tools for smarter, sharper decisions
- Discover top-performing stock ideas and upgrade to a portfolio of market leaders with Smart Investor Picks
The post highlights that benchmark tasks involved 7 to 30 steps and incorporated real policy constraints, with performance assessed via deterministic verifier scripts that checked underlying system state rather than just output quality. According to the post, the top frontier model achieved a task completion rate of only 37.4 percent on these long-horizon, stateful workflows.
As described in the post, providing AI agents with human-authored plans improved completion rates by 14 to 35 percentage points, suggesting that planning rather than raw model capability may be the main bottleneck in complex enterprise scenarios. The post argues that organizations deploying enterprise agents without such rigorous, workflow-level evaluation may be leaving significant blind spots in their testing.
For investors, the collaboration with ServiceNow and the creation of EnterpriseOps-Gym may position Turing as a specialist in benchmarking and improving AI agents for real-world enterprise operations. If adopted more broadly, such benchmarks could enhance Turing’s influence in enterprise AI tooling, potentially supporting demand for its services as companies seek reliable deployment metrics for agentic AI systems.
The post also implicitly underscores a gap between current AI hype and practical performance in multi-step, policy-constrained environments, which could shape procurement decisions and budget allocation toward evaluation and orchestration layers. This dynamic may benefit vendors like Turing that can demonstrate measurable gains in task completion and risk management, particularly in large organizations with complex workflows and compliance requirements.

