Troveo Expands Licensed AI Training Data Platform Across Five New High-Value Domains

29 April 2026 | News

New datasets in audio, text, robotics, gameplay, and enterprise workflows aim to solve AI’s biggest bottleneck—access to high-quality, rights-cleared training data

Image Courtesy: Public Domain

Troveo, the world’s largest provider of licensed real-world data for AI, announced a major expansion of its platform into five new data categories. Troveo is accelerating AI development by providing training-ready datasets at scale.

Troveo has now paid out more than $20 million to content owners, underscoring strong demand from AI labs and model builders for licensed, rights-cleared training data not available on the public internet.

Access to training data is a key bottleneck for developing next-generation frontier models. Troveo's labeled datasets allow labs to train much more efficiently and reach significantly higher quality model performance. Troveo can also allow labs to reach aggressive deadlines by delivering the data immediately.

This data is difficult to access and often lives in broadcast archives, studio vaults, enterprise systems and private collections. Accessing it requires relationships, infrastructure and legal groundwork.

Troveo has proven this model with video, building a library of more than 8 million hours of licensed video footage.

Now Troveo is expanding into five new domains:

Audio: Four million hours of audio of single and multi-channel audio across dozens of languages and dialects, which is used for the development of voice-based models including automatic speech recognition, voice assistance and conversational AI.
Text: Billions of words sourced from publishers and other rights holders, which is structured for training, fine-tuning and evaluation.
Agentic trajectories (enterprise workflow traces): Real-world business data sourced directly from companies across a range of industries that captures actual enterprise workflows.
Gameplay: Video game data, including time-synced keystroke and character progression metadata used for frontier world models.
Egocentric robotics: Real-world, first-person perspective data from real operating environments that power the robotics world.

“Beyond access to compute and top-tier talent, training data remains the biggest bottleneck for building the next generation of AI models. The most valuable data for solving that is real-world, meaning it captures the complexity of how people actually live and work,” said Marty Pesis, founder and CEO of Troveo. “It is clean, accurately labeled and ready to train on. And it’s non-public, meaning it has not been incorporated into a prior training run. It lives in archives, hard drives and operating environments that nobody has indexed or packaged for AI. Troveo delivers this data directly into the training environments of the worlds’ top labs.”

Legal and competitive pressures around training data have intensified across the AI industry. More model builders are seeking data pipelines that are legally defensible and traceable to rights holders. Every dataset in Troveo’s library is sourced and licensed from content owners. Troveo works with thousands of content owners and has active relationships with AI labs and model builders across the industry, including some of the largest technology companies in the world.

Troveo will continue to release new datasets on a regular cadence across all six categories. The full catalog is available at troveo.ai.

Editor Spotlight

Verity Wins IERA Award 2026 for Autonomous Indoor Drone Inventory System
17th June, 2026

Schneider Electric and Foxconn Partner to Accelerate Next-Generation AI Data Center Development
16th June, 2026

Faraday Future Expands AI Robotics Education Initiative Through Student STEM Engagement and School Partnerships
11th June, 2026

Kinova Launches KIMA Medical Robotic Arm to Advance Next-Generation Clinical Robotics
10th June, 2026