29 April 2026 | News
Image Courtesy: Public Domain
Troveo, the world’s largest provider of licensed real-world data for AI, announced a major expansion of its platform into five new data categories. Troveo is accelerating AI development by providing training-ready datasets at scale.
Troveo has now paid out more than $20 million to content owners, underscoring strong demand from AI labs and model builders for licensed, rights-cleared training data not available on the public internet.
Access to training data is a key bottleneck for developing next-generation frontier models. Troveo's labeled datasets allow labs to train much more efficiently and reach significantly higher quality model performance. Troveo can also allow labs to reach aggressive deadlines by delivering the data immediately.
This data is difficult to access and often lives in broadcast archives, studio vaults, enterprise systems and private collections. Accessing it requires relationships, infrastructure and legal groundwork.
Troveo has proven this model with video, building a library of more than 8 million hours of licensed video footage.
Now Troveo is expanding into five new domains:
“Beyond access to compute and top-tier talent, training data remains the biggest bottleneck for building the next generation of AI models. The most valuable data for solving that is real-world, meaning it captures the complexity of how people actually live and work,” said Marty Pesis, founder and CEO of Troveo. “It is clean, accurately labeled and ready to train on. And it’s non-public, meaning it has not been incorporated into a prior training run. It lives in archives, hard drives and operating environments that nobody has indexed or packaged for AI. Troveo delivers this data directly into the training environments of the worlds’ top labs.”
Legal and competitive pressures around training data have intensified across the AI industry. More model builders are seeking data pipelines that are legally defensible and traceable to rights holders. Every dataset in Troveo’s library is sourced and licensed from content owners. Troveo works with thousands of content owners and has active relationships with AI labs and model builders across the industry, including some of the largest technology companies in the world.
Troveo will continue to release new datasets on a regular cadence across all six categories. The full catalog is available at troveo.ai.