Award Banner
Award Banner

Inside Big Tech's underground race to buy AI training data

Inside Big Tech's underground race to buy AI training data
Social media logos are seen through magnifier displayed in this illustration taken, May 25, 2021. Picture taken May 25, 2021.
PHOTO: Reuters file

NEW YORK — At its peak in the early 2000s, Photobucket was the world's top image-hosting site. The media backbone for once-hot services like Myspace and Friendster, it boasted 70 million users and accounted for nearly half of the US online photo market.

Today only two million people still use Photobucket, according to analytics tracker Similarweb. But the generative AI revolution may give it a new lease of life.

CEO Ted Leonard, who runs the 40-strong company out of Edwards, Colorado, told Reuters he is in talks with multiple tech companies to license Photobucket's 13 billion photos and videos to be used to train generative AI models that can produce new content in response to text prompts.

He has discussed rates of between five cents and US$1 dollar (S$1.35) per photo and more than US$1 per video, he said, with prices varying widely both by the buyer and the types of imagery sought.

"We've spoken to companies that have said, 'we need way more,' Leonard added, with one buyer telling him they wanted over a billion videos, more than his platform has.

"You scratch your head and say, where do you get that?"

Photobucket declined to identify its prospective buyers, citing commercial confidentiality. The ongoing negotiations, which haven't been previously reported, suggest the company could be sitting on billions of dollars' worth of content and give a glimpse into a bustling data market that's arising in the rush to dominate generative AI technology.

Tech giants like Google, Meta and Microsoft-backed OpenAI initially used reams of data scraped from the internet for free to train generative AI models like ChatGPT that can mimic human creativity. They have said that doing so is both legal and ethical, though they face lawsuits from a string of copyright holders over the practice.

PHOTO: Reuters file

At the same time, these tech companies are also quietly paying for content locked behind paywalls and login screens, giving rise to a hidden trade in everything from chat logs to long forgotten personal photos from faded social media apps.

"There is a rush right now to go for copyright holders that have private collections of stuff that is not available to be scraped," said Edward Klaris from law firm Klaris Law, which says it's advising content owners on deals worth tens of millions of dollars apiece to license archives of photos, movies and books for AI training.

Reuters spoke to more than 30 people with knowledge of AI data deals, including current and former executives at companies involved, lawyers and consultants, to provide the first in-depth exploration of this fledgling market — detailing the types of content being bought, the prices materialising, plus emerging concerns about the risk of personal data making its way into AI models without people's knowledge or explicit consent.

OpenAI, Google, Meta, Microsoft, Apple and Amazon all declined to comment on specific data deals and discussions for this article, although Microsoft and Google referred Reuters to supplier codes of conduct that include data-privacy provisions.

Google added that it would "take immediate action, up to and including termination" of its agreement with a supplier if it discovered a violation.

Many major market research firms say they have not even begun to estimate the size of the opaque AI data market, where companies often don't disclose agreements. Those researchers who do, such as Business Research Insights, put the market at roughly US$2.5 billion now and forecast it could grow close to US$30 billion within a decade.

Generative data gold rush

The data land grab comes as makers of big generative AI "foundation" models face increasing pressure to account for the massive amounts of content they feed into their systems, a process known as "training" that requires intensive computing power and often takes months to complete.

Tech companies say the technology would be cost-prohibitive if they couldn't use vast archives of free scraped web page data, such as those provided by non-profit repository Common Crawl, which they describe as "publicly available".

Their approach has nonetheless drawn a wave of copyright lawsuits and regulatory heat, while prompting publishers to add code to their websites to block scraping.

In response, AI model makers have started hedging risks and securing data-supply chains, both through deals with content owners and via a burgeoning industry of data brokers that has popped up to satisfy demand.

Meta CEO Mark Zuckerberg delivers a speech, as the letters AI for artificial intelligence appear on screen, at the Meta Connect event at the company's headquarters in Menlo Park, California, US, Sept 27, 2023.
PHOTO: Reuters file

In the months after ChatGPT debuted in late 2022, for instance, companies including Meta, Google, Amazon and Apple all struck agreements with stock image provider Shutterstock to use hundreds of millions of images, videos and music files in its library for training, according to a person familiar with the arrangements.

The deals with Big Tech firms initially ranged from US$25 million to US$50 million each, though most were later expanded, Shutterstock's Chief Financial Officer Jarrod Yahes told Reuters. Smaller tech players have followed suit, spurring a fresh "flurry of activity" in the past two months, he added.

Yahes declined to comment on individual contracts. The Apple agreement, and the size of the other deals, haven't previously been made public.

A Shutterstock competitor, Freepik, told Reuters it had struck agreements with two large tech companies to license the majority of its archive of 200 million images at two to four cents per image. There are five more similar deals in the pipeline, said CEO Joaquin Cuenca Abela, declining to identify buyers.

OpenAI, an early Shutterstock customer, has also signed licencing agreements with at least four news organisations, including The Associated Press and Axel Springer. Thomson Reuters, the owner of Reuters News, separately said it has struck deals to license news content to help train AI large language models, but didn't disclose details.

'Ethically sourced' content

An industry of dedicated AI data firms is emerging too, securing rights to real-world content like podcasts, short-form videos and interactions with digital assistants, while also building networks of short-term contract workers to produce custom visuals and voice samples from scratch, akin to an Uber-esque gig economy for data.

Seattle-based Defined.ai licenses data to a range of companies including Google, Meta, Apple, Amazon and Microsoft, CEO Daniela Braga told Reuters.

Read Also
digicult
OpenAI makes ChatGPT's accessible without requiring sign ups

Rates vary by buyer and content type, but Braga said companies are generally willing to pay US$1 to US$2 per image, US$2 to US$4 per short-form video and US$100 to US$300 per hour of longer films. The market rate for text is US$0.001 per word, she added.

Images of nudity, which require the most sensitive handling, go for US$5 to US$7, she said.

Defined.ai splits those earnings with content providers, Braga said. It markets its datasets as "ethically sourced", as it obtains consent from people whose data it uses and strips out personally identifying information, she added.

One of the firm's suppliers, a Brazil-based entrepreneur, said he pays owners of the photos, podcasts and medical data he sources about 20 per cent to 30 per cent of total deal amounts.

The priciest images in his portfolio are those used to train AI systems that block content like graphic violence barred by the tech companies, said the supplier, who spoke on condition his company wasn't identified, citing commercial sensitivity.

To fulfil those requests, he obtains images of crime scenes, conflict violence and surgeries — mainly from police, freelance photojournalists and medical students, respectively — often in places in South America and Africa where distributing graphic images is more common, he said.

He said he has received images from freelance photographers in Gaza since the start of the war there in October, plus some from Israel at the outset of hostilities.

His company hires nurses accustomed to seeing violent injuries to anonymise and annotate the images, which are disturbing to untrained eyes, he added.

'I would find it risky'

While licencing could resolve some legal and ethical issues, resurrecting the archives of old internet names like Photobucket as fuel for the latest AI models raises others, particularly around user privacy, according to many of the industry players interviewed.

AI systems have been caught regurgitating exact copies of their training data, spitting out, for example, the Getty Images watermark, verbatim paragraphs of New York Times articles and images of real people. That means a person's private photos or intimate thoughts posted decades ago could potentially wind up in generative AI outputs without notice or explicit consent.

Ted Leonard, Chief Executive Officer of Photobucket, poses for a portrait in Edwards, Colorado, US, March 6, 2024.
PHOTO: Reuters file

Photobucket CEO Leonard says he is on solid legal ground, citing an update to the company's terms of service in October that grants it the "unrestricted right" to sell any uploaded content for the purpose of training AI systems. He sees licencing data as an alternative to selling ads.

"We need to pay our bills, and this could give us the ability to continue to support free accounts," he said.

Defined.ai's Braga said she avoids acquiring content from "platform" companies like Photobucket and prefers to source social media photos from influencers who create them, who she said have a clearer claim to licencing rights.

"I would find it very risky," Braga said of platform content. "If there's some AI that generates something that resembles a picture of someone who never approved that, that's a problem."

Photobucket is not alone among platforms in embracing licencing. Tumblr's parent company Automattic said last month it was sharing content with "select AI companies". In February, Reuters reported Reddit struck a deal with Google to make its content available for training the latter's AI models.

Ahead of its initial public offering in March, Reddit disclosed that its data-licencing business is the subject of a US Federal Trade Commission inquiry and acknowledged it could fall foul of evolving privacy and intellectual-property regulations.

The FTC, which warned businesses in February against retroactively changing terms of service for AI usage, declined to comment on the Reddit inquiry or say whether it was looking into other training data deals.

ALSO READ: UN adopts first global artificial intelligence resolution

Source: Reuters

homepage

trending

trending
    Local brands like Ann Chin Popiah and Tian Tian Hainanese Chicken Rice to open at 5-star hotel in Macau
    Taiwanese actor Jeremy Huang, known for appearance on Mr Con & Ms Csi, dies at 31
    Electrifying business: Mercedes-Benz launches 3 new electric vans in Singapore
    'You see how deep the water is': Darren Lim carries son on his shoulders through Bukit Timah flash floods on April 20
    'Her kindness and service touched countless lives': Wife of Singapore's first president, Yusof Ishak, dies at 91
    Books Kinokuniya to open new outlet at Raffles City this August
    'Steady in crisis, bold in imagining possibilities': PM Wong thanks Ng Eng Hen for contributions to Singapore
    Four Star celebrates 57th anniversary with premium mattresses from $570 and bedframes at just $57
    Man who allegedly molested stewardess on Singapore-bound flight to be charged
    Fashion meets sustainability: A sneak peek at 2nd Street outlet in Orchard, opening on April 29
    GE2025: Teo Chee Hean not contesting Pasir Ris-Changi GRC, Indranee Rajah to lead team
    The Coconut Club has a new restaurant inspired by an 'overlooked' fruit, here's what to expect

Singapore

Singapore
    • '2 potential office holders': Shanmugam to lead PAP team for Nee Soon GRC with 4 new faces
    • 'He was a champion of unity and hope': Singapore Archdiocese pays moving tribute to Pope Francis on his passing
    • 'I decided to devote more time to my family': East Coast GRC MP Cheryl Chan retires after serving a decade in politics
    • 6 taken to hospital for smoke inhalation following fire at People’s Park Complex
    • Daily roundup: Books Kinokuniya to open new outlet at Raffles City this August — and other top stories today
    • More questions asked, more ministerial statements: Inside Singapore Parliament's record-setting 14th term
    • PAP's West Coast-Jurong West GRC team plans to extend, intensify Jobs @ West Coast initiative if elected: Desmond Lee
    • WP introduces 3 new candidates, including startup founder and former US Navy security administrator
    • 2 caterers owned by same company fined after 273 fall ill from unsafe food
    • GE2025: SDP launches manifesto, proposes HDB flats to sell for up to $270,000, do away with PSLE

Entertainment

Entertainment
    • Desmond Tan recounts challenges of playing twins with polar personalities in new drama
    • Gossip mill: Elvin Ng and other celebs to play charity football match, Sora Ma becomes Singapore citizen, Elva Hsiao 80% recovered after hip fracture
    • Japanese star Mizuki Itagaki found dead at 24, months after going missing
    • 'I'm not a workforce rookie': Local actress Yunis To benefits from entering showbiz at 28
    • Tom Hanks has started working on Toy Story 5
    • Hailey Bieber reveals she has 2 ovarian cysts
    • Spike Lee cast A$AP Rocky in his new movie after seeing meme comparing him to Denzel Washington
    • Kristen Stewart and Dylan Meyer tie the knot
    • Sean 'Diddy' Combs loses bid to delay sex-trafficking trial
    • 'Allow yourself to feel grief: Ed Sheeran says there's no magic cure for depression

Lifestyle

Lifestyle
    • 'It hurts, losing everything': Mentai-Ya boss closes all remaining stalls after $550k losses in 2 years
    • Kenny Rogers Roasters now has an all-you-can-eat buffet for $28.90++, here's a sneak peek at the menu
    • This new American malt shop along Joo Chiat Road looks like it came straight out of a Wes Anderson film
    • Cinema-themed Korean restaurant opens at Changi Airport with banchan and ice-cream buffet
    • On this day in 1981, the Vanda Miss Joaquim was declared Singapore's national flower
    • Just opened in April 2025: New restaurants, cafes and bars in Singapore
    • A family's monochrome open-concept home with colour accents
    • What property agents really look for at viewings (that you might miss)
    • Top 28 free things to do in Singapore (April 2025): Public Garden, Kindness Weekend, free toastmasters clubs and more
    • Mental health enigma: AsiaOne study reveals people's tendencies to urge others to seek help, but not themselves

Digicult

Digicult
    • A $500 wake-up call: How the Samsung Galaxy Ring made me realise my stress
    • Monster Hunter Wilds producer explains how game has remained unique and fresh over 20 years
    • Google Pixel 9a: The best AI-centric phone under $800 in 2025?
    • Western intelligence agencies warn spyware threat targeting Taiwan, Tibetan rights advocates
    • Taiwan says China using generative AI to ramp up disinformation and 'divide' the island
    • Russian court fines Telegram app for refusal to remove anti-government content, TASS reports
    • One Beijing man's quest to keep cooking — and connecting with Americans — on camera
    • Nintendo Switch 2 to launch in June with US$449.99 price tag
    • Games in April: RPGs, racing and Ronaldo in a fighting game
    • Is it time to get a MacBook at a good price? The M4 MacBook Air says yes

Money

Money
    • Giant deal: Malaysian company to acquire Cold Storage and Giant supermarket chains in Singapore
    • China warns countries against striking trade deals with US at its expense
    • Why we bought a $960k 2-bedder condo at Penrose during Covid-19: A buyer's case study
    • Why are recently MOP-ed 3-room HDB flats in Yishun fetching top prices?
    • Should you buy a freehold or leasehold condo in 2025? Here's the surprising better performer
    • Chinatown merchants in the US are feeling the bite of tariffs
    • From Xiaxue's executive flat to Bishan's million-dollar flat: What's behind the price surge of 4-room HDB flats?
    • DPM Gan unveils task force to tackle impact of US tariffs on Singapore, warns of a 'more unstable and fragmented world'
    • South Korea, Vietnam pledge co-operation as US tariffs loom
    • Macau's leader warns world's biggest gambling hub could face a budget deficit

Latest

Latest
  • China sends team to Myanmar to monitor ceasefire, foreign ministry says
  • Russia is upping hybrid attacks against Europe, Dutch intelligence says
  • China expresses condolences over death of Pope Francis
  • Thai PM says US tariff negotiations postponed to review 'issues'
  • Russian-Georgian sculptor Tsereteli, known for monumental projects, dies at 91
  • South Korea's acting president Han expects positive outcome from US trade talks
  • Cardinals to meet after death of Pope Francis, plan for funeral
  • Billion-dollar cyberscam industry spreading globally, UN says
  • 'A true father to us': Filipinos mourn Pope Francis

In Case You Missed It

In Case You Missed It
  • GE2025: Why this 32-year-old is setting up a political party to contest East Coast GRC
  • Two men fight each other at Johor checkpoint over allegation of cutting queue, probe on
  • Pair narrowly escape death after driving off incomplete highway in Indonesia while following Google Maps
  • Ex-MP Lee Bee Wah introduces former MDDI director Goh Hanyan as potential candidate to Nee Soon residents
  • Robert Ng, son of late billionaire Ng Teng Fong, and 3 children to be designated as 'politically significant persons'
  • 'She should be with her family': Employer gives maid plane ticket, $800 to return to Myanmar and search for missing mum
  • 'He needed something to help him fight,' says man who bought Hokkien mee for dying patient
  • Thai woman struggles to evacuate during earthquake while her dog sleeps unfazed
  • 'New, younger' PAP team vows to reclaim WP-controlled Sengkang GRC, says Lam Pin Min
This website is best viewed using the latest versions of web browsers.