You are viewing a single comment's thread from:

RE: LeoThread 2024-08-31 09:20

in LeoFinance5 months ago

The org behind the dataset used to train Stable Diffusion claims it has removed CSAM

LAION, the German nonprofit group behind the data set used to train Stable Diffusion, among other generative AI models, claims it's removed suspected CSAM from its training data sets.

LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models, has released a new dataset that it claims has been “thoroughly cleaned of known links to suspected child sexual abuse material (CSAM).”

#newsonleo #stablediffusion #ai #technology

Sort:  

The new dataset, Re-LAION-5B, is actually a re-release of an old dataset, LAION-5B — but with “fixes” implemented with recommendations from the nonprofit Internet Watch Foundation, Human Rights Watch, the Canadian Center for Child Protection and the now-defunct Stanford Internet Observatory. It’s available for download in two versions, Re-LAION-5B Research and Re-LAION-5B Research-Safe (which also removes additional NSFW content), both of which were filtered for thousands of links to known — and “likely” — CSAM, LAION says.

“LAION has been committed to removing illegal content from its datasets from the very beginning and has implemented appropriate measures to achieve this from the outset,” LAION wrote in a blog post. “LAION strictly adheres to the principle that illegal content is removed ASAP after it becomes known.”

Important to note is that LAION’s datasets don’t — and never did — contain images. Rather, they’re indexes of links to images and image alt text that LAION curated, all of which came from a different dataset — the Common Crawl — of scraped sites and web pages.

The release of Re-LAION-5B comes after an investigation in December 2023 by the Stanford Internet Observatory that found that LAION-5B — specifically a subset called LAION-5B 400M — included at least 1,679 links to illegal images scraped from social media posts and popular adult websites. According to the report, 400M also contained links to “a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.”