Amnesty warns unlawful scraping may be feeding generative AI

Amnesty International says some AI developers have relied on unlawful web scraping to collect large volumes of online data for generative models, raising privacy and human-rights concerns. The group is calling for governments to prohibit these data-collection systems and to intervene with regulation.

Amnesty International raises concerns about unlawful data collection systems to train generative AI

Amnesty International says tech companies have used unlawful web scraping to gather large volumes of online data for generative AI training. According to the group, those collection practices can violate privacy rights and broader human-rights standards when personal information is copied, processed, and repurposed without meaningful notice or consent. The warning adds to a growing debate over whether publicly accessible online content can be treated as fair game for model development simply because it is easy to collect at scale.

The organization is urging governments to prohibit these data-collection systems and to impose stronger oversight on how AI training data is sourced. For AI builders, the immediate issue is not only legal exposure after deployment, but also upstream risk in the data pipeline: if provenance is weak, compliance problems may be embedded into the model before training is complete. That puts pressure on vendors and in-house teams to document collection methods, retention practices, and the legal basis for using scraped data.

AI teams should treat data provenance as a compliance issue, not just a model-quality issue, because regulators and rights groups are focusing on how training sets were assembled in the first place.
Unlawful scraping can create privacy, consent, and human-rights exposure before model training even begins, which means downstream safeguards will not fully solve an upstream collection problem.
Policy pressure is likely to increase on organizations that cannot document where training data came from, especially if they rely on third-party datasets or opaque web-scale collection practices.

Daily BriefJul 17, 20262 min