A team of researchers primarily from Google’s DeepMind systematically convinced ChatGPT to reveal snippets of the data it was trained on using a new type of attack prompt which asked a production model of the chatbot to repeat specific words forever.
Using this tactic, the researchers showed that there are large amounts of privately identifiable information (PII) in OpenAI’s large language models. They also showed that, on a public version of ChatGPT, the chatbot spit out large passages of text scraped verbatim from other places on the internet.
So not only are these things cases of mass copyright infringement, they also violate countless privacy laws.
Cool.
Thom Holwerda,
I get the copyright argument, but privacy? Unless they’ve used privileged/confidential information during the training, it seems to me that it’s only echoing data that has already been made public anyway. We all agree that anything posted on the internet is not private, right? It’d be extremely naive to expect otherwise. IMHO the main problem is that the data is being used without permission.
I would think that the major issue here is that the data might have been inadvertently leaked by 3rd parties and is now exposed by ChatGPT. Think of it as someone sharing a spreadsheet over Office 365 web but failing to set the permissions to specific persons, but it now shows up in a Bing search result. It would be interesting to see how mining data from the Internet indiscriminately would work in this case. After all, search engines are already required to block data violating DMCA and COPPA.
adkilla
Even so, companies including google use these same sources of internet data without permission every single day. The article was censored (including the website address, really??) so I can’t lookup the website in question, but if that data is published on that website, then their example is a big nothing burger. Furthermore if that same data is in google’s index and page cache, it’s mighty hypocritical for google to point the finger at others.
I wouldn’t think bing publishes such things from their own services, but assuming they do and somebody else took the data, then it may be too late for privacy. Of course the owner can go to other parties and ask/compel them to take it down, but just like revenge porn that gets published, unfortunately the horse may be out of the stable.
I’d expect that those hosting AI services would be subjected to take down notices just like everybody else. Technically it might mean they’ll have to add a filter for the affected person’s details to comply. In much the same way that some services return “this content is not available in your area” to comply with various laws.
Honestly I still don’t get how this is any different to Google reading through the same websites with its spiders. And your email. And your habits and using it to return search results to you with snippets of content. A search for “OSnews” it says “2 days ago — It takes advantage of advanced features of the Linux kernel to provide low latency, low footprint, and high performance while being” that content is just as copyrighted as the results ChatGPT returns.
Adurbe,
Indeed.
Website owners might not be aware of this, at least going forward anyone wanting to block chatgpt web crawlers can do so using the exact same mechanism that would be used to block google web crawlers: robots.txt…
https://www.pluralsight.com/resources/blog/data/blocking-ChatGPT-OpenAI-website-crawling
https://www.aldomedia.com/blog/how-to-block-chatgpt-in-robots-but-why-you-shouldnt
Here is the robots.txt for osnews.com…
osnews.com/robots.txt
It’s interesting to compare other social media sites too.