Stack Overflow signs deal with OpenAI, bans users trying to alter answers

Thom Holwerda 2024-05-07 General Development 36 Comments

We’re all aware of Stack Overflow – it’s a place where programmers and regular users can ask technical questions, and get answers from anyone who thinks they know the answer. Stack Overflow has become so ubiquitous among programmers and developers, the concept of “I just copied the code off Stack Overflow” has become a consistent meme to indicate you don’t fully grasp how something works, but at least it works.

If you’ve ever contributed answers to Stack Overflow, you might want to consider deleting them, altering them, or perhaps even go as far as request a GDPR removal if you’re in the European Union, because Stack Overflow has just announced a close partnership with “AI” company OpenAI (or, more accurately, “Open” “AI”). Stripped of marketing speak, the gist is exactly as you’d expect: OpenAI will absorb the questions and answers on Stack Overflow into its models, whether their respective authors like it or not.

As much as you may want to try and delete your answers if you’re not interesting in having your work generate profit for OpenAI, deleting popular questions and answers is not possible on Stack Overflow. The other option is altering your answers to render them useless, but it seems Stack Overflow is not going to allow you to do this, either. Ben Humphreys tried to alter his highest-rated answers, and Stack Overflow just reverted them back, and proceeded to ban him from the platform.

Stack Overflow does not let you delete questions that have accepted answers and many upvotes because it would remove knowledge from the community.

So instead I changed my highest-rated answers to a protest message.

Within an hour mods had changed the questions back and suspended my account for 7 days.
↫ Ben Humphreys

Now that they’ve made what is most likely an incredibly lucrative deal with OpenAI that’s going to net Stack Overflow’s owners boatloads of money, they obviously can’t let users delete or alter their answers to lower the monetary value of Stack Overflow’s content. Measures to prevent deletion or alteration are probably one of the clauses in the agreement between Stack Overflow and OpenAI. So there’s likely not much you can do to not have your answers sucked into OpenAI, but you should at least be aware it’s happening in case of future answers you might want to contribute.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

36 Comments

2024-05-07 11:57 am

Alfman verbose=1
Thom Holwerda,

If you’ve ever contributed answers to Stack Overflow, you might want to consider deleting them, altering them, or perhaps even go as far as request a GDPR removal if you’re in the European Union, because Stack Overflow has just announced a close partnership with “AI” company OpenAI (or, more accurately, “Open” “AI”). Stripped of marketing speak, the gist is exactly as you’d expect: OpenAI will absorb the questions and answers on Stack Overflow into its models, whether their respective authors like it or not.

I see where you’re coming from, but this also comes across as hostile to openness. Why should information be open for anyone to use for any purpose except for AI? Sure I get that you are concerned about compensation, but the fact is the stack exchange contributors volunteered their answers even knowing that employees of for profit companies are using the platform. The volunteers themselves might also be using stack exchange in the course of their professional work as well. They didn’t care about compensation then, so why now? I honestly think AI is a very good fit for the S/E model.

Ideally I’d like all these AI models to be as open and accessible as the original data. I am concerned about AI models that push us towards closed/proprietary services. But in principal I don’t have a problem with AI using published information to improve information science. Isn’t that a good thing? To be clear, I have sympathy when the AI displaces real jobs and livelihoods: artists, musicians, translators like you, etc. Many are dismissing AI as a joke and I think they’ll be proven wrong. In this specific instance, it a moral problem for AI to displace volunteers like those on stack exchange? I concede this could happen, but it’s much less clear to me that it raises the same moral issue as displacing paid jobs.

As much as you may want to try and delete your answers if you’re not interesting in having your work generate profit for OpenAI, deleting popular questions and answers is not possible on Stack Overflow. The other option is altering your answers to render them useless, but it seems Stack Overflow is not going to allow you to do this, either. Ben Humphreys tried to alter his highest-rated answers, and Stack Overflow just reverted them back, and proceeded to ban him from the platform.

I haven’t really combed through the terms and conditions. Whether the user has a right to retract submissions after the fact will obviously be a very interesting matter for the courts to settle. Their copyright policy does have provisions for reporting infringements, however it’s clearly about 3rd party allegations, not the contributors themselves changing their mind after the fact.

https://stackoverflow.com/legal/terms-of-service/public#copyright

I believe stack exchange would be covered by US laws. I’m genuinely curious where it will go. Keep us informed 🙂

2024-05-07 12:42 pm

Drumhellar
I agree with you, it isn’t exactly “open” if you allow users to use it for one purpose but not from another. At any rate, as per the agreement you linked to, the ToS grants Stack Exchange an irrevocable and perpetual license to the submitted content under a Creative Commons license (<a href="https://creativecommons.org/licenses/by-sa/4.0/"CC BY-SA 4.0)

2024-05-07 2:17 pm

Drumhellar
Oops. Messed up the link to the license: CC BY-SA 4.0

2024-05-08 9:18 am

alexvoda
Do the terms of service allow Stack Exchange to relicense that content? Because OpenAI is not abiding by either the BY or the SA.

2024-05-08 9:56 am

Alfman verbose=1
alexvoda,

Do the terms of service allow Stack Exchange to relicense that content? Because OpenAI is not abiding by either the BY or the SA.

Do you have any specific evidence to back this allegation?

I’m not privy to the details of the deal, but it seems conceivable that they will use stack exchange data to train a model that will comply with the Creative Commons license terms. Anyway I haven’t seen evidence to the contrary.
2024-05-08 2:26 pm

Drumhellar
How are they not?

2024-05-08 4:11 pm

mbq
Under CC-BY-SA **AND** “you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you as reasonably necessary”. This is a loophole they can use to avoid CC.

2024-05-08 5:26 pm

Drumhellar
The perpetual and irrevocable right you mentioned is granted by the CC-BY-SA.

There is no “AND”. Stack Overflow is merely doing what the license specifically allows. It isn’t a loophole. It’s literally the exact reason the license exists.

2024-05-07 1:41 pm

CaptainN--
Because multiple companies are now profiting off of the free labor of individual contributors, who may have contributed in the name of “openness” – but this feels like a step too far. It’s not just that StackOverflow is selling the highway, they are now granting a third party the dataset to quite intentionally obsolete the people who generated all the value (that is their often stated goal). In other words, this is an exploit of openness. I’ve always had an issue with open source for these reasons (especially non-copy-left license derivatives), which is kind of a related issue – they both devalue the hard work of humans, but this kind of takes it to the next level. Contributors rightfully see this as an exploit.

2024-05-07 3:15 pm

Andreas Reichel
Thank you for summarizing 60k years of human development and history. What exactly is the “AI” specific problem now?

2024-05-07 3:30 pm

CaptainN-
If you really think 60k year of human development is all exploitation, then you need to read more…

The difference is not in the human condition, it’s in the degree of exploitation. If degree doesn’t matter to you, then I don’t know what else to say.

2024-05-07 3:39 pm

Andreas Reichel
> To be clear, I have sympathy when the AI displaces real jobs and livelihoods: artists, musicians, translators like you, etc.

You surely do know that since those 60k year of modern mankind, slavery has been an integral part of our species — across all countries and cultures and centuries, until the very much present time? But please, just go on …
2024-05-07 3:41 pm

Andreas Reichel
delete the comment above please, my quitting was completely off.
I wanted to quote CaptaiN, not Alfman.

I will prefer to stay absent until this forum has some basic correction, preview and editing facilities.
2024-05-07 7:10 pm

CaptainN-
If you think a substantial portion of that 60k years is marred by slavery – like I said, you need to read more.

2024-05-07 3:26 pm

Alfman verbose=1
CaptainN–,

Because multiple companies are now profiting off of the free labor of individual contributors

This is kind of the point I was making though, stack exchange has always allowed that. They never required for profit companies to pay a dime to use their questions and answers and AFAIK stack exchange has never discriminated against business purposes. Adding that restriction now would be changing the longstanding policy of openness.

Obviously there is a real debate over training AI with copyrightable works, but that’s a separate issue. Here we are talking about contributions submitted under communal licenses, AI is not violating copyrights here but for some reason I get the impression that some would like to stigmatize AI everywhere, even though copyrights are actually being adhered to.

I’ve always had an issue with open source for these reasons (especially non-copy-left license derivatives), which is kind of a related issue – they both devalue the hard work of humans, but this kind of takes it to the next level. Contributors rightfully see this as an exploit.

You are perfectly entitled to this opinion. But I would suggest that you should not be submitting anything under terms you disagree with in the first place. If you do, it’s not the other party’s fault.

2024-05-07 3:39 pm

CaptainN-
You are saying that AI could have just trained off SEO content from SO, and therefor they already had access to the content? That’s not a settled area.

This new deal is likely the result of AI companies reading the wind – they got a lot of flack for using publicly available, yet still copyrighted material. SO owns the content on their platform (it’s in their TOS), so they can make the argument that OpenAI used their copyrighted material to produce derivative art work through AI (which is very true). This deal indemnifies OpenAI from that copyright infringement.

SO is completely within their rights to do this, BTW, according to their TOS, and their business model and all that. It’s just that it’s not really in line with what people likely thought they were getting when they participated on that platform. It feels like a violation of the spirit of that platform – that matters.

If nothing else, all these moves being made by AI companies and centralized content hosts of all types, is showing clearly the problem of centralized content management. There are interesting problems that that web 2.0 phenomena addressed – and those problems remain if there is to be an alternative decentralized approach to content hosting – specifically, find-ability. Google used to have an answer, before they ruined it in recent years, but there’s not been too many viable alternatives to finding content in decentralized systems, aside from link sharing and re-sharing in centralized social media networks (which also comes with commentary, and all that).

2024-05-07 3:44 pm

CaptainN-
Another problem with decentralized platforms is the business model- technical folks don’t talk about business concerns enough, IMHO. Businesses need to be able to produce something valuable, AND sell it. It’s not one or the other. How do you create a valuable decentralized social media platform, or to put it more directly – last-mile content aggregator – and also sell it – that is, earn enough revenue to keep the lights on, and the business running? Ads is one way (that’s what Google, Twitter and Facebook do) – is that really the only possible way? It’s a very interesting problem. IMHO, technical people need to think more about that type of stuff.
2024-05-07 4:02 pm

Alfman verbose=1
CaptainN-,

You are saying that AI could have just trained off SEO content from SO, and therefor they already had access to the content? That’s not a settled area.

I didn’t say any of that. As a copyright matter though, I don’t see any smoke, much less fire.

SO is completely within their rights to do this, BTW, according to their TOS, and their business model and all that. It’s just that it’s not really in line with what people likely thought they were getting when they participated on that platform. It feels like a violation of the spirit of that platform – that matters.

I agree with you that they are within their rights. You’re saying it’s against the spirit of the platform, but I’m not sure that’s true or just anti-AI bias. If this union ends up producing a more powerful interface to answer questions, I would argue that is very much in line with the spirit of the platform.
2024-05-07 4:06 pm

Alfman verbose=1
CaptainN-,

Ads is one way (that’s what Google, Twitter and Facebook do) – is that really the only possible way? It’s a very interesting problem. IMHO, technical people need to think more about that type of stuff.

Ideally other business models could be successful. Advertising seems to be killing all the other business models. Users generally flock to services they don’t need to pay for. This hurts many of us.
2024-05-07 5:05 pm

CaptainN-
> If this union ends up producing a more powerful interface to answer questions,

The current crop of AI hyped products based on LLM literally cannot be better than google + SO – in fact, I’d argue it’s already proven itself worse. You get decent solutions to well defined (and well solved problems) but LLMs literally cannot come up with novel solutions to any problem – even relatively simple ones. Example, I asked it to write some code to add brotli support to a node.js stream, and it just can’t figure it out. There are plenty of examples of the same code for gzip. A human who knows that realm answering a question on Stack Overflow will not only be able to answer the question, probably quickly – but also provide a ton of useful context the human thinks is relevant.

Why can’t the LLM produce the same? Because it’s just a giant vector database. It literally predicts the next likeliest word in a sequence, phrase by phrase, based on what it has in a vector database “trained” (silly word here, IMHO) on pre-existing art – it therefor cannot come up with anything that doesn’t exactly match something in it’s training set (in the vector database). That’s just how it works.

There are other AI technologies that might be able to provide that type of data, but LLM ain’t it. I pitty the poor saps trying to build businesses on this tech.
2024-05-07 8:42 pm

Alfman verbose=1
CaptainN-,

The current crop of AI hyped products based on LLM literally cannot be better than google + SO – in fact, I’d argue it’s already proven itself worse.

Why not? Honest I am quite impressed with LLM’s ability to create custom tailer answers to questions in ways that google + SO cannot do.

LLMs literally cannot come up with novel solutions to any problem – even relatively simple ones. Example, I asked it to write some code to add brotli support to a node.js stream, and it just can’t figure it out.

I disagree with this. While today’s LLMs are static and don’t do their own research, they can still extrapolate novel solutions by using the training data axiomatically. That’s quite an achievement, and very useful in answering questions Stack Exchange style I might add. Of course the LLMs today aren’t able to independently verify the truth of anything they’ve been trained on and they are kind of naive in this respect, although I suspect this may change in the future.

In terms of asking it to code, keep in mind you are asking a very generic LLM to write code, not an LLM that was trained to code. If we did build LLM models for coding, they would probably do a lot better than what you’ve seen from generic LLMs.

Why can’t the LLM produce the same? Because it’s just a giant vector database. It literally predicts the next likeliest word in a sequence, phrase by phrase, based on what it has in a vector database “trained” (silly word here, IMHO) on pre-existing art – it therefor cannot come up with anything that doesn’t exactly match something in it’s training set (in the vector database). That’s just how it works.

Outputting words sequentially does not diminish the intelligence of what’s being output. A more convoluted algorithm might “seem” more intelligent to us, but this is just bias and not an objective indicator of intelligence. A mathematician would recognize that any algorithm that outputs data can be converted into another algorithm that creates the exact same output one word at a time (even including the human brain at the extreme). Therefor we should not be treating this as a deficiency. We need to evaluate AI outputs on merits and in particular treating it as a black box without preconceptions.

There are other AI technologies that might be able to provide that type of data, but LLM ain’t it. I pitty the poor saps trying to build businesses on this tech.

I know you are being sarcastic, but they’re not the ones who are going to deserve pitty when all is said and done.

2024-05-07 5:13 pm

Bill Shooter of Bul Platinum Prime
So, if I understand you correctly, you’re saying that people who contributed to Stack overflow should be against it because AI may replace their jobs?

Still absorbing the issue, haven’t really decided if its a bad thing or not really just trying to understand everyone else’s position.

My initial reaction was ” thats great!” There was some speculation that Open AI had already trained on Stack overflow and was profiting off of that dataset without compensating Stack overflow who did the hard work of bringing everyone together. But while that is a net positive over just taking peoples data without compensation, the use of the data may still be a net negative.

2024-05-07 1:51 pm

cheemosabe
It’s because of situations of this nature that I appreciate permisive licenses like BSD. You shouldn’t lock up knowledge. Maybe OpenAI won’t use it for the best advantage (although I personally don’t fault them with anything, they’ve brought about some significant change, or at least insight).

Profit is the probably the last thing we should be afraid of when it comes to AI. I’m not sure how this will play out and I’m a little afraid of where it might lead, but it’s probably a little late to call stop, and certainly not for this reason.
2024-05-07 2:20 pm

Geck
There is no real problem involved, when it comes to using SO content for AI purposes, as the licence allows it. The problem is the licence has obligations too. The main problem is OpenAI doesn’t attribute authors and that is not allowed by the SO licence. Authors must be attributed.

2024-05-07 3:13 pm

Andreas Reichel
You sure you are attributing all the authors and inventors when ever you use an Otto motor, a diesel engine, a relational database (Codd did not get anything), calculus or zero?
There is a whole very worthwhile lecture on how the Arabs and their contributions to science have been wiped!

2024-05-07 3:17 pm

Andreas Reichel
Thom: please give us the possibility to edit and proof ready posts! Thank you in advance.
2024-05-07 5:27 pm

Geck
The licence SO authors use for existing content is clear. In short if you use the content you must attribute the author. Yes i have always attributed the author, when using code from SO, as there is no reason on why i wouldn’t and so will ChatGPT. That is the answer it provides, if any of it will be based on SO content, then attribution must also be provided.

2024-05-07 3:26 pm

Andreas Reichel
> To be clear, I have sympathy when the AI displaces real jobs and livelihoods: artists, musicians, translators like you, etc.

What was the largest industry around 1900? Whaling! And it vanished within less than 1 year completely, when mineral oil we put to use.
Should we have banned mineral oil at that time?! Think about the climate change it caused!

2024-05-07 4:55 pm

mrroman
Do you have and sources (about whaling)? That’s very interesting. I would think about steel, mining or weaving.
2024-05-07 5:17 pm

Bill Shooter of Bul Platinum Prime
Thats one data point. Another one is the death of Journalism at the hands of various online multinational companies. Journalism doesn’t exist to the extent it used to. I don’t think we’re better off on any level of society. There really should have been a better way to preserve what we had while also moving into the digital age and were all worse off for our current situation. I fear that may also happen with the increase in AI in many industries. Any gains we may see might just be offset by dramatically larger losses. Its naïve to see it as only a positive or negative cause of change.

2024-05-07 8:57 pm

Alfman verbose=1
Bill Shooter of Bul,

I fear that may also happen with the increase in AI in many industries. Any gains we may see might just be offset by dramatically larger losses. Its naïve to see it as only a positive or negative cause of change.

Yes. I’ve been saying for quite some time that AI is coming whether we want it to or not. While I respect the opinions of those who wish AI would go away, I think fighting AI is futile and by focusing on the wrong goal they may be loosing the ability to shape what AI ultimately becomes.

IMHO AI has the capacity to create both a utopia as well as a dystopia…which one we get in the future really depends on us and our choices today.

2024-05-08 6:58 am

kurkosdr
I see where you’re coming from, but this also comes across as hostile to openness. Why should information be open for anyone to use for any purpose except for AI?

Same question here. An LLM is a piece of software after all.

Looks like Thom has a beef with AI because it’s making his job as a translator increasingly obsolete.

Similarly, some StackOverflow contributors have a beef with AI because it threatens to make their jobs as computer programmers increasingly obsolete.

So, they will engage in mental gymnastics to “prove” how AI (LLMs) should be subject to special copyright laws (where fair-use doesn’t apply) and exempt from existing terms-of-service. About that last part, if you are uploading something to someone else’s computer, assume they can use it as they please because that’s what the terms-of-service probably say.

2024-05-08 8:55 am

Alfman verbose=1
kurkosdr,

About that last part, if you are uploading something to someone else’s computer, assume they can use it as they please because that’s what the terms-of-service probably say.

I think this speaks directly to the Ben Humphreys quote in the article. He doesn’t like their decision, but he doesn’t really have the authority to reneg the terms of earlier submissions in order to punish the platform. If it bothers him that much he probably needs to close his account.. I understand this is not satisfying, but it does provide an important lesson to everyone: understand what you are agreeing to!

He can try to go to court if he really feels so strongly about it, but I’m not sure he’s really got a case here unless something in the terms and conditions was illegal – I don’t know why they would be though.

2024-05-07 1:37 pm

CaptainN--
If you don’t pay for the product, you are the product. We need new business models – all of these are business problems, not technical ones.
2024-05-08 1:17 am

drstorm
Please Thom, I’m begging you, define what constitutes an AI. No quotes. I am being sincere. I just want to know.

2024-05-08 6:38 am

kurkosdr
In this particular case, it means an LLM.