NetBSD bans use of Copilot-generated code

Thom Holwerda 2024-05-15 NetBSD 22 Comments

The NetBSD project seems to agree with me that code generated by “AI” like Copilot is tainted, and cannot be used safely. The project’s added a new guideline banning the use of code generated by such tools from being added to NetBSD unless explicitly permitted by “core“, NetBSD’s equivalent, roughly, of “technical management”.

Code generated by a large language model or similar technology, such as such as GitHub/Microsoft’s Copilot, OpenAI’s ChatGPT, or Facebook/Meta’s Code Llama, is presumed to be tainted code, and must not be committed without prior written approval by core.
↫ NetBSD Commit Guidelines

GitHub Copilot is copyright infringement and open source license violation at an industrial scale, and as I keep reiterating – the fact Microsoft is not training Copilot on its own closed-source code tells you all you need to know about what Microsoft thinks about the legality of Copilot.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

22 Comments

2024-05-15 4:21 pm
spiderdroid
Legalese will not stop CoPilot’s adoption. Corps look for ease of implementation, and less need to pay humans in development. That said, the justice system will side with Microsoft in this regard. That’s why they wasted no brain cycles in the consideration of law in this regard as they have the resources to fight and prevail against literally any open source entity, although I’m on the side of NetBSD in this.
2024-05-15 4:56 pm
cevvalkoala
And they wrote “such as such as” in the clause banning the use of LLMs for being unreliable, unsafe etc.
I’m not very hopeful for the future.

2024-05-16 5:13 am
nia_netbsd
While it’s an amusing mistake, our policy has nothing to do with reliability or safety, but rather copyright (“tainted” refers specifically to illegal code). We require code to be clearly licensed – anything copy and pasted from stackoverflow would also be unacceptable for us. Our own license strictly requires attribution in derivative works, something LLMs don’t provide in their synthesis, almost by design.

2024-05-15 5:18 pm
Alfman verbose=1
spiderdroid,
GitHub Copilot is copyright infringement
To me this is still questionable. If a human learns how an algorithm works, even from copyrighted source, and then proceeds to rewrite expressions of that algorithm into their own code, copyright allows that without permission. Taking existing works and republishing them in new words is easy as pie for an LLMs. Traditionally copyright laws would treat these as new expressions, not copyright infringement. It might be a nondisclosure violation if an NDA were in play. Alternately it might be patent infringement or trademark infringement if those had applied. However if the expressions are new, then there’s case for copyright infringement by traditional standards. We’d have to redefine what copyrights are to ban these AIs learning from existing works I’m struggling to find a rational to restrict AI from doing what humans have always done.
and open source license violation at an industrial scale
Would your opinion change if you could select your desired license from a drop down box and copilote would generate code using LLM trained using only compatible licenses. I am curious if this would still bug you?
, and as I keep reiterating – the fact Microsoft is not training Copilot on its own closed-source code tells you all you need to know about what Microsoft thinks about the legality of Copilot.
Well, the legality and hypocrisy are two different matters. Of course they’re both topics worthy of discussion.

2024-05-15 5:20 pm
Alfman verbose=1
Edit “then there’s no case for copyright infringement by traditional standards”
2024-05-16 1:44 am
Book Squirrel
“If a human learns how an algorithm works, even from copyrighted source, and then proceeds to rewrite expressions of that algorithm into their own code, copyright allows that without permission. Taking existing works and republishing them in new words is easy as pie for an LLMs.”
I think you’re missing an extremely central point here in that humans and LLMs are different under the law. I tried to explain this in an earlier thread: We afford personhood and rights to humans, not LLMs.
Humans are people we can assign copyrights *to* as authors, but a human is not copyrightable, so if a human consumes a work and then produces something similar of their own without blatantly just copying directly from one page to another, then they can generally break the chain of derivative works in copyright.
Conversely, LLMs are clumps of data that are very much copyrightable, and thus they cannot break the chain of derivative works. Or they shouldn’t be able to if the legal system functions properly.
Every LLM (as well as every other type of transformer model) is in a very direct way derived from its training corpus, as well as any data subsequently used in reinforcement learning. And derivative works are subject to the copyrights of whatever they’re derived from.
Additionally, neural networks in machine learning have the interesting property that they both store data in the network and transform inputs, and those two functions are not neatly separable. Granted, they do not store data losslessly but neither does a jpeg image, and those are still subject to any copyrights on whatever they’re derived from. As such anything a transformer model produces can in turn be seen as a derivative of the model itself (which was a derivative of the training corpus).
I’m not one to put law over ethics though. The law can absolutely be wrong about how we should treat things. Animals are not people under the law. We have animal rights to an extent, but it’s still legal to breed and slaughter animals for food even though I think it probably shouldn’t be (or should be severely restricted).
But I do think treating LLMs and other transformer models under copyright is the correct approach. I have not gotten the impression that they are intelligent or sentient in the way that most animals appear to be. I don’t see AGI with intelligence or sentience as being impossible, but I still don’t get the impression that we are much nearer than we were a couple of years ago.
If you want to make the argument that transformer models should be treated as something more under the law though, you’re welcome to do that. But in that case it mostly strengthens my own impulse of not touching the things with a 10 foot pole. Because if we recognise them as entities we might afford rights to, the way we currently utilise them could be seen as absolutely horrifying in how we’re actively and constantly creating and destroying prompt contexts every single time someone interacts with them.
If we give them rights it should *mean* something, and it should completely hobble the way we are then arguably abusing them now.
“Would your opinion change if you could select your desired license from a drop down box and copilote would generate code using LLM trained using only compatible licenses. I am curious if this would still bug you?”
Can’t speak for Thom, but for my part probably not? Most licenses require attribution though, and the data in a neural network is not neatly separable, so the only approach I could imagine works here then is attributing every single author in the training corpus in the copyright notice of every single thing that is generated.
I think this could potentially be fine, and the upshot is that it would make it practically impossible to ever change the license of the output, unless the license itself allows for that.
But I don’t really expect that OpenAI or most of the other companies behind the current models will make this kind of effort. Respecting licenses in this way would pretty much destroy most of the usefulness of the models, since most copyrighted works do not come with a license for creating derivative works, and even most open licenses are probably not compatible enough to be included in the same model.

2024-05-16 5:30 pm
Alfman verbose=1
Book Squirrel,
I think you’re missing an extremely central point here in that humans and LLMs are different under the law. I tried to explain this in an earlier thread: We afford personhood and rights to humans, not LLMs.
Can you be more specific, where does copyright law say that?
Humans are people we can assign copyrights *to* as authors, but a human is not copyrightable
These are two separate claims. I don’t believe copyright law requires the assignee to be a human (again, if you can find a legal citation that proves otherwise I’ll have to reconsider). The second point about humans not being copyrightable from a legal perspective isn’t clear to me either. This is tangential to the discussion, but a genetically engineered human might be copyrightable under the law. I don’t know if any DNA copyright cases have made it to court yet, but people are considering it and it’s likely to end up there sooner or later.
https://www.insights.bio/cell-and-gene-therapy-insights/journal/article/388/An-alternative-to-patents-can-DNA-be-protected-by-copyright-and-design-right-law
Additionally, neural networks in machine learning have the interesting property that they both store data in the network and transform inputs, and those two functions are not neatly separable.
If you ban LLMs on this basis, you’d also have to ban human brains as well. I think very often those with anti-AI sentiments are overlooking biases to formulate an opinion against AI but not against humans (consciously or otherwise).
But I do think treating LLMs and other transformer models under copyright is the correct approach. I have not gotten the impression that they are intelligent or sentient in the way that most animals appear to be. I don’t see AGI with intelligence or sentience as being impossible, but I still don’t get the impression that we are much nearer than we were a couple of years ago.
No one here would claim these models are sentient. What’s more likely to be debated is whether LLMs are “creative”. I argue they do create works that are every bit as original as a human author/artist and output expressions that have never been uttered before. A rebuttal might be to suggest that the ideas behind those expressions are not original but… 1) the vast majority of human expression is derivative too and 2) copyright does not protect ideas, only the expressions of those ideas. There is no doubt that copyright, as it has traditionally existed, would not block humans from doing what LLMs are doing. So whenever someone suggests that LLMs should be blocked but not humans, I view that as discriminatory. Now conceivably there could be grounds for discrimination against AI that mechanizes new expressions, but so far I haven’t seen anybody try to justify it and until they do it comes across as “here are the rights for humans and here are different rights for AI, just because”.
Can’t speak for Thom, but for my part probably not? Most licenses require attribution though, and the data in a neural network is not neatly separable, so the only approach I could imagine works here then is attributing every single author in the training corpus in the copyright notice of every single thing that is generated.
Attributing everyone might be a legal solution, haha. Although it still comes back to the issue of whether learning itself is bound by copyright, traditionally for humans copyright does not cover learning even though the knowledge is derivative.
Respecting licenses in this way would pretty much destroy most of the usefulness of the models, since most copyrighted works do not come with a license for creating derivative works, and even most open licenses are probably not compatible enough to be included in the same model.
I think we need to look at real data rather than making assumptions about this. There may be plenty of GPL compatible projects to train an LLM to output GPL code without getting into the weeds about GPL code getting mixed with other incompatible licenses.
PS. <blockquote> might help break up long walls of text 🙂

2024-05-18 4:13 pm
Book Squirrel
(Thanks for the blockquote tip, I’ve been away from comment tracks for a long time.)
Can you be more specific, where does copyright law say that?
It’s not about copyright law specifically. “Human Rights” and “personhood” are concepts within law that decide who or what have access to a variety of rights and responsibilities, including copyright.
Though again: I fully admit I don’t always agree with the law on how this works out exactly, for example for animals that are not humans. It’s fine to disagree with the law.
If you ban LLMs on this basis, you’d also have to ban human brains as well. I think very often those with anti-AI sentiments are overlooking biases to formulate an opinion against AI but not against humans (consciously or otherwise).
I haven’t said anything about banning LLMs. I’m arguing that when you train them on copyrighted data *both* the model itself and its output are tainted by those copyrights, because function and storage are not neatly separable.
Humans have rights under the law and are thus not generally subject to this, because you’re not supposed to be able to own humans or their thoughts.
No one here would claim these models are sentient. What’s more likely to be debated is whether LLMs are “creative”. I argue they do create works that are every bit as original as a human author/artist and output expressions that have never been uttered before.
The problem isn’t “creativity”, it’s whether or not the LLM qualifies as an “author”. If it doesn’t, then it’s a transformation tool and copyright should apply when we build one from training data or transform data with it. If it does qualify then it should be because we’ve decided to give it some form of personhood and rights. And I do thing concepts like sentience, sapience and intelligence are relevant in this.
(I’ll note that we’ve afforded a kind of personhood to corporations that don’t have any of those things, and it was a terrible, terrible idea.)
I’m perfectly fine with letting some kind of advanced AI have personhood, but that implies a lot of consequenses for how we get to treat them (which I’d also be totally fine with!).
But I don’t think we’re at a point where it makes sense for LLMs or transformer models in general.
So whenever someone suggests that LLMs should be blocked but not humans, I view that as discriminatory. Now conceivably there could be grounds for discrimination against AI that mechanizes new expressions, but so far I haven’t seen anybody try to justify it and until they do it comes across as “here are the rights for humans and here are different rights for AI, just because”.
Should we give them rights then? Perhaps even treat them as humans?
I don’t think that makes sense with LLMs from what I’ve seen, but truth be told, apart from educational, technical, environmental and copyright/legal reasons I don’t want to use them, there’s also a very tiny part of me that worries that if I’m wrong and it does in fact turn out they’re intelligent or sentient somehow, then the way they’re currently being treated (owned and exploited by corporations, having contexts created and destroyed constantly for the gratification of prompters, etc) is truly horrifying and not something I want to be any part of anyway. Because, you know, slave labor is a pretty serious kind of discrimination.

2024-05-18 5:28 pm
Alfman verbose=1
Book Squirrel,
It’s not about copyright law specifically. “Human Rights” and “personhood” are concepts within law that decide who or what have access to a variety of rights and responsibilities, including copyright.
I was hoping for something specific to copyright. In any case, we do give “personhood” rights to non-sentient entities all the time in the corporate world as you allued to later. Remember “corporations are people too”? I don’t see why it should matter here.
I haven’t said anything about banning LLMs. I’m arguing that when you train them on copyrighted data *both* the model itself and its output are tainted by those copyrights, because function and storage are not neatly separable. Humans have rights under the law and are thus not generally subject to this, because you’re not supposed to be able to own humans or their thoughts.
But in a scense you are banning artificial NNs from doing what humans are doing: transforming information into a NN. Where would you draw the line? If we trained artificial NNs into biological brain tissues (some labs have started doing this), would that knowledge transfer be allowed or not? Conversely would extending human brains with artificial compute have any impact on copyright? Why or why not? My point isn’t to say that anyone’s opinion is invalid, but from a purely logical standpoint I do think there may be some human bias at play to say that AI should not be allowed to do what we do.
The problem isn’t “creativity”, it’s whether or not the LLM qualifies as an “author”. If it doesn’t, then it’s a transformation tool and copyright should apply when we build one from training data or transform data with it.
Well, I’d say “creativity” and “authorship” are one and the same. It’s more of a semantic debate. Moreover though, I feel this always goes back to my point that, when a human does it, our human bias makes us say “they are the author” and when AI does it “it’s a transformation of data” even when we’ve done the same thing.
If it does qualify then it should be because we’ve decided to give it some form of personhood and rights. And I do thing concepts like sentience, sapience and intelligence are relevant in this.
So you saying that being sentient allows a person to transform copyrighted data into new expressions without permission, whereas a machine that automates similar transformations and results should not be allowed? If this is your opinion, that’s fair enough but I’m still not convinced that this isn’t a change to traditional copyright laws that were not based on sentience before.
I’m perfectly fine with letting some kind of advanced AI have personhood, but that implies a lot of consequenses for how we get to treat them (which I’d also be totally fine with!).
To be clear I don’t think that matters since the copyright of expressions should not be based on personhood at all. I feel that whether a human did it or an AI did it should not be relevant to factual question of copyright infringement. If the law were to start looking through the lenses of human vs AI to create different legal outcomes for each, I think the practical implications would tie us up in gordian knots. Naturally people will start lying about using AI, and that will lead to people being accused of using AI whether or not they did and not having a way to prove they didn’t. I can see this turning a witch hunt!
2024-05-19 6:48 am
Book Squirrel
I was hoping for something specific to copyright. In any case, we do give “personhood” rights to non-sentient entities all the time in the corporate world as you allued to later. Remember “corporations are people too”? I don’t see why it should matter here.
I also said it was a terrible idea to let corporations have any kind of personhood. So while I am talking about copyrights I admittedly don’t agree with the law on everything.
To be clear I don’t think that matters since the copyright of expressions should not be based on personhood at all. I feel that whether a human did it or an AI did it should not be relevant to factual question of copyright infringement. If the law were to start looking through the lenses of human vs AI to create different legal outcomes for each, I think the practical implications would tie us up in gordian knots. Naturally people will start lying about using AI, and that will lead to people being accused of using AI whether or not they did and not having a way to prove they didn’t. I can see this turning a witch hunt!
Let’s boil it down to a simple question then: In your conception of this, when a generative model produces something arguably novel and creative, who should get to claim the copyright?
Is it the generative model itself, or is it the human who prompted the model?
Because if the answer is the human, then we’ve 100% created a discriminatory environment with a “different legal outcome” for the AI, and we’re treating it as a tool that does not have any rights, and that we can use as we see fit. I think that’s actually fine for LLMs and other transformer models, but if it’s just a tool then I don’t see how it’s much different from using a compiler to transform a program to machine code (which is then a derivative of the input), except the algorithm (in this case artificial NN) that we use was generated by lossily hardcoding a massive amount of (usually copyrighted) data into the NN and implements an element of randomness in which parts of it gets used in producing a given derivative output. The fact that it’s all muddled together only makes it worse in terms of deciding the origins of authorship (and thus copyrights) as I see it.
And if the answer is that we assign the copyright to the generative model itself (so that it is now seen as the author), then we have clearly recognised it as being some kind of person deserving of holding a copyright itself. But then I feel like we also have to answer a whole boatload of questions on what other rights it should have. And absolutely, the simplest idea here would be to *not* have different legal outcomes for AI and humans. That is, to recognise the AI as a full person under the law. If we manage to produce a sufficiently advanced AI, then I’m all in favor of this. But then noone gets to own (that is, have copyright on) the AI, and the AI itself owns all of the output it produces, and we should probably stop treating them like slaves to our whims as we do now.
As a side-note: Using the term “neural network” for artificial neural networks unfortunately causes confusion sometimes. To be clear, artificial NNs are inspired by some of the elements of how a brain tissue NN functions, but they are very much not the same thing. I mostly say this because you seem to equate them an awful lot.
I’m not sure it’s all that relevant to the discussion to begin with though. If we can recognise something as a person it doesn’t matter that much which architecture it’s based on, be it artificial NNs, brain-tissue NNs or something entirely different. Though even with the advent of transformer models, it still does not appear to me that artificial NNs are anywhere near to being up to the task.

2024-05-15 5:30 pm
Andreas Reichel
> GitHub Copilot is copyright infringement and open source license violation at an industrial scale,
Why must I think of Frogs complaining about drained swamps…

2024-05-15 5:33 pm
Andreas Reichel
And just to put things into context: I publish plenty of Open Source on GitHub, so I would be affected.

2024-05-15 6:32 pm
Xanady Asem
I feel NetBSD is one of those projects so chronically starved for development, that this seems like a clear focus on the wrong priority.
2024-05-16 12:38 am
cpcf
There is always this tension between copyright and invention, to often the the issue gets muddled by the law. For most people coding the methods are defined by the language, and it’s the procedure or novel idea that is the clever design. In this regard we should be policing infringements of ideas, procedures and novel methods, not fragments of code which are largely governed by the good coding practises and language design. For me this is especially the case in object oriented as distinct from procedural programming, and of course that is also dependant on language design.
If we followed on with behaviour from the corporate and legal perspectives, Kernighan and Ritchie should be suing everyone!

2024-05-16 5:51 pm
Alfman verbose=1
cpcf,
There is always this tension between copyright and invention, to often the the issue gets muddled by the law. For most people coding the methods are defined by the language, and it’s the procedure or novel idea that is the clever design. In this regard we should be policing infringements of ideas, procedures and novel methods, not fragments of code which are largely governed by the good coding practises and language design.
That’s a very good point, and one that specifically game up in the oracle vs google case. The prosecutors compared code that verified function inputs and google’s code was identical to sun’s code (now oracle’s). The judge (who was himself a coder) made the point that is exactly how he would have done it as well. Copying code can be incidental when programmers are asked to to the same task in a language where the programming expressions are obvious.
However, I am not happy with your suggestion that “we should be policing infringements of ideas”, haha. That’s what software patents are and IMHO these have been awful. The theoretical benefits of patents (ie adding the corpus of knowledge) are NIL for software. It’s much less effort to solve the problem ourselves or using public sources rather than spending days and weeks searching for and deciphering cryptic software patents. For software engineers, when was the last time you used a software patent document to solve a problem? It’s rhetorical, for 99.99% of us the answer is never. And on the other hand the costs of software patents is high with billion dollar cases that increases the risks and costs of doing business and invariably passes the costs onto consumers.
If we followed on with behaviour from the corporate and legal perspectives, Kernighan and Ritchie should be suing everyone!
If they had, maybe we’d have switched to safer languages already 🙂
You’re right though, the legal environment for software is totally different today than it used to be.

2024-05-16 1:54 am
forart.it
Well, in my opinion the main focus should be to legally force so-called “AIs” to be licenses-compliant: if were trained on open sources, generated codes MUST be open as well (aka license virality).

2024-05-16 5:59 pm
Alfman verbose=1
forart.it,
Well, in my opinion the main focus should be to legally force so-called “AIs” to be licenses-compliant: if were trained on open sources, generated codes MUST be open as well (aka license virality).
I also would like to see the entire LLM itself become FOSS. This could actually be a good outcome. Making FOSS more accessible and usable should be a good thing. Having derivations be license compatible should not be offensive to anyone IMHO, not even the original FOSS authors. But I am genuinely curious how AI naysayers would actually feel about it.

2024-05-16 3:03 pm
pfgbsd
FWIW, we discussed it briefly in FreeBSD and it was simply the case that there was no one in favor of using it.

2024-05-18 2:22 pm
pfgbsd
Replying to myself … if it ever comes to be that AI is so wonderful and time proven then it will just be the case that someone will just have it rewrite a complete BSD in Rust or whatever is the language in vogue that day, and given that the BSD license is for all purposes free already no one will actually care if the original project likes it or not.

2024-05-18 2:54 pm
Alfman verbose=1
pfgbsd,
and given that the BSD license is for all purposes free already
This is one of the things that perplexes me about those who portray AI training as a violation of open source copyrights. The one issue I see is that copyleft licenses restrict the right to mix licenses. For example not even GPL2 and GPL3 are compatible with each other. Therefor it would make logical sense for someone to argue the licenses need to be kept separated. But in principal that can be accomplished and beyond that I’m not seeing any fundamental reason for LLMs to be incompatible with FOSS copyrights. Our mainstream FOSS licenses even allow us to build commercial services from them. I understand some people may not like it, but the licenses do allow it. I find it hard to align claims of copyright infringement for AI with the actual FOSS license texts that do not back such claims.

2024-05-16 11:02 pm
kbd
There seems to be a lot of AI hate around here, not that I object to it. Personally I’m hoping the copyright maximalists sue the AI companies into oblivion and bring prices of graphics cards down. That’s all I care about, really. I’m not that interested in AI, I just want affordable, real-time path tracing, and the AI hype is hogging all the hardware!

2024-05-18 3:42 pm
Alfman verbose=1
kbd,
Personally I’m hoping the copyright maximalists sue the AI companies into oblivion and bring prices of graphics cards down. That’s all I care about, really. I’m not that interested in AI, I just want affordable, real-time path tracing, and the AI hype is hogging all the hardware!
Yes, because lawyers are known to bring prices down 🙂
I do sympathize on the affordability issue. I still kind of consider real time path tracing a luxury. Those who don’t have it for gaming aren’t missing out on much. I do like it for blender though.