GitHub’s new Copilot code suggestion tool raises GPL concerns

Thom Holwerda 2021-06-30 General Development 35 Comments

Today, we are launching a technical preview of GitHub Copilot, a new AI pair programmer that helps you write better code. GitHub Copilot draws context from the code you’re working on, suggesting whole lines or entire functions. It helps you quickly discover alternative ways to solve problems, write tests, and explore new APIs without having to tediously tailor a search for answers on the internet. As you type, it adapts to the way you write code—to help you complete your work faster.

Sounds like a cool and useful feature, but this does raise some interesting questions about the code it generates. Sure, generated code might be entirely new, but what about possible cases where the code it “generates” is just taken from the existing projects the AI was trained on? The AI was trained on open source code available on GitHub, including a lot of code licensed under, for instance, the GPL. GitHub says in the Copilot FAQ:

GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions.

That 0.1% may not sound like a lot, but that’s misleading – another way to put it is that out of every 1000 suggestions Copilot makes, 1 is copy/pasted code someone has written and selected a license for, and that license must, of course, be respected. On top of that, it’s hard to argue that code generated from a set of existing open source code doesn’t constitute a derivative work, and is thus covered by the copyright open source licenses are based on.

I am not a lawyer, so I’m not going to argue Copilot is definitively a massive GPL violation, but as a layman, on the face of it, it definitely feels like a tool that’s going to strip a lot of code from their licenses – without consent and permission of the code’s authors.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

35 Comments

2021-07-01 12:15 am
sj87
Single line of code cannot probably have copyright on its own. The same rule would apply to manual copy-paste performed by a human being: there would be no ground for a legal case.
Try copyrighting a single note on a musical piece and claim that everyone else is infringing on your rights.
Rather the problem is in the tool itself: does it violate copyrights by holding a “database of copyrighted code” and then pulling pieces out from there.

2021-07-01 3:41 am
Kochise
This.
Pretty much like using a few seconds of a movie to illustrate a point then claiming its whole ownership.
People should calm down a little.
What are freetards asking for ? Remaining in the prehistoric of computing by not using code around – because reason – and expecting someone to revolutionize everything and provide it for free, just for the sake of it ? Where is their own foss “ai code blender” that also deals with the licensing dependencies in a graceful manner ?
2021-07-02 12:33 am
AER
Correct. Thom repeatedly says he is not a programmer, but anyway commented on things about programming. And Thom says he is not a lawyer, so you have at least an honest opinion from him.
But with regards to politics, Thom has never as far as I know, commented or put a disclaimer that he is not a political scientist nor an expert on Chinese foreign and domestic policies. Does he travel to Xianjiang, China to see for himself what’s the human rights situation there for example? He claims an expert about China by repeatedly slamming China on human rights issues, while ignoring his European roots.
He is a Dutch citizen correct? While it sounds like I am off-topic, but everyone knows that when it comes to China, Thom’s comments mirrored to that of western media.
Did you know folks that Netherlands is hosting a terrorist in the name of Joma Sison, a founder of communism in the Philippines. I hope Thom will lobby his government to deport that man, of whom many of our fellow citizens are victims of the communists abuses in the Philippines since the 1980s.

2021-07-01 12:50 am
Moochman
Sounds like a question for the FSF. This is a problem never before envisioned by the license authors and thus they should determine whether or not this constitutes license violation.

2021-07-01 6:56 am
Brendan
That’s really not how laws work.
It’ll be up to the courts to decide whether or not a copyright was/wasn’t violated by who/what. The FSF might create new licenses with new clauses for “AI generated code”, but whether a court accepts any such new clause or merely laughs at it is also up to the courts.
My guess is that the courts will say “AI assisted a human to write the code, but the human is ultimately responsible for their code regardless of what assisted them” and the human programmer/s will be blamed when the AI violates someone’s copyright. Other alternatives throw the whole legal system into turmoil (if the courts decide a computer can break a law; then should computers that break laws be punished? fined? imprisoned?).

2021-07-01 11:19 am
Alfman verbose=1
Brendan,
That’s really not how laws work.
…
My guess is that the courts will say “AI assisted a human to write the code, but the human is ultimately responsible for their code regardless of what assisted them” and the human programmer/s will be blamed when the AI violates someone’s copyright. Other alternatives throw the whole legal system into turmoil (if the courts decide a computer can break a law; then should computers that break laws be punished? fined? imprisoned?).
I agree with your analysis, ultimately a human (or corporation) will be blamed. Regarding GPL copyright violations though, this happens so frequently anyways without any consequence that I have trouble imaging anyone bringing Github or it’s users to court over this, especially over such generic code snippets.
However with AI assistants I believe it remains a bit ambiguous if the user or the creator would be at fault. For example with self driving cars, you could either hold the owner responsible or the company that produced the AI. Tesla’s assisted driving feature has already killed and maimed people including their own drivers.
A couple examples…
https://electrek.co/2019/03/01/tesla-driver-crash-truck-trailer-autopilot/
https://www.theguardian.com/technology/2018/mar/31/tesla-car-crash-autopilot-mountain-view
While Tesla says the driver is responsible. I imagine we’ll see court cases going both ways.

2021-07-01 6:04 pm
Brendan
I’d be amazed if, buried in the fine print, Tesla doesn’t say something like “driver must monitor and take over in case autopilot does something dodgy” so that (from a legal standpoint) the driver gets blamed regardless. Otherwise Tesla’s legal department would be horrified at the potential for culpability.
More specifically; I think Tesla’s cars are “level 2: driver must constantly supervise” and that Elon is a liar (deliberately giving people the impression that it’s “level 4: human driver not needed” for marketing purposes when it is not for legal purposes).
For GitHub’s Copilot; this strategy doesn’t work. From a legal standpoint; when a copyright violation is noticed the publisher is responsible, and it doesn’t even matter if the publisher had nothing to do with writing (or copying) any of the work they published. The publisher can’t say “Oh, I didn’t write the code, I hired Dave to write it so Dave is responsible” (or “Oh, I didn’t write the code, I used an AI assistant”) because the court simply doesn’t care about who/what the author is in the first place. Essentially; GitHub could say “you are responsible for monitoring the output of our code-pilot for copyright violations”; and if they did, instead of their AI not getting blamed (because it’s not the publisher) the human user would not get blamed (because they aren’t the publisher either, unless they actually are both an author and a publisher).
Note that this also works in reverse – you can violate copyrights as much as you like with no fear of any consequences, as long as you don’t publish it. Sadly; I think this is also how most “cloud providers” operate (e.g. Amazon’s “GPL derived work” stays on their server and is never published, so GPL is irrelevant and they can do whatever they like with other people’s code while raking in millions of $$ and giving nothing back to the original developers).

2021-07-01 6:49 pm
Alfman verbose=1
Brendan,
I’d be amazed if, buried in the fine print, Tesla doesn’t say something like “driver must monitor and take over in case autopilot does something dodgy” so that (from a legal standpoint) the driver gets blamed regardless. Otherwise Tesla’s legal department would be horrified at the potential for culpability.
I’d think so too, and Tesla’s license agreements can say whatever they want them to, but the courts and law don’t always uphold the legal claims asserted in fine print.
For GitHub’s Copilot; this strategy doesn’t work. From a legal standpoint; when a copyright violation is noticed the publisher is responsible, and it doesn’t even matter if the publisher had nothing to do with writing (or copying) any of the work they published. The publisher can’t say “Oh, I didn’t write the code, I hired Dave to write it so Dave is responsible” (or “Oh, I didn’t write the code, I used an AI assistant”) because the court simply doesn’t care about who/what the author is in the first place.
I think it depends on both the particulars of the case and the court. If “Dave” is a serial offender (as Github would be), they might well be justified in pursuing Dave. Though I still have doubts over the copyrightability of such short generic code snippets in the first place.
Court cases can be unpredictable especially when it comes to uncharted territory like this. Sometimes even the judges disagree amongst themselves and can overturn precedent. I think we’re left with speculation until it actually happens.
Note that this also works in reverse – you can violate copyrights as much as you like with no fear of any consequences, as long as you don’t publish it
I see your of the “you’re not guilty if they don’t catch you” philosophy, haha 🙂
Sadly; I think this is also how most “cloud providers” operate (e.g. Amazon’s “GPL derived work” stays on their server and is never published, so GPL is irrelevant and they can do whatever they like with other people’s code while raking in millions of $$ and giving nothing back to the original developers).
Amazon isn’t automatically entitled to ignore license terms just because they don’t publish it. That has more to do with the specifics of the GPL license itself, which for better or worse permits them to do this.
You could choose to publish your code under a license such as Affero instead…
http://www.gnu.org/licenses/license-list.html#AGPLv3.0
This is a free software, copyleft license. Its terms effectively consist of the terms of GPLv3, with an additional paragraph in section 13 to allow users who interact with the licensed software over a network to receive the source for that program. We recommend that developers consider using the GNU AGPL for any software which will commonly be run over a network
2021-07-02 12:38 am
AER
Basically, all cloud providers will die without open source software like the Linux open source eco-system.
Does google, fb, aws etc., contributed to the open source eco-system? Yes of course. But are those enough contribs? That’s the question.
2021-07-04 4:10 am
mkone
Note that this also works in reverse – you can violate copyrights as much as you like with no fear of any consequences, as long as you don’t publish it.
You can’t violate copyright if you don’t publish. So it’s misleading to call it a violation.
And also, there is nothing wrong with complying with the GPL by not publishing / distributing code. Also, as much as the FSF and GNU love to remind, they have no problem with others making money off Free Software.
2021-07-04 5:33 am
Alfman verbose=1
mkone,
You can’t violate copyright if you don’t publish. So it’s misleading to call it a violation.
Are you referring to the GPL specifically? Here In the US you aren’t allowed to copy movies/games/software/etc without permission even if you don’t publish them. There are certain limited exceptions, like making backups of media you own.
And also, there is nothing wrong with complying with the GPL by not publishing / distributing code. Also, as much as the FSF and GNU love to remind, they have no problem with others making money off Free Software.
Of course, but it’s because the GPL specifically permits these actions. The AGPL is another license that applies restrictions to business models don’t involve publishing/distributing software. The reason such licenses can work is because copyright still applies even when the users are not publishers.
2021-07-04 9:49 pm
Brendan
Of course, but it’s because the GPL specifically permits these actions. The AGPL is another license that applies restrictions to business models don’t involve publishing/distributing software. The reason such licenses can work is because copyright still applies even when the users are not publishers.
If I said “I won’t sell you my banana unless you agree with the following conditions” then it’d be a contract of sale that has nothing to do with copyright whatsoever.
If I said “I won’t sell you a copy of my software unless you agree with the following conditions” then (at least in theory) it’d also be a contract of sale that has nothing to do with copyright whatsoever.
The problem is that copyright (designed and intended to protect the author/publisher’s business model, even if there’s no copyright notice of any kind) has been conflated with a contract of sale; so “copyright notices” include things that have nothing to do with copyrights that could’ve (and should’ve) been part of a “contract of sale” (or End User License Agreement) instead.
This is why you get things like this: https://perens.com/2017/05/28/understanding-the-gpl-is-a-contract-court-case/ where the court decided that GPL is simultaneously a contract (the “contract of sale” parts that have nothing to do with copyright) and a copyright license (the parts that grant rights to redistribute copies).
2021-07-04 11:47 pm
Alfman verbose=1
Brendan,
If I said “I won’t sell you my banana unless you agree with the following conditions” then it’d be a contract of sale that has nothing to do with copyright whatsoever.
If I said “I won’t sell you a copy of my software unless you agree with the following conditions” then (at least in theory) it’d also be a contract of sale that has nothing to do with copyright whatsoever.
That’s an interesting way to put it. There may be a distinction between selling contracts and copyrights. A lot of license agreements seem to blur such distinctions though.

2021-07-01 1:30 am
friedchicken
This reminds me of software that helps users `write` musical compositions like melodies and chord progressions by analyzing what you’ve created so far . Music is mathematical in nature with a finite combination of audible notes. I don’t know the legality of using such software when the situation arises that the analysis and algorithms used to generate a composition produces something identical or very similar to existing works. It sounds like this Github Copilot thing is similar in that it doesn’t directly copy & paste but rather tries to make logical decisions to generate a desired result, and that in some cases it may make the same choices a person has. There are not infinite ways to accomplish things in code when you factor out adding crap that serve no real purpose, For example, there are not a bazillion ways to write a text scroller. Or being , if you instructed a group of people who’ve never seen “Hello, World” code, to each write a “Hello, World”, you’ll surely have some who write the same code so who should get the copyright? Who should determine the license? Further, I don’t think too much weight should be given to the fact the AI learned by analyzing existing code. That’s exactly the same way a lot of people learn to code as well. I’m not sure where the law draws the line on stuff like this, or if it’s even possible to do so in a clear & sane way.

2021-07-01 7:56 am
Kochise
Every Melody Has Been Copyrighted (and they’re all on this hard drive) :
https://www.youtube.com/watch?v=sfXn_ecH5Rw

2021-07-01 10:57 am
friedchicken
I didn’t know anyone had done that but I’m not surprised by it either. I’m very curious to see how this would be argued in court and what the ultimate outcome would/will be. I think it will be challenging enough just to get consensus on basic terminology, much less the weeds of it.

2021-07-02 12:42 am
AER
Music is mathematical in nature with a finite combination of audible notes. I don’t know the legality of using such software when the situation arises that the analysis and algorithms used to generate a composition produces something identical or very similar to existing works.
Then I remember this:
[url] https://www.rollingstone.com/music/music-news/led-zeppelin-stairway-to-heaven-copyright-infringement-ruling-appeal-964530/ [/url]
2021-07-02 12:42 am
AER
Music is mathematical in nature with a finite combination of audible notes. I don’t know the legality of using such software when the situation arises that the analysis and algorithms used to generate a composition produces something identical or very similar to existing works.
Then I remember this:
https://www.rollingstone.com/music/music-news/led-zeppelin-stairway-to-heaven-copyright-infringement-ruling-appeal-964530/

2021-07-01 6:26 am
hardcode57
‘On top of that, it’s hard to argue that code generated from a set of existing open source code doesn’t constitute a derivative work’
Sorry Thom, that’s just plain BS. Once you’ve learned the basics of a language, you learn to actually use it well by reading others’ code, and open source advocates have long argued that the ability to study Open Source code to learn how to code well is one of the great benefits of Libre/OSS. Having AI help you with this doesn’t violate the principle.
The copyright argument you’re putting forward, that every snippet of copyrighted code is itself copyright protected, was last advanced by SCO, and I believe we’re all glad about how that turned out for them.

2021-07-01 7:39 am
crystall
Except that neural networks don’t “learn”. A certain input will always produce the same output, and that will be only function of the input and the training set.

2021-07-01 4:47 pm
Alfman verbose=1
crystall,
Except that neural networks don’t “learn”. A certain input will always produce the same output, and that will be only function of the input and the training set.
There’s no reason a neural net can’t be stateful and adaptive though (not that github is using them that way, but just saying…). In theory, our brains may be little more than neural nets, albeit very large ones. It will be very interesting to see what artificial neural nets can do once they become larger than our own and are given lifelike training conditions.
2021-07-01 10:32 pm
sukru
crystall,
That is incorrect. Neural networks uses random initialization and random updates to precisely avoid being deterministic.
After all, if you could perfectly fit the data, you would not need a complex model in the first place.

2021-07-02 12:33 am
Alfman verbose=1
sukru,
That is incorrect. Neural networks uses random initialization and random updates to precisely avoid being deterministic.
After all, if you could perfectly fit the data, you would not need a complex model in the first place.
I could be wrong, but I thought crystall was talking about the evaluation of a completed NN and not the training phase, which is more chaotic. The neural nets used in self driving cars for example are (probably) deterministic for each release of the software – given the same input, it will produce the same output. Theoretically they could add randomness into the NN’s realtime evaluation but what would be the purpose of intentionally deviating from the NN’s optimal & expected output for a given input?

2021-07-02 12:55 am
AER
That sounds so dangerous to me. I am curios about how ML/AI works. I am still a newbie on this thing.
Input -> mac evaluating input -> output -> machine eval output.
The evaluation phase must be error-prone, at first release, and then you design it that it also evaluates the output if the output is beneficial. If beneficial, the machine was learning that the specific input was beneficial according to the output. And then these benefits are only possible in certain scenarios that a man can easily figure, but a machine is having a hard time figuring out. This is where when for example in the case of a self-driving car, was running along on unpaved roads. There is no way a machine can learn something from this because there are no symbols to differentiate that a machine to learn from.
Another thing is that in the case of car, if an accident happened after the AI produces the output, it is possible that it cannot evaluate the output due to a crash, or by default, a sensor triggers a sudden impact, the last input/output causes such damage, but regardless humans need to intervene. This is how I see why development on this area is very slow.
That’s how simply I manage to understand about machine learning.
Can you elaborate on the part of adding randomness? What’s the benefits of this? I see not one, only accidents.
2021-07-02 2:07 am
sukru
AER,
You seem to be thinking about online learning, where there is a real time feedback loop.
In general, these systems will prefer “offline” learning, where a model is trained beforehand, and the feedback is used for the next iteration (1 hour later? next day? next week? next month?)
And AI systems that also depend on machine learning (like those making decisions in a car), will generally use “reinforcement learning” and simulations to run millions or billions scenarios to learn. For example “Alpha Go Zero” plays with itself on days end to learn winning strategies in the game of Go: https://deepmind.com/blog/article/alphago-zero-starting-scratch
And there are higher level systems and redundancies in mission critical AI. So your vehicle is very unlikely to crash. Same in the Mars helicopter for example, even with a “1 frame glitch” it landed safely.
2021-07-02 2:11 am
sukru
Alfman,
If they meant for the same trained model, yes, to a point. But “function of the input and the training set” generating the same output would be incorrect.
Even for the same trained model
(1) models are usually updated very often. So there is a small window (1 day?) of this happening
(2) you need to write the same exact code, probably at the same exact speed to get the same exact output.
Of course for common patterns like:
var input = inputFactory…. you would get “.newInput()”, but it would be the same even if a human were to write that code.
2021-07-02 4:21 am
Alfman verbose=1
sukru,
If they meant for the same trained model, yes, to a point. But “function of the input and the training set” generating the same output would be incorrect.
Yes but that depends on what he semantically meant by “input”.
There’s the input that trains & generates the NN itself as output, which is typically a very random process. And then there’s the input that gets evaluated to produce real time decision outputs, which is usually done deterministicly though it doesn’t have to be if random variation is desirable.
Even for the same trained model
(1) models are usually updated very often. So there is a small window (1 day?) of this happening
Hmm, I’m quite confused why you would say this in response to my post that was about autonomous cars? It would seem to me that updating every day is way too frequent as it would not allow sufficient time for hands on QA testing. Besides, unless there is a serious problem, the skills needed for driving honestly does not change on a daily basis. There’s very little a driver who is qualified to drive on the 1st of the month would need to learn to be qualified to drive on the 2nd and then on the 3rd of the month.
Updates to tesla’s autonomous driving software typically involve specific new features that are beta tested first. These are not updated every day and sometimes not even every month. Also they are staggered across the world at different dates to minimize the risk of faulty updates.
I get the impression maybe you are talking about how GitHub’s NN works instead? My post wasn’t referring to GitHub’s approach and I’m not familiar with their implementation. Deterministic neural net output is still common though.
2021-07-04 12:10 am
sukru
Alfman,
Yes, we seem to be talking about two (or three) different things.
In terms of ML models for vehicles vs coding: If I were in charge, I would prefer to have faster updates on the coding models, since users will introduce new techniques and library versions every day. At the same time, the driving one will need slower releases of course. But I don’t think they would use *one single model*.

2021-07-01 6:51 am
The123king
There’s an easy way around these licensing problems.
Train the AI on a permissively licensed codebase, such as code licensed under the MIT license or BSD license

2021-07-01 7:55 am
Kochise
If it’s trained on Github code, repositories can set their licensing scheme, hence it might already be filtered that way.

2021-07-01 7:42 pm
The1stImmortal
Repositories often don’t have the licensing for the repository set properly, or have mixed licensing that the repository doesn’t well represent. Copyright law in practice is a mess of nuance and conditions, so a simple repo licensing field really isn’t enough to go on. And then you have the problem that almost certainly there’s copyright violating code in Github already, and some of that likely ended up in the training data. That’s still a problem for Microsoft (Github) even if someone else in the chain is the one who messed up on copyright compliance originally (MS may be indemnified in terms of $ by GitHub ToS but they may still have the issue of the resultant code being potentially “derived” from illegally copied material)

2021-07-03 2:54 am
Kochise
To me it’s still a non issue : the same code snippets can surely be found in both copyrighted and non copyrighted source codes. Do you really believe these two are fundamentally different when it’s about calling apis, doing string or math operations ? That some larger algorithms might be patented is one thing, but we’re talking about an AI helping you with filling in the blanks, not writing a full application for you.
So maybe we should calm down a bit.

2021-07-01 7:52 pm
The1stImmortal
I have concerns about the *output* of the engine too.
If you accept a Copilot suggestion for a block of code, given that suggestion came from Microsoft/Github, presumably they still own the copyright on that code, even if it’s embedded in your project. Without a formal written document signed by the original copyright holder (Github) that can’t be assigned back to the one who accepted the suggestion. A simple clickthrough ToS agreement (even if it included such an assignment) wouldn’t be sufficient in many jurisdictions, particularly the US. I suppose they could give an extremely broad license grant to users of Copilot but that still has the problem that there’s microsoft-owned copyrighted code in the resultant work, meaning all sorts of potential issues down the road.
There’s also the concern that CoPilot seems to extract data from what you’re doing directly in the VS Code editor too, and if there is a copyright issue with the source data being converted into the resultant database/suggestions, then there’s no way that GitHub can reliably determine whether it has the right to use the code open in VS Code to build its database, regardless of what agreements the users click through.
2021-07-01 9:59 pm
stereotype
What a wonderful first world problem.

2021-07-03 5:02 am
Kochise
Indeed it is, when you can create a software license that force people using your code to forcefully open their, while still calling that freedom.