GitHub Copilot – Your AI-powered accomplice to steal code?

Last week GitHub and its parent company Microsoft announced “GitHub Copilot – their/your new AI pair programmer”. E.g. The New Stack, The Verge or CNBC have reported extensively about it. And there is a lot of buzz around this new service, especially within the Open Source and Free Software world. Not only by its developers, but also among its supporting lawyers and legal experts, although the actual news is not that ground breaking, because it is not the first of its kind. Similar ML-/AI-based offers like Tabnine, Kite, CodeGuru, and IntelliCode are already out there, which have also been trained with public code.

Copilot currently is in “technical preview” and planned to be offered as commercial version according to GitHub.

Illustration: GitHub Inc. © 2021

The core of it appears to be OpenAI Codex, a descendant of the famous GPT-3 for natural language processing. According to its homepage it “[…] has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub”. Update 2021/07/08: GitHub Support appears to have confirmed that all public code at GitHub was used as training data.

GitHub is the platform where the majority of source code of the global Open Source community has meanwhile been accumulated: 65+ million developers, 200+ million repositories (as of 2021) or 23+ million owners of 128+ million public repositories (as of 2020). Alternatives to it have become scarce as long as you do not want to host it on your own.

Great, in what amazing times we are living in! Sounds like with Copilot you do not need your human co-programmers any longer, who assisted you during the good old times in form of pair-programming or code review. Lucky you and especially your employer. On top you will save precious time because it will help you to directly fix a bug, write typical functions or even “[…] learn how to use a new framework without spending most of your time spelunking through the docs or searching the web”. Not to forget about copying & pasting useful code fragments from Stackoverflow or other publicly available sources like GitHub.

At the same time, two essential questions arise, in case you care a bit about authorship:

  1. Did the training of the AI infringe any copyright of the original authors who actually wrote the code that was used as training data?
  2. Will you violate any copyright by including Copilot’s code suggestions in your source code?

Let’s not talk about another aspect that GitHub mentions in their FAQs – personal data: “[…] In some cases, the model will suggest what appears to be personal data – email addresses, phone numbers, access keys, etc. […]”

Let’s start with the first question: Did the training infringe any copyright?

When the homepage went online on the June 29th, Microsoft was convinced that using publicly available data is “[…] common practice […]” within the domain of machine learning. Just a day later it changed their reasoning to “[…] fair use […]”. (via alexjc).

Fair Use is something only known to Anglo-American jurisdictions. As GitHub is U.S. based, probably this doctrine applies. Within the European Union the EU Copyright Directive 2019/790 would cover such endeavours as it allows even commercial use as long as the rightholders have not expressed against such use (article 4). In his blog post “Github Copilot: initial thoughts from an English law perspective” Neil Brown suggests that the terms of service of GitHub might even cover such use case, noting that they are not very specific though. In the end this could mean that the users have consented to it by hosting their code on GitHub.

At GitHub there are numerous projects licensed under the General Public License (GPL). The main motivation behind the GPL and the Free Software movement in general is reciprocity (and of course users’ freedoms). Reciprocity means that you share it with the public in order to keep any derivative works in the public so that anyone can benefit from future changes and additions. A kind of “circular economy” of Free Software. This principle does not work when being used as training data. AI does not care about the underlying license terms nor does it provide attribution notices when providing suggestions based on it that might end up in proprietary code. It is simply absorbing it.

Note that what is described above is also known as the copyleft principle whereas the copyleft effect is a myth.

The Free Software community once had to address a similar problem during the rise of Web services. Suddenly its software was not distributed in form of physical copies any longer but purely used over the network. Their original license trigger – physical distribution – needed to be redefined to also include network use to close the so called “application service provider loophole”, which ultimately led to the creation of the Affero General Public License (AGPL). Will we now see a “Machine General Public License (MGPL)” to close the “training data loophole”?

Just to mention, there is evidence that Copilot was trained with GPL-licensed code as it is even suggesting GPL-licensed code, but we will come to that in a minute.

Aside of that, there have been huge controversies around training AIs using publicly available data sets. Not primarily settled around copyright, but from a privacy perspective: “Facial recognition’s ‘dirty little secret’: Millions of online photos scraped without consent” (IBM) or “The Secretive Company That Might End Privacy as We Know It” (Clearview AI). Still it was within a similar context: using data for a purpose for which it was originally not intended for, but the purpose was not explicitly excluded by the copyright holders nor forbidden by law.

Now to the second question: Can the use of the suggested code lead to any copyright infringement?

In general Copilot is supposed to suggest potentially useful code whilst you program. The proposed code is derived from the training data, but has been refactored by the AI to fit the current context. But to whom do those suggestions actually belong to, when you accept them and add them to your code? According to the Copilot’s FAQ those “[…] belong to you, and you are responsible for it.” Or in other words, don’t come knocking in case of any trouble.

What can also happen is that suggestions include code that was copied literally from other’s work and might be subject to licensing and copyright. Not to talk about quality, privacy, and security issues that need to be considered as well.

Following Copilot’s FAQ only “[…] 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set”. In other words only every 1000 lines will include straight copies that could cause potential copyright infringement as it actually is the work of others that is integrated without caring for the underlying licensing terms. If this is a valid concern, it would be a violation of any license type, no matter if Permissive or Copyleft. In her article “GitHub Copilot is not infringing your copyright” Julia Reda argues that this should not be a problem as “such use is only relevant under copyright law if the excerpt used is in turn original and unique enough to reach the threshold of originality”. Or as Andrés Guadamuz states it in his post “Is GitHub’s Copilot potentially infringing copyright?”: “[…] in general, copyright infringement tends to be looked at from a qualitative, and not quantitative perspective.” So in sum even if Copilot reproduces larger chunks of code, what matters is the quality and its originality to be considered as independent work that would make it subject to copyright.

Luis Villa adds in his high-level overview that the doctrine “independent creation” of U.S. law might be applicable in that context as well.

Armin Ronacher posted a short video in which he demonstrates how Copilot reproduced him the complete implementation of the GPL-licensed fast inverse square root implementation from Quake but then suggested a BSD-style license as comment for it. Ouch … that is probably not how it is supposed to be. There are other examples in which it was able to complete a textual poem or leak working private API keys or reproduce an about page of a real world person. So let’s see what humans will do with or to it next … hopefully it will not start yelling at developers because of their ugly programming style.

Conclusion

The news about Copilot is still quite fresh, very exciting at first glance, and it feels like the discussion has just been initiated.

Right now – still being in technical preview – Copilot seems to have some major flaws. Regarding copyright: At least when you are not using it the by GitHub intended way it appears to reproduce larger and complete verbatim source code copies that are most probably subject to copyright. If this is intended behaviour, GitHub must make sure that they are transparent on its origins so that the consuming developer would have a chance to comply to the underlying licensing terms. Or will GitHub take full responsibility for legal compliance as it will anyhow become a paid service? Maybe they are going to share their revenue with the original projects depending on how many times they cited the original work?

What could also help is an opt-out (opt-in would be my preference, but GitHub is U.S. based) for GitHub users to empower them to decide by themselves if their repositories want to participate in training any AI. E.g. Adobe is offering this configuration option for their CC stuff.

First reports on actual user experiences are coming in, but for Copilot it needs yet to be seen how much added value it really provides in contrast to none ML-/AI-powered developer assistance, reaching from simple code completion mechanisms up to sophisticated linters. While those are not really cool anymore, they have proven over decades to be working very well and at least are not subject to any copyright controversy.

From a copyright perspective it could spure for good the discussion about training data use in general and hopefully bring more clarity about when re-using lines of code actually constitute a copyright violation and in which cases not. It feels a bit old-fashioned that in copyright law, computer code is still considered as literary works, even though the meaning of originality is fundamentally different in those two media.

At least one can now understand a bit better why Microsoft payed $7.5 billion for GitHub three years ago.