Thoughts on Copyright of Github Copilot // Kevin Wu

Introduction

Github Copilot has certainly hit the internet by storm recently. I recently came off the waitlist into the program and was able to give it a try. This particular post though, is not about the experience of using it but about the copyright implications of Copilot and perhaps AI generated content in general. That being said, I'm not a lawyer and this isn't legal advice.

Many people have brought up that Copilot uses the public Github repositories as training data and doesn't attribute them properly, or worse perhaps may even infringe on some public repositories that have some less permissive licenses. And most articles I've read on the topic, I feel, have significant gaps in their legal analysis. I'm not sure that using copyrighted material in your ML training data actually constitutes copyright infringement and doubt that it's actually been even adjudicated. Without any precedent, we can only guess what the outcome would be ¹. However, in this blog post, I'm going to approach it like the U.S. Supreme court approached the Google v Oracle case and argue that even if using copyrighted material in ML models is infringement, Copilot can use the Fair Use doctrine as an affirmative defense ².

Fair Use Defense

Typically for Fair Use, we look at four factors: transformation, nature of the copyrighted work, amount and substantiality of the portion taken and market usurption.

Transformation is about how the new work is used. Is it used for the same purpose or a new, innovative purpose? Copyright law was created to protect innovation, so they want to encourage creative uses that change the purpose of the work.

Nature of the copyrighted work considers if the new work can benefit the public. This can take a lot of different forms in a fair use analysis but is often considering if the new work is purely creative or if it is meant to be a product sold to a market.

Amount and substantiality of the portion taken considers the amount of the copyrighted work that is used in the new work and how "substantial" it is. For example, if you take a photograph of a landmark, and someone crops out the background, you could argue they took a small portion of your work but the most substantial part of it.

Finally, effect on the market considers whether the new product could effectively substitute for the original product. That is whether or not the new product could effectively take market share from the original product.

Typically, a Judge would determine fair use on a case by case basis by weighing these factors against each other.

Transformation

I think Copilot is largely using the repositories in a transformative way. Most the repositories on Github are not being used to generate AI recommendations in a code editor, so you can easily argue that Copilot is using the code transformatively because it's using the code to generate more code through an AI model, effectively a different use.

There are a few repositories out there though, like TabNine, which do endeavor to do the same thing as Copilot, so this factor would get murkier in their case, but I'd argue that it's still transformative.

Copilot is using the code transformatively. Most of the code on Github is meant to be executed, but to my knowledge, Copilot doesn't execute the code it is trained on (or at the minimum doesn't execute the code for the same reason as a typical user would). So Copilot transforms the purpose of the code from a program to be executed to training material for an AI model.

Accordingly, Copilot wins out on this factor.

Nature of the Copyrighted Work

In this case, Github is not selling Copilot, so it's not for-profit. In terms of potential to benefit the community though, I think it has broad potential to benefit the developer community. Therefore I would say Copilot would win on this factor.

Amount and Substantiality of the Portion Taken

This one is particularly easy. Github is likely taking entire repositories, so it would take all parts of the copyrighted work, which obviously includes the most substantial parts.

In this case, Copilot loses on this factor.

Effect on the Market

This factor is significantly weakened by the Transformation factor because since these Github repos and Copilot are being used for different purposes, it's hard to conceive of them as a market substitute for the original product. I think a company like TabNine has the best case that it does serve as a direct competitor to their business, but for most people, Copilot would still win this factor.

I want to analyze the best possible case against Copilot, so I will assume they lose this factor against a company that is directly competing with them. That being said though, it would be almost trivial for them to just remove those repos from their training model if it were really an issue.

Weighing of the factors

Looking at the factors from the previous section, we see that Copilot has won two and lost two. But I wouldn't consider these factors to be equal in weight. Copyright and fair use laws were created to protect innovation and not stifle it. Since Copilot is both transformative and stands to benefit the developer community, I think it would win the fair use analysis despite having taken the total amount of the work and potentially competing with some of the repos whose code they've taken for training.

I think of it analogously to a collage. If an artist were to make a photo mosaic, usually many smaller artworks are used as tiles to create a larger mosaic picture (you should google it if you haven't seen these before to get an idea), it would likely be considered a fair use even though they took the entire works and could potentially be sold as a competitor to any of the original art pieces.

Since it is likely a fair use, Github wouldn't legally have to attribute to all the repositories that were used as training data. That being said though, I think it would be a good gesture toward the community to do the attribution as a thanks to the community or to open source the tool itself to give back to the community.

Footnotes

Intuitively, I think it should be considered some kind of derivative work

For context, in this case, the Supreme Court concluded that Google's use of the Java API were fair use even though they declined to decide if Google ever committed copyright infringement to begin with. This is unusual because Fair Use is typically considered an affirmative defense to copyright infringement, which is legally distinct from not committing copyright infringement. Arguably, they should have actually determined if Google committed copyright infringement since that is their role in the judicial system, but I'm just a blogger on the internet, so I won't.