My project is a copy of a copy of a copy of a copy…

GitHub may be a clone club, but Java code is the most original

Jane Elizabeth
© Shutterstock / mhong84

GitHub is all about sharing code, so it makes sense that there’s a certain amount of code-copying on the site. However, recent research suggests that over 70% of the code on GitHub is just duplicates. While Java wins points for originality, all of the languages surveyed have a surprising amount of project plagiarism.

Who among us hasn’t copy-pasted some code from GitHub into whatever we’re working on? After all, code-sharing is GitHub’s raison d’être. However, there’s a non-trivial amount of copy cats on the site.

An international team of researchers based out of UC Irvine set out to study how code changes from copy to copy. But as the results came in, the “staggering” rate of duplication caused them to change focus.

The researchers analyzed over 428 million files, representing 4.5 million non-fork projects written in Java, C++, Python, and JavaScript. Of that, they found that there was only a mere 85 million unique files. In all, over 70% of the code on GitHub consists of clones of previously created files.

That’s a heck of a lot of copy-paste, friends.

SEE MORE: GitHub’s Archive function preserves old repos for future forks

And the award for “Most Original Code” goes to…

Interestingly, there’s considerable variation among programming languages. Some languages are big Dolly fans, others less so.

According to Lopez et al., the variation is striking. JavaScript has the highest rate of file duplication, with only 6% distinct files. C++ is a little better, at 23% distinct, followed by Python at 28%. Java is the most distinct, with about 60% original files.


This is their map of code duplication. The y-axis is the number of commits per project; the x-axis is the number of files in a project. The value of each tile is the percentage of duplicated files for all projects in the tile. The darker shades of red means more clones of any given project.

A project-level analysis shows that between 9% and 31% of the projects contain at least 80% of files that can be found elsewhere.


SEE MORE: GitHub allows employees to own their own ideas with new IP policy

For added hilarity, the researchers mined the data for the most-reappropriated projects; i.e files duplicated in bulk with no changes whatsoever.

  • Java – Minecraft-API and PhoneGap
  • C++ – GNU ISO C++ Library, homework templates, and Arduino examples
  • Python – Cactus, Shadowsocks, Scons
  • JavaScript – Adobe PhoneGap’s Hello World Template7, which was found in a whopping 1746 projects

Maybe don’t cheat off each other so blatantly, C++ learners?

Copy that

Why does this matter? It’s not like any of us are getting called out for plagiarism. The researchers are pretty clear that this is meant as a warning to other data scientists looking to “randomly sample” GitHub for datasets. Given the sheer number of duplicates in JavaScript, for example, it’s hard to draw any conclusions about what’s popular if you don’t control for clones.

In order to combat this, they created DéjàVu, a web-service for clones information retrieval and easy source code/projects/datasets analysis. If you’re interested, they have lots and lots of files mapping cloned files on GitHub in the Java, C++, Python, and JavaScript.

DéjàVu relies on community help. So, if you want to tag a few clones, they’d love your help.

Jane Elizabeth
Jane Elizabeth is an assistant editor for

comments powered by Disqus