GitHub may be a clone club, but Java code is the most original
GitHub is all about sharing code, so it makes sense that there’s a certain amount of code-copying on the site. However, recent research suggests that over 70% of the code on GitHub is just duplicates. While Java wins points for originality, all of the languages surveyed have a surprising amount of project plagiarism.
Who among us hasn’t copy-pasted some code from GitHub into whatever we’re working on? After all, code-sharing is GitHub’s raison d’être. However, there’s a non-trivial amount of copy cats on the site.
An international team of researchers based out of UC Irvine set out to study how code changes from copy to copy. But as the results came in, the “staggering” rate of duplication caused them to change focus.
That’s a heck of a lot of copy-paste, friends.
And the award for “Most Original Code” goes to…
Interestingly, there’s considerable variation among programming languages. Some languages are big Dolly fans, others less so.
This is their map of code duplication. The y-axis is the number of commits per project; the x-axis is the number of files in a project. The value of each tile is the percentage of duplicated files for all projects in the tile. The darker shades of red means more clones of any given project.
A project-level analysis shows that between 9% and 31% of the projects contain at least 80% of files that can be found elsewhere.
For added hilarity, the researchers mined the data for the most-reappropriated projects; i.e files duplicated in bulk with no changes whatsoever.
- Java – Minecraft-API and PhoneGap
- C++ – GNU ISO C++ Library, homework templates, and Arduino examples
- Python – Cactus, Shadowsocks, Scons
Maybe don’t cheat off each other so blatantly, C++ learners?
DéjàVu relies on community help. So, if you want to tag a few clones, they’d love your help.