Simon Phipps of Open Source fame at Sun and the Open Source Initiative wrote an interesting article titled GitHub needs to take open source seriously. He warns about the fact, that GitHub doesn’t have a mechanism that lets new projects choose a proper license for their work. Simon also cites a survey that says this:

[…] as many as half [of the projects] include no easily identifiable copyright licensing information. About 30 percent include some sort of licensing information in the source files, and around 20 percent have a clear license or notice file that makes it obvious under what terms the code is made available.

Only 20 percent of all surveyed projects include proper licensing information in their repo? Not good, folks! What’s sad about Simon’s article is that he doesn’t provide a link to the survey or further information how the data was acquired.

Since I’m really interested in the topic and wanted to see some numbers I rolled my own “survey” - and since I’m an Open Source Guy™ I will provide all the numbers and stuff for you to reproduce them. Let me say this as a motivator: My numbers don’t even roughly match those of the mentioned survey. Let’s start with the first chart:

Licensing overview

Here you see a comparison of the total number of “interesting”, “popular forked” and “popular starred” GitHub projects that include proper license information in their repo and of those that don’t. What does “proper license information” mean? I defined it as this:

The project’s repo contains a file named “copying”, “copyright” or “license” (all case insensitive) in the root folder

OR

the project’s repo contains a file named “readme.*” (case insensitive) that again contains a section called “license”.

This is actually a formalization of common practice in Open Source projects. As you can see in the chart nearly 140 of the 175 projects analyzed contain such an easily findable license information, or more precisely 78%.

But I wanted to dive a little bit deeper and analyze more projects; GitHub hosts nearly 5 million repos, after all. Since you cannot easily get a list of all repos I took this approach: Get a list of the most popular programming languages from here and get the most watched projects for each of them (this is the one for JavaScript). For each of the listed projects (200 per language) I applied the search criteria from above and got the following results:

Licensing by language

In essence this analysis of 2000 projects reflects the results from the first chart: 72% of all projects provide proper licensing information, with Ruby projects having the best ratio (90,5%) and Perl having the worst ratio (57%):

Percentage of properly licensed projects

The conclusion I draw from these numbers is that the situation seems not to be so bad as Simon indicates in his article or as the survey he got purports. Don’t get me wrong, I would really urge GitHub to provide a simple mechanism for license selection when creating a project/repo. I cannot believe that it is so hard for users to choose from a set of options. Simon links to an interesting wizard put up by John Cowan that could serve as a template for this effort.

In the end it would serve everyone - the creators as well as consumers of software - to know the implications of using a certain piece of code; and only a properly licensed project serves this purpose. I personally don’t get near a project that doesn’t make clear under what terms I can use its code.

P.S.: If you don’t already know, I’m a coder so, NO, I didn’t collect these numbers by hand. Get the simple Python script I used from here if you like to reproduce the graphs from above.

[Update] GitHub has published official numbers on license usage. These numbers indicate that the survey Simon cited in his article wasn’t as wrong as I suspected it to be. At the time I wrote this article, about 10-20 percent of all projects had proper licenses, according to GitHub’s new blog post. My numbers, thus, may seem too optimistic. I blame that on the fact that I analyzed only popular projects. So either most of the popular projects are properly licensed or mostly properly licensed projects gain popularity. I cannot say.

What I find more interesting is that license usage declines relatively. Only when GitHub introduced their license picker and choosealicense.com did the numbers climb. This is sad at first, but probably just a consequence of GitHubs (and git’s) rising popularity even for very small repos (such as dotfiles etc.). I’m glad GitHub is aboard on this topic. There’s still much to be achieved with regard to pure numbers of unlicensed repos.