and ran it on the https://github.com/j3-fortran/fortran_proposals repository. As you can see, this gives great statistics, the number of contributors and how active each contributor is. I would like to propose to list all contributors on our website somewhere, we can start listing them alphabetically. We can gather the full name, for now I just printed their GitHub ID. We can also list them based on when they contributed, as the script gathers the date of the comment also. For example gathering such contributors in the past month and listing all such people in our newsletter would be really helpful (highlighting first time contributors). One can also do graphs, etc. Let me know what ideas you have to do with this data.
where you can select the date range above, and it automatically sorts the authors based on the number of commits they made for the given range. So one can see who most contributed in the past month, or one can see who most contributed when the project started.
Instead of commits from git history, it would use the comments data that my script harvested above.
@lkedward do you know what libraries could allow me to do that? I think Bokeh could probably do that. I am hoping to keep the code relatively simple, I can see this quickly turning in a huge project. I am hoping Bokeh and/or some JavaScript libraries can present this in a static webpage (that would have the data embedded in it) that we can then include at fortran-lang somewhere.
I agree, we want to keep it simple. It shouldn’t be too difficult to generate and store the the data statically and have javascript serve up the dynamic parts like date filtering and sorting (this is how the package search works currently). I’m not familiar with Bokeh, but it looks impressive. I’m happy to look into the javascript side to this if you want? For the static side, we just need the CI to run your script and generate a JSON file.
The JSON file actually takes a while to generate, especially if we do this for multiple repositories, so for now I was thinking of doing it manually and uploading it somewhere online (say into GitHub gist) and update it from time to time — in fact we can setup a separate CI job in a separate repository that would update it periodically (once a day or once a week or something like that). Then at our CI, we simply download the (latest) JSON, and generate the static html page.
I have experience setting everything up, except the last part — how to actually visualize this in a web page, and have the sorting working when you change the date range selection. If you had time to help with this last part, that would be super helpful. I can setup the rest.
Hi @certik,
I’ve started prototyping a javascript font-end for displaying the contributor stats on the website (PR coming soon) and I’ve noticed that the issue comments do not include comments left during PR approval - this seems like an important contribution that should be counted.
I’m also wondering whether we should be counting commits (excluding merge commits) into our contributor stats as well?
Yes, we need to improve it to include all comments, great point.
Yes, we should include commits themselves. Those can be determined for example by:
ondrej@pn1707483:~/repos/stdlib(master)$ git shortlog -ns
113 Ondřej Čertík
76 Vandenplas, Jeremie
47 Izaak Beekman
31 Milan Curcic
16 Michael Hirsch, Ph.D
15 Juan Fiol
14 Jeremie Vandenplas
14 Nathaniel Shaffer
6 Bálint Aradi
6 Ivan
6 milancurcic
5 Izaak "Zaak" Beekman
5 Pierre de Buyl
3 J. Henneberg
3 JHenneberg
3 Nathaniel Shaffre
3 nshaffer
2 Ashwin Vishnu
2 Pedro Costa
1 Juan
1 Neil Carlson
1 sakamoti
So probably my scripts should check out the repository and use this command to extract data — we’ll have to map the users to github IDs somehow, probably by email.
Do you think the commit stats should be merged with the comments stats, or kept separate?
A problem with the output of this command is that the data isn’t broken down by date like the comment data - so we can’t plot and filter over time.
I think the commit stats should be stored separately so we have the option to plot them separately.
I’m not sure the best way to get github IDs for commits - can you use the github3.py API to do this?
Duplicates are handled by .mailmap, we should add one.
github3.py can probably get commits, but more reliable and faster is to just get the commits using git log, and parse the output for each commit, extract the information and put into a separate JSON file. Then we can easily analyze it and process it.
The GitHub API allows to obtain an email for a github ID which share it publicly I think there is a way to get list of IDs from GitHub for a project, so we can then match it ourselves — and those IDs that do not share their email, we can provide a hand mapping in some file, it will not be that many people, so we can manage that.