Giving credits to contributors: automatic statistics from our issue tracker

I created this simple project:

to help gather statistics of users who:

  • created new issues
  • commented in our issues

and ran it on the https://github.com/j3-fortran/fortran_proposals repository. As you can see, this gives great statistics, the number of contributors and how active each contributor is. I would like to propose to list all contributors on our website somewhere, we can start listing them alphabetically. We can gather the full name, for now I just printed their GitHub ID. We can also list them based on when they contributed, as the script gathers the date of the comment also. For example gathering such contributors in the past month and listing all such people in our newsletter would be really helpful (highlighting first time contributors). One can also do graphs, etc. Let me know what ideas you have to do with this data.

Here is the output:

$ python stats.py 
Statistics

Contributors who created new issues and how many:
  N   GitHub ID              Number of issues
  1.  certik                   58
  2.  jacobwilliams            20
  3.  zjibben                  10
  4.  marshallward              6
  5.  aradi                     6
  6.  arjenmarkus               6
  7.  milancurcic               6
  8.  Libavius                  6
  9.  FortranFan                6
 10.  urbanjost                 5
 11.  klausler                  5
 12.  rweed                     4
 13.  gronki                    4
 14.  cmacmackin                4
 15.  vansnyder                 3
 16.  qolin1                    3
 17.  thenlich                  2
 18.  nncarlson                 2
 19.  everythingfunctional      2
 20.  ivan-pi                   2
 21.  pbrady                    2
 22.  ChinouneMehdi             1
 23.  srinathv                  1
 24.  reinh-bader               1
 25.  difference-scheme         1
 26.  jvdp1                     1
 27.  rjfarmer                  1
 28.  MichaelSiehl              1
 29.  sblionel                  1
 30.  nebulaekg                 1
 31.  Beliavsky                 1

Contributors who commented and how many times:
  N   GitHub ID              Number of comments
  1.  certik                  495
  2.  klausler                186
  3.  FortranFan              178
  4.  aradi                    96
  5.  sblionel                 87
  6.  gronki                   87
  7.  jacobwilliams            74
  8.  milancurcic              70
  9.  everythingfunctional     60
 10.  cmacmackin               55
 11.  zjibben                  43
 12.  septcolor                42
 13.  marshallward             39
 14.  rweed                    30
 15.  arjenmarkus              26
 16.  Libavius                 23
 17.  difference-scheme        20
 18.  urbanjost                20
 19.  qolin1                   20
 20.  nncarlson                19
 21.  jme52                    19
 22.  tclune                   17
 23.  ivan-pi                  14
 24.  zbeekman                 13
 25.  vansnyder                11
 26.  jvdp1                    11
 27.  LKedward                  8
 28.  thenlich                  7
 29.  pbrady                    6
 30.  MichaelSiehl              6
 31.  tskeith                   5
 32.  ChinouneMehdi             4
 33.  reinh-bader               4
 34.  pdebuyl                   4
 35.  apthorpe                  4
 36.  leonfoks                  4
 37.  wolfv                     4
 38.  epagone                   3
 39.  ThemosTsikas              3
 40.  victorsndvg               2
 41.  acferrad                  2
 42.  nebulaekg                 2
 43.  gareth-d-ga               2
 44.  srinathv                  1
 45.  reubendb                  1
 46.  richardbleikamp           1
 47.  rjfarmer                  1
 48.  ghwilliams                1
 49.  opeil                     1
 50.  aamaricci                 1
 51.  rouson                    1
 52.  zmiimz                    1
 53.  traversaro                1
 54.  swpoole                   1
 55.  haraldkl                  1
 56.  longb                     1
 57.  Beliavsky                 1
 58.  gklimowicz                1
 59.  zingale                   1
3 Likes

This is a great idea @certik! It would be good to get this included on the site somewhere for each of the fortran-lang projects.

1 Like

I think I figured out how to present this data. I would like it to look like this:

https://github.com/sympy/sympy/graphs/contributors?from=2019-12-28&to=2020-05-27&type=c

where you can select the date range above, and it automatically sorts the authors based on the number of commits they made for the given range. So one can see who most contributed in the past month, or one can see who most contributed when the project started.

Instead of commits from git history, it would use the comments data that my script harvested above.

1 Like

@lkedward do you know what libraries could allow me to do that? I think Bokeh could probably do that. I am hoping to keep the code relatively simple, I can see this quickly turning in a huge project. I am hoping Bokeh and/or some JavaScript libraries can present this in a static webpage (that would have the data embedded in it) that we can then include at fortran-lang somewhere.

I agree, we want to keep it simple. It shouldn’t be too difficult to generate and store the the data statically and have javascript serve up the dynamic parts like date filtering and sorting (this is how the package search works currently). I’m not familiar with Bokeh, but it looks impressive. I’m happy to look into the javascript side to this if you want? For the static side, we just need the CI to run your script and generate a JSON file.

1 Like

The JSON file actually takes a while to generate, especially if we do this for multiple repositories, so for now I was thinking of doing it manually and uploading it somewhere online (say into GitHub gist) and update it from time to time — in fact we can setup a separate CI job in a separate repository that would update it periodically (once a day or once a week or something like that). Then at our CI, we simply download the (latest) JSON, and generate the static html page.

I have experience setting everything up, except the last part — how to actually visualize this in a web page, and have the sorting working when you change the date range selection. If you had time to help with this last part, that would be super helpful. I can setup the rest.

1 Like

Ah I see, yeah that sounds like a good workflow. Yep I’m happy to help with the visualisation part!

1 Like

I thought it might be worth mentioning some projects with a similar aim to us for contributor attribution:

https://allcontributors.org/

https://github.com/labhr/octohatrack

Unfortunately they don’t quite fulfil our needs:

  • All Contributors provides a github bot: but it isn’t automated, it’s command based;
  • octohatrack is a python script, but it only lists contributors without any stats
1 Like

Hi @certik,
I’ve started prototyping a javascript font-end for displaying the contributor stats on the website (PR coming soon) and I’ve noticed that the issue comments do not include comments left during PR approval - this seems like an important contribution that should be counted.

I’m also wondering whether we should be counting commits (excluding merge commits) into our contributor stats as well?

1 Like

@lkedward excellent! Are you using the JSON files from here: https://gitlab.com/fortran-lang/github_stats_data ?

Yes, we need to improve it to include all comments, great point.

Yes, we should include commits themselves. Those can be determined for example by:

ondrej@pn1707483:~/repos/stdlib(master)$ git shortlog -ns
   113  Ondřej Čertík
    76  Vandenplas, Jeremie
    47  Izaak Beekman
    31  Milan Curcic
    16  Michael Hirsch, Ph.D
    15  Juan Fiol
    14  Jeremie Vandenplas
    14  Nathaniel Shaffer
     6  Bálint Aradi
     6  Ivan
     6  milancurcic
     5  Izaak "Zaak" Beekman
     5  Pierre de Buyl
     3  J. Henneberg
     3  JHenneberg
     3  Nathaniel Shaffre
     3  nshaffer
     2  Ashwin Vishnu
     2  Pedro Costa
     1  Juan
     1  Neil Carlson
     1  sakamoti

So probably my scripts should check out the repository and use this command to extract data — we’ll have to map the users to github IDs somehow, probably by email.

Do you think the commit stats should be merged with the comments stats, or kept separate?

1 Like

I think separate stats for comments and commits. We should also be able to handle duplicates from the git shortlog.

1 Like

Yep, I’m using that data for now.

A problem with the output of this command is that the data isn’t broken down by date like the comment data - so we can’t plot and filter over time.
I think the commit stats should be stored separately so we have the option to plot them separately.

I’m not sure the best way to get github IDs for commits - can you use the github3.py API to do this?

Duplicates are handled by .mailmap, we should add one.

github3.py can probably get commits, but more reliable and faster is to just get the commits using git log, and parse the output for each commit, extract the information and put into a separate JSON file. Then we can easily analyze it and process it.

The GitHub API allows to obtain an email for a github ID which share it publicly I think there is a way to get list of IDs from GitHub for a project, so we can then match it ourselves — and those IDs that do not share their email, we can provide a hand mapping in some file, it will not be that many people, so we can manage that.

1 Like

Awesome, sounds like a good plan!

1 Like