API design and consistency for stdlib stats(_distribution) modules

Following Monday’s discussion about stdlib release and API stability, I wanted to bring up a point about API design and consistency for the stats modules (with specifics from the distributions modules). I can also create an issue on github, but decided to post here, since it’s potentially a broader discussion not strictly related to one issue.

There’s some inconsistency in passed arguments for normal and exponential distributions. Consider the pdfs:

result = pdf_normal (x, loc, scale)

result = pdf_exp (x, lambda)

While the normal distribution procedure uses loc and scale (reminiscent of scipy) as opposed to mu and sigma, the exponential distribution procdedure uses lambda rather than scale (, i.e. beta = 1/lambda) and is missing loc (though I appreciate that’s not needed as often as for normal distributions).

For comparison, scipy pdf arguments are as follows:
scipy.stats.norm.pdf(x, loc, scale)
scipy.stats.expon.pdf(x, loc, scale)

This brings up a broader question of API design. If the goal is to make it feel familiar to scipy users (as touched on in this discussion), sticking to location and scale makes sense. In this case, I’d suggest changing lambda to scale in the passed arguments. On the other hand, I found that while the scale and location convention creates consistency in arguments for different distributions, they also create some confusion (some not connecting normal distribution scale to standard deviation, for example, or taking scale and location parameters as statistically identical between distributions). It’s the reason I decided to stick with the more mathematical arguments for some of my older code I’m currently reworking into a stats lib, while adopting procedure naming conventions more reminiscent of scipy.
Perhaps it’s worth implementing several options, allowing users to (for example) pass either scale/beta or lambda? If you’re a scipy user with little background in statistics, scale will be intuitive and lambda potentiall confusing; if you come into it from the other side, it may be the other way around. However, there’s always the risk of things getting too bloated then. Anyway, this may be worth discussing.

5 Likes

very happy to see that there is growing interest in stdlib and that the discussion is reaching the point where the library is considered mature, and ready for production.

My personal take is that discussion at the API level should not hinder progress: one PR with some code is worth a thousand threads!

Where code is shown in action, is where reviewers experience the actual interface and refine and improve on it. My experience so far has been very positive. I believe it makes sense to have an interface for the exponential distribution with the same API as the normal one.

If you look at the history, stdlib development has been kept afloat by a tiny handful of developers led by @jeremie.vandenplas’s heroic effort. Many initial discussions on the API were at some point left behind, so if you are knowledgeable in statistics, I’m sure further contribtions in the area will be very much welcome.

7 Likes

Ok. PR with suggested modifications to the procedures, API (and associated examples and tests) is opened. :slight_smile:

2 Likes

Just a heads-up:

Since an API change proposed in this PR (modified stats exponential distribution procedures to use loc and scale) might impact some users, I’m linking it here to alert possible users, following @hkvzjal’s suggestion.

To summarise what I’m proposing with the PR (copied from GH):

I suggest introducing more consistency in the stats distribution procedures API. I have made some modifications to stdlib_stats_distribution_exponential (and associated examples and tests) to:

  1. use scale instead of lambda, and
  2. allow passing an additional location parameter loc.

With the suggested modifications in the PR, rvs, pdf and cdf procedures can now be called as such:
result = rvs_exp(x, loc, scale),
result = pdf_exp(x, loc, scale), and
result = cdf_exp(x, loc, scale).