Part 3/3: Blog series on package health

This blog post is part 3 of a 3 part series on open source package health. The series was inspired by a conversation held on twitter. This blog post is not a comprehensive perspective on what pyOpenSci plans to track as an organization. Rather it’s a summary of thoughts collect during the conversation on twitter that we can use to inform our final metrics.

In this post, I’ll summarize a conversation that was held on twitter that gaged what the community thought about metrics to track the health of scientific Python open source packages.

Packages and open source software, a few terms to clarify

  • When I say Python package or Python open source software, I’m referring to a tool that anyone can install to use in their Python environment.
  • When I say open source or free and open source software I’m referring to Python tools that are free to download and and their code is openly available for anyone to see (open source).

What is package health anyway?

There are many different ways to think about and evaluate open source Python package health.

Below is what I posted on Twitter to spur a conversation about what makes a package healthy. And more specifically what metrics should we (pyOpenSci) collect to evaluate health.

My goal: to see what the community thought about “what constitutes package health”.

The twitter convo made me realize that there are many different perspectives that we can consider when addressing this question.

More specifically, pyOpenSci is interested in the health of packages that support science. So we may need to build upon existing efforts that have determined what metrics to use to quantify package health and customize them to our needs.

A note about our pyOpenSci packages

pyOpenSci does not focus on foundational scientific Python packages like Xarray Dask, or Pandas. Those packages are stable and already have a large user base and maintenance team. Rather we focus on packages that are higher up in the ecosystem. These packages tend to have smaller user bases, and smaller maintainer teams (or often are maintained by one volunteer person).

Image used by Jake Vanderploss in the 2017 pycon conference that shows
    the ecosystem of scientific python packages starting with foundational packages
    and moving out to the wealth of smaller, domain specific packages.
Image used by Jake Vanderploss in the 2017 pycon conference that shows the ecosystem of scientific python packages starting with foundational packages and moving out to the wealth of smaller, domain specific packages. pyOpenSci focuses on the packages that are higher up in the ecosystem that often could benefit from more support.

Our package maintainers:

  • Often don’t have the resources to build community
  • Often are keen to build their user base and to contribute to the broader scientific python ecosystem.

Existing efforts on health metrics: Chaoss project and the Software Sustainability Institute (neil)

I’d be remiss if I didn’t mention that there are several projects out there that are deeply evaluating open source package health metrics.

Several people including: Nic Weber, Karthik Ram and Matthew Turk mentioned the value and thought put into the Chaoss project.

The Software Sustainability Institute lead by Neil P Chue Hong has also thought about package health extensively and pulled together some data accordingly. Neil was also a critical guiding member of the earlier pyOpenSci community meetings that were held in 2018.

Snyk and security (which aren’t discussed in this post)

One topic that I am not delving into in this post is security issues. Snyk is definitely a leader in this space and was mentioned at least once in the conversation.

Some existing metric examples

Below are some of the metrics that you can easily access via Snyk’s website.

Pandera python package metrics on Snyk

Image showing the metrics from the Snyk website for the `pandera` package in our ecosystem.
Here you can see what a Snyk report looks like for pandera, a package accepted into our ecosystem a few years ago. Pandera gets a very healthy report because it's heavily used among other metrics.

Now let’s look at pyGMT package statistics on Snyk

Image showing the metrics from the snyk website for the pyGMT package in our ecosystem.
In comparison to pandera, pyGMT gets a lower, but still good, health score. I suspect this is due to lower community adoption and use. pyGMT is a much newer package. We'd argue however that pyGMT has a very healthy level of maintenance and even healthier package structure.

And of course the scientific Python project has also been tracking the larger packages:

What metrics should pyOpenSci track for their Python scientific open source packages?

So back to the question at hand, what should pyOpenSci be tracking for packages in our ecosystem? Hao Ye (and a few others) nailed it - health metrics are multi-dimensional.

I may be a bit biased here considering I have a degree in ecology BUT… I definitely support the ecological perspective always and forever :)

As Justin Kiggins from Napari and CZI points out, metrics are also perspective based. We need to think carefully about the organization’s goals and what we need to measure as a marker of success and as a flag of potential issues.

See insightful thoughts below:

Alas it is true that metrics designed for reporting that a funder requires for a grant may differ metrics designed for internal evaluation that informs program development. pyOpenSci has a lot to unpack there over the upcoming months!

Three open source software healthy metric “buckets”

Based on all of the Twitter feedback (below), and what I think might be a start at what pyOpenSci needs, I organized the Twitter conversation into three buckets:

  1. Infrastructure
  2. Maintenance
  3. Community adoption (and usability??)

These three buckets are all priorities of pyOpenSci.

DEIA is another critical concern for pyOpenSci but I won’t discuss that in this blog post.

Infrastructure in a Python open source GitHub repository as a measure of package health

So here I start with Python package infrastructure found in a GitHub repository as a preliminary measure of package health. When think of infrastructure I think about the files and “things” available in a repository that support its use. I know that no bucket is perfectly isolated from the others but i’m taking a stab at this here.

The code for many open source software packages can be found on GitHub. GitHub is a free-to-use website that runs git which is a version control system. Version control allows developers to track historical changes to code and files. As a platform built on top of git, GitHub allows developers to communicate openly, review new code changes and update content in a structured way.

What does GitHub (and Ivan) think about health checks for Python open source software?

Ivan Ogasawara is a long time advisor, editor and member of the pyOpenSci community. He’s also a generally a great human being who is growing open science efforts such as Open Science Labs; which is a global community devoted to education efforts and tools that support open science.

Ivan was quick to point out some basic metrics offered by GitHub which follow their community standards online guidebook here.

Actually it’s totally related, Ivan! Let’s have a look look at the pyOpenSci contributing-guide GitHub repository to see how we are doing as an organization.

Note that we are missing some important components:

  • A code of conduct
  • A contributing file that helps people understand how to contribute
  • Issue template for people opening issues
  • Pull request templates to guide people through opening pull requests
  • Repository admins accepting content reporting
Image showing the community standards page in GitHub. You can see in the image we are missing several critical files including  a code of conduct file, a contributing file that helps people understand how to contribute to the guide and issue and pull request templates.
Here you can see the community page in GitHub for our contributing-guide repository. Note that we are missing several important items in the repo including a code of conduct file, a contributing file that helps people understand how to contribute to the guide and issue and pull request templates. HELP!

Um…. we’ve got some real work to do, y’all on our guides and repos. We need to set a better example here don’t we? We welcome help welcome if you are reading this and wanna contribute. Just sayin…

GitHub bare minimum requirements are a great start!

The GitHub minimum requirements for what a software repository should contain are a great start towards assessing package health. In fact I’ve created a TODO to add this url of checks to our pre-submission and submission templates as these are things we want to see too; and also to update our repos accordingly.

Health check #1: are all GitHub community checks green?

Looking at these checks more closely you can begin to think about different categories of checks that broadly look at package usability (readme, description), community engagement (code of conduct, templates), etc.

The GitHub list includes:

  • Description
  • README.md file ((but what’s in that))
  • Code of conduct (but what’s in that file?!)
  • License (OSI approved)
  • Issue templates (great for community building)
  • Pull request templates

These checks are great but don’t look at content and quality

But these checks don’t look at what’s in that README, or how the issue templates are designed to invite contributions that are useful to the maintainers (and that guide new potential contributors).

In short, GitHub checks are excellent but mostly exterior infrastructure focused. They don’t check content of those files and items.

So what do content checks look like?

As Chris mentions below, things like having a clearly stated goal and intention, likely articulated in the README file is a sign of a healthy project. This goal was ideally developed prior to development beginning. Further, if well-written, it helps keep the scope of the project management.

Test suites and Python versions

Another topic that came up in the discussion was testing and test suites. Evan, who has been helping me improve our website navigation suggested looking at test suites and what version of Python those suites are testing against.

Test suites are critical not only to ensure the package functionality works as expected (if tests are designed well). They also make it easier for contributors to check if changes they made to the code in a GitHub pull request don’t break things unexpectedly.

Tests can also be created in a Continuous Integration (CI) workflow to ensure code syntax is consistent (e.g. linting tools such as Black) and to test documentation builds for broken links and other potential errors.

How should pyOpenSci handle Python versions supported in our review process?

In fact the website that you are on RIGHT NOW has a set of checks that run to test links throughout the site and to check for alt tags in support of accessibility (Alt tags support people using screen readers to navigate a website).

Image showing the output of htmlproofer for a website. You can see that it tells you when and where there are broken links or missing alt tags.
Notice in this output from htmlproofer on GitHub actions (continuous integration) that every page with a broken link or image with a missing alt tag will be flagged. Any flags will result in a broken build on GitHub - the dreaded red x.

Infrastructure: Is it easily installable?

How the package is installed is another critical factor to consider. While these days most packages do seem to be uploaded to PyPI, some still aren’t. And there are other package managers to consider too such as Conda.

Maintenance activity as a metric of health

The second topic that came up frequently on Twitter was the issue of maintenance.

Jed Brown had some nice overarching insight here for things they look at that are indicators of both maintenance and bus factor (risk factor, mentioned below as a measure of how many people / institutions support maintenance). More people and more institutions equals lower risk, fewer people and fewer institutions supporting the package equates to a higher maintenance risk (or risk of the package becoming a sad orphan with no family to take care of it.

How many times have you tried to figure out what Python package you should use to process or download data, and you found 4 different packages on PyPI all in varying states of maintenance?

I’ve certainty been there. So has RenéKat it seems:

It’s true. For a scientist (or anyone) it’s a waste of time to install something that won’t be fixed as bugs arise. It’s also not a good use of their time to have to dig into a package repository to see if it’s being maintained or not.

pyOpenSci does hope to help with this issue through a curated catalog of tools which will be developed over time.

But what constitutes maintenance?

How do we measure degree of maintenance? Number of issues being addressed and closed? Average commits each month, quarter or year?

This could be a relative metrics too. Some package maintainers may spend lots of time on issues or have too many to handle quickly as Melissa points out replying to a comment about evaluating maintenance by looking at issues being closed:

But alas I think there are ways around that. We can look at commits, pull requests and such just to see if there’s any activity happening in the repository. Or if it’s gone dark (dark referring to no long being maintained, answer to issues, fixing bugs, etc).

Greg, interestingly suggested one might be able to model expected future lifetime of a package based upon current (and past?) GitHub activity.

Uh oh! But are commits enough, Kurt asks? Is there such a thing as a perfect project?

Koen had a more broadly profound thought that would be ideal to consider when creating a new package; especially a small package that supports specific scientific workflows.

Does it do one thing, well? Really well?

Yes, please.

While this might be challenging to enforce in peer review, it is a compelling suggestion.

How do developers evaluate package maintenance?

There is a developer perspective to consider here too. Yuvi Panda pointed out a few items that they look for:

  1. Frequency of merged commits
  2. Bus factor
  3. Release cadence (a topic brought up a few times throughout the discussion)
Funny meme image showing a bus being wrecked by a train with the text - bus factor isn't about buses.
Bus factor refers to the degree of risk associated with a package based on the number of maintainers and organizations supporting it.

Remember, bus factor has nothing to do with buses, but there is some truth to the analogy of what happens when the wheels fall off.

One thought I had here was to look at commits from the maintainer relative to total commits to get a sense of community contribution (if any).

The CHAOSS project has an entire working group devoted to risk.

Or perhaps pyOpenSci asks their maintainers what their perceived risk is? IE: how long do you think the package might remain maintained. They will obviously know better than anyone what their funding environment and support it like.

Erik suggested that metrics can be dangerous and somewhat subjective at times. Akin to the whole - maps can lie; data can lie too . Ok it’s our interpretation that is the risk or lie not the data but … you follow me, yea?

Some including Pierre brought up the idea of consistent releases. Not necessarily frequency but just some consistency to demonstrate that the package was being updated.

Other discussions evolved around semantic versioning and release roadmaps.

Community adoption of an open source Python tool

Community adoption of an scientific Python package was another broad category seen over and over throughout the Twitter conversation.

  • How many users are using the tool?
  • How many stars does the package have?
  • How often is is the package cited?

Is the package cited?

While we’d love to quantify citations, the reality of this is that most people don’t cite software. But some do, and we hope you are one of them!

What about stars (and commits) as a metrics of adoption (and maintenance)?

The tweeter below looks at stars and commit date as signs of community adoption and maintenance.

As Chris Holdgraf mentions below, a package can reach a point where the same type of activity can have varying impacts of perceived level of maintenance. Many users opening issues, can represent community interest and perhaps even community adoption. And massive volumes of unaddressed issues might represent unresponsive maintainers.

Or perhaps the maintainers are just overwhelmed by catastrophic success.

Yup

But I need at least 5 (thousand) croissants, now. ANDDDD so does my friend.

Juan agrees that a steady stream of issues suggests adoption. Especially since opening issues on GitHub suggests that the users have some technical literacy.

Metrics quantifying community around tools

I’d be remiss if i didn’t at least mention that some of the discussion steered towards community around tools. For instance, Evan brought up community governance being a priority.

But the reality of our users was summarized well here by Tania. Most scientists developing tools are trying to simplify workflows with repeated code. Workflows that others may be trying to develop to do the same thing. They aren’t necessarily focused on community, at least not yet.

Further, capturing metrics around community is hard as Melissa points out. Most of the above resources don’t capture these types of items. And also, how would one capture the work on a community manager quantitatively?

Are some things missing here? Yes, of course.

But it’s a great start!

  • Some other items that didn’t come up in the conversation included downloads.
  • Packages found as dependencies or in environments on GitHub

Joel rightfully noted that my original tweet seemed less concerned with package quality and more concerned with community and use. I think they are right. We are hopeful that peer review metrics and recommended guidelines for packaging will get at package quality.

Summarizing it all will be a WIP (Work In Progress)

There is a lot of work to do in this area. And a lot of work that has already been done to learn from. It’s clear to me that we should start by looking at what’s been done and what people are already collecting in this area. And then customize to our needs.

A few items that stand out to me that we could begin collecting now surrounding package maintenance and community adoption are below. This list will grow but it’s just a start.

Package Maintenance and Community Adoption

  • Date of last commit
  • Date of last release
  • Annual frequency of releases
  • Number of open issues / quarter
  • Issues opened by maintainers vs non maintainers
  • Number of commits made by non maintainers / year

Package quality & infrastructure

  • GitHub core checks for README, Contributing guide, etc
  • Documentation & associated documentation quality (vignettes and quick start)
  • Defined scope and intent of package maintenance
  • Testing and CI setup

I will share a more comprehensive list once we pull that together as an organization in another blog post. Stay tuned for more!

Thoughts?

If you have any additional thoughts on this topic or if I missed important parts of the conversation please share in the comment section below.

Categories: blog-post , highlight , peer-review , python-packaging

Updated:

Leave a comment