A guide to make Python packaging easier for scientists
I’ve spent the last few months working on creating a Python packaging guide. This guide seeks to help those creating new scientific Python packages select a packaging tool and workflow. This guide also supports the pyOpenSci peer review process.
Below, I provide a brief overview of our content development process given the packaging tool guide chapter has been published! Yay!
TL;DR
There are a few key takeaways from this post:
- Many learn how to create a Python package by copying the structure of other packages built by maintainers they respect. While this is an OK approach, it doesn’t help people better understand the ecosystem. And in some cases it can cause more confusion given the many different options and approaches.
- Much of the existing Python packaging blog post and guidebook content online is dated. The community needs dependable, maintained resources that they can rely on!
- Much of the existing packaging resources assumes some base technical knowledge about packaging making it difficult for beginners to work through and understand the content.
- pyOpenSci is creating a Python packaging guidebook driven by significant community input from packaging maintainers, scientists and others in the community. The guide’s goal is to help people understand best practices for packaging that follow current standards and use current tools.
- We will also be creating some tutorials in the near future.
The packaging chapter of our guide is online now! Stay tuned for more content on environments, CI and testing!
My jump into packaging
In the Fall of 2022, in support of my new role as Executive Director of pyOpenSci, I began to explore Python packaging tools in an effort to update our guidebook in support of our package peer review program.
I saw significant community confusion around how to create a Python package. But, in my mind, it wouldn’t be that big of a challenge to create a guidebook.
- There are a bunch of tools out there.
- There are a bunch of Python standards that people should follow
I just needed to find the combination of tools and standards that we could recommend to people in an attempt to demystify the packaging ecosystem.
No problem, right?
At the same time I noticed that many did not want to talk about Python packaging. And I wondered, why?
My approach to learning how to create a Python package
I’ve worked on the development of 3 other Python packages. Each time my approach to create a package was asking the question:
- What would GeoPandas do?
GeoPandas is a spatial library that supports working with vector data (think points, lines and polygons). I decided to follow their structure, because I respected the Geopandas maintainers greatly, and I had contributed to the package.
My approach to packaging was: “monkey see, monkey do”. I was the monkey.
I also munched on some bananas. It worked out alright.
The monkey approach to packaging is not ideal
Copying a package’s structure is like copying code from stack overflow and pasting it into your workflow in hopes it runs. If it doesn’t run, you don’t know enough to fix it! Frustration sets in.
However, at least in Stack Overflow you can see when the post was published and know that it might be dated. I found it hard to find updated information on Python packaging tools. I found this particularly challenging considering I found so many tool options. And each tool had a level of documentation that assumed some depth of knowledge around Python packaging.
Where does the authoritative and complete guide to packaging live and who maintains it? Further, is it helpful enough for a begineer to dig into and get started quickly?
In education, early wins are key
I’ve taught data intensive science for almost 20 years. If there is one thing I know about teaching those new to technical areas, it is that early wins are critical. Whether the win is creating a simple data plot within the first 20 minutes of a workshop, or using an init
method in PDM to create a package structure, early wins can motivate a beginner mind as:
- It builds confidence that they can do it
- That confidence allows a learner to continue even when it gets more complicated
I struggled to find any resources that provided users of Python packaging tools with early wins. Rather, I found that I needed to increase my technical knowledge of packaging to even understand many of the resources out there.
How could pyOpenSci help?
To support pyOpenSci’s goals of making packaging easier for scientists while also improving package quality I knew we needed to create a guide that would help others navigate the packaging ecosystem. At a minimum, helping users understand the tool landscape and how to pick a tool was a good start.
Python packaging is not bad. It’s just not well documented
From all of the above I came to a conclusion that Python packaging is not bad. It’s just not well documented. If people understood what all of the tools did and how to pick one, it might be akin to shopping for a car*.
*But without the annoying sales person who might assume you know nothing about cars if you are a women…
A Python packaging guide for scientists
In creating this guide, I talked with scientific Python tool maintainers, folks from PYPA, scientific python and maintainers of core packages (such as Flit, Hatch, Poetry and PDM) to get insights into common workflows, common challenges and tools that folks are using. This guide has been a true example of community-driven content. If you are curious, you can see the contributor list here.
The packaging chapter alone had over 200 comments to address in round 1 of reviews. And another 200+ in round 2 of review. All of the chapters in our guide go through community review however this particular chapter elicited a LOT of strong response regarding which tools do what, and how they should be described.
Sometimes, the discussions got tense. People have strong opinions about packaging approaches. Also, not everyone agrees on the best technical approaches. But even more interesting is that many involved knew something about some of the tools but often that was based on word of mouth or a quick glance at documentation. (this is largely because tools are evolving quickly). The people that knew the most were also the most technical, and often involved in the actual development of the tools.
An assessment of python packaging tools
My take away from all of this:
After hundreds of comments and conversations;
After testing each one of the tools in our guide with a start to end workflow;
My takeaway is that Python doesn’t have a packaging problem (if you are a user creating a pure Python package).
Python has a much more human problem where approaches to packaging are simply unclear, not well documented and often debated - heavily.
Further, the standards created for Python packaging while important, live on a website that is not intended for the broader public to use.
Sure, there are many tricky parts to packaging. And understanding the standards can be even trickier. This is certainly not a perfect system.
However, we can create packages using the given existing tools – now! I promise, this is true.
It’s just (extremely) hard to figure out:
- how do you create a package,
- which tool should we use, and
- why use that tool over the others?
The Python ecosystem is evolving rapidly. Approaches evolve and it’s hard to know which approaches are the most current. Those who deeply understand the packaging challenges represent a small subset of the community and also are technically proficient.
In general, users want to use the simplest approach to publish their packages online.
Remember - early wins go a long way.
Assessment of packaging tools functionality
At the same time there is no good assessment that i’ve seen of the tools that do exist to help users in the ecosystem. I had questions about:
- what each tool does
- what each tool could do better.
- And how I should pick a tool for my packaging needs.
It was clear that people want that guidance.
A few spoilers regarding what is to come
With this all said, i’ll now set the stage for what’s to come from pyOpenSci in the upcoming months. And what i’ve learned so far.
- There are a few great packaging tools that support comprehensive build workflows.
- Many of the packing tools out there only have one core maintainer (a low bus factor). What would happen if a few of them just teamed up and worked together (and with the community) to move forward? Or could we somehow change that to add stability to the ecosystem?
- We need better documentation with clear beginning to end quick-start tutorials that help new users get started. If the tools were better documented more people would use them.
Right now, Poetry is the most common (modern) packaging tool being used. Have a look at its documentation and you’ll see why! PDM, however, has numerous features that are be ideal for the scientific ecosystem’s needs.
Specifically it allows you to use different build back ends, which is good news if you are creating either a pure python package OR a package with some C/C++ extensions.
Poetry can’t (yet) be a single solution to packaging because right now it’s support of non pure python builds is not documented (and might not ever be). But it could be a great solution for those creating pure Python package.
What’s next?
In the next few blog posts i’m going to present each Python build workflow tool including:
- Flit
- PDM
- Hatch and
- Poetry.
I’ll break down the pros and cons of using each tool. I will also provide examples of what using the tool looks like. In the meantime, check out our shiny new packaging chapter here to see the overview of packaging tools and approaches for scientists creating pure Python packages.
Tutorials are also on the way
In the very near future i’ll also create some packaging tutorials that will help you get started with creating a new package. Stay tuned for more on that as well!
Got questions on Python packaging?
If you are just getting started with Python packaging OR if you have questions about it, please use our discourse forum to ask questions. We are happy to help!
Leave a comment