days
0
-30
-6
hours
-1
-5
minutes
-5
-2
seconds
0
-1
search
Interview with pandas developer Tom Augspurger

“The 1.0 release does not mean a conclusion, or even slowing down, of pandas’ development.”

Maika Möbus
Python
© Shutterstock / 9wooddy

pandas has reached the milestone version 1.0.0. The Python library for data analysis and manipulation has already been around for 12 years and is being used in production, so what led to this decision now? We spoke to Tom Augspurger from the pandas developer team. He shared some insights on the new release, his personal highlights and where pandas is headed in the future.

pandas 1.0.0 has been released on GitHub by Tom Augspurger, following a release candidate that arrived earlier this month. We interviewed Tom to find out more about the major upgrade to v0.25.3.

Let’s see what he has to say about the new NA scalar, breaking changes and the upgrade process, and what features he is looking forward to implementing in the future.

SEE ALSO: CodinGame 2020 developer survey says Python is the most loved programming language

JAXenter: You’ve made the decision to finally push pandas to version 1.0. What considerations played a part in this?

pandas.NA is a new concept in the scientific Python ecosystem, and it’s not clear how other libraries will adapt to handle it.

Tom: We started thinking about pandas 1.0 in earnest at our first developer sprint in July 2018. At the time, we optimistically targeted January 2019 (6 months from the sprint). We ended up needing another 18 months of development.

pandas has been “production ready” for a while now, in the sense that it’s used in production at many institutions. But we still had a few major items we wanted to iron out before calling 1.0:

  1. Clean up the API. We’d accumulated a large amount of deprecations for duplicative or unclear behavior. Many of these were enforced for 1.0.
  2. Stabilize the data model. Starting around pandas 0.23 (May 2018), we clarified exactly what kind of data could be stored in a Series or DataFrame. Historically, this was just NumPy arrays or a few “extension types” that pandas defined. The 0.23 release included an interface that specified what kind of array can be stored inside pandas, and over the subsequent releases we refined that interface.

With these stabilizations in place, we felt that a 1.0 was appropriate.

JAXenter: What is your personal highlight in pandas 1.0?

Tom: The new NA scalar to represent scalar missing values. This is the value used to represent “missing” in our new nullable integer, boolean, and string data types. Historically, we’ve used NaN (not a number), but that had several drawbacks. Most notably, NaN is a float and so cannot be used with integer dtypes. And NaN has some peculiar behavior in logical and comparison operations.

Historically, we’ve used NaN (not a number), but that had several drawbacks.

pandas.NA is a new concept in the scientific Python ecosystem, and it’s not clear how other libraries will adapt to handle it. We’re working with other libraries, including NumPy, to discover how we can best handle the concept of “missing data” across the ecosystem.

JAXenter: For developers who use pandas, what will be the most significant changes when upgrading?

Tom: All of our API breaking changes are documented in our release notes. This release had relatively minor breaking changes. The largest changes are probably to the (experimental) IntegerArray to now use the new pandas.NA scalar value rather than NaN. When upgrading, we always recommend:

  1. A careful read through of the release notes.
  2. Trying the release candidate.

We provide binaries for final releases and release candidates. Subscribe to our releases on GitHub by “watching” for releases.

Our full installation instructions are available here.

SEE ALSO: Predicting 2020 and beyond: Real time is out, predicting the future is in

JAXenter: And, lastly, what are your future plans for pandas?

Personally, I’m excited about improvements to the extension array interface.

Tom: The 1.0 release does not mean a conclusion, or even slowing down, of pandas’ development. The roadmap always contains an up-to-date vision of the maintainer’s view for where the project should head.

Personally, I’m excited about improvements to the extension array interface, in particular the ability to use extension arrays to back Index objects for indexing. Interoperability with Apache Arrow, perhaps first via a native string array, is a promising area of development too.

And, as always, pandas’ development is guided primarily by its users and contributors. Anybody can join in the development process to influence where the project goes.

Thank you for the interview!

Author
Maika Möbus
Maika Möbus has been an editor for Software & Support Media since January 2019. She studied Sociology at Goethe University Frankfurt and Johannes Gutenberg University Mainz.

Leave a Reply

avatar
400