Sep 01 2021

The Advantages of Pathlib

Using Python's Pathlib library

Python3 has a standard library with classes for filepaths. Are you using it yet?

If you have some experience using Python, you probably already know it has some good tools for ironing out differences between Windows and Unix paths, provided you don't build paths like this:

path = basepath + '/never' + '/do' + '/this'

The traditional answer has been to use libraries like os and os.path:

from os.path import join
path = join(basepath, 'a', 'better', 'path')

But in terms of ease of use, these libraries are starting to show their age.

Python3 includes pathlib, a more convenient class-based library for interacting with paths.

Pathlib Classes

By importing pathlib and using a Path class, we'll get a concrete class based on the underlying filesystem. In my case, I'm using Windows. Testing in a Python console will return a WindowsPath.

>>> from pathlib import Path
>>> pathlib_path = Path.cwd()
>>> type(pathlib_path)
<class 'pathlib.WindowsPath'>
>>> print(pathlib_path)
C:\Users\Tom\AppData\Local\Programs\Python\Python39

pathlib also has a PosixPath concrete class that you'll get from calling Path() on an Ubuntu machine, for instance. Each concrete class is inherited from a PurePath parent, and each PurePath class allows path operations, provided they don't touch the filesystem, which will error.

>>> from pathlib import PurePosixPath
>>> pure_linux_path = PurePosixPath('/usr/local/bin/python3')
>>> pure_linux_path.parent
PurePosixPath('/usr/local/bin')
>>> pure_linux_path.rmdir()
Traceback (most recent call last):
File "<pyshell#63>", line 1, in <module>
pure_linux_path.rmdir()
AttributeError: 'PurePosixPath' object has no attribute 'rmdir'

This can be an elegant way to do some filepath manipulation in the opposite platform; a necessary evil that I've sometimes run into for cross-platform CI projects.

Console comparisons with os

One pet-peeve of mine, especially when revisiting Python after some time away, is that os.path contains functions instead of methods. It is easy to forget that as my little whoops below demonstrates. The OOP consistency from pathlib avoids this.

>>> import os
>>> os_path = os.getcwd()
>>> os_path.exists()
Traceback (most recent call last):
File "<pyshell#11>", line 1, 
in <module> os_path.exists()
AttributeError: 'str' object has no attribute 'exists'
>>> os.path.exists(os_path)
True

>>> from pathlib import Path
>>> pathlib_path = Path.cwd()
>>> pathlib_path.exists()
True

Another source of errors is the inconsistent interface. Directories have to do with paths, so the function for listing them must be in os.path, right?

>>> os.path.listdir(os_path)
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
os.path.listdir(os_path)
AttributeError: module 'ntpath' has no attribute 'listdir'
>>> os.listdir(os_path)
['DLLs', 'Doc', ...]

>>> list(pathlib_path.iterdir())
[WindowsPath('C:/Users/Tom/AppData/Local/Programs/Python/Python39/DLLs'), 
 WindowsPath('C:/Users/Tom/AppData/Local/Programs/Python/Python39/Doc'),
 ...]

pathlib also provides some convenient attributes, retrieving values related to the path is as simple as it should be. Some of the terminology pathlib uses for paths can be quickly understood by looking at the pathlib cheatsheet.

>>> os_home = os.path.expanduser('~Tom')
>>> os.path.basename(os_home)
'Tom'
>>> os.path.dirname(os_home)
'C:\\Users'
>>> os.path.splitext(os.path.join(os_home, 'test.txt'))[1]
'.txt'

>>> pathlib_home = Path('~Tom').expanduser()
>>> pathlib_home.name
'Tom'
>>> pathlib_home.parent
WindowsPath('C:/Users')
>>> Path(pathlib_home, 'test.txt').suffix
'.txt'

Final examples

So while it is still important to avoid code like this:

path = basepath + '/never' + '/do' + '/this'

It is easy to understand the temptation, which brings me to my conclusion: pathlib enables me to think about paths the way I already do, without the hurdles of a dispersed interface. Below I've simplified a scenario I've encountered working on a CI project, again comparing os with the same logic refactored for pathlib.

import os
info_folder = os.path.join(os.environ.get('WORKSPACE', '.'), 'build', 'info')
os.makedirs(info_folder, exist_ok=True)
project.build()
with open(os.path.join(info_folder, 'results.xml')) as f:
    xml_results = f.read()
with open(os.path.join(info_folder, 'build_log.txt')) as f:
    build_log = f.read()

from pathlib import Path
workspace = Path(os.environ.get('WORKSPACE', '.'))
(workspace/'build'/'info').mkdir(parents=True, exist_ok=True)
project.build()
xml_results = (workspace/'build'/'info'/'results.xml').read_text()
build_log = (workspace/'build'/'info'/'build_log.txt').read_text()

With os, I'm almost forced to over-name things to keep the verbosity down, hence the info_folder variable. There really isn't a need for such a variable when using pathlib. I can use forward slashes on either platform and pathlib will manage the differences behind the scenes. This matches Java/Groovy behavior I've used in Jenkins Pipeline before too, so switching languages feels smoother. I can get back to how I actually think about the path and see the portion I care about, under one library of consistently named methods. Which of the above would you rather read?

If that isn't enough to convince you to change your code to pathlib, there is also the flexibility of partially swapping out pathlib without having to adjust for each new pathlib method. Thanks to PEP 519 and the PathLike base class, pathlib paths resolve to strings and can be used as arguments to built-in functions as if they were the path strings from os functions.

>>> with open(os.path.join(os.path.expanduser('~Tom'), 'test.txt')) as f:
...     x = f.read()
...
>>> with open(Path.home()/'test.txt') as f:
...     y = f.read()
...
>>> with (Path.home()/'test.txt').open() as f:
...     z = f.read()
...
>>> min(x == y, y == z)
True

So while using classes instead of strings comes with a bit more resource overhead, the reduction in errors and readability you get from pathlib is usually well worth it. I would love to see more code use pathlib, so if you aren't already using it, I hope this post has swayed you.