Writing type-safe code in Python

This post is one of many I plan to write detailing techniques you can use to write "type-safe" code in Python. I use quotes there because types in Python are not enforced. Despite this, they can make your code much safer and easier to develop against, reason about, test, and maintain.

Everything is a string

If you've ever developed or maintained an HTTP/JSON API, you've likely experienced this issue: you've got some POST endpoint that accepts a large JSON payload which gets dumped into your lap as an arbitrarily nested dict of strings. This isn't particularly useful to us, the maintainer of this endpoint. Most web frameworks, I'm assuming, will marshall the data into native Python objects, but this is only half the battle:

What is the difference between "123-456-7890" and "[email protected]"?

In Python (virtually any programming language, really), these two pieces of information are merely strings. Of course, you, the reader know that one is a phone number and the other is an email address -- but Python doesn't. All Python sees is two strings. Python wouldn't think you were crazy for trying to store that email address in a database table called phone_numbers.

# my_database.py
from typing import Protocol


class Database(Protocol):
    def persist(self, value: str, table: str) -> None: ...


def persist_phone_number(phone: str, db: Database) -> None:
    db.persist(phone, table="phone_number")


def persist_email(email: str, db: Database) -> None:
    db.persist(email, table="email_address")

# my_service.py
import my_database
from typing import Tuple


# UserInfo is a tuple consisting of a phone number and an email address
UserInfo = Tuple[str, str]


def customer_info(data: UserInfo, db: my_database.Database) -> None:
    email, phone = data

    if not isinstance(phone, str) or phone == "":
        raise ValueError("Phone number must be a non-empty string")

    if not isinstance(email, str) or email == "":
        raise ValueError("Email address must be a non-empty string")

    my_database.persist_phone_number(phone, db)
    my_database.persist_email(email, db)

Can you spot the problem with this code? It's a little bit of a trick question; this code would run just fine -- there are no syntax errors, we are validating that phone and email are both non-empty strings, which means we're calling the persistence methods with the correct types -- so what's wrong?

We've unpacked the UserInfo tuple incorrectly. That line of code should read phone, email = data.

Best case scenario: your data team has some mechanism of validating the input at the database layer, and returns a DatabaseError, crashing your program. Worst case scenario: there is no further validation and you've just stuffed an email into the phone_number table, and a phone number into the email_address table -- everything worked fine because you didn't violate the API contract anywhere along the line. You performed a perfectly valid task in your Python program, but now you have a data integrity problem!

Everything is a string, and that's kind of bad

How can we prevent this? We could implement custom classes for EmailAddress and PhoneNumber, but that'll get old real quick when you also need to implement classes for UserName, Password, GivenName, FamilyName, PreferredName, Pronouns, etc. Another issue is that not all of these inputs can even be validated. This also wouldn't solve the issue of having to deal with two inputs that are the same "type" (e.g., mailing addresses -- you definitely don't want to mix up sending and receiving addresses!).

You would also need to construct instances of each of those classes whenever you receive a payload, and deal with all the overhead of maintaining a custom class definition. You may have also taken already care of data validation in the frontend, or at the webserver ingress or proxy layer, so writing another step of data validation is both wasteful and adds technical debt.

Different types of string

This is where typing.NewType comes in! It's only a type definition so it'll be completely ignored at runtime, but it allows you to define different "variants" of a given data type -- in our case, str -- which our type checker will treat as completely different types:

# my_database.py
from typing import NewType, Tuple, Protocol


EmailAddress = NewType("EmailAddress", str)
PhoneNumber = NewType("PhoneNumber", str)
UserInfo = Tuple[PhoneNumber, EmailAddress]


class Database(Protocol):
    def persist(self, value: PhoneNumber | EmailAddress, table: str) -> None: ...


def persist_phone_number(phone: PhoneNumber, db: Database) -> None:
    db.persist(phone, table="phone_number")


def persist_email(email: EmailAddress, db: Database) -> None:
    db.persist(email, table="email_address")

Using the same customer_info function, but with a UserInfo tuple that uses our NewType types, the editor now shows a type error with our previous code:

Note that you cannot get this behavior with just type aliases!

This is obviously a contrived example, but using NewType can really help you and the other developers reading and maintaining your code, or your users if you're developing a library! Remember that you really need to be validating your data somewhere along the line -- Python's types are not enforced, they are merely a tool for keeping developers from making simple mistakes.

My next post will go into more detail on using typing.Protocol. I'm going to try to keep these posts short and sweet for now -- there's tons more to say about NewType, but you can also just go RTFM!

Distinct string types with Python

Writing type-safe code in Python

Everything is a string

Everything is a string, and that's kind of bad

Different types of string