Bug Bytes Web

Pydantic - Nested Models and JSON Schemas

In this post, we'll learn about how to implement Nested Models in pydantic model classes, including how to do validations on the child models. We'll also learn how to generate JSON schemas from our pydantic models.

This post continues from the previous post, which can be found here.

The associated video for this post can be found below:

Objectives

In this post, we will learn:

How to create nested Pydantic models, and form parent-child relationships between model classes.
How to use Literal types to constrain input values
How to define default values for fields
How to generate a JSON Schema from your Pydantic model definition
How to use datamodel-code-generator to generate Pydantic models automatically from JSON Schema definitions

Nested Pydantic model classes

The data we worked with in the previous post was simple, flat data, as seen here. However, commonly in real-world apps, we need to deal with nested objects, each with their own schemas and fields.

How do we use nested models in a pydantic context?

We have extended the dataset from the previous post to add the concept of nested modules to our student records. You can view the dataset, which we'll use in this post, here.

Each student has a set of modules that they are taking in the academic year, and these modules are their own objects with multiple fields.

In the previous post, we had the following pydantic model that described a hypothetical Student record in a college/university - note, other code (such as the DepartmentEnum ) can be found in the previous post:

import uuid
from datetime import date, datetime, timedelta
from pydantic import BaseModel, confloat, validator


class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: confloat(ge=0, le=4)
    course: str | None
    department: DepartmentEnum
    fees_paid: bool

    @validator('date_of_birth')
    def ensure_16_or_over(cls, value):
        sixteen_years_ago = datetime.now() - timedelta(days=365*16)

        # convert datetime object -> date
        sixteen_years_ago = sixteen_years_ago.date()
        
        # raise error if DOB is more recent than 16 years past.
        if value > sixteen_years_ago:
            raise ValueError("Too young to enrol, sorry!")
        return value

We will now need to extend the model to specify the modules that the student is taking when we fetch the new dataset.

If we inspect the new dataset here, we can see that a typical module has the following structure:

{
    "id": 1,
    "name": "Data Science and Machine Learning",
    "professor": "Prof. Susan Love",
    "credits": 20,
    "registration_code": "abc"
}

Each student record has a list of three of these modules, if the student has chosen a course. There are a few quirks here that we're going to cover with tools provided by Pydantic:

The id field can be either an integer (as above), or a UUID.
The credits field can only have the possible values of 10 or 20.
If a student has not chosen a course, then the modules list will not exist in the data.
If a student has modules, there must only be 3 modules for the academic year

Firstly, let's see how to define a nested model in pydantic. Firstly, we're going to create a model that defines the Module data, as below:

class Module(BaseModel):
    id: int
    name: str
    professor: str
    credits: int
    registration_code: str

Now, this isn't a complete model by any means, but let's roll with it for now.

We now need to link to this new model from the original Student model. This forms a parent-child relationship, with Student the parent and Module the child. A student HAS a set of modules that they're taking in the academic year.

We also need to ensure that the child field is optional, or has a sensible default value - remember, if a student does not have a course (which is also optional), then by definition they won't have a set of modules for that course yet.

It's very easy to do this in pydantic. We add our new field for the list of modules below:

class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: confloat(ge=0, le=4)
    course: str | None
    department: DepartmentEnum
    fees_paid: bool
    modules: list[Module] = []

    @validator('date_of_birth')
    def ensure_16_or_over(cls, value):
        sixteen_years_ago = datetime.now() - timedelta(days=365*16)

        # convert datetime object -> date
        sixteen_years_ago = sixteen_years_ago.date()
        
        # raise error if DOB is more recent than 16 years past.
        if value > sixteen_years_ago:
            raise ValueError("Too young to enrol, sorry!")
        return value

On line 9, we have the new definition. We specify a field called modules, and set the type to list[Module] indicating that we expect a list of Module objects. This means that each module in the list will itself be validated and converted to the Module model that we defined earlier.

We see some new syntax on this line too - after the type definition, we give the field a default value of an empty list. Since this field is expecting a list of objects, indicating a lack of modules with an empty list rather than None may be a more logical choice here.

This pattern outlined above is quite powerful, as the validation and typing power that Pydantic gives us can very easily be applied to nested data structures.

We can run this whole set of code on our new dataset with the following script:

import uuid
import requests
from datetime import date, datetime, timedelta
from pydantic import BaseModel, confloat, validator
from enum import Enum

# fetch the raw JSON data from Github
url = 'https://raw.githubusercontent.com/bugbytes-io/datasets/master/students_v2.json'
data = requests.get(url).json()


# define an Enum of acceptable Department values
class DepartmentEnum(Enum):
    ARTS_AND_HUMANITIES = 'Arts and Humanities'
    LIFE_SCIENCES = 'Life Sciences'
    SCIENCE_AND_ENGINEERING = 'Science and Engineering'


# Pydantic model to outline structure/types of Modules
class Module(BaseModel):
    id: int
    name: str
    professor: str
    credits: int
    registration_code: str


# Pydantic model to outline structure/types of Students (including nested model)
class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: confloat(ge=0, le=4)
    course: str | None
    department: DepartmentEnum
    fees_paid: bool
    modules: list[Module] = [] 

    @validator('date_of_birth')
    def ensure_16_or_over(cls, value):
        sixteen_years_ago = datetime.now() - timedelta(days=365*16)

        # convert datetime object -> date
        sixteen_years_ago = sixteen_years_ago.date()
        
        # raise error if DOB is more recent than 16 years past.
        if value > sixteen_years_ago:
            raise ValueError("Too young to enrol, sorry!")
        return value


# Iterate over each student record
for student in data:
    # create Pydantic model object by unpacking key/val pairs from our JSON dict as arguments 
    model = Student(**student)
    print(model)

However, when we run this code, we'll find that it fails. Remember, the id field in our module data can also be a UUID string, rather than an integer.

Let's fix this and make the id field's type a union of the integer type and the UUID type.

class Module(BaseModel):
    id: uuid.UUID | int
    name: str
    professor: str
    credits: int
    registration_code: str

If we run this now, it should work! Inspect the generated models, and you should find that the Student model has a modules field that links to a list of validated Module objects.

It's very important to know the following, however - when Pydantic encounters a union as a type, it will attempt to cast data in the order defined in the union. So for the above, it will first attempt to convert the ID to a UUID, and if that fails, it will then try to cast to an integer.

Imagine we defined a union as follows:

from pydantic import BaseModel, Field
from typing import Union


class Number(BaseModel):
    value: int | float

number = Number({'value': 2.2}) 
print(number)  # number = 2

Even though the value passed is a floating point number, 2.2 - the Pydantic model has first attempted to convert this number to an integer (as it was defined first in the union), therefore the value is now equal to 2.

This is likely undesirable behaviour. Be careful with the order of union types - read more on this here.

So now we understand how to use nested models in our pydantic data models. Let's now see what additional validations we can put on this model.

Additional Module validations

There are two additional validations we want to add to our Module model, as noted earlier:

The credits field can only have the possible values of 10 or 20.
If a student has modules, there must only be 3 modules for the academic year.

Let's start with the credits field. We can make this a Literal type, as below:

from typing import Literal

class Module(BaseModel):
    id: uuid.UUID | int
    name: str
    professor: str
    credits: Literal[10,20]
    registration_code: str

We specify Literal[10, 20] as the type - this constrains the acceptable values to 10 and 20. With the Literal type, the field cannot take on any value other than those defined in the Literal.

If we run this, the script should work. However, try changing or removing one of the values in the Literal - this should result in a ValidationError.

Now, let's cover the modules definition on the Student model. If this list is not empty, it should have length 3. Let's write a simple custom validator for this task, as below (lines 11-15):

class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: confloat(ge=0, le=4)
    course: Optional[str]
    department: DepartmentEnum
    fees_paid: bool
    modules: list[Module] = []

    @validator('modules')
    def validate_module_length(cls, value):
        if len(value) and len(value) != 3:
            raise ValueError('List of modules should have length 3')
        return value

    @validator('date_of_birth')
    def ensure_16_or_over(cls, value):
        sixteen_years_ago = datetime.now() - timedelta(days=365*16)

        # convert datetime object -> date
        sixteen_years_ago = sixteen_years_ago.date()
        
        # raise error if DOB is more recent than 16 years past.
        if value > sixteen_years_ago:
            raise ValueError("Too young to enrol, sorry!")
        return value

The validator method on lines 11-15 ensures that the length of the modules list is equal to three, if the list is not empty.

Generating JSON Schemas

With our models defined, Pydantic allows us to create JSON Schemas from these model definitions. It's very easy to do - we simply call the static .schema() or .schema_json() functions on the class.

JSON Schema is "a JSON-based format for describing the structure of JSON data. JSON Schema asserts what a JSON document must look like, ways to extract information from it, and how to interact with it." (source here)

Let's generate a JSON schema for our Module model:

class Module(BaseModel):
    id: uuid.UUID | int
    name: str
    professor: str
    credits: Literal[10,20]
    registration_code: str

print(Module.schema_json(indent=2))

This outputs the following JSON:

{
  "title": "Module",
  "type": "object",
  "properties": {
    "id": {
      "title": "Id",
      "anyOf": [
        {
          "type": "string",
          "format": "uuid"
        },
        {
          "type": "integer"
        }
      ]
    },
    "name": {
      "title": "Name",
      "type": "string"
    },
    "professor": {
      "title": "Professor",
      "type": "string"
    },
    "credits": {
      "title": "Credits",
      "enum": [
        10,
        20
      ],
      "type": "integer"
    },
    "registration_code": {
      "title": "Registration Code",
      "type": "string"
    }
  },
  "required": [
    "id",
    "name",
    "professor",
    "credits",
    "registration_code"
  ]
}

This defines the fields that exist on the model, the required fields, the types and different formats (for example, UUID string format), and more.

Furthermore, this machine-readable JSON schema allows other tools to generate code from the schema.

There are also tools that allow you to take JSON Schemas, and generate Pydantic models from the schema! For example, a fantastic tool called datamodel-code-generator, available here.

If you save the above JSON Schema to a file called schema.json, we can then use datamodel-code-generator to generate a Pydantic model from the schema.

Firstly, install the library with: pip install datamodel-code-generator

Then, run the following command:

datamodel-codegen --input jsonschema.json --output models.py

This will generate a Pydantic model that conforms to your JSON schema definition within the output file!

This is extremely useful when you have a schema for an API, and you need to generate Pydantic models that conform to the definition. For example, you can view an example schema here - copy/paste this into the schema.json file and re-run the above command:

datamodel-codegen --input jsonschema.json --output models.py

This will generate a Pydantic model for this schema that we've never seen before, which could then be used in your applications.

Very cool and very useful.

Summary

This post has taken our knowledge of Pydantic to a new level - crucially, we've learned to deal with nested objects, which occur frequently in real-life data.

We've also learned a bit about how we can generate JSON Schemas and take advantage of code-generation tools when working with Pydantic.

In the next post, we'll learn about Pydantic's Field() function, about advanced model exporting techniques, and about model Config classes which can be used to set model-wide configuration options.

If you enjoyed this post, please subscribe to our YouTube channel and follow us on Twitter to keep up with our new content!

Please also consider buying us a coffee, to encourage us to create more posts and videos!