Bug Bytes Web

Modelling Results with Python Dataclasses

In this lesson, we will create Python classes that model the Euro 2020 football data that has been scraped from the BBC in the previous lesson.

We will create a Result class that models an individual result in the tournament, with attributes for each team and the goals each team scored in the match. We will create methods on that class to determine who won a match, lost a match, whether the result was a draw, and more.

We will use type-annotated Python dataclasses to model the data. We will also look at some "dunder" methods from Python's data model for modelling certain features in an intuitive, Pythonic way.

This post has an associated video on our YouTube channel - linked below.

There is also a Jupyter Notebook on Github, which you can check out here.

Objectives

By the end of this post, you should:

Understand how to use classes in Python to model data.
Understand how to use Python's new dataclasses module's @dataclass decorator
Understand how to use the attributes and methods on the Result class to gain interesting insights into tournament data.

Data Modelling

In the last post, we had the following code, used for gathering Euro 2020 results.

import time
from bs4 import BeautifulSoup
import pandas as pd
import requests

base_url = 'https://www.bbc.co.uk/sport/football/european-championship/scores-fixtures'
start_date = '2021-06-11'
end_date = '2021-07-11'
KNOCKOUT_GAMES_START = pd.Timestamp('2021-06-26')

# generate tournament dates and URLs
tournament_dates = pd.date_range(start_date, end_date)
urls = [f"{base_url}/{dt.date()}" for dt in tournament_dates]

# container to store results
results = []

def show_result(home, home_goals, away, away_goals, pens=None) -> str:
  """ Stringifies a result from the scraped data """
  if pens:
    return f"{home} {home_goals} - {away_goals} {away} ({pens})"  
  return f"{home} {home_goals} - {away_goals} {away}"


for url in urls:
  response = requests.get(url)
  # time.sleep(1)
  
  soup = BeautifulSoup(response.text)

  # get all fixtures on the page
  fixtures = soup.find_all('article', {'class': 'sp-c-fixture'})

  for fixture in fixtures:
    home = fixture.select_one('.sp-c-fixture__team--home .sp-c-fixture__team-name-trunc').text
    away = fixture.select_one('.sp-c-fixture__team--away .sp-c-fixture__team-name-trunc').text
    home_goals = fixture.select_one('.sp-c-fixture__number--home').text
    away_goals = fixture.select_one('.sp-c-fixture__number--away').text

    game_date = pd.Timestamp(url.split("/")[-1])
    if game_date >= KNOCKOUT_GAMES_START:
      pens = fixture.select_one('.sp-c-fixture__win-message')
      if pens is not None:
        results.append(show_result(home, home_goals, away, away_goals, pens.text))
        continue
    
    results.append(show_result(home, home_goals, away, away_goals))

We now want to create a Result class that models this data in a better, more extensible and flexible manner. Currently, we only have strings representing each result.

Let's use Python's dataclasses to create a class container for results. The class will have the following attributes:

home: the home team
away: the away team
home_goals: the number of goals the home team scored
away_goals: the number of goals the away team scored.
penalty_winner: the winner on penalties. Specified as Optional[str] with a default of None because matches might not go to penalties.
penalty_score: the penalty score. Specified as Optional[str] with a default of None because matches might not go to penalties.

We will also define some helpful methods that encapsulate some functionality that we are interested in. Some methods we will implement include:

is_draw(): was the result a draw?
winner(): returns the winner, or None if the result was a draw.
loser(): returns the loser, or None if the result was a draw.
goals_scored(): returns total number of goals scored in the match
goal_difference(): the number of goals difference between the winning team and the losing team's goal count. Example: 4-1 or 1-4 (difference is 3 here)

Let's get started and create a class now!

from dataclasses import dataclass
from typing import Optional

@dataclass
class Result:
  home: str
  away: str
  home_goals: int
  away_goals: int
  penalty_winner: Optional[str] = None # Example: "Italy"
  penalty_score: Optional[str]  = None # Example: "5-4"

  def is_draw(self) -> bool:
    score_draw = self.home_goals == self.away_goals
    is_group_match = self.penalty_winner is None
    return score_draw and is_group_match

  def winner(self) -> Optional[str]:
    if self.is_draw(): return None

    if self.home_goals > self.away_goals:
      return self.home
    elif self.away_goals > self.home_goals:
      return self.away
    else:
      return self.penalty_winner

  def loser(self) -> Optional[str]:
    if self.is_draw(): return None

    if self.home_goals < self.away_goals:
      return self.home
    elif self.away_goals < self.home_goals:
      return self.away
    else:
      return self.home if self.penalty_winner == self.away else self.away

  def goals_scored(self) -> int:
    return self.home_goals + self.away_goals

  def goal_difference(self) -> int:
    return abs(self.home_goals - self.away_goals)

  def __contains__(self, team):
    return team in [self.home, self.away]

  def __str__(self):
    return f"{self.home} {self.home_goals}-{self.away_goals} {self.away}"

This code is explained more fully in our video on YouTube (watch from the 4 minute mark for the @dataclass implementation).

Note the @dataclass decorator on line 4, that allows us to easily specify our class's fields on lines 6-11, with type-hints, rather than through a cumbersome __init__() constructor with lots of parameters.

Soon, we are going to modify our loop from the previous post, to store the data as Result instances, rather than simply using strings. But first, we need to handle extracting penalty data.

Penalty Data

We need to figure out how to extract both the winning team name, and the scores, for matches that were decided by penalties. When a knockout match goes to penalties, the winning team is displayed with the following message:

TEAM_NAME win 5-4 on penalties

We need to extract both the TEAM_NAME and the score from this expression.

For the team name, we can simply split on the space and take the first element of the returned list. For the score, we will define a regular expression that finds the following pattern: a digit, followed by a dash, followed by another digit. The regex pattern for this is: \d+-\d+.

\d in a regular expression represents any digit, and the + symbol means "match one or more". So \d+ means "match one or more digits".

Let's write the code to extract this information below, using an example string.

import re

msg = "Italy win 5-4 on penalties"

winner = msg.split(" ")[0]
print(winner)

score = re.search("\d+-\d+", msg).group()
print(score)

This outputs Italy as the winner, and 5-4 as the score. We use the re module to search for the pattern specified for the score.

We can now add this code to our loop that collects the data, for knockout games that have the .sp-c-fixture__win-message class, and we can therefore now create Result objects with the correct data for each match in the tournament.

The modified loop is shown below.

results = []

for url in urls:
  response = requests.get(url)
  time.sleep(1)
  
  soup = BeautifulSoup(response.text)

  # get all fixtures on the page
  fixtures = soup.find_all('article', {'class': 'sp-c-fixture'})

  for fixture in fixtures:
    home = fixture.select_one('.sp-c-fixture__team--home .sp-c-fixture__team-name-trunc').text
    away = fixture.select_one('.sp-c-fixture__team--away .sp-c-fixture__team-name-trunc').text
    home_goals = fixture.select_one('.sp-c-fixture__number--home').text
    away_goals = fixture.select_one('.sp-c-fixture__number--away').text

    game_date = pd.Timestamp(url.split("/")[-1])
    if game_date >= KNOCKOUT_GAMES_START:
      pens = fixture.select_one('.sp-c-fixture__win-message')
      if pens is not None:

        # extract penalty winner from string:
        # TEAM_NAME win 5-4 on penalties
        pen_winner = pens.text.split(" ")[0]

        # get the score using a regular expression
        pen_score = re.search("\d+-\d+", pens.text).group()

        results.append(Result(
            home, 
            away, 
            int(home_goals), 
            int(away_goals),
            penalty_winner=pen_winner,
            penalty_score=pen_score)
        )
        continue
    
    results.append(Result(
        home,
        away,
        int(home_goals),
        int(away_goals))
    )

This code is explained in greater detail in the video for this post. Importantly, we extract the penalty winner on line 25, and the score on line 28. We then instantiate Result objects using the data collected by BeautifulSoup, allowing us to model the results as classes.

If we inspect the first 5 elements of the results list, we see objects like the following.

[Result(home='Turkey', away='Italy', home_goals=0, away_goals=3, penalty_winner=None, penalty_score=None),
 Result(home='Wales', away='Switzerland', home_goals=1, away_goals=1, penalty_winner=None, penalty_score=None),
 Result(home='Denmark', away='Finland', home_goals=0, away_goals=1, penalty_winner=None, penalty_score=None),
 Result(home='Belgium', away='Russia', home_goals=3, away_goals=0, penalty_winner=None, penalty_score=None),
 Result(home='Austria', away='North Macedonia', home_goals=3, away_goals=1, penalty_winner=None, penalty_score=None)]

We have objects that encapsulate the results! And we can now use our new object-oriented approach to analyze the data a bit more naturally.

Let's look at the winners - we'll use a list comprehension to call the .winner() method for each of our Result objects that did not end in a draw.

winners = [r.winner() for r in results if not r.is_draw()]
winners

This outputs the winners of all the matches at the Euros (we filter out the draws in the list-comprehension). First 5 winners are shown below:

['Italy',
 'Finland',
 'Belgium',
 'Austria',
 'Netherlands']

We can now perform basic analytics on this data. For example, we can count up the results of calling .winner() and .loser() to find out who won/lost the most matches in the tournament.

We'll use collections.Counter to take care of counting.

# Find the 10 teams that won the most matches in the tournament
from collections import Counter

Counter(winners).most_common(10)

This outputs the 10 teams who won the most games, along with the number of games they won:

[('Italy', 7),
 ('England', 5),
 ('Belgium', 4),
 ('Netherlands', 3),
 ('Denmark', 3),
 ('Spain', 3),
 ('Austria', 2),
 ('Czech Rep', 2),
 ('Ukraine', 2),
 ('Sweden', 2)]

We can do the same for the losers - let's see the 5 teams who lost the most games.

# Find the 5 teams that lost the most matches in the tournament
losers = [r.loser() for r in results if not r.is_draw()]
Counter(losers).most_common(5)

This outputs the following teams, along with how many matches they lost.

[('Turkey', 3),
 ('Denmark', 3),
 ('North Macedonia', 3),
 ('Ukraine', 3),
 ('Russia', 2)]

Let's look at more facts about the tournament. For example, let's find out how many draws there were and how many goals were scored - we can use our class's .is_draw() and .goals_scored() methods for this, and use the sum() builtin to sum up these values across the entire dataset

num_draws = sum([r.is_draw() for r in results])
print(f"{num_draws} draws in the tournament")

total_goals = sum([r.goals_scored() for r in results])
print(f"{total_goals} goals scored in the tournament")

This shows that there were 8 draws in the tournament, and 142 goals scored.

Finally, we can check what the biggest margin of victory was in the tournament, by using the max() builtin along with our class's .goal_difference() method for this.

# find the maximum goal difference
max_goal_diff = max([r.goal_difference() for r in results])
print(f"Biggest margin of victory: {max_goal_diff} goals")

# which result(s) does this correspond to?
biggest_victories = [str(r) for r in results if r.goal_difference() == max_goal_diff]
print(biggest_victories)

We see that the biggest margin of victory was by 5 goals, and that the only result that had such a large difference was the Slovakia 0-5 Spain result.

Summary

In this post, we have modelled our data using classes and an object-oriented approach, using Python's new @dataclass decorator and type-hinting to simplify this process.

In the next post, we'll create one more class to model team-specific statistics in the tournament, such as number of wins, number of goals scored/conceded, and more.

If you enjoyed this post, please subscribe to our YouTube channel and follow us on Twitter to keep up with our new content!

Please also consider buying us a coffee, to encourage us to create more posts and videos!