The Best Way To Test Pandas Code For Machine Learning

testing machine learning python pandas blog banner

Testing your code is critical throughout the software or machine learning system development life cycle. Following this, selecting – and employing the appropriate Python testing tools should be an essential part of writing high-quality machine learning apps.

Unlike other tech stacks, pandas does not have a strong testing culture. The majority of pandas developers never write tests. However, if you want to write production-grade code, this is not the case. Good software engineering practices result in more stable data pipelines.

This post explains how to test pandas code with the built-in test helper methods (pandas.testing) and with the additional tools like beavis and Datatest that give more readable error messages and advanced testing features.

If you are looking for more articles about pandas and how to build a nice GUI for them, read them here:

Powerful Data Analysis And Manipulation Using Pandas Library In A Delphi Windows App

Build The Ultimate GUI For Pandas To Perform Complex Data Analysis

If you are looking for a way to test numpy, read it here:

How To Add Python Testing Tools Into Machine Learning Code: Testing NumPy

For more advanced topics in Python, please read the following articles:

The Comprehensive Guide To Built-In Python Testing Tools

The Super Short Guide To Built-In Python Profiling Tools

How To Add Python Profiling Tools Into Machine Learning Code

Table of Contents

Why is `pandas` testing important?

pandas testing is important, based on the need to extend TDD’s idea of testing for software correctness with the idea of testing for the meaningfulness of analysis, correctness and validity of input data, & correctness of interpretation (TDDA: Test-driven data analysis) [1].

When you’re ready to start writing production-grade code, you should seriously consider writing tests. Good software engineering practices result in more stable data pipelines [4].

Do `pandas` have a built-in test helper?

pandas provide us with a built-in test helper, called pandas.testing.

The following are all available functions provided by pandas.testing [2]:

Assertion functions

For example:

testing.assert_frame_equal(left, right[, ...]),

testing.assert_series_equal(left, right[, ...]),

testing.assert_index_equal(left, right[, ...]), etc.

Exceptions and warnings

For example:

errors.AbstractMethodError(class_instance[, ...]),

errors.AccessorRegistrationWarning,

errors.AttributeConflictWarning, etc.

Bug report function

show_versions([as_json])

Test suite runner

test([extra_args])

What does the built-in `pandas.testing` do?

Using the built-in pandas.testing functions, we can write faster test cases [3].

Writing tests for end-to-end ML pipelines can be extremely beneficial in situations where [3]:

Certain assumptions about the data are made.
Certain assumptions are made regarding the outcome of a computation.

Unit testing can be used to ensure that the assumptions being made are still valid and that no side effects (change to the variable, environment, runtime, etc., regardless of whether it was intended or not) are being generated [3].

pandas provides several helpful functions that can make unit testing easier. These can be found in the pandas.testing module.

`DataFrame` tests 1: Column equality assertions

Let’s write a function that inserts a startswith_s column into a DataFrame and returns True if a string begins with the letter “s“:

import pandas as pd

def startswith_s(df, input_col, output_col):

df[output_col] = df[input_col].str.startswith("s")

df = pd.DataFrame({"col1": ["sap", "hi"], "col2": [3, 4]})

The created data frame, inside PyScripter IDE:

Let’s write a unit test that runs the startswith_s function and checks that it returns the expected result. We’ll begin with the pd.testing.assert_series_equal method.

In simple terms, if this test is correctly passed, it will assert a new equality column that describes whether the word started with s or not.

startswith_s(df, "col1", "col1_startswith_s")

expected = pd.Series([True, False], name="col1_startswith_s")

pd.testing.assert_series_equal(df["col1_startswith_s"], expected)

Here is the new data frame result, after testing:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 2

`DataFrame` tests 2: Checking entire `DataFrame` equality

You may want to compare the equality of two entire DataFrames rather than just individual columns.

Here’s how to use the pandas.testing.assert_frame_equal function to compare DataFrame equality:

import pandas as pd

df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})

pd.testing.assert_frame_equal(df1, df2)

The testing output on PyScripter IDE:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 3

How to test `pandas` with `beavis`?

We will compare the testing results with pandas.testing vs beavis in this section.

beavis is testing helper functions for pandas and Dask. The test helpers are inspired by chispa and spark-fast-tests, two popular spark test helper libraries.

Why beavis? Because the beavis error messages are more descriptive compared to the built-in pandas errors.

How to install `beavis`?

You can get beavis using the pip command to your command prompt:

1	pip install beavis

`pandas.testing` vs `beavis`

`pandas.testing` error messages

The following is the code example to compare two data frames using pandas.testing:

import pandas as pd

df = pd.DataFrame({"col1": [1041, 1, 8, 5], "col2": [6, 3, 8, 7]})

pd.testing.assert_series_equal(df["col1"], df["col2"])

Here is the built-in pandas.testing error message when comparing series that are not equal.

Spyder output:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 4

PyScripter output:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 5

Because the columns aren’t aligned, it’s difficult to tell which ones are mismatched.

`beavis` error messages

The following is the code example to compare two data frames using beavis:

import beavis

import pandas as pd

df = pd.DataFrame({"col1": [1041, 3, 8, 5], "col2": [6, 3, 8, 7]})

beavis.assert_pd_column_equality(df, "col1", "col2")

The beavis error message, which aligns rows and highlights mismatches in red, is shown below.

Spyder IDE results:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 6

Output from the PyScripter IDE:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 7

This descriptive error message makes debugging errors and maintaining flow easier. But unfortunately, there might be some bugs on Spyder and PyScripter, so it didn’t show well (on Spyder: The mismatched output on the first row didn’t highlight as red, on PyScripter: All the output-colored red, and the matched outputs on the second and third row are showing strange output).

How to test `pandas` with `datatest`?

datatest is a test-driven data-wrangling and data-validation tool.

datatest also helps to speed up and formalize data-wrangling and data-validation tasks. It was designed to work with badly formatted data by detecting and describing validation failures.

datatest validates pandas objects (DataFrame, Series, and Index) in the same way that it validates built-in types.

This demo uses a DataFrame to load and inspect data from a CSV file (movies.csv). The CSV file uses the following format:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 8

How to install `datatest`?

You can get datatest using the pip command to your command prompt:

1	pip install datatest

Using `datatest` with `pytest-style` tests

The test_movies_df.py script demonstrates pytest-style tests:

import pytest

import pandas as pd

import datatest as dt

@pytest.fixture(scope='module')

@dt.working_directory(__file__)

def df():

return pd.read_csv('movies.csv')

@pytest.mark.mandatory

def test_columns(df):

dt.validate(

df.columns,

{'title', 'rating', 'year', 'runtime'},

)

def test_title(df):

dt.validate.regex(df['title'], r'^[A-Z]')

def test_rating(df):

dt.validate.superset(

df['rating'],

{'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated'},

)

def test_year(df):

dt.validate(df['year'], int)

def test_runtime(df):

dt.validate(df['runtime'], int)

Use the following command to run the tests:

1	pytest test_movies_df.py

How To Add Python Testing Tools Into Pandas Machine Learning Code- testing machine learning pandas output 9

Using `datatest` with `unittest-style` tests

The test_movies_df_unit.py script demonstrates unittest-style tests:

import pandas as pd

import datatest as dt

def setUpModule():

global df

with dt.working_directory(__file__):

df = pd.read_csv('movies.csv')

class TestMovies(dt.DataTestCase):

@dt.mandatory

def test_columns(self):

self.assertValid(

df.columns,

{'title', 'rating', 'year', 'runtime'},

)

def test_title(self):

self.assertValidRegex(df['title'], r'^[A-Z]')

def test_rating(self):

self.assertValidSuperset(

df['rating'],

{'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated'},

)

def test_year(self):

self.assertValid(df['year'], int)

def test_runtime(self):

self.assertValid(df['runtime'], int)

Use the following command to run the tests:

1	python -m datatest test_movies_df_unit.py

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 10

What the above tests just did?

The code above are done the following test:

1. Define a test fixture

Create a test fixture that reads the CSV file and loads it into a DataFrame.

2. Check `column` names

Check whether the data includes the expected column names or not.

The set of values in df.columns must match the required set for this validation. datatest treats the df.columns attribute as an Index object, just like any other sequence of values.

This test is designated as mandatory because it is a prerequisite that must be met before any of the other tests can be passed. When a mandatory test fails, the test suite comes to a halt and no further tests are run.

3. Check ‘`title`’ values

Check if the values in the title column begin with an upper-case letter.

This validation checks that each value in the df['title'] matches the regular expression ^[A-Z].

4. Check ‘`rating`’ values

Check that the values in the rating column correspond to one of the permitted codes.

This validation ensures that the values in df['rating'] are also present in the specified set.

5. Check ‘`year`’ and ‘`runtime`’ types

Check if the values in the year and runtime columns are integers.

Click here to get started with PyScripter, a free, feature-rich, and lightweight Python IDE.

Download RAD Studio to create more powerful Python GUI Windows Apps in 5x less time.

Check out Python4Delphi, which makes it simple to create Python GUIs for Windows using Delphi.

Also, look into DelphiVCL, which makes it simple to create Windows GUIs with Python.

References & further readings

[1] Radcliffe, N. J. (2018).

Introduction to pandas, testing and test-driven data analysis. Europython 2018. Stochastic Solutions Limited & Department of Mathematics, University of Edinburgh.

[2] pandas Documentation. (2023).

pandas Testing API References. pandas. PyData. NumFOCUS, Inc. pandas.pydata.org/docs/reference/testing.html

[3] Srinivas, R. (2022).

Unit Tests with Pandas in Python. Python in Plain English. python.plainenglish.io/unit-tests-with-pandas-bf4c596baeda

[4] mrpowers. (2021).

Testing Pandas Code. MungingData. mungingdata.com/pandas/unit-testing-pandas

[5] Datatest Documentation. (2014-2021).

Datatest: Test driven data-wrangling and data validation. National Committee for an Effective Congress, et al. datatest.readthedocs.io/en/stable

How To Add Python Testing Tools Into Pandas Machine Learning Code

Why is pandas testing important?

Do pandas have a built-in test helper?

Assertion functions

Exceptions and warnings

Bug report function

Test suite runner

What does the built-in pandas.testing do?

DataFrame tests 1: Column equality assertions

DataFrame tests 2: Checking entire DataFrame equality

How to test pandas with beavis?

How to install beavis?

pandas.testing vs beavis

pandas.testing error messages

beavis error messages

How to test pandas with datatest?

How to install datatest?

Using datatest with pytest-style tests

Using datatest with unittest-style tests

What the above tests just did?

1. Define a test fixture

2. Check column names

3. Check ‘title’ values

4. Check ‘rating’ values

5. Check ‘year’ and ‘runtime’ types

References & further readings

[1] Radcliffe, N. J. (2018).

[2] pandas Documentation. (2023).

[3] Srinivas, R. (2022).

[4] mrpowers. (2021).

[5] Datatest Documentation. (2014-2021).

Related posts

Leave a Reply Cancel reply

Something Fresh

What People Reading

Categories

Python GUI

Categories

Useful Links

Follow us

Why is `pandas` testing important?

Do `pandas` have a built-in test helper?

What does the built-in `pandas.testing` do?

`DataFrame` tests 1: Column equality assertions

`DataFrame` tests 2: Checking entire `DataFrame` equality

How to test `pandas` with `beavis`?

How to install `beavis`?

`pandas.testing` vs `beavis`

`pandas.testing` error messages

`beavis` error messages

How to test `pandas` with `datatest`?

How to install `datatest`?

Using `datatest` with `pytest-style` tests

Using `datatest` with `unittest-style` tests

2. Check `column` names

3. Check ‘`title`’ values

4. Check ‘`rating`’ values

5. Check ‘`year`’ and ‘`runtime`’ types