CodeLearn PythonPythonWindows

How To Add Python Testing Tools Into Pandas Machine Learning Code

testing machine learning python pandas blog banner

Testing your code is critical throughout the software or machine learning system development life cycle. Following this, selecting – and employing the appropriate Python testing tools should be an essential part of writing high-quality machine learning apps.

Unlike other tech stacks, pandas does not have a strong testing culture. The majority of pandas developers never write tests. However, if you want to write production-grade code, this is not the case. Good software engineering practices result in more stable data pipelines.

This post explains how to test pandas code with the built-in test helper methods (pandas.testing) and with the additional tools like beavis and Datatest that give more readable error messages and advanced testing features.

If you are looking for more articles about pandas and how to build a nice GUI for them, read them here:

If you are looking for a way to test NumPy, read it here:

For more advanced topics in Python, please read the following articles:

Why is pandas testing important?

pandas testing is important, based on the need to extend TDD’s idea of testing for software correctness with the idea of testing for the meaningfulness of analysis, correctness and validity of input data, & correctness of interpretation (TDDA: Test-driven data analysis) [1].

When you’re ready to start writing production-grade code, you should seriously consider writing tests. Good software engineering practices result in more stable data pipelines [4].

Do pandas have a built-in test helper?

pandas provide us with a built-in test helper, called pandas.testing.

The following are all available functions provided by pandas.testing [2]:

Assertion functions

For example: 

testing.assert_frame_equal(left, right[, ...]), testing.assert_series_equal(left, right[, ...]), testing.assert_index_equal(left, right[, ...]), etc.

Exceptions and warnings

For example:

errors.AbstractMethodError(class_instance[, ...]), errors.AccessorRegistrationWarning, errors.AttributeConflictWarning, etc.

Bug report function

show_versions([as_json])

Test suite runner

What does the built-in pandas.testing do?

Using the built-in pandas.testing functions, we can write faster test cases [3].

Writing tests for end-to-end ML pipelines can be extremely beneficial in situations where [3]:

  • Certain assumptions about the data are made.
  • Certain assumptions are made regarding the outcome of a computation.

Unit testing can be used to ensure that the assumptions being made are still valid and that no side effects (change to the variable, environment, runtime, etc., regardless of whether it was intended or not) are being generated [3].

pandas provides several helpful functions that can make unit testing easier. These can be found in the pandas.testing module.

DataFrame tests 1: Column equality assertions

Let’s write a function that inserts a startswith_s column into a DataFrame and returns True if a string begins with the letter “s”:

The created data frame, inside PyScripter IDE:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 1

Let’s write a unit test that runs the startswith_s function and checks that it returns the expected result. We’ll begin with the pd.testing.assert_series_equal method.

In simple terms, if this test is correctly passed, it will assert a new equality column that describes whether the word started with s or not.

Here is the new data frame result, after testing:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 2

DataFrame tests 2: Checking entire DataFrame equality

You may want to compare the equality of two entire DataFrames rather than just individual columns.

Here’s how to use the pandas.testing.assert_frame_equal function to compare DataFrame equality:

The testing output on PyScripter IDE:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 3

How to test pandas with beavis?

We will compare the testing results with pandas.testing vs beavis in this section. 

beavis is testing helper functions for Pandas and Dask. The test helpers are inspired by chispa and spark-fast-tests, two popular Spark test helper libraries.

Why beavis? Because the beavis error messages are more descriptive compared to the built-in Pandas errors.

How to install beavis?

You can get beavis using the pip command to your command prompt:

pandas.testing vs beavis

pandas.testing error messages

The following is the code example to compare two data frames using pandas.testing:

Here is the built-in pandas.testing error message when comparing series that are not equal.

Spyder output:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 4

PyScripter output:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 5

Because the columns aren’t aligned, it’s difficult to tell which ones are mismatched.

beavis error messages

The following is the code example to compare two data frames using beavis:

The beavis error message, which aligns rows and highlights mismatches in red, is shown below.

Spyder IDE results:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 6

Output from the PyScripter IDE:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 7

This descriptive error message makes debugging errors and maintaining flow easier. But unfortunately, there might be some bugs on Spyder and PyScripter, so it didn’t show well (on Spyder: The mismatched output on the first row didn’t highlight as red, on PyScripter: All the output-colored red, and the matched outputs on the second and third row are showing strange output).

How to test pandas with Datatest?

Datatest is a test-driven data-wrangling and data-validation tool. 

Datatest also helps to speed up and formalize data-wrangling and data-validation tasks. It was designed to work with badly formatted data by detecting and describing validation failures.

Datatest validates pandas objects (DataFrame, Series, and Index) in the same way that it validates built-in types.

This demo uses a DataFrame to load and inspect data from a CSV file (movies.csv). The CSV file uses the following format:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 8

How to install Datatest?

You can get datatest using the pip command to your command prompt:

Using Datatest with pytest-style tests

The test_movies_df.py script demonstrates pytest-style tests:

Use the following command to run the tests:

 How To Add Python Testing Tools Into Pandas Machine Learning Code- testing machine learning pandas output 9

Using Datatest with unittest-style tests

The test_movies_df_unit.py script demonstrates unittest-style tests:

Use the following command to run the tests:

How To Add Python Testing Tools Into Pandas Machine Learning Code - testing machine learning pandas output 10

What the above tests just did?

The code above are done the following test:

1. Define a test fixture

Create a test fixture that reads the CSV file and loads it into a DataFrame.

2. Check column names

Check whether the data includes the expected column names or not.

The set of values in df.columns must match the required set for this validation. Datatest treats the df.columns attribute as an Index object, just like any other sequence of values.

This test is designated as mandatory because it is a prerequisite that must be met before any of the other tests can be passed. When a mandatory test fails, the test suite comes to a halt and no further tests are run.

3. Check ‘title’ values

Check if the values in the title column begin with an upper-case letter.

This validation checks that each value in the df['title'] matches the regular expression ^[A-Z].

4. Check ‘rating’ values

Check that the values in the rating column correspond to one of the permitted codes.

This validation ensures that the values in df['rating'] are also present in the specified set.

5. Check ‘year’ and ‘runtime’ types

Check if the values in the year and runtime columns are integers.

Click here to get started with PyScripter, a free, feature-rich, and lightweight Python IDE.

Download RAD Studio to create more powerful Python GUI Windows Apps in 5x less time.

Check out Python4Delphi, which makes it simple to create Python GUIs for Windows using Delphi.

Also, look into DelphiVCL, which makes it simple to create Windows GUIs with Python.

References & further readings

[1] Radcliffe, N. J. (2018). Introduction to pandas, testing and test-driven data analysis. Europython 2018. Stochastic Solutions Limited & Department of Mathematics, University of Edinburgh. 

[2] pandas Testing API References. PyData. https://pandas.pydata.org/docs/reference/testing.html

[3] Unit Tests with Pandas in Python. Python in Plain English. https://python.plainenglish.io/unit-tests-with-pandas-bf4c596baeda

[4] Testing Pandas Code. MungingData. https://mungingdata.com/pandas/unit-testing-pandas/

[5] Datatest: Test driven data-wrangling and data validation. datatest. https://datatest.readthedocs.io/en/stable/

close
Related posts
Python

Embrace The Power Of Brand New Python 3.11 Features

Python

How To Exit A Python Script

Python

Python Profilers: Learn The Basics Of A Profiler For Python

Learn PythonPython

7 Easy Steps To Learning Python Scripting

Deixe um comentário

O seu endereço de e-mail não será publicado.

pt_BRPortuguese