Frelancing Datascience/AI Project-6

Pre-Commit Hooks for Data Science Projects: Boost Code Quality in Your Repositories

Format Jupyter Notebook and Python Files with a Single Pre-Commit Configuration

Introduction

Have you ever inherited a data science project with poorly written code, making it difficult to understand and maintain (the last person who knew the code a month ago)? What if there was a way to prevent this from happening in the first place?

(captured by a writer)

As the Mistral Large model mentions at Le Chat [1], ‘In summary, writing clean code is an important part of being a professional software developer. It can save time and effort in the long run, make the code easier to work with, and lead to better software’. I will not go further about why we need to pay attention to code quality instead of focusing only on writing a working code. It is worth noting that your code will be a signature or reference for others who may maintain your work in the future, especially when the system encounters issues.

When the subject of code quality continuity arises, the pre-commit hook comes to the rescue. The pre-commit hook is a script that helps the developers automatically maintain the code standards following pre-given criteria. When anyone pushes any commit, these scripts can enforce the code standards, check the syntax errors, or apply several checks. They also provide consistency between the contributors.

I have noticed that the code quality of projects with Python files tends to be better than those with Jupyter Notebook files. This is likely because Jupyter Notebooks are often used for exploratory analysis and model building, rather than production-ready code. As a result, they may not receive the same level of coding standards as Python files.

In this post, I will show you how to set up a pre-commit hook, that can format both Python and Jupyter Notebook files to maintain code quality across different file types. By following these steps, you can ensure that your code meets certain standards, and is consistent and easy to maintain, regardless of whether you work with Python files or Jupyter Notebooks.

Hooks

Several pre-commit hooks can improve the code quality and help for the maintenance of the quality. I will be discussing five specific pre-commit hooks, Black, isort, Pyupgrade, Flake8, and Mypy, that can significantly improve the code quality and aid in maintaining the quality of the codebase.

Black

As a code formatter [2], it conforms the code to the PEP8 style guide [3]. Black reformats the code to meet line length, indentation, spacing, and other formatting rules without any user intervention. Black is highly utilized because of its unified style, which can save time over code style.

# before
def print_your_name ( name, surname ): print(f'{name} {surname}')

# after
def print_your_name(name, surname):
print(f"{name} {surname}")

isort

Isort [4] automatically sorts the imports into sections (e.g. standard library imports, third-party imports, local application imports) and alphabetizes them within each section.

# before
import pandas as pd

import unittest
import numpy as np



import scipy

# after
import unittest

import numpy as np
import pandas as pd
import scipy

Pyupgrade

Pyupgrade [5] upgrades Python syntax to use new language features and APIs. Any Python type can be defined inthe pre-commit hook.

# before
oasis = set(('liam','noel'))

# after
oasis = {'liam','noel'}

Flake8

Flake8 [6] analyzes by checking the style and quality the code and then report any issues it finds for each file. It mainly focuses on the syntax errors, style violations, and complexity issues

import transformers
def do_something_else(numbers):
total =
0

for number in numbers:
total *= 2
return total
# after running black, you will see these warnings:
test.py:1:1: F401 'transformers' imported but unused
test.py:2:1: E302 expected 2 blank lines, found 0
test.py:7:17: W292 no newline at end of file

Mypy

Mypy [7] is a static type checker for Python to help catch type-related errors in Python code.

def speak_like_benedict_cumberbatch(my_sentence):
my_sentence += 'some british accent'
return my_sentence
# after running mypy, you will see this warning:
error: Function is missing a type annotation [no-untyped-def]
Found 1 error in 1 file (checked 1 source file)

For each hook, we can add optional behaviors to make or more strict or specific. In addition to the specific arguments, we can also ignore some checks. Because when the code does not meet the rules of these hooks, you cannot even create your commit. At the next section, you can find how to setup a pre-commit script that can use the hooks above for both Python files and Jupyter Notebook.

How

After having a brief explanation of code quality and some details on the hooks, now it’s time to set up the pre-commit hooks! In order to install these pre-commit hooks, we need two main elements:

  • requirements-dev.txt to contain the pre-commit and hook packages to install (if you use poetry, you can copy the package names)
  • .pre-commit-config.yaml file to configure the hooks for both file types.

Installing Packages

You can create a requirements-dev.txt file and add the packages there:

nbqa[toolchain]==0.5.0
pre-commit==3.6.0
isort==5.13.2
black==24.1.1
flake8==7.0.0
pyupgrade==3.15.0
mypy==1.8.0

nbqa [8] is a package that can apply the pre-commit hooks on the Jupyter Notebooks.

You can install these packages with the command below:

pip install -r requirements-dev.txt

Now that the packages are installed, we need to create a .pre-commit-config.yaml file to configure the pre-commit hooks.

Creating the Configuration File

You can use the following configuration as a starting point:

repos:
- repo: https://github.com/nbQA-dev/nbQA
rev: 1.7.1
hooks:
- id: nbqa-black
additional_dependencies: [black==24.1.1]
- id: nbqa-pyupgrade
args: [ --py39-plus ]
- id: nbqa-isort
additional_dependencies: [isort==5.13.2]
args: [--profile=black]
- id: nbqa-flake8
additional_dependencies: [flake8==7.0.0]
args: ["--ignore=E501,W503,F704,E203"]
- id: nbqa-mypy
additional_dependencies: [mypy==1.8.0]
args: ['--ignore-missing-imports', '--disable-error-code=top-level-await']
- repo: https://github.com/psf/black
rev: 24.1.1
hooks:
- id: black
- repo: https://github.com/asottile/pyupgrade
rev: v3.15.1
hooks:
- id: pyupgrade
- repo: https://github.com/pycqa/isort
rev: 5.13.2
hooks:
- id: isort
args: [--profile=black]
- repo: https://github.com/PyCQA/flake8
rev: 7.0.0
hooks:
- id: flake8
args: ["--ignore=E501,W503,F704,E203"]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v1.9.0'
hooks:
- id: mypy
args: ['--ignore-missing-imports', '--disable-error-code=top-level-await', "--disable-error-code=empty-body"]

The repos section in the configuration file specifies the repositories that compose the pre-commit hooks. The id field specifies the name of the hook, and the rev field indicates the version of the hook. The first repository listed contains all the hooks for Jupyter Notebooks, which we discussed earlier. The subsequent repositories contain hooks for Python files, with some arguments that specify additional options or ignore given warnings.

Activating the Pre-Commit Hooks

Once you have built the configuration file, you need to run the following command to install the pre-commit script that will run before each commit:

pre-commit install

Your pre-commit hooks are now ready to format your code! With each commit you create, auto-formatting and validations are automatically performed by the pre-commit scripts. You may encounter errors such as unused imports or missing return types in methods during the validations. When this occurs, that’s time to improve your code quality and make your code commit-ready!

If you would like to set up this pre-commit configuration on an existing project, there’s no need to be concerned! You can run the command below (please note that it may generate a massive number of errors or warnings depending on your coding style and the number of files):

pre-commit run --all-files

Now, there is no excuse for the code quality anymore !

Final Words

Before concluding this post, I’d like to share some common mistakes to avoid when working with pre-commit scripts:

  1. Ensure new contributors have installed the development packages and activated the pre-commit script.
  2. Specify hook versions to hold consistent standards across the project. Utilizing different versions can lead to discrepancies.
  3. Be cautious when adding multiple ignores while creating new commits. Although it’s tempting to postpone improvements, these ignore can quickly accumulate, making it challenging to maintain code quality.

In this post, I’ve explained the importance of code quality and demonstrated how to maintain it using pre-commit hooks for both Python files and Jupyter Notebooks. I hope you find this post valuable for improving the consistency and readability of your existing and future projects!

Leave a Comment

MFH IT Solutions (Regd No -LIN : AP-03-46-003-03147775)

Consultation & project support organization.

Contact

MFH IT Solutions (Regd)
NAD Kotha Road, Opp Bashyam School, Butchurajupalem, Jaya Prakash Nagar Visakhapatnam, Andhra Pradesh – 530027