Category: Technology

Programming, gadgets (reviews thereof) and computers

A way of working: data science

I am about to take on a couple of data science students from Lancaster University for summer projects, from past experience I always spend some time at the beginning of such projects explaining how I work with the expectation that they will at least take some notice if not repeat my methodology exactly. This methodology evolves slowly over time as I learn new things and my favoured technologies change.

Typically I develop on a Windows laptop but I use the git-bash prompt as my shell for typing in commands – this is a Linux-like terminal which I adopted after working with developers who mainly used Linux and also because I was familiar with the Unix style commandline from before the time on Linux. You can do a lot from the commandline in data science – Data Science at the Command Line by Jeroen Janssens is an excellent introduction.

I use Docker containers a bit to spin up local versions of services which are difficult to run on Windows (things like Airflow and Linkedin DataHub), some people develop entirely inside Docker containers to reduce dependency issues and make deployment of code easier.

I work pretty much entirely in Python for data processing and analysis although I generate CSV files which I load to Tableau for visualisation. I tend not to try complex processing in Tableau since I find the GUI inconvenient and confusing for such work. I use the Anaconda distribution of Python, originally because I liked that it came packaged with a load of useful libraries for data science and it handled virtual environments and installation of more tricky packages better than plain Python. It may be worth revisiting this decision. I have recently shifted my code to Python 3.9.

For a piece of work I will usually set up a Python project which can be “installed”. This blog post explains a standard structure for Python projects. I aim to use Python virtual environments on a per project basis but sometimes I fail. Typically, I will write Python modules that provide functions but also have a simple command line interface which takes two or three positional parameters. You can see this in action in the git repo here which I share as a template for myself and others!

To date I have picked up commandline arguments using sys.argv I should probably use one of the libraries to make these commandline interfaces better, there is a blog post here which compares the built-in argparse library with click and docopt. I think I might check out click for future projects.

As well as running commandline scripts I use tests to develop analysis, as well as being good software development practice, test runners make a convenient way to run arbitrary functions in a code base. I prefer to use the unittest built-in library but I’ve started using pytest for a recent project. I wrote a blog post about writing tests, since I wrote it I have learned about test mocks and pytest’s fixture functionality.

I have a library of general utilities for interacting with databases, setting up logging and writing dictionaries which I wrote because I found I was doing these things repeatedly and making my own library allowed me to forgot some of the boilerplate code required to do these things. The key utilities are included with the repo attached to this blog.

I’ve been using Visual Code as my editor for some time now, I prefer not to use full blown IDEs because I find they present more functionality than I can cope with. I think this is as a result of coding in Java using Eclipse and C# .net in Visual Studio. In any case Visual Code starts as a nice enough code editor but has been sneaking in more IDE functionality via extensions.

The extensions I use heavily in Visual Code are Python and Pylance – the Python language server provides type-hinting support. I wrote about type-hinting in Python here. I also use Rainbow CSV for when I am editing or viewing CSV files.

I could use Visual Code for accessing git, my preferred source control system, but instead I use GitKraken which has a very nice GUI interface. Since I am usually working by myself my git usage is very simple, I typically have one branch onto which I make many small commits. I have recently started working with a team where I am using feature-based branches which get merged by pull requests – this was a bit of a culture shock.

As a result of working with other people on a new project I have started using some technologies which I will just mention here. I run the black formatter, as well pylint and flake8. Black just reformats my code files when I save them and can largely be ignored. Flake8 is fairly easy to satisfy although I spent a lot of time addressing line length issues. Pylint generates quite a few warnings which I attend to but sometimes ignore.

I have also started using Make files and Azure Devops pipelines for running common tasks on my code (tests, cleanup, setting up infrastructure, linting).

Outside technology, I have a very long established method of working using a monthly Word document as a notebook, I describe it here. I tend to prefix file names with ISO8601 format dates (2022-05-22) this means that if I created a Tableau workbook or an Excel worksheet I can link it easily to what I was writing in my notebook and the status of the appropriate git repo at that point in time.

I’ve incorporated all the code related elements mentioned above in this ways-of-working-data-science git repository.

Understanding setup.py, setup.cfg and pyproject.toml in Python

This blog post is designed to clarify my thinking around installing Python packages and the use of setup.py, setup.cfg and pyproject.toml files. Hopefully it will be a useful reference for other people, and future me.

It is stimulated by my starting work on a new project where we have been discussing best practices in Python programming, and how to layout a git repository containing Python code. More broadly it is relevant to me as someone who programmes a lot in Python, mainly for my own local use, but increasingly for other people to consume my code. Prior to writing this the layout of my Python repositories was by a system of random inheritance dating back a number of years.

The subject of Python packaging, installation and publication is a bit complicated for mainly historic reasons – the original distutils module was created over 20 years ago. A suite of tools have grown up either as part of the standard library or de facto standards, and have evolved over time. Some elements are contentious in the sense that projects will have lengthy arguments over whether or not to support a particular method of configuration. A further complication for people whose main business is not distributing their code is that it isn’t necessarily at the start of a project and may never be relevant.

Update: I have updated this blog post 5th May 2023, the change is that project settings formerly in setup.cfg can now go in pyproject.toml, as per PEP-621 – described in more detail in the PyPA documentation. Currently I only use setup.cfg for flake8 configuration.  A reader from Mastodon commented that setup.py is not required for installation of a package but is required for build/publication.

tl;dr

Structure your Python project like this with setup.py and pyproject.toml in the top level with a tests directory and a src directory with a package subdirectory inside that:

The minimal setup.py file simply contains an invocation of the setuptools setup function, if you do not intend to publish your project then no setup.py file is required at all, pip install -e . will work without it:

setup.py

Setup.cfg is no longer required for configuring a package but third-party tools may still use it. Put at least this in pyproject.toml:

 

Then install the project locally:

pip install -e .

If you don’t do this “editable installation” then your tests won’t run because the package will not be installed. An editable install means that changes in the code will be immediately reflected in the functionality of the package.

It is common and recommended practice to use virtual environments for work in Python. I use the Anaconda distribution of Python in which we setup and activate a virtual environment using the following, to be run before the pip install statement

conda create -n tmp python=3.9
conda activate tmp

There is a copy of this code, including some Visual Code settings, and a .gitignore file in this GitHub repository: https://github.com/IanHopkinson/mypackage

Setup.py and setup.cfg

But why should we do it this way? It is worth stepping back a bit and defining a couple of terms:

module – a module is a file containing Python functions.

package – a package is a collection of modules intended to be installed and used together.

Basically this blog post is all about making sure import and from ... import ... works in a set of distinct use cases. Possibilities include:

  1. Coding to solve an immediate problem with no use outside of the current directory anticipated – in this case we don’t need to worry about pyproject.toml, setup.cfg, setup.py or even __init__.py.
  2. Coding to solve an immediate problem with the potentially to spread code over several files and directories – we should now make sure we put an empty __init__.py in each directory containing module files.
  3. Coding to provide a local library to reuse in other projects locally this will require us to run python setup.py develop or better pip install -e .
  4. Coding to provide a library which will be used on other systems you control again using pip install -e .
  5. Coding to provide a library which will be published publicly, here we will need to additionally make use of something like the packaging library.

I am primarily interested in cases 3 and 4, and my projects tend to be pure Python so I don’t need to worry about compiling code. More recently I have been publishing packages to a private PyPI repository but that is a subject for another blog post.

The setup.py and setup.cfg files are artefacts of the setuptools module which is designed to help with the packaging process. It is used by pip whose purpose is to install a package either locally or remotely. If we do not configure setup.py/setup.cfg correctly then pip will not work. In the past we would have written a setup.py file which contained a bunch of configuration information but now we should put that configuration information into setup.cfg which is effectively an ini format file (i.e. does not need to be executed to be read). This is why we now have the minimal setup.py file.

It is worth noting that setup.cfg is an ini format file, and pyproject.toml is a slightly more formal ini-like format.

What is pyproject.toml?

The pyproject.toml file was introduced in PEP-518 (2016) as a way of separating configuration of the build system from a specific, optional library (setuptools) and also enabling setuptools to install  itself without already being installed. Subsequently PEP-621 (2020) introduces the idea that the pyproject.toml file be used for wider project configuration and PEP-660 (2021) proposes finally doing away with the need for setup.py for editable installation using pip.

Although it is a relatively new innovation, there are a number of projects that support the use of pyproject.toml for configuration including black, pylint and mypy. More are listed here:

https://github.com/carlosperate/awesome-pyproject

Where do tests go?

Tests go in a tests directory at the top-level of the project with an __init__.py file so they are discoverable by applications like pytest. The alternative of placing them inside the src/mypackage directory means they will get deployed into production which may not be desirable.

Why put your package code in a src/ subdirectory?

Using a src directory ensures that you must install a package to test it, so as your users would do. Also it prevents tools like pytest incidently importing it.

Conclusions

I found it a useful exercise researching this blog post, the initial setup of a Python project is something I rarely consider and have previously done by rote. Now I have a clear understanding of what I’m doing, and I also understand the layout of Python projects. One of my key realisations is that this is a moving target, what was standard practice a few years ago is no longer standard, and in a few years time things will have changed again.

Python Documentation with Sphinx

I’ve been working on a proof of concept project at work, and the time has come to convert it into a production system. One of the things it was lacking was documentation, principally for the developers who would continue work on it. For software projects there is a solution to this type of problem: automated documentation systems which take the structure of the code and the comments in it (written in a particular way) and generate human readable documentation from it – typically in the form of webpages.

For Python the “go to” tool in this domain is Sphinx.

I used Sphinx a few years ago, and although I got to where I wanted in terms of the documentation it felt like a painful process. This time around I progressed much more quickly and was happier with the results. This blog post is an attempt to summarise what I did for the benefit of others (including future me). Slightly perversely, although I use a Windows 10 laptop, I use Git Bash as my command prompt but I believe everything here will apply regardless of environment.

There are any number of Sphinx guides and tutorials around, I used this one by Sam Nicholls as a basis supplemented with a lot of Googling for answers to more esoteric questions. My aim here is to introduce some pragmatic solutions to features I wanted, and to clarify some thing that might seem odd if you are holding the wrong concept of how Sphinx works in your head.

I was working on a pre-existing project. To make all of the following work I ran “pip install …” for the following libraries: sphinx, sphinx-rtd-theme, sphinx-autodoc-typehints, and m2r2. In real life these additional libraries were added progressively. sphinx-rtd-theme gives me the the popular “Readthedocs” theme, Readthedocs is a site that publishes documentation and the linked example shows what can be achieved with Sphinx. sphinx-autodoc-typehints pulls in type-hints from the code (I talked about these in another blog post) and m2r2 allows the import of Markdown (md) format files, Sphinx uses reStructuredText (rst) format by default. These are both simple formats that are designed to translate easily into HTML format which is a pain to edit manually.

With these preliminaries done the next step is to create a “docs” subdirectory in the top level of your repository and run the “sphinx-quickstart” script from the commandline. This will ask you a bunch of questions, you can usually accept the default or provide an obvious answer. The only exception to this, to my mind, is when asked “Separate source and build directories“, you should answer “yes“. When this process finishes sphinx-quickstart will have generated a couple of directories beneath “docs“: “source” and “build“. The build directory is empty, the source directory contains a conf.py file which contains all the configuration information you just provided, an index.rst file and a Makefile. I show the full directory structure of the repository further down this post.

I made minor changes to conf.py, switching the theme with html_theme = ‘sphinx_rtd_theme’, and adding the extensions I’m using:

extensions = [
'sphinx.ext.autodoc',
'sphinx_autodoc_typehints',
'm2r2',
]

In the past I added these lines to conf.py but as of 2022-12-26 this seems not to be necessary:

import os 
import sys
sys.path.insert(0, os.path.abspath('..'))

This allows the Sphinx to “see” the rest of your repository from the docs directory.

The documentation can now be built using the “make html” command but it will be a bit dull.

In order to generate the documentation from code a command like: “sphinx-apidoc -o source/ ../project_code“, run from the docs directory will generate .rst files in the source directory which reflect the code you have. To do this Sphinx imports your code, and it will use the presence of the __init__.py file to discover which directories to import. It is happy to import subdirectories of the main module as submodules. These will go into files of the form module.submodule.rst.

The rst files contain information from the docstrings in your code files, (those comments enclosed in triple double-quotes “””I’m a docstring”””. A module or submodule will get the comments from the __init__.py file as an overview then for each code file the comments at the top of the file are included. Finally, each function gets an entry based on its definition and some specially formatted documentation comments. If you use type-hinting, the sphinx-autodoc-typehints library will include that information in documentation. The following fragment shows most of the different types of annotation I am using in docstrings.

def initialise_logger(output_file:Union[str, bytes, os.PathLike], mode:Optional[str]="both")->None:
    """
    Setup logging to console and file simultanenously. The process is described here:
    Logging to Console and File In Python

    :param output_file: log file to use. Frequently we set this to:
    .. highlight:: python
    .. code-block:: python

            logname = __file__.replace("./", "").replace(".py", "")
            os.path.join("logs", "{}.log".format(logname)) 
        
    :param mode: `both` or `file only` selects whether output is sent to file and console, or file only
    
    :return: No return value
    """

My main complaint regarding the formatting of these docstrings is that reStructuredText (and I suspect all flavours of Markdown) are very sensitive to whitespace in a manner I don’t really understand. Sphinx can support other flavours of docstring but I quite like this default. The docstring above, when it is rendered, looks like this:

In common with many developers my first level of documentation is a set of markdown files in the top level of my repository. It is possible to include these into the Sphinx documentation with a little work. The two issues that need to be addressed is that commonly such files are written in Markdown rather reStructuredText. These can be fixed by using the m2r2 library. Secondly the top level of a repository is outside the Sphinx source tree, so you need to put rst files in the source directory which include the Markdown files from the root of the repository. For the CONTRIBUTIONS.md file the contributions.rst file looks like this:

.. mdinclude:: ../../CONTRIBUTIONS.md

Putting this all together the (edited) structure for my project looks like the following, I’ve included the top-level of the repository which contains the Markdown flavour files, the docs directory, where all the Sphinx material lives, and stubs to the directories containing the module code, with __init__.py files.

.

├── CONTRIBUTIONS.md
├── INSTALLATION.md
├── OVERVIEW.md
├── USAGE.md
├── andromeda_dq
│   ├── __init__.py
│   ├── scripts
│   │   ├── __init__.py
│   ├── tests
│   │   ├── __init__.py
├── docs
│   ├── Makefile
│   ├── make.bat
│   └── source
│       ├── _static
│       ├── _templates
│       ├── andromeda_dq.rst
│       ├── andromeda_dq.scripts.rst
│       ├── andromeda_dq.tests.rst
│       ├── conf.py
│       ├── contributions.rst
│       ├── index.rst
│       ├── installation.rst
│       ├── modules.rst
│       ├── overview.rst
│       └── usage.rst
├── setup.py

The index.rst file pulls together documentation in other rst files, these are referenced by their name excluded the rst extension (so myproject pulls in a link to myproject.rst). By default the index file does not pull in all of the rst files generated by apidoc, so these might need to be added (specifically the modules.rst file). The index.rst file for my project looks like this, all I have done manually to this file is add in overview, installation, usage, contributions and modules in the “toctree” section. Note that the indentation for these file imports needs to be the same as for the preceding :caption: directive.

.. Andromeda Data Quality documentation master file, created by
   sphinx-quickstart on Wed Sep 15 08:33:59 2021.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Andromeda Data Quality
======================

Documentation built using Sphinx. To re-build run `make html` in the `docs`
directory of the project.

The OVERVIEW.md, INSTALLATION.md, USAGE.md, and CONTRIBUTIONS.md files are imported 
from the top level of the repo.

Most documentation is from type-hinting and docstrings in source files.

.. toctree::
   :maxdepth: 3
   :caption: Contents:

   overview
   installation
   usage
   contributions
   modules
   


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

The (edited) HTML index page for the documentation looks like this:

For some reason Sphinx puts the text in the __init__.py files which it describes as “Module Contents” at the bottom of the relevant package description, this can be fixed by manually moving the “Module contents” section to the top of the file in the relevant package rst file.

There is a little bit of support for Sphinx in Visual Code, I’ve installed the reStructuredText Syntax highlighting extension and the Python Sphinx Highlighter extension. The one thing I haven’t managed to do is automate the process of running “make html” either on commit of new code, or when code is pushed to a remote. I suspect this will be one of the drawbacks in using Sphinx. I’m getting a bit better at adding type-hinting and docstrings as I code now.

If you have any comments, or suggestions, or find any errors in this blog post feel free to contact me on twitter (@ianhopkinson_).

Unit testing in Python using the unittest module

The aim of this blog post is to capture some simple “recipes” on testing code in Python that I can return to in the future. I thought it would also be worth sharing some of my thinking around testing more widely. The code in this GitHub gist illustrates the testing features I mention below.

#!/usr/bin/env python
# encoding: utf-8
"""
Some exercising of Python test functionality based on:
https://docs.python.org/3/library/doctest.html
https://docs.python.org/3/library/unittest.html
Generating tests dynamically with unittest
https://eli.thegreenplace.net/2014/04/02/dynamically-generating-python-test-cases
Supressing log output to console:
https://stackoverflow.com/questions/2266646/how-to-disable-and-re-enable-console-logging-in-python
The tests in this file are run using:
./tests.py -v
Ian Hopkinson 2020-11-18
"""
import unittest
import logging
def factorial(n):
"""Return the factorial of n, an exact integer >= 0.
>>> [factorial(n) for n in range(6)]
[1, 1, 2, 6, 24, 120]
>>> factorial(30)
265252859812191058636308480000000
>>> factorial(-1)
Traceback (most recent call last):
ValueError: n must be >= 0
Factorials of floats are OK, but the float must be an exact integer:
>>> factorial(30.1)
Traceback (most recent call last):
ValueError: n must be exact integer
>>> factorial(30.0)
265252859812191058636308480000000
It must also not be ridiculously large:
>>> factorial(1e100)
Traceback (most recent call last):
OverflowError: n too large
"""
import math
if not n >= 0:
raise ValueError("n must be >= 0")
if math.floor(n) != n:
raise ValueError("n must be exact integer")
if n+1 == n: # catch a value like 1e300
raise OverflowError("n too large")
result = 1
factor = 2
while factor <= n:
result *= factor
factor += 1
return result
class TestFactorial(unittest.TestCase):
test_cases = [(0, 1, "zero"),
(1, 1, "one"),
(2, 2, "two"),
(3, 5, "three"), # should be 6
(4, 24, "four"), # should be 24
(5, 120, "five")]
def test_that_factorial_30(self):
self.assertEqual(factorial(30), 265252859812191058636308480000000)
def test_that_factorial_argument_is_positive(self):
with self.assertRaises(ValueError):
factorial(-1)
def test_that_a_list_of_factorials_is_calculated_correctly(self):
# nosetests does not run subTests correctly:
# It does not report which case fails, and stops on failure
for test_case in self.test_cases:
with self.subTest(msg=test_case[2]):
print("Running test {}".format(test_case[2]), flush=True)
logging.info("Running test {}".format(test_case[2]))
self.assertEqual(factorial(test_case[0]), test_case[1], "Failure on {}".format(test_case[2]))
@unittest.skip("demonstrating skipping")
def test_nothing(self):
self.fail("shouldn't happen")
if __name__ == '__main__':
# Doctests will run if they are invoked before unittest but not vice versa
# nosetest will only invoke the unittests by default
import doctest
doctest.testmod()
# If you want your generated tests to be separate named tests then do this
# This is from https://eli.thegreenplace.net/2014/04/02/dynamically-generating-python-test-cases
def make_test_function(description, a, b):
def test(self):
self.assertEqual(factorial(a), b, description)
return test
testsmap = {
'test_one_factorial': [1, 1],
'test_two_factorial': [2, 3],
'test_three_factorial': [3, 6]}
for name, params in testsmap.items():
test_func = make_test_function(name, params[0], params[1])
setattr(TestFactorial, 'test_{0}'.format(name), test_func)
# This supresses logging output to console, like the –nologcapture flag in nosetests
logging.getLogger().addHandler(logging.NullHandler())
logging.getLogger().propagate = False
# Finally we run the tests
unittest.main(buffer=True) # supresses print output, like –nocapture in nosetests or you can use -b
view raw tests.py hosted with ❤ by GitHub

My journey with more formal code testing started about 10 years ago when I was programming in Matlab. It only really picked up a couple of years later when I moved to work at a software startup, coding in Python. I’ve read a couple of books on testing (BDD in action by John Ferguson Smart, Test-Driven Development with Python by Harry J.W. Percival) as well as Working effectively with legacy code by Michael C. Feathers which talks quite a lot about testing. I wrote a blog post a number of years ago about testing in Python when I had just embarked on the testing journey.

As it stands I now use unit testing fairly regularly although the test coverage in my code is not great.

Python has two built-in mechanisms for carrying out tests, the doctest and the unittest modules. Doctests are added as comments to the top of function definitions (see lines 27-51 in the gist above). They are run either by adding calling doctest.testmod() in a script, or adding doctest to a Python commandline as shown below.

python -m doctest -v tests.py

Personally I’ve never used doctest – I don’t like the way the tests are scattered around the code rather than being in one place, and the “replicating the REPL” seems a fragile process but I include them here for completeness.

That leaves us with the unittest module. In Python it is not unusual use a 3rd party testing library which runs on top of unittest, popular choices include nosetests and, more recently, pytest. These typically offer syntactic sugar in terms of making tests slightly easier to code, and read. There is also additional functionality in writing and running test suites. Unittest is based on the Java testing framework, Junit, as such it inherits an object-oriented approach that demands tests are methods of a class derived from unittest.TestCase. This is not particularly Pythonic, hence the popularity of 3rd party libraries.

I’ve used nosetest for a while, now but it looks like its use is no longer recommended since it is no longer being developed. Pytest is the new favoured 3rd party library. Personally, I’m probably going to revert to writing tests using unittest. As a result of writing this blog post I will probably stop using nosetests as a test runner and simply use pure unittest.

The core of unittest is to call the function under test with a set of parameters, and check that the function returns the correct response. This is done using one of the assert* methods of the unittest.TestCase class. I nearly always end up using assertEquals. This is shown in minimal form in lines 67-76 above.

With data science work we often have a list of quite similar tests to run, calling the same function with a list of arguments and checking off the response against the expected value. Writing a function for each test case is a bit laborious, unittest has a couple of features to help with this:

  • subTest puts all the test cases into a single test function, and executes them all, reporting only those that fail (see lines 82-90). This is a compact approach but not verbose. Note that nosetests does not run subTest correctly, it being a a feature of unittest only introduced in Python 3.4 (2014);
  • alternatively we can use a functional programming trip to programmatically generate test functions and add them to the unittest.TestCase class we have derived, this is shown on lines 105-116;

Sometimes you write tests that you don’t always want to run either because they are slow to run, or because you used them in addressing a particular problem and now want to keep for the purposes of documentation but not to run. Decorators in unittest are used to skip tests, @unittest.skip() is the simplest of these, this is an “opt-out”.

Once you’ve written your tests then you need to run them. I liked using nosetests for this, if you ran it in a directory then it would trundle off and find any files that looked like they contained tests and run them, reporting back on the results of the tests.

Unittest has some test discovery functionality which I haven’t yet explored, the simplest way of invoking it is simply by running the script file using:

python tests.py -v

The -v flag indicates that output should be “verbose”, the name of each test is shown with a pass/fail status, and debug output if a test fails. By default unittest shows print messages from the test functions and the code being tested on the console, and also logging messages which can confuse test output. These can be supressed by running tests with the -b flag at the commandline or setting the buffer argument to True in the call to unittest.main(). Logging messages can be supressed by adding a NullHandler, as shown in the gist above on lines 188-119.

The only functionality I’ve used in nosetests and can’t do using pure unittest is re-running only those tests that failed. This limitation could be worked around using the -k commandline flag and using a naming convention to track those test still failing.

Not covered in this blog post are the setUp and tearDown methods which can be run before and after each test method.  

I hope you found this blog post useful, I found writing it helpful in clarifying my thoughts and I now have a single point of reference in future.

Type annotations in Python, an adventure with Visual Studio Code and Pylance

I’ve been a Python programmer pretty much full time for the last 7 or 8 years, so I keep an eye out for new tools to help me with this. I’ve been using Visual Studio Code for a while now, and I really like it. Microsoft have just announced Pylance, the new language server for Python in Visual Studio Code.

The language server provides language-sensitive help like spotting syntax errors, providing function definitions and so forth. Pylance is based on the type checking engine Pyright. Python is a dynamically typed language but recently has started to support type annotations. Dynamic typing means you don’t tell the interpreter what “type” a variable is (int, string and so forth) you just use it as such. This contrasts with statically typed languages where for every variable and function you define a type as you write your code. Type annotations are a halfway house, they are not used by the Python interpreter but they can be used by tools like Pylance to check code, making it more likely to run correctly on first go.

Pylance provides a range of “Intellisense” code improvement features, as well as type annotation based checks (which can be switched off).

I was interested to use the type annotations checking functionality since one of the pleasures of working with statically typed languages is that once you’ve satisfied your compiler that all of the types are right then it has a better chance of running correctly than a program in a dynamically typed language.

I will use the write_dictionary function in my little ihutilities library as an example here, this function is defined in the file io_utils.py. The appropriate type annotation for write_dictionary is:

def write_dictionary(filename: str, data: List[Dict[str,Any]], 
append:Optional[bool]=True, delimiter:Optional[str]=",") -> None:

Essentially each variable is followed by a colon, and then a type (i.e str). Certain types are imported from the typing library (Any, List, Optional and Dict in this instance). We supply the types of the elements of the list, or dictionary. The Any type allows for any type. The Optional keyword is used for optional parameters which can have a default value. The return type is put at the end after the ->. In a *.pyi file described below, the function body is replaced with ellipsis (…).

Actually the filename type hint shouldn’t be string but I can’t get the approved type of Union[str, bytes, os.PathLike] to work with Pylance at the moment.

As an aside Pylance spotted that two imports in the io_utils.py library were unused. Once I’d applied the type annotation to the function definition it inferred the types of variables in the code, and highlighted where there might be issues. A recurring theme was that I often returned a string or None from a function, Pylance indicated this would cause a problem if I tried to measure the length of None.

There a number of different ways of providing typing information, depending on your preference and whether you are looking at your own code, or at a 3rd party library:

  1. Types provided at definition in the source module – this is the simplest method, you just replace the function def line in the source module file with the type annotated one;
  2. Types provided in the source module by use of *.pyi files – you can also put the type-annotated function definition in a *.pyi file alongside the original file in the source module in the manner of a C header file. The *.pyi file needs to sit in the same directory as its *.py sibling. This definition takes precedence over a definition in the *.py file. The reason for using this route is that it does not bring incompatible syntax into the *.py files – non-compliant interpreters will simply ignore *.pyi files but it does clutter up your filespace. Also there is a risk of the *.py and *pyi becoming inconsistent;
  3. Stub files added to the destination project – if you import write_dictionary into a project Pylance will highlight that it cannot find a stub file for ihutilities and will offer to create one. This creates a `typings` subdirectory alongside the file on which this fix was executed, this contains a subdirectory called `ihutilities` in which there are files mirroring those in the ihutilities package but with the *.pyi extension i.e. __init__.pyi, io_utils.py, etc which you can modify appropriately;
  4. Types provided by stub-only packages PEP-0561 indicates a fourth route which is to load the type annotations from a separate, stub only, module.
  5. Types provided by Typeshed – Pyright uses Typeshedfor annotations for built-in and standard libraries, as well as some popular third party libraries;

Type annotations were introduced in Python 3.5, in 2015, so are a relatively new language feature. Pyright is a little over a year old, and Pylance is a few days old. Unsurprisingly documentation in this area is relatively undeveloped. I found myself looking at the PEP (Python Enhancement Proposals) references as often as not to understand what was going on. If you want to see a list of relevant PEPs then there is a list on the Pyright README.md, I even added one myself.

Pylance is a definite improvement on the old Python language server which was itself more than adequate. I am currently undecided about type annotations, the combination of Pylance and type annotations caught some problems in my code which would only come to light in certain runtime circumstances. They seem to be a bit of an overhead which I suspect I would only use for frequently used library routines, and core code which gets run a lot and is noticeable by others when it fails. I might start by adding in some *.pyi files to my active projects.