Software Engineering for Data Scientists

For a long time I have worked as a data scientist, and before that a physical scientist – writing code to do data processing and analysis. I have done some work in software engineering teams but only in a relatively peripheral fashion – as a pair programmer to proper developers. As a result I have picked up some software engineering skills – in particular unit testing and source control. This year, for the first time, I have worked as a software engineer in a team. I thought it was worth recording the new skills and ways of working I have picked up in the process. It is worth pointing out that this was a very small team with only three developers working about 1.5 FTE.

This blog assumes some knowledge of Python and source control systems such as git.

Coding standards

At the start of the project I did some explicit work on Python project structure, which resulted in this blog post (my most read by a large margin). At this point we also discussed which Python version would be our standard, and which linters (syntax/code style enforcers) we would use (Black, flake and pylint) – previously I had not used any linters/syntax checkers other than those built-in to my preferred editors (Visual Studio Code). My Python project layout used to be a result of rote learning – working in a team forced me to clarify my thinking in this area.

Agile development

We followed an Agile development process, with work specified in JIRA tickets which were refined and executed in 2 week sprints. Team members were subjected to regular rants (from me) on the non-numerical “story points” which have the appearance of numbers BUT REALLY THEY ARE NOT! Also the metaphor of sprinting all the time is exhausting. That said I quite like the structure of working against tickets and moving them around the JIRA board. Agile development is the subject of endless books, I am not going to attempt to describe it in any detail here.

Source control and pull requests

To date my use of source control (mainly git these days) has been primitive; effectively I worked on a single branch to which I committed all of my code. I was fairly good at committing regularly, and my commit messages were reasonable useful. I used source control to delete code with confidence and as a record of what I was doing when.

This project was different – as is common we operated on the basis of developing new features on branches which were merged to the main branch by a process of “pull requests” (GitHub language) / “merge requests” (GitLab language). For code to be merged it needed to pass automated tests (described below) and review by another developer.

I now realise we were using the GitHub Flow strategy (a description of source control branching strategies is here) which is relatively simple, and fits our requirements. It would probably have been useful to talk more explicitly about our strategy here since I had had no previous experience in this way of working.

I struggled a bit with the code review element, my early pull requests were massive and took ages for the team to review (partly because they were massive, and partly because the team was small and had limited time for the project). At one point I Googled for dealing with slow code review and read articles starting “If it takes more than a few hours for code to be reviewed….” – mine were taking a couple of weeks! My colleagues had a very hard line on comments in code (they absolutely did not want any comments in code!)

On the plus side I learnt a lot from having my code reviewed – often in pushing me to do stuff I knew I should have done. I also learned from reviewing other’s code, often I would review someone else’s code and then go change my own code.

Automated pipelines

As part of our development process we used Azure Pipelines to run tests on pull requests. Azure is our corporate preference – very similar pipeline systems can be found in GitHub and GitLab. This was all new to me in practical, if not theoretical, terms.

Technically configuring the pipeline involved a couple of components. The first is optional, we used Linux “make” targets to specify actions such as running installation, linters, unit tests and integration tests. Make targets are specified in a Makefile, and are involved with simple commands like “make install”. I had a simple MakeFile which looked something like this:

The make targets can be run locally as well as in the pipeline. In practice we could fix all issues raised by black and flake8 linters but pylint produced a huge list of issues which we considered then ignored (so we forced a pass for pylint in the pipeline).

The Azure Pipeline was defined using a YAML file, this is a simple example:

This YAML specifies that the pipeline will be triggered on attempting a pull request against a main branch. The pipeline is run on an Ubuntu image (the latest one) with Python 3.9 installed. Three actions are done, first installation of the Python package specified in the git repo, then unit tests are run and finally a set of linters is run. Each of these actions is run regardless of the status of previous actions. Azure Pipelines offers a lot of pre-built tasks but they are not portable to other providers, hence the use of make targets.

The pipeline is configured by navigating to the Azure Pipeline interface and pointing at the GitHub repo (and specifically this YAML file). The pipeline is triggered when a new commit is pushed to the branch on GitHub. The results of these actions are shown in a pretty interface with extensive logging.

The only downside of using a pipeline from my point of view was that my standard local operating environment is Windows with the git-bash prompt providing a Linux-like commandline interface. The pipeline was run on an Ubuntu image, which meant that certain tests would pass locally, but not in the pipeline, and were consequently quite difficult to debug. Regular issues were around checking file sizes (line endings mean that file sizes on Linux and Windows differ) and file paths – even with Python’s pathlib – are different between Windows and Linux systems. Using a pipeline forces you to ensure your installation process is solid, since the pipeline image is built on every run.

We also have a separate pipeline to publish the Python package to a private PyPi repository but that is the subject of another blog post.

Conclusions

I learnt a lot working with other, more experienced, software engineers and as a measure of the usefulness of this experience I have retro-fitted the standard project structure and make targets to my legacy projects. I have started using pipelines for other applications.

Book review: Curious devices and mighty machines by Samuel J.M.M. Alberti

albertiThis review is of Curious devices and might machines: Exploring Science Museums by Samuel J.M.M. Alberti. I picked this up because I follow a number of history of science and museum people on Twitter. One downside of this is that these are the sort of people that get sneak previews of such books, leaving us mortals a long wait before we get our hands on them!

There are a couple of thousand science museums around the world, out of a total of 30,000 museums globally. About a fifth of the population visits a science museum every year. In the UK the Science Museum Groups gets about 6 million visits a year. Around 100,000 visits a year are required for a museum to be economically viable. There is an overlap between science museums and the more recently instituted "exploratoriums". Science museums have always been technology and science museums, with artefacts actually biased towards the former. Science museum exhibits can be massive (whole aeroplanes and steam engines), they can be commonplace (for example one of billions of mobile phones) and unlike most museums it is not unusual for the public to be able to handle selected parts of the collection. 

The first science museums came into being out of the personal "cabinets of curiosities" found in the Renaissance, they became public institutions in the 18th and 19th century. They were often founded to demonstrate a country’s technological prowess, or provide training for a workforce as the Industrial Revolution occurred. Sometimes scientific workplaces became museums by the passage of time, this was certainly true of the (New) Cavendish Laboratory where I once worked – the spacious corridor outside the suite of labs I worked in contained a collection of objects including James Clarke Maxwell’s desk and some of his models of mathematical functions. It was striking how scientific apparatus transitioned from finely crafter objects in the 19th century to rather more utilitarian designs in the early 20th century. Frank Oppenheimer (brother of Robert Oppenheimer) founded the first Exploratorium in San Francisco in 1969.

Perhaps a little surprisingly, science museum collections have not historically been formed systematically. The London Science Museum started, alongside the Victoria and Albert Museum, with objects from the Great Exhibition, and was boosted by part of the (enormous) Henry Wellcome collection. More recently curators have been proactive – cultivating collectors and research and industry institutions. Acquisition by purchase at auction is less common than in the art museum world but not completely unknown. Sometimes museums will make public appeals for objects, for example during the recent COVID pandemic. It has always been the case that documents, and more recently software and other digital artefacts greatly outnumber "physical" objects. Digital artefacts represent a challenge since for most modern scientific equipment to be useable the software required to run the equipment is required, and speaking from experience it can be challenging to get the software running whilst the equipment is in working use. These documents are either artefacts in their own right (for example railway posters) or documentation relating to a particular object.

Like icebergs much of a science museum collection is away from public view in increasingly specialised storage facilities. Alberti is keen to highlight the vitality and dynamism of storage facilities, curators in general appear reluctant to refer to stores as "stores"! Stores are places where research and conservation happen, sometimes there are hazards to be managed – legacy radioactive materials are an issue both in museums and also in currently operational labs.

Museums present objects in long term exhibitions, and shorter, more focused exhibitions which may move from museum to museum. Exhibitions can be object-led or story-led, and the human stories are an important element. Science museums attract a wide age range. Pierre Boudieu makes an appearance here, as my wife completes her (Doctorate of Education) Bourdieu has been a constant occupant of the mental space of our home. His relevance here is the idea of "scientific capital" to parallel Bourdieu’s "cultural capital". "Scientific capital" refers to all the scientific touch points and knowledge you might have, I have demonstrated my "scientific capital" above, citing my experiences in word class research laboratories, and experience with scientific research. As a scientist from a very young age science museums have been my natural home but this is in large part due to my family rather than formal education.

The book finishes with a chapter on campaigning with collections, covering climate change, racism and colonialism, disability, and mis-information. Museums are held in high regard in terms of confidence in the information they provide, although they see their role more in teaching scientific literacy – supported by the objects they hold – rather than trying to megaphone facts. Many collections contain objects with morally dubious histories, as white Western countries we have typically ignored these issues – the Black Lives Matter movement means this is starting to change.

I think the best way of placing this is as a social history of the science museum – the author cites Richard Fortey’s Dry Store Room 1 as a model/inspiration and talks of the book as a "curator confessional", an entertaining enough read but rather specialist.

Book review: Writing and script – A very Short Introduction by Andrew Robinson

writingA very short book for this review, Writing and script – A very short introduction by Andrew Robinson – this fits in with my previous review on Kingdom of Characters by Jing Tsu. In some ways the “very short” format stymies my reviewing process which involves writing notes on a longer book!

Robinson makes a distinction between proto-writing and fully writing, the first proto-writing – isolated symbols which clearly meant something – dates back to 20,000BC. Whilst the first full-writing defined as “a system of graphic symbols which can be used to convey any and all thought” dates back to some time around 3300BC in Mesopotamia and Egypt. It first appeared in India 2500 BC, Crete (Europe) 1750BC, China 1200BC and Meso-America 900BC. In common with humanity itself there is a lively single origin / multi-origin debate – did writing arise in one place and then travel around the world or arise separately in different places?

When I am reading books relating to history I am very keen to pin down “firsts” and “dates”, I suspect this is not a good obsession. As for mathematics the earliest full-writing was used for accountancy, and bureaucracy!

An innovation in writing is the rebus principle that allows that word can be written as a series of symbols representing sounds whilst those symbols by themselves might convey a different meaning.

I was excited to learn a new word: boustrophedon – which means writing which goes from left to right and then right to left on alternate lines – it is from the Greek for “like the ox turns”. Writing in early scripts was often in both left-right and right-left form and only stabilised to one form (typically left-right) after a period of time. I have been learning Arabic recently (which is read right-left) and was surprised how easy this switch was from my usual left-right reading in English.

Another revelation for me is that a written script is not necessarily a guide to pronunciation – in English it broadly is, some languages do a better job of describing pronunciation in the written form but other languages like Chinese, the core script is largely about transmitting ideas. Arabic holds an intermediate position – accents were added to an alphabet comprised of consonants to provide vowels and thus clarify pronunciation.

As well as the appearance of scripts, Robinson also discusses their disappearance, this happens mainly for political reasons. For example, Egyptian Hieroglyphics fell into decline through invasions by the Greeks and then Romans who used different scripts. Cuneiform was in use for 3000 years before dying out in around 75AD.

Deciphering scripts gets a chapter of its own, classifying decipherment efforts in terms of whether the script was known or unknown and whether the language it represented was known or unknown. Given a sample of a script the first task is to determine the writing direction, and the number of distinct elements. This second task can be challenging given the variations in individual writing styles and, for example, the use of capitalisation. The next step is to identify the type of script (an alphabet – standing for vowels and consonants, a syllabary – standing for whole syllables , or  logograms – standing for whole words) – on the basis of the character count and other clues. The final step, of decipherment, requires something like the Rosetta stone – the same text written in multiple languages where at least one is known – names of people and places are often key here. A broad knowledge of languages living and dead is also a help.

The chapter on How writing systems work expands further on the alphabet/syllabary/logogram classification with a separate chapter on alphabets – I particularly liked the alphabet family tree. Greek is considered the first alphabet which included both consonants and vowels, earlier systems were syllabaries or just contained consonants.

Japanese and Chinese writing systems are covered in a separate chapter. I don’t think I had fully absorbed that Chinese characters were a writing system equivalent to the Latin alphabet, and so can express multiple languages. Kingdom of Characters focus on Chinese elided the fact that Japanese has troubles of its own, particularly in the number of homophones, Japanese speakers sometimes sketch disambiguating characters with their hands to clarify their meaning.

The book finishes with an obligatory Writing goes electronic chapter which highlights that text speak (i.e. m8 for mate) is an example of the rebus principle in action. Robinson also highlights that the electronic publishing has not ended or diminished the importance of the physically printed language, the opposite is true in fact.

This book packs a lot into a short space, it provides the reader with interesting new facts to share (I liked boustrophedon), it would not be a substantial holiday read but a great introduction to the field.

Book Review: Kingdom of Characters by Jing Tsu

jing_tsuKingdom of Characters: A Tale of Language, Obsession, and Genius in Modern China by Jing Tsu describes the evolution of technology to handle the Chinese language from the start of the 20th century until pretty much today (2022). As such it blends technology, linguistics and politics.

The "issue" with Chinese as a language is that it is written as characters with each character representing a whole word, in contrast to English and similar languages which build words from a relatively small number of alphabetic characters. The Chinese language uses thousands of characters regularly, and including rarer forms brings the number to tens of thousands. This means that technologies to input, print, transmit, and index Chinese language material must be changed fairly radically from the variants used to handle alphabetic languages.

The written Chinese language has been around in fairly similar form for getting on for 3000 years, and it was in China that printing was invented in around 600AD – several hundred years before it was invented in Western Europe by Gutenberg. "Penmanship" – how someone writes characters – is still seen as an important personal skill, in a way that handwriting in English is not.

Aside from the linguistic and technological aspects of the process, politics plays and important part.

Kingdom of Characters covers the modernisation of the Chinese language and its use in new technology in seven chapters, (in chronological order), each chapter focuses on one or two individuals each, and some attempt is made to fill out their backgrounds. The first chapter covers the standardisation of the written language to Mandarin which culminated in the 1913 conference of the Commission on the Unification of Pronunciation.

The next step in the modernisation of Chinese was the invention of a Chinese character typewriters, commercialised by the Commercial Press from 1926, developed by Zhou Houkun and Shu Zhendong.

I found the telegraphy chapter quite telling, not through its solution but as a demonstration of what happened when China was not at the table when systems were designed – they were condemned to use a numerical code system which was more expensive than sending alphabetic letters. Interestingly the global telegraphy system seemed to spend a great deal of time trying to stop people sending encoded messages because they saw it as "fare dodging", Chinese was caught up in this effort. Numbers were more expensive to send than letters but representing whole words with numbers was seen as encoding.

Cataloguing gets a chapter of its own, the chapter covers the period from the late 1920s until the 1950s but it feels like a continuation of other discussions on how to break the tens of thousands of characters down into a smaller set of ordered elements in a consistent and memorable fashion. There is a precedent for this, Chinese characters are written in a standard order, stroke by stroke and there has existed for a long time the idea of "radicals" a small set set of foundational strokes. It means that the challenge is two-fold: technical but also linguistic.

In a reprise of the standardisation discussion the fifties saw the simplification of Chinese characters, followed by the introduction of Pinyin – a phonetic system for Chinese. This replaced the Wade-Giles phoneticisation, developed by two Westerners. Growing up in the seventies I first learned that Peking (under the Wade-Giles) was the capital of China, for it to be replaced by Beijing (the Pinyin) in the eighties. The new system also included Chinese tones which don’t have an equivalent in English or other Western languages.

The chapter entitled "Entering into the computer (1979)" is largely about using computers to do photo-typesetting to print Chinese. I suspect the Chinese invention of vector-based character representations may have leapfrogged Western technology. This work was born during the Cultural Revolution which from 1966-76 impacted technological progress rather seriously. I recall in the late eighties a Chinese academic who was visiting the research group where I did my final year undergraduate project, he had worked in the fields during the Cultural Revolution – not voluntarily, and he had a better time of it than many.

The final chapter is on the burgeoning Chinese internet, with a proliferation of input methods, and an audience several times larger than the US audience although it starts with the introduction of Unicode in 1988, and the standing group tasked with the addition of new Chinese characters to the standard from ever more esoteric literary sources.

The broad political context of the work is the decline of China in the 19th century under the Qing Dynasty – forced to open up to foreign influences by the Opium Wars. Towards the end of this time the Chinese language, tied to the ruling dynasty was seen as part of the problem – holding China back from becoming a modern nation. In the 20th century 1912 saw the formation of the republican, Nationalist government although it was in regular conflict with the communists, and then the Japanese in the Second Sino-Japanese War which ended with the defeat of the Japanese in the Second World War. The People’s Republic of China was founded in 1949 with a renewed interest in preserving the Chinese language, but with the interests of the worker at its heart – under the Qing Dynasty literacy, and the use of the written language, was a preserve of the ruling class.

Kingdom of Characters is pretty readable, and will appeal to those interested in radically different writing systems (when compared to alphabetic languages).

Book review: Data mesh by Zhamak Dehgani

datameshThis book, Data mesh: Delivering Data-Driven Value at Scale by Zhamak Dehghani essentially covers what I have been working on for the last 6 months or so, therefore it is highly relevant but I perhaps have to be slightly cautious in what I write because of commercial confidentiality.

The Data Mesh is a new design for handling data within an organisation, it has been developed over the last 3 or 4 years with Dehghani at the Thoughtworks consultancy at the core. Given its recency there are no Data Mesh products on the market so one is left build your own on the basis of components available.

To a large degree the data mesh is a conceptual and organisational shift rather than a technical shift, all the technical component parts are available for a data mesh, less programmatic glue to hold the whole thing together.

Data Mesh the book is divided into five parts, the first describes what a data mesh is in fairly abstract terms, the second explains why one might need a data mesh, the third and fourth parts about how to design the architecture of the data mesh itself, and the data products that make it up. The final part is on “How to get started” – how to make it happen in your organisation.

Dehghani talks in terms of companies having established systems for operational data (data required to serve customers and keep the business running such billing information and the state of bank accounts), the data mesh is directed at analytical data – data which is derived from the operational data.  She uses a fictional company, Daff, Inc. that sounds an awful lot like Spotify to illustrate these points. Analytical data is used to drive machine learning recommender systems, for example, and better understanding of business, customer and operations.

The legacy data systems Data Mesh describes are data warehouses and data lakes where data is managed by a central team. The core issue this system brings is one of scalability, as the number of data sets grows the size of the central team grows, and the responsiveness of the system drops.

The data mesh is a distributed solution to this centralised system. Dehghani defines the data mesh in terms of four principles, listed in order of importance:

  1. Domain Ownership – this says that analytical data is owned by the domains that generate it rather than a centralised data team;
  2. Data as a product – analytical data is owned as a product, with the associated management, discoverability, quality standards and so forth around it. Data products are self-contained entities in their own right – in theory you can stand up the infrastructure to deliver a single data product all by itself;
  3. Self-serve data platform – a self-serve data platform is introduced which makes the process of domain ownership of data products easier, delivering the self-contained infrastructure and services that the data product defines;
  4. Federated computational governance – this is the idea that policies such as access control, data retention, encryption requirements, and actions such as the “right to be forgotten” are determined centrally by a governance board but are stored, and executed, in machine-readable form by data products;

For me the core idea is that of a swarm of self-contained data products which are all independent but by virtue of simple behaviours and some mesh spanning services (such as a data catalogue) provide a sum that is greater than the whole. A parallel is drawn here with domain-driven design and microservices, on which the data mesh is modelled.

I found the parts on designing the data mesh platform and data products most interesting since this is the point I am at in my work. Dehghani breaks the data mesh down into three “planes”: the infrastructure utility plane, the data product experience plane, and the mesh experience plane (this is where the data catalogue lives).

We spent some time worrying over whether it was appropriate to include data processing functionality in our data mesh – Dehghani makes it clear that this functionality is in scope, arguing that the benefit of the data product orientation is that only a small number data pipelines are managed together rather than hundreds or possibly thousands in a centralised scheme.

I have been spending my time writing code, which Dehghani describes as the “sidecar”, common code that sits inside the data product to provide standard functionality. In terms of useful new ideas, I have been worrying about versioning of data schema and attributes – Dehghani proposes that “bitemporality” is what is required here (see Martin Fowler’s blog post here for an explanation). Essentially bitemporality means recording the time at which schema and attributes were changed, as well as the time at which data was provided and recording the processing time. This way one can always recreate a processing step simply by checking which set of metadata and data were in play at the time (bar data being deleted by a data retention policy).

Data Mesh also encouraged me to decouple my data catalogue from my data processing, so that a data product can act in a self-contained way without depending on the data catalogue which serves the whole mesh and allows data to be discovered and understood.

Overall, Data Mesh was a good read for me in large part because of its relevance to my current work but it is also well-written and presented. The lack of mention of specific technologies is rather refreshing and means the book will not go out of date within the next year or so. The first companies are still only a short distance into their data mesh journeys, so no doubt a book written in five years time will be a different one but I am trying to solve a problem now!