Tag: data science

Book review: Data mesh by Zhamak Dehgani

datameshThis book, Data mesh: Delivering Data-Driven Value at Scale by Zhamak Dehghani essentially covers what I have been working on for the last 6 months or so, therefore it is highly relevant but I perhaps have to be slightly cautious in what I write because of commercial confidentiality.

The Data Mesh is a new design for handling data within an organisation, it has been developed over the last 3 or 4 years with Dehghani at the Thoughtworks consultancy at the core. Given its recency there are no Data Mesh products on the market so one is left build your own on the basis of components available.

To a large degree the data mesh is a conceptual and organisational shift rather than a technical shift, all the technical component parts are available for a data mesh, less programmatic glue to hold the whole thing together.

Data Mesh the book is divided into five parts, the first describes what a data mesh is in fairly abstract terms, the second explains why one might need a data mesh, the third and fourth parts about how to design the architecture of the data mesh itself, and the data products that make it up. The final part is on “How to get started” – how to make it happen in your organisation.

Dehghani talks in terms of companies having established systems for operational data (data required to serve customers and keep the business running such billing information and the state of bank accounts), the data mesh is directed at analytical data – data which is derived from the operational data.  She uses a fictional company, Daff, Inc. that sounds an awful lot like Spotify to illustrate these points. Analytical data is used to drive machine learning recommender systems, for example, and better understanding of business, customer and operations.

The legacy data systems Data Mesh describes are data warehouses and data lakes where data is managed by a central team. The core issue this system brings is one of scalability, as the number of data sets grows the size of the central team grows, and the responsiveness of the system drops.

The data mesh is a distributed solution to this centralised system. Dehghani defines the data mesh in terms of four principles, listed in order of importance:

  1. Domain Ownership – this says that analytical data is owned by the domains that generate it rather than a centralised data team;
  2. Data as a product – analytical data is owned as a product, with the associated management, discoverability, quality standards and so forth around it. Data products are self-contained entities in their own right – in theory you can stand up the infrastructure to deliver a single data product all by itself;
  3. Self-serve data platform – a self-serve data platform is introduced which makes the process of domain ownership of data products easier, delivering the self-contained infrastructure and services that the data product defines;
  4. Federated computational governance – this is the idea that policies such as access control, data retention, encryption requirements, and actions such as the “right to be forgotten” are determined centrally by a governance board but are stored, and executed, in machine-readable form by data products;

For me the core idea is that of a swarm of self-contained data products which are all independent but by virtue of simple behaviours and some mesh spanning services (such as a data catalogue) provide a sum that is greater than the whole. A parallel is drawn here with domain-driven design and microservices, on which the data mesh is modelled.

I found the parts on designing the data mesh platform and data products most interesting since this is the point I am at in my work. Dehghani breaks the data mesh down into three “planes”: the infrastructure utility plane, the data product experience plane, and the mesh experience plane (this is where the data catalogue lives).

We spent some time worrying over whether it was appropriate to include data processing functionality in our data mesh – Dehghani makes it clear that this functionality is in scope, arguing that the benefit of the data product orientation is that only a small number data pipelines are managed together rather than hundreds or possibly thousands in a centralised scheme.

I have been spending my time writing code, which Dehghani describes as the “sidecar”, common code that sits inside the data product to provide standard functionality. In terms of useful new ideas, I have been worrying about versioning of data schema and attributes – Dehghani proposes that “bitemporality” is what is required here (see Martin Fowler’s blog post here for an explanation). Essentially bitemporality means recording the time at which schema and attributes were changed, as well as the time at which data was provided and recording the processing time. This way one can always recreate a processing step simply by checking which set of metadata and data were in play at the time (bar data being deleted by a data retention policy).

Data Mesh also encouraged me to decouple my data catalogue from my data processing, so that a data product can act in a self-contained way without depending on the data catalogue which serves the whole mesh and allows data to be discovered and understood.

Overall, Data Mesh was a good read for me in large part because of its relevance to my current work but it is also well-written and presented. The lack of mention of specific technologies is rather refreshing and means the book will not go out of date within the next year or so. The first companies are still only a short distance into their data mesh journeys, so no doubt a book written in five years time will be a different one but I am trying to solve a problem now!

A way of working: data science

I am about to take on a couple of data science students from Lancaster University for summer projects, from past experience I always spend some time at the beginning of such projects explaining how I work with the expectation that they will at least take some notice if not repeat my methodology exactly. This methodology evolves slowly over time as I learn new things and my favoured technologies change.

Typically I develop on a Windows laptop but I use the git-bash prompt as my shell for typing in commands – this is a Linux-like terminal which I adopted after working with developers who mainly used Linux and also because I was familiar with the Unix style commandline from before the time on Linux. You can do a lot from the commandline in data science – Data Science at the Command Line by Jeroen Janssens is an excellent introduction.

I use Docker containers a bit to spin up local versions of services which are difficult to run on Windows (things like Airflow and Linkedin DataHub), some people develop entirely inside Docker containers to reduce dependency issues and make deployment of code easier.

I work pretty much entirely in Python for data processing and analysis although I generate CSV files which I load to Tableau for visualisation. I tend not to try complex processing in Tableau since I find the GUI inconvenient and confusing for such work. I use the Anaconda distribution of Python, originally because I liked that it came packaged with a load of useful libraries for data science and it handled virtual environments and installation of more tricky packages better than plain Python. It may be worth revisiting this decision. I have recently shifted my code to Python 3.9.

For a piece of work I will usually set up a Python project which can be “installed”. This blog post explains a standard structure for Python projects. I aim to use Python virtual environments on a per project basis but sometimes I fail. Typically, I will write Python modules that provide functions but also have a simple command line interface which takes two or three positional parameters. You can see this in action in the git repo here which I share as a template for myself and others!

To date I have picked up commandline arguments using sys.argv I should probably use one of the libraries to make these commandline interfaces better, there is a blog post here which compares the built-in argparse library with click and docopt. I think I might check out click for future projects.

As well as running commandline scripts I use tests to develop analysis, as well as being good software development practice, test runners make a convenient way to run arbitrary functions in a code base. I prefer to use the unittest built-in library but I’ve started using pytest for a recent project. I wrote a blog post about writing tests, since I wrote it I have learned about test mocks and pytest’s fixture functionality.

I have a library of general utilities for interacting with databases, setting up logging and writing dictionaries which I wrote because I found I was doing these things repeatedly and making my own library allowed me to forgot some of the boilerplate code required to do these things. The key utilities are included with the repo attached to this blog.

I’ve been using Visual Code as my editor for some time now, I prefer not to use full blown IDEs because I find they present more functionality than I can cope with. I think this is as a result of coding in Java using Eclipse and C# .net in Visual Studio. In any case Visual Code starts as a nice enough code editor but has been sneaking in more IDE functionality via extensions.

The extensions I use heavily in Visual Code are Python and Pylance – the Python language server provides type-hinting support. I wrote about type-hinting in Python here. I also use Rainbow CSV for when I am editing or viewing CSV files.

I could use Visual Code for accessing git, my preferred source control system, but instead I use GitKraken which has a very nice GUI interface. Since I am usually working by myself my git usage is very simple, I typically have one branch onto which I make many small commits. I have recently started working with a team where I am using feature-based branches which get merged by pull requests – this was a bit of a culture shock.

As a result of working with other people on a new project I have started using some technologies which I will just mention here. I run the black formatter, as well pylint and flake8. Black just reformats my code files when I save them and can largely be ignored. Flake8 is fairly easy to satisfy although I spent a lot of time addressing line length issues. Pylint generates quite a few warnings which I attend to but sometimes ignore.

I have also started using Make files and Azure Devops pipelines for running common tasks on my code (tests, cleanup, setting up infrastructure, linting).

Outside technology, I have a very long established method of working using a monthly Word document as a notebook, I describe it here. I tend to prefix file names with ISO8601 format dates (2022-05-22) this means that if I created a Tableau workbook or an Excel worksheet I can link it easily to what I was writing in my notebook and the status of the appropriate git repo at that point in time.

I’ve incorporated all the code related elements mentioned above in this ways-of-working-data-science git repository.

Book review: Data Pipelines with Apache Airflow by Bas P Harenslak and Julian R De Ruiter

data-pipelinesMy next review is on Data Pipelines with Apache Airflow by Bas P Harenslak and Julian R De Ruiter. The book was published in 2021, and is compatible with Airflow 2.0 which was released at the end of 2020.

Airflow is all about orchestrating the movement of data from sources such as APIs and so forth into other places, it originated in Airbnb. It is designed for batch processing, rather than streaming data, and for pipelines that do not change much.

Data pipelines in Airflow are represented as "directed acyclic graphs" or DAGs which are defined in Python code using "Operators" which carry out tasks. A graph is a collection of nodes (tasks in this case) with "edges" between them. The "directed acyclic" bit means tasks have a definite order, the edges between them are "directed", and the graph cannot have loops or cycles because that would imply having to finish a set of tasks before you could start them. Simple data pipelines would just be a linear set of tasks that always follow one from another, a more complicated pipeline might bring in data from several sources before combining them to produce a final data product.

The Operators are strung together using expressions of the form "operator 1 >> operator 2" or even "[operator 1, operator 2] >> operator 3". 

Operators do not have to use Python, they can invoke code in other languages such as the BashOperator, or interact with other systems such as databases or storage systems such as S3. It is relatively easy to write your own operators. Alongside operators that do stuff there are branch operators which select one or other path in the DAG, and there are also sensors which detect changes in filesystems and trigger work and hooks which form connections with external services. Dummy operators can be used to simplify the appearance of DAGs.

As an orchestration system the intention of operators is that they should not contain a great deal of code to process data, that function should be off-loaded to libraries or systems elsehwhere.

The Airflow system is comprised of a web server which allows you to observe / trigger execution of DAGs, a scheduler which is responsible for the scheduled running of DAGs and workers which do the actual work of the DAG. The Airflow system loops over the tasks defined in a DAG, and tries to execute tasks which depends on the tasks upstream of the task in question, if they have been successfully completed then a task can execute.

A basic implementation runs DAGs locally using a simple queue to schedule work, and a sqlite database to store metadata. A production implementation would use something like Postgres or Amazon RDS as the metadata store, schedule work using Celery and run tasks in Docker containers marshalled using Kubernetes.

For some reason reading this I was reminded that big projects like Airflow are just other people’s code, and if you look too carefully you’ll find something nasty. This is both comforting and mildly scary. I think the issue was that Airflow uses jinja templating to inject parameters into code which feels wrong but is probably a pragmatic and safe why to do it, these shenanigans are not required for Python operators. Also discussed are issues with code dependencies, which the authors suggest are best eliminated by putting operators into Docker containers each of which contain their own code dependencies – allowing otherwise dependency incompatible libraries to work together. 

Alongside the material on Airflow there are moderate chunks on Python modules, testing, Docker and Kubernetes and logging so you get a well rounded view not only of Airflow but also of the ecosystem it sits in. The book finishes with deployment into various Cloud environments. I found these parts quite useful since the most complicated work I do in my role is trying to get things to work in AWS! The data science part is easy…

The book finishes with some short chapters on Cloud deployments, mentioning first fully managed services such as astronomer.io, Amazon MWAA and Google Cloud Composer before going on to talking about implementation of one of the demos in the book on AWS, Azure and Google cloud services. I considered skipping these chapters but they turned out be quite interesting in highlighting the differences between services and perhaps the preferences of the authors of both the book and of Airflow.

I found this a readable introduction to Airflow with some nice examples, and interesting additional material. Useful if you are thinking about using Airflow, or even if you are working on data pipelines without Airflow since it provides a good description of the moving parts required.

The code repository for the book is here: https://github.com/BasPH/data-pipelines-with-apache-airflow

Book review: Deep learning with Python by François Chollet

Deep learning with Python by Francois Chollet is the third book I have reviewed on deep learning neural networks. Despite these reviews only spanning a couple of years it feels like the area is moving on rapidly. The biggest innovations I see from this book are in the use of pre-trained networks, and the dominance of the Keras/Tensorflow/Python ecosystem in doing deep learning.

Deep learning is a type of artificial intelligence based on many-layered neural networks. This is where the “deep” comes in – it refers to the numbers of layers in the networks. The area has boomed in the last few years with the availability of massive datasets on which to train, improvements in numerical algorithms for training neural networks and the use of GPUs to further accelerate deep learning. Neural networks have been used in production since the 1990s – by the US postal service for reading handwritten zip codes.

Chollet works on artificial intelligence at Google and is the author of the Keras deep learning library. Google is also the home of Tensorflow, a lower level library which is often used as a backend to Keras. This is a roundabout way of saying we should expect Chollet to be expert and authoritative in this area.

The book starts with some nice background to machine learning. I liked Chollet’s description of machine learning (deep learning included) being about finding a representation of data which makes the problem at hand trivial to solve. Imagine taking two pieces of coloured paper, placing them one on top of the other and then crumpling them into a ball. Machine learning is the process of un-crumpling the ball.

As an introduction to the field Deep Learning in Python runs through some examples of deep learning applied to various classes of problem, including movie review sentiment analysis, classifying newswire articles and predicting house prices before going back to discuss some issues these problems raise. A recurring theme is the problem of overfitting. Deep learning models can learn their training data really well, essentially they memorise the answers to questions and so when they are faced with questions they have not seen before they perform badly. Overfitting can be addressed with a range of techniques.

One twist I had not seen before is the division of the labelled data used in machine learning into three, not two parts: training, validation and test. The use of training and validation parts is commonplace, the training set is used for training – the validation set is used to test the quality of a model after training. The third component which Chollet introduces is the “test” set, this is like the validation set but it is only used when your model is about to go into production to see how it will perform in real life. The problem it addresses is that machine learning involves a large number of hyperparameters (things like the type of machine learning model, the number of layers in a deep network, the form of the activation function) which are not changed during training but are changed by the data scientist quite possibly automatically and systematically. The hyperparameters can be overfitted to the validation set, hence a model can perform well on validation data (that it has seen before) but not on test data which represents real life.

A second round of examples looks at deep learning in computer vision, using convolution neural networks (convnets). These are related to the classic computer vision process of convolution and image morphology. Also introduced here are recurrent neural networks (RNNs) for applications in processing sequences such as time series data and language. RNNs have memory across layers which dense and convolution networks don’t, this makes them effective for problems where the sequence of data is important.

The final round of examples is in generative deep learning including generating text, the DeepDream system, image style transfer and generating images of faces.

The book ends with some thoughts of the future. Chollet comments that he doesn’t like to use the term neural networks which implies the ability to reason and abstract in the way that humans do. One of the limitations of deep learning is that, as currently used, does not have the ability to abstract or generate programmatic descriptions of solutions. You would not use deep learning to launch a rocket – we have detailed knowledge of the physics of rockets, gravity and the atmosphere which makes a physics-based approach far better.

As I read I realised that keeping up with what was new in machine learning was a critical and challenging task, Chollet answers this question exactly suggesting three approaches to keeping abreast of new developments:

  1. Kaggle – the machine learning competition site;
  2. ArXiv – the preprint server, in particular http://www.arxiv-sanity.com/ which is a curated view of the machine learning part of arXiv;
  3. Keras – keeping up with developments in the Keras ecosystem;

If you’re going to read one book on deep learning this should probably be the one, it is readable, covers off the field pretty well, Chollet is an authority in this area and in my view has particularly acute insight into deep learning.

Book review: Designing Data-Intensive Applications by Martin Kleppmann

Designing Data-Intensive Applications by Martin Kleppmann does pretty much what it says in the title. The book provides a lot of detail on how various types of databases and database functionality work, and how these can be plumbed together to build applications. It is reminiscent of Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, in the sense that it provides a broad overview of a range of different data systems which are specialised for different applications. It is authoritative and well-written. Seven Databases is more concerned with the specifics of particular NoSQL databases whilst Designing Data-Intensive Applications is concerned about data applications rather than just the underlying database.

The book is divided into three broad sections covering foundations of data systems, distributed data and derived data. Each chapter starts with a cartoon map of the territory, which I thought would be a bit gimmicky but it serves as a nice summary of what the chapter covers particularly in terms of the software available.

The section on data systems talks about reliability, scaleability and maintainability before going on to discuss types of database (i.e. relational, graph and so forth) and some of the low-level implementation of data storage systems such as hash indexes and B-trees.

Reliability is about a system returning responses in a timely fashion, Amazon have observed sales drop by 1% for every 100ms of delay, other have reported a drop in consumer satisfaction of 16% with 1 second slowdown. The old academic in me twitches at providing these statistics without citing the reference. However, Designing Data-Intensive Applications is heavily referenced.

There is some interesting historical detail, including the IMS database which IBM built for the Apollo space program in the late 1960s (which is still available today), and the CODASYL database model for graph-like databases from a little later. Its interesting to see how some of these models have been revisited recently in light of the advent of fast, large memory in place of slow disk or even tape drives.

I was introduced to databases rather late in my career, they are not really a core part of the scientific computing background I have. Learning the distinction between OLAP (analytics) and OLTP (transactions) databases was useful. Briefly, transactional databases are optimised to work on single rows and provide fairly strong guarantees on transactional integrity. The access pattern for analytics databases is different, typically analytical workflows want to take the contents of an entire column and carry out aggregations and calculations over the whole column. Transactions are not so important on such databases but consistency is important, a query may take a long time to run but it should provide results as if it ran on the database at a single point in time. These workflows are better serviced by so-called column-stores such as Vertica.

The section on distributed data systems covers replication, partitioning, transactions and consensus. The problem with distributed systems is that you never know things have failed for ever, and its difficult to know what order things have happened in. This reminds me a bit of teaching special relativity to physics undergraduates long ago.

It is hard to even be able to rely on timekeeping on servers. I found this a bit surprising, when we put our minds to measuring time we can be incredibly accurate. GPS time signals have an accuracy significantly better than microseconds, yet servers synced well using NTP (Network Time Protocol) achieve something like 100 milliseconds – a factor of thousands poorer. And this accuracy is only achieved if everything is configured correctly. This is important because we therefore cannot rely on timestamps to provide a unique order for events across multiple servers, nor can we even rely on timestamps synced with NTP to be always increasing!

The two big themes in terms of databases are transactions and consensus. These are the concepts that provide the best assurance on the integrity of operations and their success over distributed systems. I used the word “assurance” rather than “guarantee” deliberately because reading Designing Data-Intensive Applications it becomes clear that perfection is hard to achieve and there are always trade-offs. It also highlights the problems of the language used to describe features. Some terms are used in different ways in different contexts.

The derived data sections starts with praise for the Unix way of piping data between simple command line scripts, Data Science at the Commandline covers this area in much more detail. It then goes on to discuss the MapReduce ecosystem and the differences between batch and stream processing. This feels like a section I will be returning to.

The book finishes with some speculation as to the future of the field, the two thoughts that stuck with me are the idea of federated databases, systems which use a common query language to interface with multiple different datastores. The second idea is that of unbundling functionality so that, for example, data may be stored in a standard SQL database for unique ID based queries and in Elasticsearch for full-text search queries – in some ways these are simply different facets of the same idea.

Designing Data-Intensive Applications is a big book with no padding, it is packed with detail including many references, but remains readable. Across a fair number of titles this is definitely one of the better technology books I have read.