Tag Archive: data science

Nov 04 2019

Book review: Deep learning with Python by François Chollet

Deep learning with Python by Francois Chollet is the third book I have reviewed on deep learning neural networks. Despite these reviews only spanning a couple of years it feels like the area is moving on rapidly. The biggest innovations I see from this book are in the use of pre-trained networks, and the dominance of the Keras/Tensorflow/Python ecosystem in doing deep learning.

Deep learning is a type of artificial intelligence based on many-layered neural networks. This is where the “deep” comes in – it refers to the numbers of layers in the networks. The area has boomed in the last few years with the availability of massive datasets on which to train, improvements in numerical algorithms for training neural networks and the use of GPUs to further accelerate deep learning. Neural networks have been used in production since the 1990s – by the US postal service for reading handwritten zip codes.

Chollet works on artificial intelligence at Google and is the author of the Keras deep learning library. Google is also the home of Tensorflow, a lower level library which is often used as a backend to Keras. This is a roundabout way of saying we should expect Chollet to be expert and authoritative in this area.

The book starts with some nice background to machine learning. I liked Chollet’s description of machine learning (deep learning included) being about finding a representation of data which makes the problem at hand trivial to solve. Imagine taking two pieces of coloured paper, placing them one on top of the other and then crumpling them into a ball. Machine learning is the process of un-crumpling the ball.

As an introduction to the field Deep Learning in Python runs through some examples of deep learning applied to various classes of problem, including movie review sentiment analysis, classifying newswire articles and predicting house prices before going back to discuss some issues these problems raise. A recurring theme is the problem of overfitting. Deep learning models can learn their training data really well, essentially they memorise the answers to questions and so when they are faced with questions they have not seen before they perform badly. Overfitting can be addressed with a range of techniques.

One twist I had not seen before is the division of the labelled data used in machine learning into three, not two parts: training, validation and test. The use of training and validation parts is commonplace, the training set is used for training – the validation set is used to test the quality of a model after training. The third component which Chollet introduces is the “test” set, this is like the validation set but it is only used when your model is about to go into production to see how it will perform in real life. The problem it addresses is that machine learning involves a large number of hyperparameters (things like the type of machine learning model, the number of layers in a deep network, the form of the activation function) which are not changed during training but are changed by the data scientist quite possibly automatically and systematically. The hyperparameters can be overfitted to the validation set, hence a model can perform well on validation data (that it has seen before) but not on test data which represents real life.

A second round of examples looks at deep learning in computer vision, using convolution neural networks (convnets). These are related to the classic computer vision process of convolution and image morphology. Also introduced here are recurrent neural networks (RNNs) for applications in processing sequences such as time series data and language. RNNs have memory across layers which dense and convolution networks don’t, this makes them effective for problems where the sequence of data is important.

The final round of examples is in generative deep learning including generating text, the DeepDream system, image style transfer and generating images of faces.

The book ends with some thoughts of the future. Chollet comments that he doesn’t like to use the term neural networks which implies the ability to reason and abstract in the way that humans do. One of the limitations of deep learning is that, as currently used, does not have the ability to abstract or generate programmatic descriptions of solutions. You would not use deep learning to launch a rocket – we have detailed knowledge of the physics of rockets, gravity and the atmosphere which makes a physics-based approach far better.

As I read I realised that keeping up with what was new in machine learning was a critical and challenging task, Chollet answers this question exactly suggesting three approaches to keeping abreast of new developments:

  1. Kaggle – the machine learning competition site;
  2. ArXiv – the preprint server, in particular http://www.arxiv-sanity.com/ which is a curated view of the machine learning part of arXiv;
  3. Keras – keeping up with developments in the Keras ecosystem;

If you’re going to read one book on deep learning this should probably be the one, it is readable, covers off the field pretty well, Chollet is an authority in this area and in my view has particularly acute insight into deep learning.

Jul 24 2019

Book review: Designing Data-Intensive Applications by Martin Kleppmann

Designing Data-Intensive Applications by Martin Kleppmann does pretty much what it says in the title. The book provides a lot of detail on how various types of databases and database functionality work, and how these can be plumbed together to build applications. It is reminiscent of Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, in the sense that it provides a broad overview of a range of different data systems which are specialised for different applications. It is authoritative and well-written. Seven Databases is more concerned with the specifics of particular NoSQL databases whilst Designing Data-Intensive Applications is concerned about data applications rather than just the underlying database.

The book is divided into three broad sections covering foundations of data systems, distributed data and derived data. Each chapter starts with a cartoon map of the territory, which I thought would be a bit gimmicky but it serves as a nice summary of what the chapter covers particularly in terms of the software available.

The section on data systems talks about reliability, scaleability and maintainability before going on to discuss types of database (i.e. relational, graph and so forth) and some of the low-level implementation of data storage systems such as hash indexes and B-trees.

Reliability is about a system returning responses in a timely fashion, Amazon have observed sales drop by 1% for every 100ms of delay, other have reported a drop in consumer satisfaction of 16% with 1 second slowdown. The old academic in me twitches at providing these statistics without citing the reference. However, Designing Data-Intensive Applications is heavily referenced.

There is some interesting historical detail, including the IMS database which IBM built for the Apollo space program in the late 1960s (which is still available today), and the CODASYL database model for graph-like databases from a little later. Its interesting to see how some of these models have been revisited recently in light of the advent of fast, large memory in place of slow disk or even tape drives.

I was introduced to databases rather late in my career, they are not really a core part of the scientific computing background I have. Learning the distinction between OLAP (analytics) and OLTP (transactions) databases was useful. Briefly, transactional databases are optimised to work on single rows and provide fairly strong guarantees on transactional integrity. The access pattern for analytics databases is different, typically analytical workflows want to take the contents of an entire column and carry out aggregations and calculations over the whole column. Transactions are not so important on such databases but consistency is important, a query may take a long time to run but it should provide results as if it ran on the database at a single point in time. These workflows are better serviced by so-called column-stores such as Vertica.

The section on distributed data systems covers replication, partitioning, transactions and consensus. The problem with distributed systems is that you never know things have failed for ever, and its difficult to know what order things have happened in. This reminds me a bit of teaching special relativity to physics undergraduates long ago.

It is hard to even be able to rely on timekeeping on servers. I found this a bit surprising, when we put our minds to measuring time we can be incredibly accurate. GPS time signals have an accuracy significantly better than microseconds, yet servers synced well using NTP (Network Time Protocol) achieve something like 100 milliseconds – a factor of thousands poorer. And this accuracy is only achieved if everything is configured correctly. This is important because we therefore cannot rely on timestamps to provide a unique order for events across multiple servers, nor can we even rely on timestamps synced with NTP to be always increasing!

The two big themes in terms of databases are transactions and consensus. These are the concepts that provide the best assurance on the integrity of operations and their success over distributed systems. I used the word “assurance” rather than “guarantee” deliberately because reading Designing Data-Intensive Applications it becomes clear that perfection is hard to achieve and there are always trade-offs. It also highlights the problems of the language used to describe features. Some terms are used in different ways in different contexts.

The derived data sections starts with praise for the Unix way of piping data between simple command line scripts, Data Science at the Commandline covers this area in much more detail. It then goes on to discuss the MapReduce ecosystem and the differences between batch and stream processing. This feels like a section I will be returning to.

The book finishes with some speculation as to the future of the field, the two thoughts that stuck with me are the idea of federated databases, systems which use a common query language to interface with multiple different datastores. The second idea is that of unbundling functionality so that, for example, data may be stored in a standard SQL database for unique ID based queries and in Elasticsearch for full-text search queries – in some ways these are simply different facets of the same idea.

Designing Data-Intensive Applications is a big book with no padding, it is packed with detail including many references, but remains readable. Across a fair number of titles this is definitely one of the better technology books I have read.

Mar 20 2019

Book review: JavaScript Patterns by Stoyan Stefanov

More technology related reviewing next, JavaScript Patterns by Stoyan Stefanov. This is part of my continuing effort to learn JavaScript.

For me this isn’t a question of learning the nuts and bolts of a language but rather one of learning to use it fluently and idiomatically.

I thought this book might be in the spirit of the original “Gang of four” design patterns, but although it mentions these design patterns it is more generally about good style in JavaScript. The book is divided into eight chapters including an introduction.

The first substantive chapter on “essentials” talks mainly about variable declarations and some odds and ends. The most interesting one of these was the behaviour of parseInt which converts a string into an integer. Except if the string starts with a zero, as ISO8601 days and months would, then parseInt assumes it is a number in base 8 (octal)!! I can foresee many long hours trying to debug this problem without this forewarning. This chapter also discusses the importance of coding style conventions.

The second chapter talks about literals and constructors. It strikes me that much of this is about unwinding the behaviour of developers more used to statically-typed languages. The JavaScript way is to create objects by example, rather than take a class definition and derive from that. Although in the permissive manner of many languages it will let you do it either way. Since this book was written JavaScript has gained a “class” keyword which allows you to construct classes as you might in Java or C#.

Next up are functions, JavaScript shares Python’s view of functions as objects, allowing them to be passed as arguments. This is particularly important in JavaScript to provide “callback” functionality which is very useful when doing asynchronous programming. I learn here that the “currying” of function is named after Haskell Curry, who also has a whole language named for him. I always feel when passing functions as arguments that I am fiddling with the underpinnings of reality – it can make debugger difficult too.

I found the idea of functions that redefine themselves on first run interesting, it sounds useful and dangerous at the same time.

The chapter on object creation patterns is all about introducing module like behaviour and namespacing to JavaScript which at the time the book was written were not part of the language. Also covered are making private properties by hiding them in function closures.

The code reuse chapter is largely about patterns for achieving inheritance-like behaviour. This introduces a range of patterns which build up to almost exactly replicate class-based inheritance.

Finally we meet some of the classic Gang of Four design patterns. Some of these patterns, such as the iterator pattern, have been absorbed entirely into the core of languages like Python and more recently, JavaScript. The Observer patterns is implemented in web browsers as events, which are ubiquitous. Perhaps the lesson of this chapter is that some of the Gang of Four patterns have been absorbed into the core of languages, we use them almost without thinking. The Strategy Pattern, which determines algorithms at runtime, fits well with the chapter on functions and JavaScript’s view of functions as objects.

The book finishes with a chapter on patterns for the Document Object Model, or rather JavaScript in the browser. It includes well-known advice such as not testing for browser type but rather testing for functionality. It also has advice on optimising JavaScript for deployment.

There is minimal mention of specific tools or libraries in this regard, although Yahoo’s YUI library is mentioned a few times – Stefanov has worked on this library so this is unsurprising, and not unreasonable.

This book had more of the air of Douglas Crockfords’ JavaScript: The Good Parts than a book on patterns which was what I was expecting. Alternatively perhaps “JavaScript for users of statically-typed languages”, as such it probably works pretty well for Python programmers too although modules have always been built-in to Python and there is a “class” keyword for specifying classes.

JavaScript Patterns is readable though, I’m glad I picked it up.

Aug 26 2018

Book review: Data Strategy by Bernard Marr

This is a review of data_strategyData Strategy by Bernard Marr. The proposition of the book is that all businesses are now data businesses and that they should have a strategy to exploit data. He envisages such a strategy operating through a Chief Data Officer and thus at the highest level of a company.

It is in the nature of things that to be successful you feel that you have to be saying something new and interesting. The hook for this book is big data, or the increasing availability of data, is a new and revolutionary thing. To be honest, I don’t really buy this but once we’re over the hook the advice contained within is rather good.

Marr sees data benefitting businesses in three ways, and covers these in successive chapters:

  1. It can support business decisions – that’s to say helping humans make decisions;
  2. It can support business operations – this is more the automated use of data, for example, a recommender algorithm you might come across at any number of retail sites is driven by data and falls into this category;
  3. It can be an asset in its own right;

This first benefit of supporting business decisions is further sub-divided into data about the following:

  1. Customers, markets and competition
  2. Finance
  3. Internal operations
  4. People

The chapter on supporting business operations contains quite a lot of material on using sensors in manufacturing and warehouse operations but also includes fraud detection.

Subsequent chapters cover how to source and collect data, provide the human and physical infrastructure to draw meaning from it and some comments on data governance. In Europe this last topic has been the subject of enormous activity over the past couple of years with the introduction of the General Data Protection Regulation (GDPR) which determines the way in which personal information can be collected and processed.

Following the theme of big data, Marr’s view is that the the past is represented by data in SQL tables whilst the future is in unstructured data sources.

My background is as a physical scientist, and as such I read this with a somewhat quizzical “You’re not doing this already” face. Pretty much the whole point of a physical scientist is to collect data to better understand the world. The physical sciences have never really had a big data moment, typically we have collected and analysed data to the limit of currently available technology but that has never been the thing itself. Philosophically physical sciences gave up on collecting “all of the data” long ago. One of the unappreciated features of the big detectors a CERN is their ability to throw away enormous quantities of data really fast. If you have what is a effectively a building sized CCD camera then it is the only strategy that works. This isn’t to say the physical sciences always do it right, or that they are directly relevant to businesses. Physical sciences work on the basis that there are universally applicable, immutable physical laws which data is used to establish. This is not true of businesses, what works for one business need not work for another, what works now need not work in the future.

Reading the book I kept thinking of A computer called Leo by Georgina Ferry, which describes the computer built by the J. Lyons company (who ran a teashop and catering business) in the 1950s. Lyons had been doing large scale data work since the 1920s, in the aftermath of the Second World War they turned to automated, electronic computation. From my review I see that Charles Babbage wrote about the subject in 1832 although he was writing more about prospects for the future. IBM started its growth in computing machinery in the late 19th century. So the idea of data being core to a business is by no means new.

The text is littered with examples of data collection for business good across a wide range of sectors. The Rolls-Royce’s engine monitoring programme is one of my favourites, their engines send data back to Rolls-Royce four times during each flight. This can be used to support engine servicing, and, I would imagine, product development. In the category of monetizing data American Express and Axciom are mentioned, they provide either personal or aggregate demographic information which can be used for targeted marketing.

Some entries might be a bit surprising, restaurant chains in the form of Dominos and Dickey’s Barbeque Pit are big users of data. Walmart also makes an appearance. This shouldn’t be surprising since the importance of data is a matter more of business scale than sector, as the Lyons company shows.

Marr repeatedly tells us that we should collect the data which answers our questions rather than just trying to collect all the data. I don’t think this can be repeated too often! It seems that many businesses have been sold (or built) Big Data infrastructure and only really started to think about how they would extract business value from the data collected once this has been done.

Definitely thought provoking, and a well-structured guide as to how data can benefit your company.

Apr 26 2018

Book review: Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron

machine-learningI’ve recently started playing around with recurrent neural networks and tensorflow which brought me to Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron, as a bonus it also includes material on scikit-learn which I’ve been using for a while.

The book is divided into two parts, the first, “Fundamentals of Machine Learning” focuses on the functionality which is found in the scikit-learn library. It starts with a big picture, running through the types of machine learning which exist (supervised / unsupervised, batched / online and instance / model) and then some of the pitfalls and problems with machine learning before a section on testing and validation. The next part is a medium sized example of machine learning in action which demonstrates how the functionality of scikit-learn can be quickly used to develop predictions of house prices in California based on census data. This is a subject after my own heart, I’ve been working property data in the UK for the past couple of years.

This example serves two purposes, firstly it demonstrates the practical steps you need to take when undertaking a machine learning exercise and secondly it highlights just how concisely much of it can be executed in scikit-learn. The following chapters then go into more depth first about how models are trained and scored and then going into the details of different algorithms such as Support Vector Machines and Decision Trees. This part finishes with a chapter on ensemble methods.

Although the chapters contain some maths, their strength is in the clear explanations of the methods described. I particularly liked the chapter on ensemble methods. They also demonstrate how consistent the scikit-learn library is in its interfaces. I knew that I could switch algorithms very easily with scikit-learn but I hadn’t fully appreciated how the libraries generally handled regression and multi-class classification so seamlessly.

I wonder whether outside data science it is perceived that data scientists write their own algorithms from scratch. In practical terms it is not the case, and hasn’t been the case since at least the early nineties when I started data analysis which looks very similar to the machine learning based analysis I do today. In those days we used the NAG numerical library, Numerical Recipes in FORTRAN and libraries developed by a very limited number of colleagues in the wider academic community (probably shared by email attachment).

The second part of the book, “Neural networks and Deep Learning”, looks at the tensorflow library. Tensorflow has general applications for processing multi-dimensional arrays but it has been designed with deep learning and neural networks in mind. This means there are a whole bunch of functions to generate and train neural networks of different types and different architectures.

The section starts with an overview of tensorflow with some references to other deep learning libraries, before providing an introduction to neural networks in general, which have been around quite a while now. Following this there is a section on training deep learning networks, and the importance of the form of activation functions.

Tensorflow will run across multiple processors, GPUs and/or servers although configuring this looks a bit complicated. Typically a neural network layer can’t be distributed over multiple processing units.

There then follow chapters on convolution neural networks (good for image applications), recurrent neural networks (good for sequence data), autoencoders (finding compact representations) and finally reinforcement learning (good for playing pac-man). My current interest is in recurrent neural networks, it was nice to see a brief description of all of the potential input/output scenarios for recurrent neural networks and how to build them.

I spent a few years doing conventional image analysis, convolution neural networks feel quite similar to the convolution filters I used then although they stack more layers (or filters) than are normally used in conventional image analysis. Furthermore, in conventional image analysis the kernels are typically handcrafted to perform certain tasks (such as detect horizontal or vertical edges), whilst neural networks learn their kernels in training. In conventional image analysis convolution is often done in Fourier space since it is more efficient and I see there experiments along these lines with convolution neural networks.

Developing and training neural networks has the air of an experimental science rather than a theoretical science. That’s to say that rather than thinking hard and coming up with an effective neural network and training scheme one needs to tinker with different designs and training methods and see how well they work. It has more the air of training an animal the programming a computer. There are a number of standard training / test sets of images and successful models trained against these by other people can be downloaded. Such models can be used as-is but alternatively just parts can be used.

This section has many references to the original literature for the development of deep learning, highlighting how recent this new iteration of neural networks is.

Overall an excellent book, scikit-learn and tensorflow are the go-to libraries for Python programmers wanting to do machine learning and deep learning respectively. This book describes their use eloquently, with references to original literature where appropriate whilst providing a good overview of both topics. The code used in the book can be found on github, as a set of Jupyter Notebooks.

Older posts «