Tag: machine learning

Book review: You look like a thing and I love you by Janelle Shane

You look like a thing and I love you by Janelle Shane is a non-technical overview of machine learning. This isn’t to say it doesn’t goYou look like a thing and I love you book cover into some depth, and that if you are experienced practitioner in machine learning you won’t learn something. The book is subtitled “How Artificial Intelligence Works and Why It’s Making the World a Weirder Place” but Shane makes clear at the outset that it is all about machine learning – Artificial Intelligence is essentially the non-specialist term for the field.

Machine learning is based around training an algorithm with a set of data which represents the task at hand. It might be a list of names (of kittens, for example) where essentially we are telling the algorithm “all these things here are examples of what we want”. Or it might be a set of images where we indicate the presence of dogs, cats or whatever we are interested in. Or, to use one of Shane’s examples, it might be sandwich recipes labelled as “tasty” or “not so tasty”.  After training, the algorithm will be able to generate names consistent with the training set, label images as containing cats or dogs or tell you whether a sandwich is potentially tasty.

The book has grown out of Shane’s blog AI Weirdness where she began posting about her experiences of training recurrent neural networks (a machine learning algorithm) at the beginning of 2016. This started with her attempts to generate recipes. The results are, at times, hysterically funny. Following attempts at recipes she went on to the naming of things, using neural networks to generate the names of kittens, guinea pigs, craft beers, Star Wars planet names and to generate knitting patterns. More recently she has been looking at image labelling using machine learning, and at image generation using generative adversarial networks.

The “happy path” of machine learning is interrupted by a wide range of bumps in the road which Shane identifies, these include:

  • Messy training data – the recipe data, at one point, had ISBN numbers mixed in which led to the neural network erroneously trying to include ISBN-like numbers in recipes;
  • Biased training data – someone tried to analyse the sentiment of restaurant reviews but found that Mexican restaurants were penalised because the Word2vec training set (word2vec is a popular machine learning library which they used in there system) associated Mexican with “illegal”;
  • Not detecting the thing you thought it was detecting – Shane uses giraffes as an example, image labelling systems have a tendency to see giraffes where they don’t exist. This is because if you train a system to recognise animals then in all likelihood you will not include pictures with no animals. Therefore show a neural network an image of some fields and trees with no animals in it will likely “see” an animal because, to its knowledge, animals are always found in such scenes. And neural networks just like giraffes;
  • Inappropriate reward functions – you might think you have given your machine learning system an appropriate “reward function” aka a measure for success but is it really the right one? For example the COMPAS system, which recommends whether prisoners in the US should be recommended for parole, was trained using a reward based on re-arrest, not re-offend. Therefore it tended to recommend against parole for black prisoners because they were more likely to be arrested (not because they were more likely to re-offend);
  • “Hacking the Matrix” – in some instances you might train your system in a simulation of the real world, for example if you want to train a robot to walk then rather than trying to build real robots you would build virtual robots and try them out in a simulated environment. The problem comes when your virtual robot works out how to cheat in the simulated environment, for example by exploiting limitations of collision detection to generate energy;
  • Problems unsuited to machine learning – some tasks are not amenable to machine learning solutions. For example, in the recipe generation problem the “memory” of the neural network limits the recipes generated because by the time a neural network has reached the 10th ingredient in a list it has effectively forgotten the first ingredient. Furthermore, once trained in one task, a neural network will “catastrophically forget” how to do that task if it is subsequently trained to do another task – machine learning systems are not generalists;

My favourite of these is “Hacking the matrix” where algorithms discover flaws in the simulations in which they run, or flaws in their reward system, and exploit them for gain. This blog post on AI Weirdness provides some examples, and links to original research.

Some of this is quite concerning, the examples Shane finds are the obvious ones – the flight simulator which found that it could meet the goal of a “minimum force” landing by making the landing force enormous and overflowing the variable that stored the value, making it zero. This is catastrophic from the pilot’s point of view. This would have been a very obvious problem which could be identified without systematic testing. But what if the problem is not so obvious but equally catastrophic when it occurs?

A comment that struck me towards the end of the book was that humans “fake intelligence” with prejudices and stereotypes, it isn’t just machines that use shortcuts when they can.

The book finishes with how Shane sees the future of artificial intelligence, essentially in a recognition that these systems have strengths and weaknesses and that the way forward is to combine artificial and human intelligence.

Definitely worth a read!

Book review: Deep learning with Python by François Chollet

Deep learning with Python by Francois Chollet is the third book I have reviewed on deep learning neural networks. Despite these reviews only spanning a couple of years it feels like the area is moving on rapidly. The biggest innovations I see from this book are in the use of pre-trained networks, and the dominance of the Keras/Tensorflow/Python ecosystem in doing deep learning.

Deep learning is a type of artificial intelligence based on many-layered neural networks. This is where the “deep” comes in – it refers to the numbers of layers in the networks. The area has boomed in the last few years with the availability of massive datasets on which to train, improvements in numerical algorithms for training neural networks and the use of GPUs to further accelerate deep learning. Neural networks have been used in production since the 1990s – by the US postal service for reading handwritten zip codes.

Chollet works on artificial intelligence at Google and is the author of the Keras deep learning library. Google is also the home of Tensorflow, a lower level library which is often used as a backend to Keras. This is a roundabout way of saying we should expect Chollet to be expert and authoritative in this area.

The book starts with some nice background to machine learning. I liked Chollet’s description of machine learning (deep learning included) being about finding a representation of data which makes the problem at hand trivial to solve. Imagine taking two pieces of coloured paper, placing them one on top of the other and then crumpling them into a ball. Machine learning is the process of un-crumpling the ball.

As an introduction to the field Deep Learning in Python runs through some examples of deep learning applied to various classes of problem, including movie review sentiment analysis, classifying newswire articles and predicting house prices before going back to discuss some issues these problems raise. A recurring theme is the problem of overfitting. Deep learning models can learn their training data really well, essentially they memorise the answers to questions and so when they are faced with questions they have not seen before they perform badly. Overfitting can be addressed with a range of techniques.

One twist I had not seen before is the division of the labelled data used in machine learning into three, not two parts: training, validation and test. The use of training and validation parts is commonplace, the training set is used for training – the validation set is used to test the quality of a model after training. The third component which Chollet introduces is the “test” set, this is like the validation set but it is only used when your model is about to go into production to see how it will perform in real life. The problem it addresses is that machine learning involves a large number of hyperparameters (things like the type of machine learning model, the number of layers in a deep network, the form of the activation function) which are not changed during training but are changed by the data scientist quite possibly automatically and systematically. The hyperparameters can be overfitted to the validation set, hence a model can perform well on validation data (that it has seen before) but not on test data which represents real life.

A second round of examples looks at deep learning in computer vision, using convolution neural networks (convnets). These are related to the classic computer vision process of convolution and image morphology. Also introduced here are recurrent neural networks (RNNs) for applications in processing sequences such as time series data and language. RNNs have memory across layers which dense and convolution networks don’t, this makes them effective for problems where the sequence of data is important.

The final round of examples is in generative deep learning including generating text, the DeepDream system, image style transfer and generating images of faces.

The book ends with some thoughts of the future. Chollet comments that he doesn’t like to use the term neural networks which implies the ability to reason and abstract in the way that humans do. One of the limitations of deep learning is that, as currently used, does not have the ability to abstract or generate programmatic descriptions of solutions. You would not use deep learning to launch a rocket – we have detailed knowledge of the physics of rockets, gravity and the atmosphere which makes a physics-based approach far better.

As I read I realised that keeping up with what was new in machine learning was a critical and challenging task, Chollet answers this question exactly suggesting three approaches to keeping abreast of new developments:

  1. Kaggle – the machine learning competition site;
  2. ArXiv – the preprint server, in particular http://www.arxiv-sanity.com/ which is a curated view of the machine learning part of arXiv;
  3. Keras – keeping up with developments in the Keras ecosystem;

If you’re going to read one book on deep learning this should probably be the one, it is readable, covers off the field pretty well, Chollet is an authority in this area and in my view has particularly acute insight into deep learning.

Book review: Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron

machine-learningI’ve recently started playing around with recurrent neural networks and tensorflow which brought me to Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron, as a bonus it also includes material on scikit-learn which I’ve been using for a while.

The book is divided into two parts, the first, “Fundamentals of Machine Learning” focuses on the functionality which is found in the scikit-learn library. It starts with a big picture, running through the types of machine learning which exist (supervised / unsupervised, batched / online and instance / model) and then some of the pitfalls and problems with machine learning before a section on testing and validation. The next part is a medium sized example of machine learning in action which demonstrates how the functionality of scikit-learn can be quickly used to develop predictions of house prices in California based on census data. This is a subject after my own heart, I’ve been working property data in the UK for the past couple of years.

This example serves two purposes, firstly it demonstrates the practical steps you need to take when undertaking a machine learning exercise and secondly it highlights just how concisely much of it can be executed in scikit-learn. The following chapters then go into more depth first about how models are trained and scored and then going into the details of different algorithms such as Support Vector Machines and Decision Trees. This part finishes with a chapter on ensemble methods.

Although the chapters contain some maths, their strength is in the clear explanations of the methods described. I particularly liked the chapter on ensemble methods. They also demonstrate how consistent the scikit-learn library is in its interfaces. I knew that I could switch algorithms very easily with scikit-learn but I hadn’t fully appreciated how the libraries generally handled regression and multi-class classification so seamlessly.

I wonder whether outside data science it is perceived that data scientists write their own algorithms from scratch. In practical terms it is not the case, and hasn’t been the case since at least the early nineties when I started data analysis which looks very similar to the machine learning based analysis I do today. In those days we used the NAG numerical library, Numerical Recipes in FORTRAN and libraries developed by a very limited number of colleagues in the wider academic community (probably shared by email attachment).

The second part of the book, “Neural networks and Deep Learning”, looks at the tensorflow library. Tensorflow has general applications for processing multi-dimensional arrays but it has been designed with deep learning and neural networks in mind. This means there are a whole bunch of functions to generate and train neural networks of different types and different architectures.

The section starts with an overview of tensorflow with some references to other deep learning libraries, before providing an introduction to neural networks in general, which have been around quite a while now. Following this there is a section on training deep learning networks, and the importance of the form of activation functions.

Tensorflow will run across multiple processors, GPUs and/or servers although configuring this looks a bit complicated. Typically a neural network layer can’t be distributed over multiple processing units.

There then follow chapters on convolution neural networks (good for image applications), recurrent neural networks (good for sequence data), autoencoders (finding compact representations) and finally reinforcement learning (good for playing pac-man). My current interest is in recurrent neural networks, it was nice to see a brief description of all of the potential input/output scenarios for recurrent neural networks and how to build them.

I spent a few years doing conventional image analysis, convolution neural networks feel quite similar to the convolution filters I used then although they stack more layers (or filters) than are normally used in conventional image analysis. Furthermore, in conventional image analysis the kernels are typically handcrafted to perform certain tasks (such as detect horizontal or vertical edges), whilst neural networks learn their kernels in training. In conventional image analysis convolution is often done in Fourier space since it is more efficient and I see there experiments along these lines with convolution neural networks.

Developing and training neural networks has the air of an experimental science rather than a theoretical science. That’s to say that rather than thinking hard and coming up with an effective neural network and training scheme one needs to tinker with different designs and training methods and see how well they work. It has more the air of training an animal the programming a computer. There are a number of standard training / test sets of images and successful models trained against these by other people can be downloaded. Such models can be used as-is but alternatively just parts can be used.

This section has many references to the original literature for the development of deep learning, highlighting how recent this new iteration of neural networks is.

Overall an excellent book, scikit-learn and tensorflow are the go-to libraries for Python programmers wanting to do machine learning and deep learning respectively. This book describes their use eloquently, with references to original literature where appropriate whilst providing a good overview of both topics. The code used in the book can be found on github, as a set of Jupyter Notebooks.

Book review: Weapons of Math Destruction by Cathy O’Neil

weapons_of_math_destructionObviously for any UK anglophone the title of Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil is going to be a bit grating. The book is an account of how algorithms can ruin people’s lives. To a degree the “Big Data” in the subtitle is incidental.

Cathy O’Neil started her career as a mathematician before worked for the Shaw Hedge Fund as a quant before moving to Instant Media to work as a data scientist. It’s nice to know that I’m not the only person to have become a data scientist largely by writing “data scientist” on their CV! Nowadays she is an activist in the Occupy movement.

The book is the result of O’Neil’s revelation that algorithms were often used destructively, and are responsible for gross injustices. Algorithms in this case are models that determine how companies, and sometimes government, deal with their employees, customers and citizens; whether they are offered loans, adverts of a particular sort, employment, termination or a lengthy prison sentence.

The book starts with her experience at Shaw where she saw the subprime mortgage crisis from quite close up. In a nutshell: the subprime mortgage crisis happened because it was in the interests of most of the players in the industry for the stated risk of these mortgages to be minimised. The ratings agencies were paid by the aggregators of these mortgages to rate their risk, and the purchasers of these risk ratings had an interest in those ratings to be low – the ratings agencies duly obliged.

The book goes on to cover a number of other “Weapons of Math Destruction”, including models for recruitment, insurance, credit rating, scheduling (for work), politics and policing. So, for example, there are the predictive policing algorithms which will direct the police for particular parts of town in an effort to reduce serious crime but where the police will consequently record more anti-social behaviour which will lead the algorithm to send them there again because it turns out that serious crime is quite rare but anti-social behaviour isn’t (so there’s more data to draw on). And the police in a number of countries are following the “zero-tolerance” model which says if you address minor misdemeanours then more serious crimes are fixed automatically. The problem in the US with this approach is that the police are sent to black neighbourhoods repeatedly (rather than, say, college campuses) and the model is self-reinforcing.

O’Neil identifies several systematic problems which are typically of Weapons of Math Destruction. These are the use of proxies rather than “real outcomes”, the lack of feedback from outcomes to the model, the scale on which the model impacts people, the lack of fairness built into the model, the opacity of the models and the damage the models can do. The damage is extensive, these WMDs can lead to you being arrested, incarcerated for lengthy periods, denied a job, denied medical insurance, and offered loans at most extortionate rates to complete courses at rather low rate universities.

The book is focused almost entirely on the US, in fact the only mention of a place outside the US is of policing in the “city of Kent”. However, O’Neil does seem to rate the data and privacy legislation in Europe – where consumers should be told of the purposes to which data will be put when they supply it. Even in the States the law provides some limits on certain types of model (such as credit scoring) but these laws have not kept pace with new developments, nor are they necessarily easy to use. For example, if your credit score is wrong fixing it although legally mandated is not quick and easy.

Perhaps her most telling comment is that computers don’t understand fairness, and certainly don’t exhibit fairness if they are not asked to optimise for it. Which does lead to the question “How do you implement fairness?”. In some cases it is obvious: you shouldn’t make use of algorithms which explicitly take into account gender, race or disability. But it’s easy to inadvertently bring in these parameters by, for example, postcode being correlated with race. Or part-time working being correlated with gender or disability.

As a middle aged, middle class white man with a reasonably well-paid job, living in a nice part of town I am least likely to find myself on the wrong end of an algorithm and ironically the most likely to be writing such algorithms.

I found the book very thought-provoking, it will certainly lead me to ask me whether the algorithms and data that I am generating are fair and what the cost of any unfairness is.

Adventures in Kaggle: Forest Cover Type Prediction


forest_cover_thumb
This post was first published at ScraperWiki.

Regular readers of this blog will know I’ve read quite few machine learning books, now to put this learning into action. We’ve done some machine learning for clients but I thought it would be good to do something I could share. The Forest Cover Type Prediction challenge on Kaggle seemed to fit the bill. Kaggle is the self-styled home of data science, they host a variety of machine learning oriented competitions ranging from introductory, knowledge building (such as this one) to commercial ones with cash prizes for the winners.

In the Forest Cover Type Prediction challenge we are asked to predict the type of tree found on 30x30m squares of the Roosevelt National Forest in northern Colorado. The features we are given include the altitude at which the land is found, its aspect (direction it faces), various distances to features like roads, rivers and fire ignition points, soil types and so forth. We are provided with a training set of around 15,000 entries where the tree types are given (Aspen, Cottonwood, Douglas Fir and so forth) for each 30x30m square, and a test set for which we are to predict the tree type given the “features”. This test set runs to around 500,000 entries. This is a straightforward supervised machine learning “classification” problem.

The first step must be to poke about at the data, I did a lot of this in Tableau. The feature most obviously providing predictive power is the elevation, or altitude of the area of interest. This is shown in the figure below for the training set, we see Ponderosa Pine and Cottonwood predominating at lower altitudes transitioning to Aspen, Spruce/Fir and finally Krummholz at the highest altitudes. Reading in wikipedia we discover that Krummholz is not actually a species of tree, rather something that happens to trees of several species in the cold, windswept conditions found at high altitude.

Figure1

Data inspection over I used the scikit-learn library in Python to predict tree type from features. scikit-learn makes it ridiculously easy to jump between classifier types, the interface for each classifier is the same so once you have one running swapping in another classifier is a matter of a couple of lines of code. I tried out a couple of variants of Support Vector Machines, decision trees, k-nearest neighbour, AdaBoost and the extremely randomised trees ensemble classifier (ExtraTrees). This last was best at classifying the training set.

The challenge is in mangling the data into the right shape and selecting the features to use, this is the sort of pragmatic knowledge learnt by experience rather than book-learning. As a long time data analyst I took the opportunity to try something: essentially my analysis programs would only run when the code had been committed to git source control and the SHA of the commit, its unique identifier, was stored with the analysis. This means that I can return to any analysis output and recreate it from scratch. Perhaps unexceptional for those with a strong software development background but a small novelty for a scientist.

Using a portion of the training set to do an evaluation it looked like I was going to do really well on the Kaggle leaderboard but on first uploading my competition solution things looked terrible! It turns out this was a common experience and is a result of the relative composition of the training and test sets. Put crudely the test set is biased to higher altitudes than the training set so using a classifier which has been trained on the unmodified training set leads to poorer results then expected based on measurements on a held back part of the training set. You can see the distribution of elevation in the test set below, and compare it with the training set above.

figure2

We can fix this problem by biasing the training set to more closely resemble the test set, I did this on the basis of the elevation. This eventually got me to 430 rank on the leaderboard, shown in the figure below. We can see here that I’m somewhere up the long shallow plateau of performance. There is a breakaway group of about 30 participants doing much better and at the bottom there are people who perhaps made large errors in analysis but got rescued by the robustness of machine learning algorithms (I speak from experience here!).

figure3

There is no doubt some mileage in tuning the parameters of the different classifiers and no doubt winning entries use more sophisticated approaches. scikit-learn does pretty well out of the box, and tuning it provides marginal improvement. We observed this in our earlier machine learning work too.

I have mixed feelings about the Kaggle competitions. The data is nicely laid out, the problems are interesting and it’s always fun to compete. They are a great way to dip your toes in semi-practical machine learning applications. The size of the awards mean it doesn’t make much sense to take part on a commercial basis.

However, the data are presented such as to exclude the use of domain knowledge, they are set up very much as machine learning challenges – look down the competitions and see how many of them feature obfuscated data likely for reasons of commercial confidence or to make a problem more “machine learning” and less subjectable to domain knowledge. To a physicist this is just a bit offensive.

If you are interested in a slightly untidy blow by blow account of my coding then it is available here in a Bitbucket Repo.