Feb 20 2017

Book review: Weapons of Math Destruction by Cathy O’Neil

weapons_of_math_destructionObviously for any UK anglophone the title of Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil is going to be a bit grating. The book is an account of how algorithms can ruin people’s lives. To a degree the “Big Data” in the subtitle is incidental.

Cathy O’Neil started her career as a mathematician before worked for the Shaw Hedge Fund as a quant before moving to Instant Media to work as a data scientist. It’s nice to know that I’m not the only person to have become a data scientist largely by writing “data scientist” on their CV! Nowadays she is an activist in the Occupy movement.

The book is the result of O’Neil’s revelation that algorithms were often used destructively, and are responsible for gross injustices. Algorithms in this case are models that determine how companies, and sometimes government, deal with their employees, customers and citizens; whether they are offered loans, adverts of a particular sort, employment, termination or a lengthy prison sentence.

The book starts with her experience at Shaw where she saw the subprime mortgage crisis from quite close up. In a nutshell: the subprime mortgage crisis happened because it was in the interests of most of the players in the industry for the stated risk of these mortgages to be minimised. The ratings agencies were paid by the aggregators of these mortgages to rate their risk, and the purchasers of these risk ratings had an interest in those ratings to be low – the ratings agencies duly obliged.

The book goes on to cover a number of other “Weapons of Math Destruction”, including models for recruitment, insurance, credit rating, scheduling (for work), politics and policing. So, for example, there are the predictive policing algorithms which will direct the police for particular parts of town in an effort to reduce serious crime but where the police will consequently record more anti-social behaviour which will lead the algorithm to send them there again because it turns out that serious crime is quite rare but anti-social behaviour isn’t (so there’s more data to draw on). And the police in a number of countries are following the “zero-tolerance” model which says if you address minor misdemeanours then more serious crimes are fixed automatically. The problem in the US with this approach is that the police are sent to black neighbourhoods repeatedly (rather than, say, college campuses) and the model is self-reinforcing.

O’Neil identifies several systematic problems which are typically of Weapons of Math Destruction. These are the use of proxies rather than “real outcomes”, the lack of feedback from outcomes to the model, the scale on which the model impacts people, the lack of fairness built into the model, the opacity of the models and the damage the models can do. The damage is extensive, these WMDs can lead to you being arrested, incarcerated for lengthy periods, denied a job, denied medical insurance, and offered loans at most extortionate rates to complete courses at rather low rate universities.

The book is focused almost entirely on the US, in fact the only mention of a place outside the US is of policing in the “city of Kent”. However, O’Neil does seem to rate the data and privacy legislation in Europe – where consumers should be told of the purposes to which data will be put when they supply it. Even in the States the law provides some limits on certain types of model (such as credit scoring) but these laws have not kept pace with new developments, nor are they necessarily easy to use. For example, if your credit score is wrong fixing it although legally mandated is not quick and easy. 

Perhaps her most telling comment is that computers don’t understand fairness, and certainly don’t exhibit fairness if they are not asked to optimise for it. Which does lead to the question “How do you implement fairness?”. In some cases it is obvious: you shouldn’t make use of algorithms which explicitly take into account gender, race or disability. But it’s easy to inadvertently bring in these parameters by, for example, postcode being correlated with race. Or part-time working being correlated with gender or disability.

As a middle aged, middle class white man with a reasonably well-paid job, living in a nice part of town I am least likely to find myself on the wrong end of an algorithm and ironically the most likely to be writing such algorithms.

I found the book very thought-provoking, it will certainly lead me to ask me whether the algorithms and data that I am generating are fair and what the cost of any unfairness is.

Feb 09 2017

Book review: I contain multitudes by Ed Yong

multitudesThis book was a Christmas gift, for which I’m very grateful! I Contain Multitudes: The Microbes within us and a Grander View of Life by Ed Yong is all about bacteria.

Bacteria are somewhat neglected in the popular science literature, I think the closest I can come is The Eighth Day of Creation by Maurice Freeland Judson which is about the discovery of DNA and its role in molecular biology in which bacteria and viruses play a part.

Yong’s book is about the relationship between bacteria and other organisms, humans included. It reveals a world where bacteria are not simply passengers on oblivious hosts but are a heavily integrated part of the host’s life cycle.

The study of the “microbiome” is relatively recent. Unravelling the members of a microbial community prior to the invention of cheap, and easy, DNA sequencing was hard. Carl Woese pioneered this approach in the 1970s, and used it discover the archea, a whole knew Kingdom of life (plants and animals are two of the other Kingdoms, to give you and idea of the magnitude of this discovery). Sequencing of the bacterial inhabitants of humans gained pace in the 2000s when it was discovered that we all carry a rich community of bacteria which varies from site to site around the body, let alone from individual to individual. What is true for humans is true for other organisms.

The book continues with an overview of how important bacteria can be to an organisms life. For example choanoflagellates, typically single-celled organisms, only form colonies in the presence of certain bacteria. And bobtail squid rely on bacterial partners to provide their luminescence. The standard lab animals (mice, zebrafish, flies) have been raised in germ-free environments and whilst they do not die, they do not flourish – even in the comfortable environment of the lab. The Wolbachia bacteria interferes with the sex lives of its insect hosts, it is only passed down via the eggs of the female and so it arranges by various means that there are more eggs and females than sperm.

These partnerships are not accidental, in the sense that organisms often provide specific structures to support their bacterial partners and exchange specific molecular markers with them. In some cases the host is essential to the survival of bacteria it contains because they have given up on carrying out tasks essential to their continued existence, for example in the supply of essential nutrients. This is true on many scales, animals from termites to cows have digestive systems designed to accommodate a particular bacterial support team to enable them to digest what would otherwise be food of low nutritional value. The early years of a human infants life are shaped by its acquisition of the right microbiome to prime the immune system and aid digestion.

The reason that bacteria are so effective in providing support services to their hosts is their high rate of evolution. Not only do they replicate fast, they have a promiscuous approach to DNA they come across in their environment. This means that if any bacterial species evolves a useful trait, such as the ability to digest seaweed then its neighbours in the gut can pick up that ability via its DNA. These genes can, eventually, end up in the genome of their hosts.

Japanese people who eat nori seaweed, which contains carbohydrates which the human body can’t digest on its own, host bacteria which can. Moreover, the genes those bacteria use to carry out this digestion were acquired from marine bacteria.

Yong is not misty-eyed about his bacterial subjects, as he points out their symbiosis with other organisms is not altogether harmonious – in the end the bacteria are in it for themselves.

The book finishes with some examples of how bacteria can be used to support human health, and speculates how this approach – currently only used in curing persistent C. difficile infections – could be extended to all manner of ailments including blood pressure and mental health problems.

I’ve been following Ed Yong on twitter for quite a while, and where he found the time to write a book as well as everything else he seems to do is a mystery to me! his style, as a science journalist, can be seen in the book, both in the presentation of the story, with brief character sketches of the scientists involved and quotes from them, and in the titles of the chapters which are entertaining but not necessarily informative. The book is thick with examples which build into larger themes, turn to the back of the book and you’ll find references to the primary literature.

Bacteria deserve our attention, this book is a great introduction to how they shape the lives of “higher” organisms.

Jan 01 2017

Book review: The Headspace Guide to Mindfulness & Meditation by Andy Puddicombe

headspaceTo some degree this is a review of an app rather than a book. I started using the Headspace app a couple of months ago after a particularly trying burst of insomnia. The Headspace Guide to Mindfulness & meditation by Andy Puddicombe is the book of the app, I picked up the Kindle version for £0.99.

The mindfulness / meditation division is a bit of language engineering. Meditation is the exercise you do to achieve a state of mindfulness but also when Western medical practitioners were starting to use techniques of meditation in treating patients they found greater acceptability in the scientific community when using the newly coined term “mindfulness” rather than “meditation”.

The Headspace app costs £60 per year, more if you pay by month and less if you buy for longer. There’s much more material in the app than in the book but I find it easier to read chunks of text rather than remember the preambles to the guided meditation sessions which is where the material from the book is found in the app. I think it’s pretty much the only app I’ve bought and I had it on a free trial first – so I certainly value it.

The core of the Headspace programme is the 10 minute daily meditation. This starts with a few deep breathes to get yourself going, some scanning of the body to gain awareness of your physicality, followed by a period of focusing on your breathing and then a phase of returning to the world. Clearly, this brief description does not do the process justice. From my mechanistic point of view, the “aim” is to become a dispassionate observer of ones thoughts – mindfulness. More widely is carrying this approach in to the rest of your day.

In the app the voice of Andy is a comforting presence; perhaps more so for me because he’s from the West Country, and so makes me feel at home with his very slight Bristol accent. I found the app made it very easy for me to stick to a regular programme of daily 10 minute sessions.

I think one of the things that helped me latch on to Headspace was the exercise for insomnia, not so much the exercise itself but the fact that Andy describes exactly the form my insomnia takes (it’s also repeated in one of the case studies at the end of the book). Essentially I go to bed, and for an hour or so I fail to fall asleep. At which point I start getting stressed and frustrated at not falling asleep, and worrying about how awful the following day is going to be without any sleep. Usually the thought that set me off not sleeping is also going around and around in my head as well. At its worst this continues through the whole night.

The book is very readable much of the book is anecdotes of Andy’s time spent in monasteries, although there are some exercises and written guidance for meditation. This may seem frivolous but I found it far more interesting than dry descriptions of meditation, and as a result more likely to stick in the mind. I found these stories, and other visualisations described, very helpful.

There are frequent references to scientific research on the benefits of meditating and mindfulness through the text, as a trained scientist I muttered about the lack of in-text citations but on reaching the end I discovered the references section which, without following them up, look legitimate.

I learned a couple of things from reading the book compared to the app, it turns out my sofa is considered a bed for meditating purposes so I should be using a dining room chair. Secondly, it seemed implicit in the app that meditating was best done in the morning and the book makes this explicit.  

Strangely I find that elements of mindfulness existed in my life before I got Headspace. When I’m running (at my best) I find myself focusing on my breathing, in fact I can tell I’ve been thinking rather than focusing on breathing because I run slower when I’m thinking. I found a similar experience when we used to go walking in the countryside, the tramp of the feet acts as a good focus.

Meditation has its roots in Buddhism, and much of the experience that the author relates is from monasteries and his time as a monk, despite this the book (and the “programme”) feel like they have no religious elements to them. It has to be said adherents to meditation can sound rather evangelical – I can feel myself doing this now.

In summary, I highly recommend giving meditation, and Headspace a go. You can try the app for free for 10 days, or if you prefer the book is very cheap in its Kindle edition.  

Dec 29 2016

Review of the year: 2016

Another year passes and once more it is time to write the annual review of my blogging. I no longer have an hour and a half or so of commuting on the train everyday, so I thought my reading rate might have dropped. However, I see in the last year I have 21 book reviews on my blog as opposed to 22 last year. As usual my reading is split between technical books, the history of science and various odds and ends.

In terms of technical books, Pro Git by Scott Chacon and Ben Straub and Test-driven Development with Python by Harry J.W. Percival probably had the biggest impact on me in terms of the way I did my job. But Beautiful Javascript edited by Anton Kovalyov was the most thought provoking, it is an edited collection of the thoughts of a set of skilled Javascript developers. Lab Girl by Hope Jahren is an autobiography describing how it is to be a scientist, it is beautifully written. Maphead by Ken Jennings is about those obsessed with maps rather than science. Of the more directly science-related books I think The Invention of Science by David Wootton was the best in terms of provoking thought, it’s also very readable. The Invention covers the Scientific Revolution from 1500-1700 in terms of the language available to and used by its practitioners.

A second contender for the “sweeping overview” award goes to A New History of Life by Peter D. Ward and Joe Kirschvink which focuses particularly on the work over the last 20 years on the very earliest life on earth. I read some economic history in the form of The Honourable Company by John Keay (about the East India Company) and the more general The Company by John Micklethwait and Adrian Wooldridge. I also read about the Romans, in the form of Mary Beard’s SPQR which is a history of ancient Rome, and Roman Chester by David J.P. Mason which is about my home city.

You can see all I’ve read on Goodreads. I don’t blog about my fiction reading, I think because for me blogging is mostly about reminding myself about facts and ideas I’ve read about and I struggle to see how I’d do that with fiction. Perhaps I should try. In fiction, I’ve been making some effort to read books not written by middle aged white men which has been rewarding.

This year’s holiday was to Benllech on the isle of Anglesey, an embarrassingly short drive from home – our holiday bungalow had leaflets describing attractions in our home city! We took in a number of castles, the beach on a daily basis and the Anglesey Sea Zoo. The photo at the top is from Amlwch which was once port to the Copper Mountain.

The year has been momentous politically with Jeremy Corbyn’s re-election as leader of the Labour Party, the Leave vote in the EU referendum, David Cameron stepping down as Prime Minister and then leaving parliament, Theresa May taking over as Prime Minister and the election of Donald Trump as president in the US. I haven’t written much about all of these things. I wrote a blog post shortly before the EU referendum, putting out my reasons for voting Remain. I accidently wrote that I thought Leave would win – which was strangely prophetic. In the aftermath of the vote I was dazed and disturbed, much as I thought I would be. I half wrote many blog posts after the vote but the only one I published was on the unsuitability of Boris Johnson for pretty much anything, let alone the delicate role of Foreign Secretary.

Things are looking up a little for my party, the Liberal Democrats, who seem the only ones prepared to oppose the government over their Brexit “plans”, and the only ones prepared to vote against the “Snooper’s Charter”. We’re the only ones making significant gains in local elections and have made significant showings in Westminster by-elections, getting a 23.5% swing in Witney and winning Richmond Park with a 30.4% swing. The Labour party seems to be marching itself into the wilderness with considerable enthusiasm.

David Laws’ Coalition was my only political reading of the year.

I’ve written a couple of times on exercise related things: The Running Man on my newfound enthusiasm for running. Since writing I have a fancier running watch (a Garmin Forerunner 235), I read Bob Glover’s The Runner’s Handbook and decided I had to have a heart rate monitor. As it was I don’t pay a huge amount of attention to the heart rate monitor but it is nice that the GPS is ready to go by the time I reach the end of my walk up the drive rather than five minutes later. I also wrote about cycling to work in Ride, as others struggle to find parking at work I have a 12 space bike shed mostly to myself (particularly in the winter)!

I’ve been trying out Headspace recently which is an app for guided meditation, it seems helpful for the gloomy winter. I realise that some of the elements of meditation I used to get from our long walks in the country. 

Work has been fun, I have built something which is now being sold to customers, and I made something of an impact with my sequinned jacket and willingness to dance the night away at the office Christmas party. 

Dec 27 2016

Book review: Elasticsearch–The Definitive Guide by Clinton Gormley & Zachary Tong

elasticsearchBack to technology with this blog post and a review of Elasticsearch – The Definitive Guide by Clinton Gormley and Zachary Tong. The book is available for free online, and probably more up to date (here), that said Elasticsearch seems to be quite stable now. I have a dead tree copy because I’m old-fashioned.

Elasticsearch is a full-text search engine based on the Apache Lucene project. I was first made aware of it when I was working at ScraperWiki where we used it for a proof of concept system for analysing legalisation from many countries (I wasn’t involved hands-on with this work). Recently, I used it to make a little auto-completion web form for company names using the Companies House dataset. From download to implementing a solution which was x1000 times faster than a naive SQL querying system took less than a day – the default configuration and system is that good!

You can treat Elasticsearch like a SQL database to a fair degree, what it refers to indexes are what would be separate databases on a SQL server. Elasticsearch refers to document types instead of tables, and what would be rows in a SQL database are called “documents”. There are no joins as such in Elasticsearch but there are a number of workarounds such as parent-child relationships, nested objects or plain old denormalisation. I suspect one needs to be a bit cautious of treating Elasticsearch as a funny looking SQL database.

The preferred way to interact with Elasticsearch is using the HTTP API, this means that once installed you can prod away at your Elasticsearch database using curl from the commandline or the  Sense plugin for Google Chrome. The book is liberally scattered with examples written as HTTP requests, and online these can be launched from the browser (given a bit of configuration). To my mind the only downside of this is that queries are written in JSON which introduces a lot of extraneous brackets and quoting. For my experiments I moved quickly to using the Python interface which seems well-supported and complete (as do other language bindings). 

Elasticsearch: The Definitive Guide is divided into 7 sections: Getting started, Search in Depth, Dealing with Human Language, Aggregations, Geolocation, Modelling your data, and finishes with Administration, Monitoring and Deployment.

The Getting Started section of the book covers everything you need to get you going but no single topic in any depth. The subsequent sections are largely about filling in that detail. The query language is completely different to SQL and queries come back with results ranked by a relevance score. I suspect this is where I’ll find myself working a lot in future, currently my queries give me a set of results which I filter in Python. I suspect I could write better queries which would return relevance scores which matched my application (and that I would trust). As it stands my queries always return *something* which may or may not be what I want. 

I found the material regarding analyzers (which are applied to searchable fields and, symmetrically, search terms) very interesting and applicable to wider search problems where Elasticsearch is not necessarily the technology to be used. There is an overlap here with natural language processing in the sense that analyzers can include tokenizers, stemmers, and synonym lookups which are all part of the NLP domain. This is expanded on further in the “Dealing with human language” section.

The section on aggregations explains Elasticsearch’s “group by”-like functionality, and that on geolocation touches on spatial extension-like behaviour. Elasticsearch handles geohashes which are a relatively recent innovation in encoding spatial coordinates.

The book mentions very briefly the ELK stack which is Elasticsearch, Logstash and Kibana (all available from the elastic website). This is used to analyse log files, logstash funnels the log data into elasticsearch where it is visualised using Kibana. I tried out kibana briefly, its an easy to use visualising frontend.

Elasticsearch is a Big Data technology from the start which means it supports sharding, replication and distribution over nodes out of the box but it runs fine on a simple single node such as my laptop.

Elasticsearch is a pretty big book but the individual chapters are pretty short and to the point. As I’d expect from O’Reilly Elasticsearch is well-edited, and readable. I found it great for working out what all the parts of Elasticsearch are and now know what exists when it comes to solving live problems. The book is pretty good at telling you which things you can do, and which things you should do.   

Older posts «