Tag: scraperwiki

Book review: Docker Up & Running by Karl Matthias and Sean P. Kane

This review was first published at ScraperWiki.

This last week I have been reading dockerDocker Up & Running by Karl Matthias and Sean P. Kane, a newly published book on Docker – a container technology which is designed to simplify the process of application testing and deployment.

Docker is a very new product, first announced in March 2013, although it is based on older technologies. It has seen rapid uptake by a number of major web-based companies who have open-sourced their tooling for using Docker. We have been using Docker at ScraperWiki for some time, and our most recent projects use it in production. It addresses a common problem for which we have tried a number of technologies in search of a solution.

For a long time I have thought of Docker as providing some sort of cut down virtual machine, from this book I realise this is the wrong mindset – it is better to think of it as a “process wrapper”. The “Advanced Topics” chapter of this book explains how this is achieved technically. This makes Docker a much lighter weight, faster proposition than a virtual machine.

Docker is delivered as a single binary containing both client and server components. The client gives you the power to build Docker images and query the server which hosts the running Docker images. The client part of this system will run on Windows, Mac and Linux systems. The server will only run on Linux due to the specific Linux features that Docker utilises in doing its stuff. Mac and Windows users can use boot2docker to run a Docker server, boot2docker uses a minimal Linux virtual machine to run the server which removes some of the performance advantages of Docker but allows you to develop anywhere.

The problem Docker and containerisation are attempting to address is that of capturing the dependencies of an application and delivering them in a convenient package. It allows developers to produce an artefact, the Docker Image, which can be handed over to an operations team for deployment without to and froing to get all the dependencies and system requirements fixed.

Docker can also address the problem of a development team onboarding a new member who needs to get the application up and running on their own system in order to develop it. Previously such problems were addressed with a flotilla of technologies with varying strengths and weaknesses, things like Chef, Puppet, Salt, Juju, virtual machines. Working at ScraperWiki I saw each of these technologies causing some sort of pain. Docker may or may not take all this pain away but it certainly looks promising.

The Docker image is compiled from instructions in a Dockerfile which has directives to pull down a base operating system image from a registry, add files, run commands and set configuration. The “image” language is probably where my false impression of Docker as virtualisation comes from. Once we have made the Docker image there are commands to deploy and run it on a server, inspect any logging and do debugging of a running container.

Docker is not a “total” solution, it has nothing to say about triggering builds, or bringing up hardware or managing clusters of servers. At ScraperWiki we’ve been developing our own systems to do this which is clearly the approach that many others are taking.

Docker Up & Running is pretty good at laying out what it is you should do with Docker, rather than what you can do with Docker. For example the book makes clear that Docker is best suited to hosting applications which have no state. You can copy files into a Docker container to store data but then you’d need to work out how to preserve those files between instances. Docker containers are expected to be volatile – here today gone tomorrow or even here now, gone in a minute. The expectation is that you should preserve state outside of a container using environment variables, Amazon’s S3 service or a externally hosted database etc – depending on the size of the data. The material in the “Advanced Topics” chapter highlights the possible Docker runtime options (and then advises you not to use them unless you have very specific use cases). There are a couple of whole chapters on Docker in production systems.

If my intention was to use Docker “live and in anger” then I probably wouldn’t learn how to do so from this book since the the landscape is changing so fast. I might use it to identify what it is that I should do with Docker, rather than what I can do with Docker. For the application side of ScraperWiki’s business the use of Docker is obvious, for the data science side it is not so clear. For our data science work we make heavy use of Python’s virtualenv system which captures most of our dependencies without being opinionated about data (state).

The book has information in it up until at least the beginning of 2015. It is well worth reading as an introduction and overview of Docker.

Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia

This post was first published at ScraperWiki.
learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!

Book review: Mastering Gephi Network Visualisation by Ken Cherven

1994_7344OS_Mastering Gephi Network Visualization

This review was first posted at ScraperWiki.

A little while ago I reviewed Ken Cherven’s book Network Graph Analysis and Visualisation with Gephi, it’s fair to say I was not very complementary about it. It was rather short, and had quite a lot of screenshots. It’s strength was in introducing every single element of the Gephi interface. This book, Mastering Gephi Network Visualisation by Ken Cherven is a different, and better, book.

Networks in this context are collections of nodes connected by edges, networks are ubiquitous. The nodes may be people in a social network, and the edges their friendships. Or the nodes might be proteins and metabolic products and the edges the reaction pathways between them. Or any other of a multitude of systems. I’ve reviewed a couple of other books in this area including Barabási’s popular account of the pervasiveness of networks, Linked, and van Steen’s undergraduate textbook, Graph Theory and Complex Networks, which cover the maths of network (or graph) theory in some detail.

Mastering Gephi is a practical guide to using the Gephi Network visualisation software, it covers the more theoretical material regarding networks in a peripheral fashion. Gephi is the most popular open source network visualisation system of which I’m aware, it is well-featured and under active development. Many of the network visualisations you see of, for example, twitter social networks, will have been generated using Gephi. It is a pretty complex piece of software, and if you don’t want to rely on information on the web, or taught courses then Cherven’s books are pretty much your only alternative.

The core chapters are on layouts, filters, statistics, segmenting and partitioning, and dynamic networks. Outside this there are some more general chapters, including one on exporting visualisations and an odd one on “network patterns” which introduced diffusion and contagion in networks but then didn’t go much further.

I found the layouts chapter particularly useful, it’s a review of the various layout algorithms available. In most cases there is no “correct” way of drawing a network on a 2D canvas, layout algorithms are designed to distribute nodes and edges on a canvas to enable the viewer to gain understanding of the network they represent.  From this chapter I discovered the directed acyclic graph (DAG) layout which can be downloaded as a Gephi plugin. Tip: I had to go search this plugin out manually in the Gephi Marketplace, it didn’t get installed when I indiscriminately tried to install all plugins. The DAG layout is good for showing tree structures such as organisational diagrams.

I learnt of the “Chinese Whispers” and “Markov clustering” algorithms for identifying clusters within a network in the chapter on segmenting and partitioning. These algorithms are not covered in detail but sufficient information is provided that you can try them out on a network of your choice, and go look up more information on their implementation if desired. The filtering chapter is very much about the mechanics of how to do a thing in Gephi (filter a network to show a subset of nodes), whilst the statistics chapter is more about the range of network statistical measures known in the literature.

I was aware of the ability of Gephi to show dynamic networks, ones that evolved over time, but had never experimented with this functionality. Cherven’s book provides an overview of this functionality using data from baseball as an example. The example datasets are quite appealing, they include social networks in schools, baseball, and jazz musicians. I suspect they are standard examples in the network literature, but this is no bad thing.

The book follows the advice that my old PhD supervisor gave me on giving presentations: tell the audience what you are go to tell them, tell them and then tell them what you told them. This works well for the limited time available in a spoken presentation, repetition helps the audience remember, but it feels a bit like overkill in a book. In a book we can flick back to remind us what was written earlier.

It’s a bit frustrating that the book is printed in black and white, particularly at the point where we are asked to admire the blue and yellow parts of a network visualisation! The referencing is a little erratic with a list of books appearing in the bibliography but references to some of the detail of algorithms only found in the text.

I’m happy to recommend this book as a solid overview of Gephi for those that prefer to learn from dead tree, such as myself. It has good coverage of Gephi features, and some interesting examples. In places it is a little shallow and repetitive.

The publisher sent me this book, free of charge, for review.

Book review: Cryptocurrency by Paul Vigna and Michael J. Casey

 

cryptocurrencyThis review was first posted at ScraperWiki.

Amongst hipster start ups in the tech industry Bitcoin has been a thing for a while. As one of the more elderly members of this community I wanted to understand a bit more about it. Cryptocurrency: How Bitcoin and Digital Money are Challenging the Global Economic Order by Paul Vigna and Michael Casey fits this bill.

Bitcoin is a digital currency: the Bitcoin has a value which can be exchanged against other currencies but it has no physical manifestation. The really interesting thing is how Bitcoins move around without any central authority, there is no Bitcoin equivalent of the Visa or BACS payment systems with their attendant organisations or central back as in the case of a normal currency. This division between Bitcoin as currency and Bitcoin as decentralised exchange mechanism is really important.

Conventional payment systems like Visa have a central organisation which charges retailers a percentage on every payment made using their system. This is exceedingly lucrative. Bitcoin replaces this with the blockchain – a distributed ledger in which transactions are encrypted. The validation is carried out by so-called ‘miners’ who are paid in Bitcoin for carrying out a computationally intensive encryption task which ensures the scarcity of Bitcoin and helps maintain its value. In principle anybody can be a Bitcoin miner, all they need is the required free software and the ability to run the software. The generation of new Bitcoin is strictly controlled by the fundamental underpinnings of the blockchain software. Bitcoin miners are engaged in a hardware arms race with each other as they compete to complete units on the blockchain, more processing power equals more chances to complete blocks ahead of the competition and hence win more Bitcoin. In practice mining meaningful quantities these days requires significant, highly specialised hardware.

Vigna and Casey provide a history of Bitcoin starting with a bit of background as to how economists see currency, this amounts to the familiar division between the Austrian school and the Keynesians. The Austrians are interested in currency as gold, whilst the Keynesians are interested in Bitcoin as a medium for exchange. As a currency Bitcoin doesn’t appeal to Keysians since there is no “quantitative easing” in Bitcoin, the government can’t print money.

Bitcoin did not appear from nowhere, during the late 90s and early years of the 20th century there were corporate attempts at building digital currencies. These died away, they had the air of lone wolf operations hidden within corporate structures which met their end perhaps when they filtered up to a certain level and their threat to the current business model was revealed. Or perhaps in the chaos of the financial collapse.

More pertinently there were the cypherpunks, a group interested in cryptography operating on the non-governmental, non-corporate side of the community. This group was also experimenting with ideas around digital currencies. This culminated in 2008 with the launch of Bitcoin, by the elusive Satoshi Nakamoto, to a cryptography mailing list. Nakamoto has since disappeared, no one has ever met him, no one knows whether he is the pseudonym of one of the cypherpunks, and if so, which one.

Following its release Bitcoin experienced a period of organic growth with cryptography enthusiasts and the technically curious. With the Bitcoin currency growing an ecosystem started to grow around it beginning with more user-friendly routes to accessing the blockchain – wallets to hold your Bitcoins, digital currency exchanges and tools to inspect the transactions on the blockchain.

Bitcoin has suffered reverses, most notoriously the collapse of the Mt Gox currency exchange and its use in the Silk Road market, which specialised in illegal merchandise. The Mt Gox collapse demonstrated both flaws in the underlying protocol and its vulnerability to poorly managed components in the ecosystem. Alongside this has been the wildly fluctuating value of the Bitcoin against other conventional currencies.

One of the early case studies in Cryptocurrency is of women in Afghanistan, forbidden by social pressure if not actual law from owning private bank accounts. Bitcoin provides them with a means for gaining independence and control over at least some financial resources. There is the prospect of it becoming the basis of a currency exchange system for the developing world where transferring money within a country or sending money home from the developed world are as yet unsolved problems, beset both with uncertainty and high costs.

To my mind Bitcoin is an interesting idea, as a traditional currency it feels like a non-starter but as a decentralized transaction mechanism it looks very promising. The problem with decentralisation is: who do you hold accountable? In two senses, firstly the technical sense – what if the software is flawed? Secondly, conventional currencies are backed by countries not software, a country has a stake in the success of a currency and the means to execute strategies to protect it. Bitcoin has the original vision of a vanished creator, and a very small team of core developers. As an aside Vigna and Casey point out there is a limit within Bitcoin of 7 transactions per second which compares with 10,000 transactions per second handled by the Visa network.

It’s difficult to see what the future holds for Bitcoin, Vigna and Casey run through some plausible scenarios. Cryptocurrency is well-written, comprehensive and pitched at the right technical level.

Adventures in Kaggle: Forest Cover Type Prediction


forest_cover_thumb
This post was first published at ScraperWiki.

Regular readers of this blog will know I’ve read quite few machine learning books, now to put this learning into action. We’ve done some machine learning for clients but I thought it would be good to do something I could share. The Forest Cover Type Prediction challenge on Kaggle seemed to fit the bill. Kaggle is the self-styled home of data science, they host a variety of machine learning oriented competitions ranging from introductory, knowledge building (such as this one) to commercial ones with cash prizes for the winners.

In the Forest Cover Type Prediction challenge we are asked to predict the type of tree found on 30x30m squares of the Roosevelt National Forest in northern Colorado. The features we are given include the altitude at which the land is found, its aspect (direction it faces), various distances to features like roads, rivers and fire ignition points, soil types and so forth. We are provided with a training set of around 15,000 entries where the tree types are given (Aspen, Cottonwood, Douglas Fir and so forth) for each 30x30m square, and a test set for which we are to predict the tree type given the “features”. This test set runs to around 500,000 entries. This is a straightforward supervised machine learning “classification” problem.

The first step must be to poke about at the data, I did a lot of this in Tableau. The feature most obviously providing predictive power is the elevation, or altitude of the area of interest. This is shown in the figure below for the training set, we see Ponderosa Pine and Cottonwood predominating at lower altitudes transitioning to Aspen, Spruce/Fir and finally Krummholz at the highest altitudes. Reading in wikipedia we discover that Krummholz is not actually a species of tree, rather something that happens to trees of several species in the cold, windswept conditions found at high altitude.

Figure1

Data inspection over I used the scikit-learn library in Python to predict tree type from features. scikit-learn makes it ridiculously easy to jump between classifier types, the interface for each classifier is the same so once you have one running swapping in another classifier is a matter of a couple of lines of code. I tried out a couple of variants of Support Vector Machines, decision trees, k-nearest neighbour, AdaBoost and the extremely randomised trees ensemble classifier (ExtraTrees). This last was best at classifying the training set.

The challenge is in mangling the data into the right shape and selecting the features to use, this is the sort of pragmatic knowledge learnt by experience rather than book-learning. As a long time data analyst I took the opportunity to try something: essentially my analysis programs would only run when the code had been committed to git source control and the SHA of the commit, its unique identifier, was stored with the analysis. This means that I can return to any analysis output and recreate it from scratch. Perhaps unexceptional for those with a strong software development background but a small novelty for a scientist.

Using a portion of the training set to do an evaluation it looked like I was going to do really well on the Kaggle leaderboard but on first uploading my competition solution things looked terrible! It turns out this was a common experience and is a result of the relative composition of the training and test sets. Put crudely the test set is biased to higher altitudes than the training set so using a classifier which has been trained on the unmodified training set leads to poorer results then expected based on measurements on a held back part of the training set. You can see the distribution of elevation in the test set below, and compare it with the training set above.

figure2

We can fix this problem by biasing the training set to more closely resemble the test set, I did this on the basis of the elevation. This eventually got me to 430 rank on the leaderboard, shown in the figure below. We can see here that I’m somewhere up the long shallow plateau of performance. There is a breakaway group of about 30 participants doing much better and at the bottom there are people who perhaps made large errors in analysis but got rescued by the robustness of machine learning algorithms (I speak from experience here!).

figure3

There is no doubt some mileage in tuning the parameters of the different classifiers and no doubt winning entries use more sophisticated approaches. scikit-learn does pretty well out of the box, and tuning it provides marginal improvement. We observed this in our earlier machine learning work too.

I have mixed feelings about the Kaggle competitions. The data is nicely laid out, the problems are interesting and it’s always fun to compete. They are a great way to dip your toes in semi-practical machine learning applications. The size of the awards mean it doesn’t make much sense to take part on a commercial basis.

However, the data are presented such as to exclude the use of domain knowledge, they are set up very much as machine learning challenges – look down the competitions and see how many of them feature obfuscated data likely for reasons of commercial confidence or to make a problem more “machine learning” and less subjectable to domain knowledge. To a physicist this is just a bit offensive.

If you are interested in a slightly untidy blow by blow account of my coding then it is available here in a Bitbucket Repo.