Category: Technology

Programming, gadgets (reviews thereof) and computers

Adventures in Kaggle: Forest Cover Type Prediction


forest_cover_thumb
This post was first published at ScraperWiki.

Regular readers of this blog will know I’ve read quite few machine learning books, now to put this learning into action. We’ve done some machine learning for clients but I thought it would be good to do something I could share. The Forest Cover Type Prediction challenge on Kaggle seemed to fit the bill. Kaggle is the self-styled home of data science, they host a variety of machine learning oriented competitions ranging from introductory, knowledge building (such as this one) to commercial ones with cash prizes for the winners.

In the Forest Cover Type Prediction challenge we are asked to predict the type of tree found on 30x30m squares of the Roosevelt National Forest in northern Colorado. The features we are given include the altitude at which the land is found, its aspect (direction it faces), various distances to features like roads, rivers and fire ignition points, soil types and so forth. We are provided with a training set of around 15,000 entries where the tree types are given (Aspen, Cottonwood, Douglas Fir and so forth) for each 30x30m square, and a test set for which we are to predict the tree type given the “features”. This test set runs to around 500,000 entries. This is a straightforward supervised machine learning “classification” problem.

The first step must be to poke about at the data, I did a lot of this in Tableau. The feature most obviously providing predictive power is the elevation, or altitude of the area of interest. This is shown in the figure below for the training set, we see Ponderosa Pine and Cottonwood predominating at lower altitudes transitioning to Aspen, Spruce/Fir and finally Krummholz at the highest altitudes. Reading in wikipedia we discover that Krummholz is not actually a species of tree, rather something that happens to trees of several species in the cold, windswept conditions found at high altitude.

Figure1

Data inspection over I used the scikit-learn library in Python to predict tree type from features. scikit-learn makes it ridiculously easy to jump between classifier types, the interface for each classifier is the same so once you have one running swapping in another classifier is a matter of a couple of lines of code. I tried out a couple of variants of Support Vector Machines, decision trees, k-nearest neighbour, AdaBoost and the extremely randomised trees ensemble classifier (ExtraTrees). This last was best at classifying the training set.

The challenge is in mangling the data into the right shape and selecting the features to use, this is the sort of pragmatic knowledge learnt by experience rather than book-learning. As a long time data analyst I took the opportunity to try something: essentially my analysis programs would only run when the code had been committed to git source control and the SHA of the commit, its unique identifier, was stored with the analysis. This means that I can return to any analysis output and recreate it from scratch. Perhaps unexceptional for those with a strong software development background but a small novelty for a scientist.

Using a portion of the training set to do an evaluation it looked like I was going to do really well on the Kaggle leaderboard but on first uploading my competition solution things looked terrible! It turns out this was a common experience and is a result of the relative composition of the training and test sets. Put crudely the test set is biased to higher altitudes than the training set so using a classifier which has been trained on the unmodified training set leads to poorer results then expected based on measurements on a held back part of the training set. You can see the distribution of elevation in the test set below, and compare it with the training set above.

figure2

We can fix this problem by biasing the training set to more closely resemble the test set, I did this on the basis of the elevation. This eventually got me to 430 rank on the leaderboard, shown in the figure below. We can see here that I’m somewhere up the long shallow plateau of performance. There is a breakaway group of about 30 participants doing much better and at the bottom there are people who perhaps made large errors in analysis but got rescued by the robustness of machine learning algorithms (I speak from experience here!).

figure3

There is no doubt some mileage in tuning the parameters of the different classifiers and no doubt winning entries use more sophisticated approaches. scikit-learn does pretty well out of the box, and tuning it provides marginal improvement. We observed this in our earlier machine learning work too.

I have mixed feelings about the Kaggle competitions. The data is nicely laid out, the problems are interesting and it’s always fun to compete. They are a great way to dip your toes in semi-practical machine learning applications. The size of the awards mean it doesn’t make much sense to take part on a commercial basis.

However, the data are presented such as to exclude the use of domain knowledge, they are set up very much as machine learning challenges – look down the competitions and see how many of them feature obfuscated data likely for reasons of commercial confidence or to make a problem more “machine learning” and less subjectable to domain knowledge. To a physicist this is just a bit offensive.

If you are interested in a slightly untidy blow by blow account of my coding then it is available here in a Bitbucket Repo.

Git–notes

logo@2xI’ve discovered that my blog is actually a good place to put things I need to remember see, for example, my blog post on running Ubuntu in a VM on Windows 8.

In this spirit here are my notes on using git, the distributed version control system (DVCS). These are things I picked up around the office at ScraperWiki, I wrote something there about the scheme we use for Git. This is more a compendium of useful git commands.

I use Git on both Windows and Ubuntu and I have accounts with both GitHub and Bitbucket. I’ve configured ssh on my Windows and Ubuntu machines and use that for authentication. I Windows I interact with Git using Git Bash.

Installation

On installing Git I do the following setup, obviously using my own name and email:

git config --global user.name "John Doe"
git config --global user.email johndoe@example.com
git config --global core.editor vim

I can list my config settings using:

git config -l

Starting a repo

To start a new repo we do:

git init

These days I feel bereft if I’m not “pushing” my local repository to an online repository like GitHub or BitBucket. To add a remote repository create one using the service of your choice which will probably ask you to do:

git remote add origin [url]

Alternatively you can clone an existing repository into a subdirectory of your current directory with the name of the repo:

git clone [url]

This one clones into current directory, making a mess if that’s not what you intended!

git clone [url] .

A variant, if you are using a repo with submodules in it, :

git clone –recursive [url]

If you forgot to do the above on first cloning then you can do:

git submodule update –init

Adding and committing files

If you’ve started a new repository then need to add some files to track:

git add [filename]

You don’t have to commit all the changes you made since the last commit, you can select them using the -p option

git add –p

And commit them to the repository with a commit command like:

git commit –m [message]

Alternatively you can add the commit message in your favoured editor with the difference from previous commit shown below:

git commit –a –v

I tend to use an remote repository as a backup so I regularly do:

git push origin HEAD

If someone else is working on the same repository as you then things get more complicated but that’s out of the scope of this post.

Undoing things

If you get your commit message wrong you can edit it with:

git commit --amend

If you decide you change your mind about staging a file for commit:

git reset HEAD [filename]

If you change your mind about the modifications you have made to a file since the last commit then you can revert to the last commit using this **destructive** command:

git checkout -- [filename]

You should be careful doing that since it will obliterate any changes you’ve made to a file, even if you saved them from the editor.

Working out where you are

You can list files in the repo with:

git ls-tree --full-tree -r HEAD

The general command for seeing what is going on is:

git status

This tells you if you have made edits which have not been staged, which branch you are on and files which are not being tracked. Whilst you are working you can see the difference from the previous commit using:

git diff

If you’ve already added files to commit then you need to do:

git diff –cached

You can see a list of all your changes using:

git log

This command gives you more information, in a more compact form:

git log --oneline --graph --decorate

is a good way of seeing the status of your branch and the other branches in the repository. I have aliased this log set of options as:

git lg

To do this I added the following to my ~/.gitconfig file:

[alias]
  
        lg = log --oneline --graph --decorate

Once you’ve commited a bunch of changes you might want to push them to a remote server. This pushes to the remote called origin, and HEAD ensures you push to your current branch. HEAD is Git’s shorthand for the latest commit on the current branch:

git push origin HEAD

Branches

The proceeding commands are how you’d work using a single master branch, if you were working alone on something simple, for example. If you are working with other people or on something more complicated then you probably want to work on a branch, you can make a new branch by doing:

git checkout –b [branch name]

You can find out what other branches are available by doing:

git branch –v -a

Once you are on a branch you can commit changes, and push them onto your remote server, just as if you were on the master branch.

Merging and rebasing

The excitement comes when you want to merge your changes onto the master branch or you want to get changes on your own branch made by someone else and pushed to the remote reposition. The quick and dirty way to do this is using

git pull

This does a fetch and merge all at the same time. The better way is to fetch the changes and then merge them:

git fetch –prune –all
git merge origin/master

If you are working with someone else then you may prefer to merge changes onto the master branch by making a pull request on GitHub or BitBucket.

Accepting Pull Requests from Forks

If someone makes a Pull Request based on their forked copy of a repo then you can download for testing by doing:

git fetch origin pull/ID/head:BRANCHNAME

To infinity and beyond! Or how I replaced my hard disk with an SSD

Samsung840ProNearly two years ago I bought a Sony Vaio T13 laptop as my combined work and home main computer. It’s a nice piece of kit and has served me well. When I bought it I commented that I’d like to have had an SSD drive rather than the conventional 500GB “spinning rust” hybrid drive with which it came. At the time specifying a 512GB SSD from the outset was eyewateringly expensive.

Nearly 2 years later I finally got around to making the upgrade! And it was remarkably straightforward. Largely through laziness I went for the 512GB SSD drive I’d identified nearly two years ago – the Samsung 840 Pro, the price had dropped from something like £450 to £230. There are cheaper options, down to about £150 but I’m already saving money if the Pro started at £450* ;-)

The drive itself is rather insubstantial, turning up in a padded envelope that just doesn’t feel heavy enough. It comes with a CD containing Data Migration software and some wizard diagnostic software. The migration software clones your current hard drive to the new SSD. You need to get a SATA to USB adaptor, like this one, to do this. Cloning my drive took about 5 hours but I need to decrypt it first which was a 24 hour or so job – my drive was encrypted with Bitlocker.

With SSD containing contents of original drive cloned on to it in hand, all that is required is to open up laptop and swap the drives over. This turns out to be really easy on the Vaio: unscrew battery, unscrew drive compartment cover, unscrew drive cage from from laptop, remove old hard drive, remove drive cage from old drive, put drive cage on new drive and then repeat steps to remove old drive in reverse to install new drive.

A set of dinky screw drivers is handy and it would have helped if I’d realised the drive cage was screwed to the laptop frame before I started prising at it with the big screwdriver but no harm done.

I actually found replacing the hard drive on my laptop easy than replacing or adding a drive to a desktop. Whenever I’ve added a hard drive to a desktop there has been cursing and skinned knuckles in removing/adding the power connector and unseemly cowboying of the drive into some ill fitting drive cage using left over grub screws. Compared to this working on the Vaio was a joy. You might want to test the “lie of the land” on your own model of laptop in terms of accessibility to the drive. My suspicion is that it will be generally straightforward since laptops often have the drive size as an option.

The moment of truth is rebooting after installation – this Just WorkedTM  which was a relief. First hints of improved performance were in re-encrypting the hard drive. Decrypting the conventional drive the peak IO transfer rate was about 30MB/s whilst with the SSD drive the peak was around 150MB/s. Opening up Microsoft Office applications is much snappier, as is opening Sublime Text. I should probably have a go at uploading a multi-gigabyte CSV file to MySQL, which I know is heavily IO bound but I can’t be bothered. All in all my laptop just feels rather more responsive.

I played a bit with the supplied diagnostic wizard software but didn’t think much of it, so promptly uninstalled it.

Overall: much more straightforward and less scary than I anticipated – I recommend this approach to anyone with a laptop to refresh and a modicum of courage.

*This logic brought to you by Mrs SomeBeans and her Yamaha Thundercat purchase!

NewsReader – the developers story

newsreader

This post was first published at ScraperWiki.

ScraperWiki has been a partner in NewsReader, an EU Framework 7 research project, for the last couple of years. The aim of NewsReader is to give computers the power to “understand” the news; to extract from a myriad of news articles the underlying events which gave rise to those articles; the who, the where, the why and the what of those events. The project is comprised of academic researchers specialising in computational linguistics (VUA in Amsterdam, EHU in the Basque Country and FBK in Trento), Lexis Nexis – a major news aggregator, and a couple of small technology companies: ourselves at ScraperWiki and SynerScope – a Dutch startup specialising in the visualisation of complex networks.

Our role at ScraperWiki is in providing mechanisms to enable developers to exploit the NewsReader technology, and to feed news into the system. As part of this work we have developed a simple REST API which gives access to the KnowledgeStore, the system which underpins NewsReader. The native query language of the KnowledgeStore is SPARQL – the query language of the semantic web. The Simple API provides a set of predefined queries which are easier for end users to work with than raw SPARQL, and help us as service managers by providing a predictable set of optimised queries. If you want to know more technical detail then we’ve written a paper about it (here).

The Simple API has seen live action at a Hack Day on World Cup news which we held in London in the summer. Attendees were able to develop a range of applications which probed violence, money and corruption in the realm of the World Cup. I blogged about our previous Hack Day here and here. The Simple API, and the Hack Day helped us shake out some bugs and add features which will make it even better next time.

“Next time” is another Hack Day to be held in the Amsterdam on 21st January 2015, and London on the 30th January 2015. This time we have processed 6,000,000 articles relating to the car industry over the period 2005-2014. The motor industry is a trillion dollar a year business, so we can anticipate finding lots of valuable information in this horde.

From our previous experience the three things that NewsReader excels at are:

  1. Finding networks of interactions, identifying important players. For the World Cup Hack Day we at ScraperWiki were handicapped slightly by having no interest in football! But the NewsReader technology enabled us to quickly identify that “Sepp Blatter”, “Jack Warner” and “Mohammed bin Hammam” were important in world football. This is illustrated in this slightly cryptic visualisation made using Gephi:beckham_and_blatter
  2. Finding events of a particular type. the NewsReader technology carries out semantic role labeling: taking sentences and identifying what type of event is described in that sentence and what roles the participants took. This information is then aggregated and exposed using semantic web technology. In the World Cup Hack Day participants used this functionality to identify events involving violence, bribery, gambling, and other financial transactions;
  3. Establishing timelines. In the World Cup data we could track the events involving “Mohammed bin Hammam” through time and the type of events he was involved in. This enabled us to quickly navigate to pertinent news articles.Timeline

You can see fragments of code used to extract these data using the Simple API in these GitHub Gists (here and here), and dynamic visualisations illustrating these three features here and here.

The Simple API is up and running already, you can find it (here). It is self-documenting, simply visit the root URL and you’ll see query examples with optional and compulsory parameters. Be aware though: the Simple API is under active development, and the underlying data in the KnowledgeStore is being optimised for the Hack Days so it may not be available when you visit.

If you want to join our automotive Hack Day then you can sign up for the Amsterdam event (here) and the London event (here).

Exploring the ONS

This post was first published at ScraperWiki.

The Office for National Statistics (ONS) is the United Kingdom statistical body charged by the government with the task of collecting and publishing  statistics related to the economy, population and society of England and Wales at national, regional and local levels. The data is typically published in the form of Excel spreadsheets.

The ONS is working on opening up their data, and making it more accessible to users. We’ve been doing a bit of work to help with that. This is typical of a number of jobs we have done. A customer has a website containing content which they want to move/process/republish elsewhere. The current website might have been built by aggregation over a number of years, and the underlying structure of the Content Management System may not be available to them. In these circumstances making a survey of the pre-existing content is an obvious first step.

The index for the ONS reference tables and datasets can be found here. Each dataset has a title, a release date, and the type of dataset. There is also a URL to the dataset inside the title field there is an indication of the size of the file. We wrote a simple scraper to collect these pieces of information.

First up, we’ll looking at the topics of the data released. There are a couple of routes into discovering these, one is to read the titles, this is OK as an approach but the titles are quite wordy and sometimes it isn’t clear what they refer to. An alternative, in this case, is to look at the URLs to the documents.They look something like this:

http://www.ons.gov.uk/ons/rel/lms/labour-market-statistics/november-2013/table-unem03.xls

This can be quite revealing since even if the website is not explicit about its structure the URLs can reveal the structure the builder used. We process the URLs by splitting them at the backslashes. The first part http://www.ons.gov.uk/ons/rel is common to all the URLs. Subsequent parts we can use to define a hierarchy. In this case we will focus on the fourth part of the hierarchy – “labour-market-statistics” in this instance, this gives us a human readable description of a topic. There are approximately 400 topics as defined by this metric as opposed to 90 or so defined by the third level of the hierarchy. Using the fourth level of the hierarchy key areas of the website by numbers of documents are:

  • Labour market statistics
  • National population projections
  • Family spending
  • Subnational labour market statistics
  • Census
  • Annual survey of hours and earnings.

We can visualise this as a treemap, here I am simply showing the top 20 areas by number of documents:

ONS-treemap-(simple)-v2

These 20 topics cover approximately the two thirds of the total number of documents.

We can identify file types using the file extension in the URL, this approach needs to be used a little cautiously since sometimes the extension doesn’t match the file type. Most of the files are Excel spreadsheets although there are a few CSV and zip files, the zip files containing Excel spreadsheets. CSV appears to have been used for some of the older datasets. Most of the files are pretty small, less than 290kb but there are a few up to much larger sizes.

Finally we can look at the release dates for the datasets. There are datasets from as far back as 1988, in fact the data set released in 1988 actually refers to data from 1984. There are some data released regularly from about 2001 but from 2011 a wider range of data has been released on a regular basis. We can see the monthly pattern of data releases in this timeline for 2014 which is restricted to the top 20 topics identified above:

Timeline-(detail)-v2

This shows the big releases of labour market statistics, both national and regional, on the third Wednesday of each month. Other monthly releases include retail sales and producer prices data. And every week provisional figures on the registration of deaths in England and Wales are reported.

You can explore these data yourself using the Tableau workbook here.

The actual content of these spreadsheets is another story.

This survey approach to a website is handy for a range of applications, and the techniques used are quite general. We’ve used similar approaches to understand government and newspaper websites.