«

»

Jan 01 2014

Book review: Data Mining – Practical Machine Learning Tools and Techniques by Witten, Frank and Hall

datamining

This review was first published at ScraperWiki.

I’ve been doing more reading on machine learning, this time in the form of Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank and Mark A. Hall. This comes by recommendation of my academic colleagues on the Newsreader project, who rely heavily on machine learning techniques to do natural language processing.

Data mining is about finding structure in data, the algorithms for doing this are found in the field of machine learning. The classic example is Iris flower dataset. This dataset contains measurements of parts of a flower for three different species of Iris, the challenge is to build a system which classifies a flower to its species by its measurements. More practical examples are in the diagnosis of machine faults, credit assessment, detection of oil slicks, customer support analysis, marketing and sales.

Previously I’ve reviewed Machine Learning in Action by Peter Harrington. Data Mining is a somewhat different book. The core contents are quite similar: background to machine learning, evaluating your results and a run through the core algorithms. Machine Learning in Action is a pretty quick run through the field touching on many subjects, with toy demonstrations built from scratch in Python. Data Mining, running to almost 600 pages, is a much more thorough reference. There is a place for both types of book, even on the same bookshelf.

Data Mining is written by three members of the University of Waikato’s Computer Science Department and is based around the Weka machine learning system developed there. Weka is a complete framework, written in Java, which implements the algorithms described in the book as well as some others. Weka can be accessed via the command-line or using a GUI. As well as the machine learning algorithms there are systems for preparing data, evaluating and visualising results. A collection of well-known demonstration data sets are included. I’ve no reason to doubt the quality of the implementations in Weka, the GUI interface is functional, occasionally puzzling and not particularly slick. The book stands alone from the Weka framework but the framework provides a good playground to try out the techniques discussed in the book. Weka seems to be entirely suitable for conducting serious analysis. This approach is in contrast to the approach of Harrington in Machine Learning in Action who provides toy implementations of algorithms in Python.

The first two parts of the book provide an overview of machine learning, followed by a more detailed look at how the key algorithms are implemented. The third section is dedicated to Weka, whilst the first two sections refer to it but do not rely on it. The third section is divided into a discussion of Weka, covering all its key features and then a tutorial. I found this a bit confusing since the first part has the air of a tutorial, but isn’t, and the tutorial part keeps referring back to the overview section for its screenshots.

With some knowledge already in machine learning, the things I learned from this book:

  • better methods, and subtleties in measuring the performance of machine learning algorithms;
  • the success of the one-rule algorithm, essentially a decision tree which gets the maximum benefit from a single rule. It turns such an approach is surprisingly effective and only bettered a little, if at all, by more sophisticated algorithms;
  • getting enough, clean data to do machine learning is often a problem;
  • where to learn more!

The first edition of this book was published in 1999; my review is of the third edition. The book does show some signs of age, Machine Learning in Action  was written as a response to a poll published at the International Conference on Data Mining 2006 on the 10 most important machine learning algorithms (see the paper here). Whilst Data Mining mentions this survey, it is as something of an afterthought and the authors seem bemused by the inclusion of the PageRank algorithm used by Google to rank web pages in search results. They mention the Moa framework for data stream mining although do not discuss it in any detail. Moa focuses on techniques for large datasets.

In summary: a well-written, well-structured and readable book on machine learning algorithms with demonstrations based on an extensive machine learning framework. Definitely one to read and come back to for reference.