Tag Archive: data science

Aug 26 2018

Book review: Data Strategy by Bernard Marr

This is a review of data_strategyData Strategy by Bernard Marr. The proposition of the book is that all businesses are now data businesses and that they should have a strategy to exploit data. He envisages such a strategy operating through a Chief Data Officer and thus at the highest level of a company.

It is in the nature of things that to be successful you feel that you have to be saying something new and interesting. The hook for this book is big data, or the increasing availability of data, is a new and revolutionary thing. To be honest, I don’t really buy this but once we’re over the hook the advice contained within is rather good.

Marr sees data benefitting businesses in three ways, and covers these in successive chapters:

  1. It can support business decisions – that’s to say helping humans make decisions;
  2. It can support business operations – this is more the automated use of data, for example, a recommender algorithm you might come across at any number of retail sites is driven by data and falls into this category;
  3. It can be an asset in its own right;

This first benefit of supporting business decisions is further sub-divided into data about the following:

  1. Customers, markets and competition
  2. Finance
  3. Internal operations
  4. People

The chapter on supporting business operations contains quite a lot of material on using sensors in manufacturing and warehouse operations but also includes fraud detection.

Subsequent chapters cover how to source and collect data, provide the human and physical infrastructure to draw meaning from it and some comments on data governance. In Europe this last topic has been the subject of enormous activity over the past couple of years with the introduction of the General Data Protection Regulation (GDPR) which determines the way in which personal information can be collected and processed.

Following the theme of big data, Marr’s view is that the the past is represented by data in SQL tables whilst the future is in unstructured data sources.

My background is as a physical scientist, and as such I read this with a somewhat quizzical “You’re not doing this already” face. Pretty much the whole point of a physical scientist is to collect data to better understand the world. The physical sciences have never really had a big data moment, typically we have collected and analysed data to the limit of currently available technology but that has never been the thing itself. Philosophically physical sciences gave up on collecting “all of the data” long ago. One of the unappreciated features of the big detectors a CERN is their ability to throw away enormous quantities of data really fast. If you have what is a effectively a building sized CCD camera then it is the only strategy that works. This isn’t to say the physical sciences always do it right, or that they are directly relevant to businesses. Physical sciences work on the basis that there are universally applicable, immutable physical laws which data is used to establish. This is not true of businesses, what works for one business need not work for another, what works now need not work in the future.

Reading the book I kept thinking of A computer called Leo by Georgina Ferry, which describes the computer built by the J. Lyons company (who ran a teashop and catering business) in the 1950s. Lyons had been doing large scale data work since the 1920s, in the aftermath of the Second World War they turned to automated, electronic computation. From my review I see that Charles Babbage wrote about the subject in 1832 although he was writing more about prospects for the future. IBM started its growth in computing machinery in the late 19th century. So the idea of data being core to a business is by no means new.

The text is littered with examples of data collection for business good across a wide range of sectors. The Rolls-Royce’s engine monitoring programme is one of my favourites, their engines send data back to Rolls-Royce four times during each flight. This can be used to support engine servicing, and, I would imagine, product development. In the category of monetizing data American Express and Axciom are mentioned, they provide either personal or aggregate demographic information which can be used for targeted marketing.

Some entries might be a bit surprising, restaurant chains in the form of Dominos and Dickey’s Barbeque Pit are big users of data. Walmart also makes an appearance. This shouldn’t be surprising since the importance of data is a matter more of business scale than sector, as the Lyons company shows.

Marr repeatedly tells us that we should collect the data which answers our questions rather than just trying to collect all the data. I don’t think this can be repeated too often! It seems that many businesses have been sold (or built) Big Data infrastructure and only really started to think about how they would extract business value from the data collected once this has been done.

Definitely thought provoking, and a well-structured guide as to how data can benefit your company.

Apr 26 2018

Book review: Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron

machine-learningI’ve recently started playing around with recurrent neural networks and tensorflow which brought me to Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron, as a bonus it also includes material on scikit-learn which I’ve been using for a while.

The book is divided into two parts, the first, “Fundamentals of Machine Learning” focuses on the functionality which is found in the scikit-learn library. It starts with a big picture, running through the types of machine learning which exist (supervised / unsupervised, batched / online and instance / model) and then some of the pitfalls and problems with machine learning before a section on testing and validation. The next part is a medium sized example of machine learning in action which demonstrates how the functionality of scikit-learn can be quickly used to develop predictions of house prices in California based on census data. This is a subject after my own heart, I’ve been working property data in the UK for the past couple of years.

This example serves two purposes, firstly it demonstrates the practical steps you need to take when undertaking a machine learning exercise and secondly it highlights just how concisely much of it can be executed in scikit-learn. The following chapters then go into more depth first about how models are trained and scored and then going into the details of different algorithms such as Support Vector Machines and Decision Trees. This part finishes with a chapter on ensemble methods.

Although the chapters contain some maths, their strength is in the clear explanations of the methods described. I particularly liked the chapter on ensemble methods. They also demonstrate how consistent the scikit-learn library is in its interfaces. I knew that I could switch algorithms very easily with scikit-learn but I hadn’t fully appreciated how the libraries generally handled regression and multi-class classification so seamlessly.

I wonder whether outside data science it is perceived that data scientists write their own algorithms from scratch. In practical terms it is not the case, and hasn’t been the case since at least the early nineties when I started data analysis which looks very similar to the machine learning based analysis I do today. In those days we used the NAG numerical library, Numerical Recipes in FORTRAN and libraries developed by a very limited number of colleagues in the wider academic community (probably shared by email attachment).

The second part of the book, “Neural networks and Deep Learning”, looks at the tensorflow library. Tensorflow has general applications for processing multi-dimensional arrays but it has been designed with deep learning and neural networks in mind. This means there are a whole bunch of functions to generate and train neural networks of different types and different architectures.

The section starts with an overview of tensorflow with some references to other deep learning libraries, before providing an introduction to neural networks in general, which have been around quite a while now. Following this there is a section on training deep learning networks, and the importance of the form of activation functions.

Tensorflow will run across multiple processors, GPUs and/or servers although configuring this looks a bit complicated. Typically a neural network layer can’t be distributed over multiple processing units.

There then follow chapters on convolution neural networks (good for image applications), recurrent neural networks (good for sequence data), autoencoders (finding compact representations) and finally reinforcement learning (good for playing pac-man). My current interest is in recurrent neural networks, it was nice to see a brief description of all of the potential input/output scenarios for recurrent neural networks and how to build them.

I spent a few years doing conventional image analysis, convolution neural networks feel quite similar to the convolution filters I used then although they stack more layers (or filters) than are normally used in conventional image analysis. Furthermore, in conventional image analysis the kernels are typically handcrafted to perform certain tasks (such as detect horizontal or vertical edges), whilst neural networks learn their kernels in training. In conventional image analysis convolution is often done in Fourier space since it is more efficient and I see there experiments along these lines with convolution neural networks.

Developing and training neural networks has the air of an experimental science rather than a theoretical science. That’s to say that rather than thinking hard and coming up with an effective neural network and training scheme one needs to tinker with different designs and training methods and see how well they work. It has more the air of training an animal the programming a computer. There are a number of standard training / test sets of images and successful models trained against these by other people can be downloaded. Such models can be used as-is but alternatively just parts can be used.

This section has many references to the original literature for the development of deep learning, highlighting how recent this new iteration of neural networks is.

Overall an excellent book, scikit-learn and tensorflow are the go-to libraries for Python programmers wanting to do machine learning and deep learning respectively. This book describes their use eloquently, with references to original literature where appropriate whilst providing a good overview of both topics. The code used in the book can be found on github, as a set of Jupyter Notebooks.

Dec 28 2017

Book review: Fraud analytics by B. Baesens, V. Van Vlasselaer and W. Verbeke

This next book is rather work oriented: fraud_analyticsFraud Analytics using descriptive, predictive and social network techniques: A guide to data science for fraud detection by Bart Baesens, Veronique van Vlasselaer and Wouter Verbeke.

Fraud analytics starts with an introductory chapter on the scale of the fraud problem, and some examples of types of fraud. It also provides an overview of the chapters that are to come. In the UK fraud losses stand at about £73 billion per annum, typically fraud losses are anything up to 5%. There are many types of fraud: credit card fraud, insurance fraud, healthcare fraud, click fraud, identity theft and so forth.

There then follows a chapter on data preparation, sampling and preprocessing. This includes some domain related elements such as the importance of the so-called RFM attributes: Recency, Frequency, and Monetary which are the core variables for financial transactions. Also covered are missing values and data quality which are more general issues in statistics.

The core of the book is three long chapters on descriptive statistics, predictive analysis and social networks.

Descriptive statistics concerns classical statistical techniques such as the detection of outliers using the z-score (the normalised standard deviation), through the clustering techniques such as k-means or related techniques. These clustering techniques fall into the category of unsupervised machine learning. The idea here is that fraudulent transactions are different to non-fraudulent ones, this may be a temporal separation (i.e. a change in customer behaviour may indicate that their account has been compromised and used nefariously) or it might be a snapshot across a population where fraudulent actors have different behaviour than non-fraudulent ones. Clustering techniques and outlier detection seek to identify these “different” transactions, usually for further investigation – that’s to say automated methods are used as a support for human investigators not a replacement. This means that ranking transactions for potential fraud is key. Obviously fraudsters are continually adapting their behaviour to avoid standing out, and so fraud analytics is an arms-race.

Predictive analysis is more along the lines of regression, classification and machine learning. The idea here is to develop rules for detecting fraud from training sets containing example transactions which are known to be fraudulent or not-fraudulent.Whilst not providing an in depth implementation guide Fraud Analytics gives a very good survey of the area. It discusses different machine learning algorithms, including their strengths and weaknesses particularly with regard to model “understandability”. Also covered are a wide range of model evaluation methods, and the importance of an appropriate training set. A particular issue here is that fraud is relatively uncommon so care needs to be taken in sampling training sets such that algorithms have a chance to identify fraud. These are perennial issues in machine learning and it is good to see them summarised here.

The chapter on social networks clearly presents an active area of research in fraud analytics. It is worth highlighting here that the term “social” is meant very broadly, it is only marginally about social networks like Twitter and Facebook. It is much more about networks of entities such as the claimant, the loss adjustor, the law enforcement official and the garage carrying out repairs. Also relevant are networks of companies, and their directors set up to commit corporate frauds. Network (aka graph) theory is the appropriate, efficient way to handle such systems. In this chapter, network analytic ideas such as “inbetweeness” and “centrality” are combined with machine learning involving non-network features.

The book finishes with chapters on fraud analytics in operation, and a wider view. How do you use these models in production? When do you update them? How do you update them? The wider view includes some discussion of data anonymisation prior to handing it over to data scientists. This is an important area, data protection regulations across the EU are tightening up, breaches of personal data can have serious consequences for those companies involved. Anonymisation may also provide some protection against producing biased models i.e those that discriminate unfairly against people on the basis of race, gender and economic circumstances. Although this area should attract more active concern.

A topic not covered but mentioned a couple of times is natural language processing, for example analysing the text of claims against insurance policies.

It is best to think of this book as a guide to various topics in statistics and data science as applied to the analysis of fraud. The coverage is more in the line of an overview, rather than an in depth implementation guide. It is pitched at the level of the practitioner rather than the non-expert manager. Aside from some comments at the end on label-based security access control (relating to SQL) and some screenshots from SAS products it is technology agnostic.

Occasionally the English in this book slips from being fully idiomatic, it is still fully comprehensible – it simply reads a little oddly. Not a fun read but an essentially starter if you’re interested in fraud and data science.

Oct 24 2017

Scala – installation behind a workplace web proxy

I’ve been learning Scala as part of my continuing professional development. Scala is a functional language which runs primarily on the Java Runtime Environment. It is a first class citizen for working with Apache Spark – an important platform for data science. My intention in learning Scala is to get myself thinking in a more functional programming style and to gain easy access to Java-based libraries and ecosystems, typically I program in Python.

In this post I describe how to get Scala installed and functioning on a workplace laptop, along with its dependency manager, sbt. The core issue here is that my laptop at work puts me behind a web proxy so that sbt does not Just Work™. I figure this is a common problem so I thought I’d write my experience down for the benefit of others, including my future self.

The test system in this case was a relatively recent (circa 2015) Windows 7 laptop, I like using bash as my shell on Windows rather than the Windows Command Prompt – I install this using the Git for Windows SDK.

Scala can be installed from the Scala website https://www.scala-lang.org/download/. For our purposes we will use the  Windows binaries since the sbt build tool requires additional configuration to work. Scala needs the Java JDK version 1.8 to install and the JAVA_HOME needs to point to the appropriate place. On my laptop this is:

JAVA_HOME=C:\Program Files (x86)\Java\jdk1.8.0_131

The Java version can be established using the command:

javac –version

My Scala version is 2.12.2, obtained using:

scala -version

Sbt is the dependency manager and build tool for Scala, it is a separate install from:

http://www.scala-sbt.org/0.13/docs/Setup.html

It is possible the PATH environment variable will need to be updated manually to include the sbt executables (:/c/Program Files (x86)/sbt/bin).

I am a big fan of Visual Studio Code, so I installed the Scala helper for Visual Studio Code:

https://marketplace.visualstudio.com/items?itemName=dragos.scala-lsp

This requires a modification to the sbt config file which is described here:

http://ensime.org/build_tools/sbt/

Then we can write a trivial Scala program like:

object HelloWorld {

 

def main(args: Array[String]): Unit = {

 

    println(“Hello, world!”)

 

  }

 

}

And run it at the commandline with:

scala first.scala

To use sbt in my workplace requires proxies to be configured. The symptom of a failure to do this is that the sbt compile command fails to download the appropriate dependencies on first run, as defined in a build.sbt file, producing a line in the log like this:

Error!

Server access Error: Connection reset url=https://repo1.maven.org/maven2/net/
sourceforge/htmlcleaner/htmlcleaner/2.4/htmlcleaner-2.4.pom

In my case I established the appropriate proxy configuration from the Google Chrome browser:

chrome://net-internals/#proxy

This shows a link to the pacfile, something like:

http://pac.madeupbit.com/proxy.pac?p=somecode

The PAC file can be inspected to identify the required proxy, in my this case there is a statement towards the end of the pacfile which contains the URL and port required for the proxy:

if (url.substring(0, 5) == ‘http:’ || url.substring(0, 6) == ‘https:’ || url.substring(0, 3) == ‘ws:’ || url.substring(0, 4) == ‘wss:’)

{

return ‘PROXY longproxyhosturl.com :80’;

}

 

These are added to a SBT_OPTS environment variable which can either be set in a bash-like .profile file or using the Windows environment variable setup.

export SBT_OPTS=”-Dhttps.proxyHost=longproxyhosturl.com -Dhttps.proxyPort=80 -Dhttps.proxySet=true”

As a bonus, if you want to use Java’s Maven dependency management tool you can use the same proxy settings but put them in a MAVEN_OPTS environment variable.

Typically to start a new project in Scala one uses the sbt new command with a pointer to a g8 template, in my workplace this does not work as normally stated because it uses the github protocol which is blocked by default (it runs on port 9418). The normal new command in sbt looks like:

sbt new scala/scala-seed.g8

The workaround for this is to specify the g8 repo in full including the https prefix:

sbt new https://github.com/scala/scala-seed.g8

This should initialise a new project, creating a whole bunch of standard directories.

So far I’ve completed one small project in Scala. Having worked mainly in dynamically typed languages it was nice that, once I had properly defined my types and got my program to compile, it ran without obvious error. I was a bit surprised to find no standard CSV reading / writing library as there is for Python. My Python has become a little more functional as a result of my Scala programming, I’m now a bit more likely to map a function over a list rather than loop over the list explicitly.

I’ve been developing intensively in Python over the last couple of years, and this seems to have helped me in configuring my Scala environment in terms of getting to grips with module/packaging, dependency managers, automated doocumentation building and also in finding my test library (http://www.scalatest.org/) at an early stage.

Jun 01 2017

Book Review: Scala for the Impatient by Cay S. Horstmann

scala_for_impatientI thought I should learn a new language, and Scala seemed like a good choice so I got Scala for the Impatient by Cay S. Horstmann.

Scala is a functional programming language which supports object orientation too. I’m attracted to it for a number of reasons. Firstly, I’m using or considering using a number of technologies which are based on Java – such as Elasticsearch, Neo4j and Spark. Although there are bindings to my favoured language, Python, for Spark in particular I feel a second class citizen. Scala, running as it does on the Java Virtual Machine, allows you to import Java functions easily and so gives better access to these systems.

I’m also attracted to Scala because it is rather less verbose than Java. It feels like some of the core aspects of the language ecosystem (like the dependency manager and testing frameworks) have matured rapidly although the range of available libraries is smaller than that of older languages.

Scala for the Impatient gets on with providing details of the language without much preamble. Its working assumption is that you’re somewhat familiar with Java and so concepts are explained relative to Java. I felt like it also made an assumption that you knew about the broad features of the language, since it made some use of forward referencing – where features are used in an example before being explained somewhat later in the book.

I must admit programming in Scala is a bit of a culture shock after Python. Partly because its compiled rather than interpreted, although the environment does what it can to elide this difference – Scala has an REPL (read-evaluate-print-loop) which works in the background by doing a quick compile. This allows you to play around with the language very easily. The second difference is static typing – Scala is a friendly statically typed language in the sense that if you initialise something with a string value then it doesn’t force you to tell it you want this to be a string. But everything does have a very definite type. It follows the modern hipster style of putting the type after the symbol name (i.e var somevariablename: Int = 5 ), as in Go rather than before, as in earlier languages (i.e int somevariablename = 5).

You have to declare new variables as either var or val. Variables (var) are mutable and values (val) are immutable. It strikes me that static typing and this feature should fix half of my programming errors which in a dynamically typed language are usually mis-spelling variable names, changing something as a side effect and putting the wrong type of thing into a variable – usually during I/O.

The book starts with chapters on basic control structures and data types, to classes and objects and collection data types. There are odd chapters on file handling and regular expressions, and also on XML processing which is built into the language, although it does not implement the popular xpath query language for XML. There is also a chapter on the parsing of formal grammars.

I found the chapter on futures and promises fascinating, these are relatively new ways to handle concurrency and parallelism which I hadn’t been exposed to before, I notice they have recently been introduced to Python.

Chapters on type parameters, advanced types and implicit types had me mostly confused although the early parts were straightforward enough. I’d heard of templating classes and data strctures but as someone programming mainly in a dynamically typed languages I hadn’t any call for them. I turns out templating is a whole lot more complicated than I realised!

My favourite chapter was the one on collections – perhaps because I’m a data scientists, and collections are where I put my data. Scala has a rich collection of collections and methods operating on collections. It avoids the abomination of the Python “dictionary” whose members are not ordered, as you might expect. Scala calls such a data structure a HashMap.

It remains to be seen whether reading, once again, chapters on object-oriented programming will result in me writing object-oriented programs. It hasn’t done in the past.

Scala for the Impatient doesn’t really cover the mechanics of installing Scala on your system or the development environment you might use but then such information tends to go stale fast and depends on platform. I will likely write a post on this, since installing Scala and its build tool, sbt, behind a corporate proxy was a small adventure.

Googling for further help I found myself at the Scala Cookbook by Alvin Alexander quite frequently. The definitive reference book for Scala is Programming in Scala by Martin Odersky, Lex Spoon and Bill Venners. Resorting to my now familiar technique of searching the acknowledgements for women working in the area, I found Susan Potter whose website is here.

Scala for the Impatient is well-named, it whistles through the language at a brisk pace, assuming you know how to program. It highlights the differences with Java, and provides you with the vocabulary to find out more.

Older posts «