Tag: spark

Book review: Spark GraphX in Action by Michael S. Malak and Robin East

malak-meapI wrote about Spark not so long ago when I reviewed Learning Spark, at the time I noted that Learning Spark did not cover the graph processing component of Spark, GraphX. Spark GraphX in Action by Michael S. Malak and Robin East fills that gap.

I read the book via Mannings Early Access Program (MEAP), they approached me and gave me access to the whole book for free, this meant I read it on my Kindle which I tend not to do these days for technical books because I still find paper a more flexible medium. Early Access means the book is still a little rough around the edges but it is complete.

The authors suggest that readers should be comfortable reading Scala code to enjoy the book. Scala is the language Spark is written in, and the best way to access GraphX. In fact access via Python (my favoured route) is impossible and using Java it sounds ugly. Scala is a functional language which runs on the Java virtual machine. It seems to be motivated by a desire to remove Java’s verbosity but perhaps goes a little too far. There is no `return` keyword for identifying the return value of a function. Its affectation is to overload the meaning of the underscore _. As it was I felt comfortable enough reading Scala code. I was interested to read that the two “variable” definitions are `val` and `var`, `val` is immutable and is preferred – var is mutable. This is probably a lesson for my Python programming – immutable “variables” can provide higher performance (and using immutable for things that you intend to be immutable aids clarity and debugging).

From the point of view of someone who has read about Spark and graph theory in the past the book is pitched at the right level, there is some introductory material about Spark and also about graph theory and then a set of examples. The book finishes with some material on inspecting running jobs in Spark using the Spark web interface. If you have never heard of Spark, then this book probably isn’t a good place to start.

The examples start with basic algorithms on measuring shortest paths across a graph, connectedness and the Page Rank algorithm on which Google was originally built. These are followed by simple implementations of some further algorithms including shortest paths with weighted edges (essential for route finding) and the travelling salesman problem. There then follows a chapter on some machine learning algorithms including recommendation engines, spam detection, and document clustering. Where appropriate the authors cite the original papers for algorithms including PageRank, Pregel (Google’s graph processing framework) and SVD++ (which was a key component of the winning entry for the Netflix recommendation prize) which is very welcome. The examples are outlines rather than full implementations of these sophisticated algorithms.

Finally, there is a chapter titled “The Missing Algorithms”, this is more a discussion of utility functions for GraphX in terms of import from other schemes such as RDF, operations such as merging two graphs or trimming away stray vertices.

The book gives the impression that GraphX is not ready for the big time yet, in a couple of places the authors said “this bit has only just started working”, and when they move on to talking about using SVD++ in GraphX they explain how the algorithm is only half implemented in GraphX. Full implementations are available in other languages.

It seemed to me on my original reading about Spark that the big benefit was that you could write machine learning systems in a familiar language which ran on a single machine in Spark, and then scale up effortlessly to a computing cluster, if required. Those benefits are not currently present in GraphX, you need to worry about coding in a foreign language and about the quality of the underlying implementation. It feels like the appropriate approach (for me) should be to prototype using Python/Neo4J, and likely discover that that is all that is needed. Only if you have a very large graph do you need to consider switching to a Spark based solution, and I’m not convinced GraphX is how you would do it even then.

The code samples are poorly formatted but you can fix this by downloading the source code and viewing it in the editor of your choice with nice syntax highlighting and consistent indenting – this makes things much clearer. The figures are clear enough but I find the Kindle approach of embedding thumbnail scale figures unhelpful – you need to double click them to make them readable. A reasonable solution would be to make figures full page by default, if that is possible.

This is one of the better “* in Action” books I’ve read, it’s not convinced me to use GraphX – quite the reverse – but that’s no bad thing and I’ve learnt a little about recommender algorithms and Scala.

Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia

This post was first published at ScraperWiki.
learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!