Tag Archive: programming

Jun 28 2013

Testing, testing…

testingtesting

This post was first published at ScraperWiki.

Data science is a distinct profession from software engineering. Data scientists may write a lot of computer code but the aim of their code is to answer questions about data. Sometimes they might want to expose the analysis software they have written to others in order they can answer questions for themselves, and this is where the pain starts. This is because writing code that only you will use and writing code someone else will use can be quite different.

ScraperWiki is a mixed environment, it contains people with a background in software engineering and those with a background in data analysis, like myself. Left to my own devices I will write code that simply does the analysis required. What it lacks is engineering. This might show up in its responses to the unexpected, its interactions with the user, its logical structure, or its reliability.

These shortcomings are addressed by good software engineering, an area of which I have theoretical knowledge but only sporadic implementation!

I was introduced to practical testing through pair programming: there were already tests in place for the code we were working on and we just ran them after each moderate chunk of code change. It was really easy. I was so excited by it that in the next session of pair programming, with someone else, it was me that suggested we added some tests!

My programming at ScraperWiki is typically in Python, for which there a number of useful testing tools. I typically work from Windows, using the Spyder IDE and I have a bash terminal window open to commit code to either BitBucket or Github. This second terminal turns out to be very handy for running tests.

Python has an internal testing mechanism called doctest which allows you to write tests into the top of a function in what looks like a comment. Typically these comprise a call to the function from a command prompt followed by the expected response. These tests are executed by running a command like:

 python -m doctest yourfile.py

This is OK, and it’s “batteries included” but I find the mechanism a bit ugly. When you’re doing anything more complicated than testing inputs and outputs for individual functions, you want to use a more flexible mechanism like nose tools, with specloud to beautify the test output. The Git-Bash terminal on Windows needs a little shim in the form of ansicon to take full advantage of specloud’s features. Once you’re suitably tooled up, passed tests are marked with a vibrant, satisfying green and the failed tests by a dismal, uncomfortable red.

My latest project, a module which automatically extracts tables from PDF files, has testing. It divides into two categories: testing the overall functionality – handy as I fiddle with structure – and tests for mathematically or logically complex functions. In this second area I’ve started writing the tests before the functions, this is because often this type of function has a simple enough description and test case but implementation is a bit tricky. You can see the tests I have written for one of these functions here.

Testing isn’t as disruptive to my workflow as I thought it would be. Typically I would be repeatedly running my code as I explored my analysis making changes to a core pilot script. Using testing I can use multiple pilot scripts each testing different parts of my code; I’m testing more of my code more often and I can undertake moderate changes to my code, safe in the knowledge that my tests will limit the chances of unintended consequences.

Jun 01 2012

Book Review: Visualize This by Nathan Yau

9780470944882 cover.inddThis book review is of Nathan Yau’s “Visualize This: The FlowingData Guide to Design, Visualization and Statistics”. It grows out of Yau’s blog: flowingdata.com, which I recommend, and also his experience in preparing graphics for The New York Times, amongst others.

The book is a run-through of pragmatic methods in visualisation, focusing on practical means of achieving ends rather more abstract design principles for data visualisation; if you want that then I recommend Tufte’s “The Visual Display of Quantitative Information”.

The book covers a bit of data scraping, extracting useful numerical data from disparate sources, as Yau comments this is the thing that takes the time in this type of activity. It also details methods for visualising time series data, proportions, geographic data and so forth.

The key tools involved are the R and Python programming languages; I already have these installed in the form of R Studio and Python(x,y), distributions which provide an environment that looks like the Matlab one with which I have long been familiar with but which sadly is somewhat expensive for a hobby programmer. Alongside this are the freely available Processing language and the Protovis Javascript library which are good for interactive, online visualisations, and the commercial packages Adobe Illustrator, for vector graphic editing, and Adobe Flash Builder for interactive web graphics. Again these are tools I find out of my range financially for my personal use although Inkscape seems to be a good substitute for Illustrator.

With no prior knowledge of Flash and no Flash Builder, I found the sections on Flash a bit bewildering. It also highlights how perhaps this will be a book very distinctively of its time, with Apple no longer supporting Flash on iPhone its quite possible that the language will die out. And I notice on visiting the Protovis website that this is no longer under development: the authors have moved on to D3.js, Openzoom which is also mentioned is no longer supported. Python has been around for sometime now and is the lightweight language of choice for many scientists, similarly R has been around for a while and is increasing in popularity.

You won’t learn to program from this book: if you can already program you’ll see that R is a nice language in which to quickly make a wide range of plots. If you can’t program then you may be surprised how few commands R requires to produce impressive results. As someone who is a beginner in R, the examples are a nice tour of what is possible and some little tricks, such as the fact that plot functions don’t take data frames as arguments: you need to extract arrays.

As well as programming the book also includes references to a range of data sources and online tools, for example colorbrewer2.org – a tool for selecting colour schemes, and links to the various mapping APIs.

Readers of this blog will know that I am an avid data scraper and visualiser myself, and in a sense this book is an overview of that way of working – in fact I see I referenced flowingdata in my attempts to colour in maps (here).

The big thing I learned from the book in terms of workflow is the application of a vector graphics package, such as Adobe Illustrator or, Inkscape, to tidy up basic graphics produced in R. This strikes me as a very good idea, I’ve spent many a frustrating hour trying to get charts looking just right in the programming or plotting language of my choice and now I discover that the professionals use a shortcut! A quick check shows that R exports to PDF, which Inkscape can read.

Stylistically the book is exceedingly chatty, including even the odd um and huh, which helps make it quick and easy read although is a little grating. Many of the examples are also available over on flowingdata.com, although I notice that some are only accessible for paid membership. You might want to see the book as a way of showing your appreciation for the blog in physical and monetary form.

Look out for better looking visualisations from me in the future!

Dec 01 2011

Case-sensitive

As a long time programmer there is a little thing I’d like to rant about: case-sensitivity.

For the uninitiated this is the thing that makes your program think that the variable called “MyVariable” is different from the variable called “myVariable” and the variable called “Myvariable”. The problem is that some computer languages have it and some computer languages don’t.

I grew up with BASIC and later FORTRAN, case-insensitive languages which do the natural thing and assume that capitalisation does not matter. Other languages (C#, Java, C, Matlab) are not so forgiving and insist that “a” and “A” refer to two completely different things. In real life this feels like a wilful act of obstinacy, the worst excesses of teenage pedantry, it is a user experience fail.

The origins of case-sensitivity lie in the origins of the language C in the early 1970s,  FORTRAN doesn’t have it because when it was invented, in the dawn of computing, teletype printers did not support lowercase – there was no space on the print head.  I still think of FORTRAN as a language written in ALL CAPS and so rather IMPERATIVE.

There is an argument for case-sensitivity from the point of view of compactness; mathematicians, even of my relatively lowly level will name their variables in equations with letters from the Roman and Greek alphabets, subscripts and superscripts. My father, an undergraduate mathematician, even went as far as Cyrillic alphabet. Sadly the print media, even New Scientist, do not support such typographically extravagance.

It’s even worse when your language is dynamically-typed, that’s to say it allows you to create variables willy-nilly as you write your program rather than statically-typed languages which demand you tell them explicitly of the introduction of new variables. In a statically typed language if you start with a variable called “MyVariable” and later introduce “Myvariable”, by a slip of the key, then the compiler will kick-off: complaining it has no knowledge of this interloper. A dynamically-typed language will accept this new introduction silently, giving it a default value and causing untold damage in subsequent calculations.

It’s not like case-sensitivity is used in any syntactically meaningful manner: to a computer there is no practical difference between “foo” and “Foo” – the standard placeholder function name, foo” and “Foo” to the computer are simply the label you have stuck to a box containing a thing. There are some human conventions, but they are just that – and as with any convention they are honoured as much in the breech as the observance. The compiler doesn’t care.

I must admit to a fondness of CamelCase: capitalising the initial letters of each word in a long variable name, I do it in my hashtags on twitter. In the old days of FORTRAN no such fripperies existed, not only were your variable names limited in case but also in length: you had 6 characters to work your magic.

This is to ignore the many and varied uses different uses that computer languages find for brackets: {}, (), [] and even <>.

Nov 23 2011

House of Lords register of members interests

This post is about the House of Lords register of members interests, an online resource which describes the financial and other interests of members of the UK House of Lords. This follows on from earlier posts on the attendance rates of Lords, it turns out 20% of them only turn up twice a year. I also wrote a post on the political  breakdown of the House and the number of appointments to it in each year over the period since the mid-1970s. This is all of current interest since reform is in the air for the House of Lords, on which subject I made a short post.

I was curious to know the occupations of the Lords, there is no direct record of occupations but the register of members interests provides a guide. The members interests are divided into categories, described in this document and summarised below:

Category 1 Directorships
Category 2 Remunerated employment, office, profession etc.
Category 3 Public affairs advice and services to clients
Category 4a Controlling shareholding
Category 4b Not a controlling shareholding but exceeding £50,000
Category 5 Land and property, capital value exceeding £250,000 or income exceeding £5,000 but not main residence
Category 6 Sponsorship
Category 7 Overseas visits
Category 8 Gifts, benefits and hospitality
Category 9 Miscellaneous financial interests
Category 10a Un-renumerated directorship or employment
Category 10b Membership of public bodies, (hospital trusts, governing bodies etc)
Category 10c Trusteeships of galleries, museums and so forth
Category 10d Officer or trustee of a pressure group or union
Category 10e Officer or trustee of a voluntary or not-for-profit organisation

 

The values of these interests are not listed but typically the threshold value for inclusion is £500 except where stated.

The data are provided as webpages, with one page per initial letter there are no Lords whose Lord Name starts with X or Z. This is a bit awkward for carrying out analysis so I wrote a program in Python which reads the webpages using the BeautifulSoup HTML/XML parser and converts them into a single Comma Separated Value (CSV) file where each row corresponded to a single category entry for a single Lord – this is the most useful format for subsequent analysis.

The data contains entries for 828 Lords, which translates into 2821 entries in the big table. The chart below shows the number of entries for each category.

 

CategoryBreakdown

This breaks things down into more manageable chunks. I quite like the miscellaneous category 9, where people declare their spouses if they are also members of the House and Lord Edmiston who declares “Occasional income from the hiring of Member’s plane”. Those that declare no interests are split between “on leave of absence”, “no registrable interests”, “there are no interests for this peer” and “information not yet received”. The sponsorship category (6) is fairly dull, typically secretarial support from other roles.

Their Lordships are in great demand as officers and trustees of non-profits and charities, as indicated by category 10e, and as members on the boards of public bodies (category 10b).

I had hoped that category 2 would give me some feel for occupations of Lords, I was hoping to learn something of the skills distribution since it’s often claimed that the way in which they are appointed means they bring a wide range of expertise to bear. Below I show a wordle of the category 2 text.Wordle of category 2 interests textThere’s a lot of speaking and board membership going on unfortunately it’s not easy to pull occupations out of the data. I can’t help but get the impression that the breakdown of the Lords is not that dissimilar to that of the Commons, indeed many Lords are former MPs – this means lots of lawyers.

You can download the data in the form of a single file from Google Docs here. I’ve added an index column and the length of the text for each entry. Viewing as a single file in this compact format is easier than the original pages and you can do interesting things such as sort by different columns or search the entire file for keywords (professor, Tesco, BBC… etc). The Python program I wrote is here.

Aug 27 2011

Living in code

Eric Schmidt, chairman of Google is in the news with his comments at the MacTaggart Lecture at the Edinburgh International Television Festival. The headline is a general criticism of the UK education system but what he actually said was more focussed on technology and in particular IT education: bemoaning the fact that computer science was not compulsory and what of it that there was about the use of software packages rather than how to code.

I was born in 1970, and learnt to program sometime in the early 80s. I can’t remember exactly where but I suspect it was in part at the after-school computer club my school ran. A clear memory I have is of an odd man who’d brought in a TRS-80 explaining that a FOR-NEXT loop was an instruction for a computer to “go look up its bottom” – this was at a time before CRB checks. My first computer was a Commodore VIC-20, Clive Sinclair having failed to deliver a ZX81 and the BBC Micros being rather more expensive proposition than my parents were willing to afford.

Many children of the early 80s cut their teeth programming by typing in programs from computer magazines; a tedious exercise which trained you in accurate transcription and debugging. Even at that time the focus of Computer Studies lessons was on using applications rather than teaching us to program although I do remember watching the BBC programmes on programming which went alongside the BBC Micro. As I have mentioned before, programming is in my blood – both my parents were programmers in the 60s.

About 10 years ago I was teaching programming to undergraduate physicists, from a class of 50 only 2 had any previous programming experience. The same is true in my workplace, a research lab where only a small minority of us can code.

Knowing how to code gives you a different mindset when approaching computer systems. Recently I have been experimenting with my company reports database. The reports are stored as PDF files; I was told the text inside them was not accessible – now to me that sounds like a challenge! After a bit of hacking I’d worked out how to extract the full text of reports out of the PDF files but then code that once worked stopped working. This puzzled me, so I checked the text that my program was pulling from the database and instead of being a PDF file, it was a message saying “Please don’t do that"!

At the moment I’m writing a program that takes an address list file, checks to see if the addressees have a mobile phone number and if they do uploads it to an SMS service, spitting out into a separate file those that do not have a mobile phone number. To me this is a problem that has an obvious programming solution, for the people who generate the address list it’s a bit like black magic.

These days we are surrounding by technology bearing code, just about every piece of electrical equipment in my house has code in it, but it seems that ever fewer of us have been inducted into the magic of writing our own code. These days there’s just so much more fun to be had from programming: there are endless online data sources and our phones and computers have so many programmable facilities built into them.

At what age can I teach my child Python?

Older posts «