Tag: R

Making a ScraperWiki view with R


This post was first published at ScraperWiki.

In a recent post I showed how to use the ScraperWiki Twitter Search Tool to capture tweets for analysis. I demonstrated this using a search on the #InspiringWomen hashtag, using Tableau to generate a visualisation.

Here I’m going to show a tool made using the R statistical programming language which can be used to view any Twitter Search dataset. R is very widely used in both academia and industry to carry out statistical analysis. It is open source and has a large community of users who are actively developing new libraries with new functionality.

Although this viewer is a trivial example, it can be used as a template for any other R-based viewer. To break the suspense this is what the output of the tool looks like:


The tool updates when the underlying data is updated, the Twitter Search tool checks for new tweets on an hourly basis. The tool shows the number of tweets found and a histogram of the times at which they were tweeted. To limit the time taken to generate a view the number of tweets is limited to 40,000. The histogram uses bins of one minute, so the vertical axis shows tweets per minute.

The code can all be found in this BitBucket repository.

The viewer is based on the knitr package for R, which generates reports in specified formats (HTML, PDF etc) from a source template file which contains R commands which are executed to generate content. In this case we use Rhtml, rather than the alternative Markdown, which enables us to specify custom CSS and JavaScript to integrate with the ScraperWiki platform.

ScraperWiki tools live in their own UNIX accounts called “boxes”, the code for the tool lives in a subdirectory, ~/tool, and web content in the ~/http directory is displayed. In this project the http directory contains a short JavaScript file, code.js, which by the magic of jQuery and some messy bash shell commands, puts the URL of the SQL endpoint into a file in the box. It also runs a package installation script once after the tool is first installed, the only package not already installed is the ggplot2 package.

function save_api_stub(){
scraperwiki.exec('echo "' + scraperwiki.readSettings().target.url + '" > ~/tool/dataset_url.txt; ')
function run_once_install_packages(){
scraperwiki.exec('run-one tool/runonce.R &> tool/log.txt &')

view raw


hosted with ❤ by GitHub

The ScraperWiki platform has an update hook, simply an executable file called update in the ~/tool/hooks/ directory which is executed when the underlying dataset changes.

This brings us to the meat of the viewer: the knitrview.R file calls the knitr package to take the view.Rhtml file and convert it into an index.html file in the http directory. The view.Rhtml file contains calls to some functions in R which are used to create the dynamic content.

# Script to knit a file 2013-08-08
# Ian Hopkinson

view raw


hosted with ❤ by GitHub

Code for interacting with the ScraperWiki platform is in the scraperwiki_utils.R file, this contains:

  • a function to read the SQL endpoint URL which is dumped into the box by some JavaScript used in the Rhtml template.
  • a function to read the JSON output from the SQL endpoint – this is a little convoluted since R cannot natively use https, and solutions to read https are different on Windows and Linux platforms.
  • a function to convert imported JSON dataframes to a clean dataframe. The data structure returned by the rjson package is comprised of lists of lists and requires reprocessing to the preferred vector based dataframe format.

Functions for generating the view elements are in view-source.R, this means that the R code embedded in the Rhtml template are simple function calls. The main plot is generated using the ggplot2 library. 

# Script to create r-view 2013-08-14
# Ian Hopkinson
query = 'select count(*) from tweets'
number = ScraperWikiSQL(query)
#threshold = 20
bin = 60 # Size of the time bins in seconds
query = 'select created_at from tweets order by created_at limit 40000'
dates_raw = ScraperWikiSQL(query)
posix = strptime(dates_raw$created_at, "%Y-%m-%d %H:%M:%S+00:00")
num = as.POSIXct(posix)
Dates = data.frame(num)
p = qplot(num, data = Dates, binwidth = bin)
# This gets us out the histogram count values
counts = ggplot_build(p)$data[[1]]$count
timeticks = ggplot_build(p)$data[[1]]$x
# Calculate limits, method 1 – simple min and max of range
start = min(num)
finish = max(num)
minor = waiver() # Default breaks
major = waiver()
p = p+scale_x_datetime(limits = c(start, finish ),
breaks = major, minor_breaks = minor)
p = p + theme_bw() + xlab(NULL) + theme(axis.text.x = element_text(angle=45,
hjust = 1,
vjust = 1))
p = p + xlab('Date') + ylab('Tweets per minute') + ggtitle('Tweets per minute (Limited to 40000 tweets in total)')

view raw


hosted with ❤ by GitHub

So there you go – not the world’s most exciting tool but it shows the way to make live reports on the ScraperWiki platform using R. Extensions to this would be to allow some user interaction, for example by allowing them to adjust the axis limits. This could be done either using JavaScript and vanilla R or using Shiny.

What would you do with R in ScraperWiki? Let me know in the comments below or by email: ian@scraperwiki.com

Posting abroad: my book reviews at ScraperWiki

It’s been a bit quiet on my blog this year, this is partly because I’ve got a new job at ScraperWiki. This has reduced my blogging for two reasons, the first is that I am now much busier but the second is that I write for the ScraperWiki blog. I thought I’d summarise here what I’ve done there just to keep everything in one place.

There’s a lot of programming and data science in my new job , so I’ve been reading programming and data analysis books on the train into work. The book reviews are linked below:

I seem to have read quite a lot!

Related to this is a post I did on Enterprise Data Analysis and visualisation: An interview study, an academic paper published by the Stanford Visualization Group.

Finally, I’ve been on the stage – or at least presenting at a meeting – I spoke at Data Science London a couple of weeks ago about Scraping and Parsing PDF files. I wrote a short summary of the event here.

datavisualization_andykirk javascriptthegoodparts1 machinelearningcover interactivevisualisation natural-language-processing-with-python



Book Review: Visualize This by Nathan Yau

9780470944882 cover.inddThis book review is of Nathan Yau’s “Visualize This: The FlowingData Guide to Design, Visualization and Statistics”. It grows out of Yau’s blog: flowingdata.com, which I recommend, and also his experience in preparing graphics for The New York Times, amongst others.

The book is a run-through of pragmatic methods in visualisation, focusing on practical means of achieving ends rather more abstract design principles for data visualisation; if you want that then I recommend Tufte’s “The Visual Display of Quantitative Information”.

The book covers a bit of data scraping, extracting useful numerical data from disparate sources, as Yau comments this is the thing that takes the time in this type of activity. It also details methods for visualising time series data, proportions, geographic data and so forth.

The key tools involved are the R and Python programming languages; I already have these installed in the form of R Studio and Python(x,y), distributions which provide an environment that looks like the Matlab one with which I have long been familiar with but which sadly is somewhat expensive for a hobby programmer. Alongside this are the freely available Processing language and the Protovis Javascript library which are good for interactive, online visualisations, and the commercial packages Adobe Illustrator, for vector graphic editing, and Adobe Flash Builder for interactive web graphics. Again these are tools I find out of my range financially for my personal use although Inkscape seems to be a good substitute for Illustrator.

With no prior knowledge of Flash and no Flash Builder, I found the sections on Flash a bit bewildering. It also highlights how perhaps this will be a book very distinctively of its time, with Apple no longer supporting Flash on iPhone its quite possible that the language will die out. And I notice on visiting the Protovis website that this is no longer under development: the authors have moved on to D3.js, Openzoom which is also mentioned is no longer supported. Python has been around for sometime now and is the lightweight language of choice for many scientists, similarly R has been around for a while and is increasing in popularity.

You won’t learn to program from this book: if you can already program you’ll see that R is a nice language in which to quickly make a wide range of plots. If you can’t program then you may be surprised how few commands R requires to produce impressive results. As someone who is a beginner in R, the examples are a nice tour of what is possible and some little tricks, such as the fact that plot functions don’t take data frames as arguments: you need to extract arrays.

As well as programming the book also includes references to a range of data sources and online tools, for example colorbrewer2.org – a tool for selecting colour schemes, and links to the various mapping APIs.

Readers of this blog will know that I am an avid data scraper and visualiser myself, and in a sense this book is an overview of that way of working – in fact I see I referenced flowingdata in my attempts to colour in maps (here).

The big thing I learned from the book in terms of workflow is the application of a vector graphics package, such as Adobe Illustrator or, Inkscape, to tidy up basic graphics produced in R. This strikes me as a very good idea, I’ve spent many a frustrating hour trying to get charts looking just right in the programming or plotting language of my choice and now I discover that the professionals use a shortcut! A quick check shows that R exports to PDF, which Inkscape can read.

Stylistically the book is exceedingly chatty, including even the odd um and huh, which helps make it quick and easy read although is a little grating. Many of the examples are also available over on flowingdata.com, although I notice that some are only accessible for paid membership. You might want to see the book as a way of showing your appreciation for the blog in physical and monetary form.

Look out for better looking visualisations from me in the future!