This post was first published at ScraperWiki.
In a recent post I showed how to use the ScraperWiki Twitter Search Tool to capture tweets for analysis. I demonstrated this using a search on the #InspiringWomen hashtag, using Tableau to generate a visualisation.
Here I’m going to show a tool made using the R statistical programming language which can be used to view any Twitter Search dataset. R is very widely used in both academia and industry to carry out statistical analysis. It is open source and has a large community of users who are actively developing new libraries with new functionality.
Although this viewer is a trivial example, it can be used as a template for any other R-based viewer. To break the suspense this is what the output of the tool looks like:
The tool updates when the underlying data is updated, the Twitter Search tool checks for new tweets on an hourly basis. The tool shows the number of tweets found and a histogram of the times at which they were tweeted. To limit the time taken to generate a view the number of tweets is limited to 40,000. The histogram uses bins of one minute, so the vertical axis shows tweets per minute.
The code can all be found in this BitBucket repository.
|scraperwiki.exec('echo "' + scraperwiki.readSettings().target.url + '" > ~/tool/dataset_url.txt; ')|
|scraperwiki.exec('run-one tool/runonce.R &> tool/log.txt &')|
The ScraperWiki platform has an update hook, simply an executable file called update in the ~/tool/hooks/ directory which is executed when the underlying dataset changes.
This brings us to the meat of the viewer: the knitrview.R file calls the knitr package to take the view.Rhtml file and convert it into an index.html file in the http directory. The view.Rhtml file contains calls to some functions in R which are used to create the dynamic content.
|# Script to knit a file 2013-08-08|
|# Ian Hopkinson|
Code for interacting with the ScraperWiki platform is in the scraperwiki_utils.R file, this contains:
- a function to read the JSON output from the SQL endpoint – this is a little convoluted since R cannot natively use https, and solutions to read https are different on Windows and Linux platforms.
- a function to convert imported JSON dataframes to a clean dataframe. The data structure returned by the rjson package is comprised of lists of lists and requires reprocessing to the preferred vector based dataframe format.
Functions for generating the view elements are in view-source.R, this means that the R code embedded in the Rhtml template are simple function calls. The main plot is generated using the ggplot2 library.
|# Script to create r-view 2013-08-14|
|# Ian Hopkinson|
|query = 'select count(*) from tweets'|
|number = ScraperWikiSQL(query)|
|#threshold = 20|
|bin = 60 # Size of the time bins in seconds|
|query = 'select created_at from tweets order by created_at limit 40000'|
|dates_raw = ScraperWikiSQL(query)|
|posix = strptime(dates_raw$created_at, "%Y-%m-%d %H:%M:%S+00:00")|
|num = as.POSIXct(posix)|
|Dates = data.frame(num)|
|p = qplot(num, data = Dates, binwidth = bin)|
|# This gets us out the histogram count values|
|counts = ggplot_build(p)$data[]$count|
|timeticks = ggplot_build(p)$data[]$x|
|# Calculate limits, method 1 – simple min and max of range|
|start = min(num)|
|finish = max(num)|
|minor = waiver() # Default breaks|
|major = waiver()|
|p = p+scale_x_datetime(limits = c(start, finish ),|
|breaks = major, minor_breaks = minor)|
|p = p + theme_bw() + xlab(NULL) + theme(axis.text.x = element_text(angle=45,|
|hjust = 1,|
|vjust = 1))|
|p = p + xlab('Date') + ylab('Tweets per minute') + ggtitle('Tweets per minute (Limited to 40000 tweets in total)')|
What would you do with R in ScraperWiki? Let me know in the comments below or by email: firstname.lastname@example.org