Tag: linux

A Docker environment for Windows (October 2015 edition)

This blog post provides an outline method for installing a nice environment for developing in Python using Docker on a Windows 10 machine. Hopefully I have provided sufficient of the error messages I encountered that both myself and others will find this post when in distress!

The past three years I’ve been working for ScraperWiki as a data scientist, this has meant a degree of coding in Python and interacting with my colleagues, and some customers, who use Linux (principally Ubuntu) or OS X. I have continued to use a Windows laptop. You can see my review of it here.

Until recently my setup was based on a core installation of Python and a whole bunch of handy libraries using Python(x,y). I also installed Git for Windows which gave me a shell prompt, and the command-line git commands along with some fraction of the bash environment. I also installed msysgit which provided further Linux style enhancements to my shell. I configured my shell so I could get ssh access to ScraperWiki servers in the cloud. For reasons I can’t recall I also installed ansicon.exe which gives the Windows Command prompt some of the colour highlighting of a modern shell prompt.

With this setup I could do most of what I needed from Windows, and if I had to I could fire up my Ubuntu VM and work in there. Typically I did this when I had some tricky libraries to install, or I wanted to be sure I could deploy onto ScraperWiki’s servers in the cloud. I never really got virtual environments working nicely on Windows – virtualenvwrapper, which makes such things, nicer is challenging to configure on Windows.

Students of this sort of thing will appreciate that the configuration described above is reached with a degree of trial and error, and a lot of Googling of error messages.

Times have changed and this setup was getting a bit long in the tooth, the environment around me was also changing – we started using Docker. I couldn’t get code using the Python requests library to run because of problems with SSL. Also, all the cool kids were talking about Python 3 and how new projects should all be in Python 3. I couldn’t work out how to add Python 3 to my Python(x,y) installation, and furthermore I was currently tied to 32-bit rather than 64-bit Python. ScraperWiki had recently done some work on making an easily deployable Python application and identified that the Anaconda Python distribution by ContinuumIO was the way to go.

Installing Python 3 and 2 using Anaconda

This worked very smoothly, there is an installer here. I had Python 3.4.3 (64-bit) working in the twinkling of an eye, and from my bash prompt I could now run that Python code which was previously broken due to OpenSSL problems. However, all was not rosy since it turns out my latest project was accidently Python 3 compatible, whilst my older projects were not. I therefore needed Python 2 as well. In principle, with Anaconda this is as simple as doing:

conda create -n python2 python=2.7 anaconda

and then

source activate python2

This puts you into a Python 2 virtual environment which will run your old code. However, it doesn’t work from the Git Bash prompt, you need to use a Windows Command prompt, as discussed here. But at least I now have the latest whizzy Python 3 installation and I can also run Python 2, when required. It’s worth noting that installing new libraries on Python under Windows has become rather easier with newer versions of pip, I believe due to the introduction of pip wheel. In the past installing some libraries was a pain because of a need to compile binary components.

Using Docker on Windows with Docker Toolbox and the Git SDK

The next task was to get support for Docker, the container system. You can find out more about Docker in my blog post here. Essentially it is a method for running an application in an isolation unit which is defined by a simple Dockerfile, largely removing problems of dependencies and versions. Docker is intrinsically a Linux technology, it relies on several deeply embedded components of the operating system and so does not run on Windows. However, you can boot up a very lightweight Linux-based VM and run Docker images on that from either Windows or OS X. Until recently this was done using boot2docker. The new way is to use the Docker Toolbox. I held off installing this until it became Windows 10 compatible since as a neophile I have obviously upgraded to Windows 10 at earliest opportunity. Docker Toolbox installs VirtualBox to run a VM to host Docker and Git for Windows to provide a bash shell prompt, as well as the Docker commandline tools.

I found installing Docker Toolbox relatively smooth although I had a problem with it finding ssh key files with an error message “open <filepath\ca.pem : The system cannot find the file specified” which was fixed by regenerating the key files:

docker-machine regenerate-certs default

But this alone does not give me the right workflow since ScraperWiki make heavy use of Make to build and run containers and Git for Windows does not come configured with Make. You can see this in action for the Simple API we made for the NewsReader Project. I used the Git for Windows SDK to provide Make and other build tools. This is designed for use by Git for Windows developers, it’s based on msys2 which I also tried to install but which errored on a couple of steps. The Git SDK is more verbose in its installation appeared to install cleanly.

Once we have Git for Windows SDK we need to use its git-bash to launch Docker Quickstart Terminal (rather than the version provided by the Toolbox), this means changing the command executed by the Docker Quickstart Terminal shortcut from:

"C:\Program Files\Git\git-bash.exe" "C:\Program Files\Docker Toolbox\start.sh"

to

C:\git-sdk-64\git-bash.exe "C:\Program Files\Docker Toolbox\start.sh"

Update 2016-03-21: I modified start.sh to give the docker-machine binary an absolute path, this means I can launch a plain Git Bash shell and run the start.sh script later, if required. This change requires further modification to make sure paths were properly escaped. You can see my version of start.sh here: https://gist.github.com/IanHopkinson/85453a90212eb6627f29

Simply trying to run the Git SDK version of the make tool does not seem to work, you get an error like “unable to make temporary trusted Dockerfile”.

We’re into the final straight now!

My final problem was that when I tried to make my previously working application it failed with an error message:

IOError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/newsreader-demo']

The problem seems to be the way in which msys2 handles paths in Windows it needs to have two preceding //, rather than one, as described here. So all I need to do is change this line in my Makefile

@docker run -p 8000:8000 --read-only --rm --volume /tmp -e NEWSREADER_PUBLIC_API_KEY ianhopkinson/newsreader_demo

To this:

@docker run -p 8000:8000 --read-only --rm --volume //tmp -e NEWSREADER_PUBLIC_API_KEY ianhopkinson/newsreader_demo

Can you see what I did there?

Update 2015-10-14 – interactive shells into docker

If you try to get an interactive shell on a container then you get an error like:

cannot enable tty mode on non tty input

To avoid this you can use winpty:

winpty docker exec -i -t [CONTAINER_NAME] bash

There’s some discussion of this on the Docker Toolbox issue tracker

Update 2015-10-22 – Which Python are you using?

It turns out I was accidentally using the Python shipped with Git for Windows SDK, rather than the Anaconda version I had so carefully installed. I fixed this by adding this to my .profile file:

export PATH=/c/anaconda3/:$PATH

I didn’t spot it earlier because I checked Python version by running ipython rather than python.

Concluding thoughts

I wrote this partly in frustration at the amount of time I spent getting this all fixed up, and the fact that I couldn’t stop until I had fixed it. The scheme above worked for me but I suspect it is quicker and easier to do on a laptop with no history.

There’s no doubt that the situation is better than I found it 3 years ago but it is still a painful process involving much trial and error. Docker brings great benefits for developers, and once it is working makes sharing your work across multiple users very straightforward.

Book review: Data Science at the Command Line by Jeroen Janssens

 

datascienceatthecommandlineThis review was first published at ScraperWiki.

In the mixed environment of ScraperWiki we make use of a broad variety of tools for data analysis. Data Science at the Command Line by Jeroen Janssens covers tools available at the Linux command line for doing data analysis tasks. The book is divided thematically into chapters on Obtaining, Scrubbing, Modeling, Interpreting Data with “intermezzo” chapters on parameterising shell scripts, using the Drake workflow tool and parallelisation using GNU Parallel.

The original motivation for the book was a desire to move away from purely GUI based approaches to data analysis (I think he means Excel and the Windows ecosystem). This is a common desire for data analysts, GUIs are very good for a quick look-see but once you start wanting to repeat analysis or even repeat visualisation they become more troublesome. And launching Excel just to remove a column of data seems a bit laborious. Windows does have its own command line, PowerShell, but it’s little used by data scientists. This book is about the Linux command line, examples are all available on a virtual machine populated with all of the tools discussed in the book.

The command line is at its strongest with the early steps of the data analysis process, getting data from places, carrying out relatively minor acts of tidying and answering the question “does my data look remotely how I expect it to look?”. Janssens introduces the battle tested tools sed, awk, and cut which we use around the office at ScraperWiki. He also introduces jq (the JSON parser), this is a more recent introduction but it’s great for poking around in JSON files as commonly delivered by web APIs. An addition I hadn’t seem before was csvkit which provides a suite of tools for processing CSV at the command line, I particularly like the look of csvstat. csvkit is a Python tool and I can imagine using it directly in Python as a library.

The style of the book is to provide a stream of practical examples for different command line tools, and illustrate their application when strung together. I must admit to finding shell commands deeply cryptic in their presentation with chunks of options effectively looking like someone typing a strong password. Data Science is not an attempt to clear the mystery of these options more an indication that you can work great wonders on finding the right incantation.

Next up is the Rio tool for using R at the command line, principally to generate plots. I suspect this is about where I part company with Janssens on his quest to use the command line for all the things. Systems like R, ipython and the ipython notebook all offer a decent REPL (read-evaluation-print-loop) which will convert seamlessly into an actual program. I find I use these REPLs for experimentation whilst I build a library of analysis functions for the job at hand. You can write an entire analysis program using the shell but it doesn’t mean you should!

Weka provides a nice example of smoothing the command line interface to an established package. Weka is a machine learning library written in Java, it is the code behind Data Mining: Practical Machine Learning Tools and techniques. The edges to be smoothed are that the bare command line for Weka is somewhat involved since it requires a whole pile of boilerplate. Janssens demonstrates nicely how to do this by developing automatically autocompletion hints for the parts of Weka which are accessible from the command line.

The book starts by pitching the command line as a substitute for GUI driven applications which is something I can agree with to at least some degree. It finishes by proposing the command line as a replacement for a conventional programming language with which I can’t agree. My tendency would be to move from the command line to Python fairly rapidly perhaps using ipython or ipython notebook as a stepping stone.

Data Science at the Command Line is definitely worth reading if not following religiously. It’s a showcase for what is possible rather than a reference book as to how exactly to do it.