Tag: technology

Book review: Data Pipelines with Apache Airflow by Bas P Harenslak and Julian R De Ruiter

data-pipelinesMy next review is on Data Pipelines with Apache Airflow by Bas P Harenslak and Julian R De Ruiter. The book was published in 2021, and is compatible with Airflow 2.0 which was released at the end of 2020.

Airflow is all about orchestrating the movement of data from sources such as APIs and so forth into other places, it originated in Airbnb. It is designed for batch processing, rather than streaming data, and for pipelines that do not change much.

Data pipelines in Airflow are represented as "directed acyclic graphs" or DAGs which are defined in Python code using "Operators" which carry out tasks. A graph is a collection of nodes (tasks in this case) with "edges" between them. The "directed acyclic" bit means tasks have a definite order, the edges between them are "directed", and the graph cannot have loops or cycles because that would imply having to finish a set of tasks before you could start them. Simple data pipelines would just be a linear set of tasks that always follow one from another, a more complicated pipeline might bring in data from several sources before combining them to produce a final data product.

The Operators are strung together using expressions of the form "operator 1 >> operator 2" or even "[operator 1, operator 2] >> operator 3". 

Operators do not have to use Python, they can invoke code in other languages such as the BashOperator, or interact with other systems such as databases or storage systems such as S3. It is relatively easy to write your own operators. Alongside operators that do stuff there are branch operators which select one or other path in the DAG, and there are also sensors which detect changes in filesystems and trigger work and hooks which form connections with external services. Dummy operators can be used to simplify the appearance of DAGs.

As an orchestration system the intention of operators is that they should not contain a great deal of code to process data, that function should be off-loaded to libraries or systems elsehwhere.

The Airflow system is comprised of a web server which allows you to observe / trigger execution of DAGs, a scheduler which is responsible for the scheduled running of DAGs and workers which do the actual work of the DAG. The Airflow system loops over the tasks defined in a DAG, and tries to execute tasks which depends on the tasks upstream of the task in question, if they have been successfully completed then a task can execute.

A basic implementation runs DAGs locally using a simple queue to schedule work, and a sqlite database to store metadata. A production implementation would use something like Postgres or Amazon RDS as the metadata store, schedule work using Celery and run tasks in Docker containers marshalled using Kubernetes.

For some reason reading this I was reminded that big projects like Airflow are just other people’s code, and if you look too carefully you’ll find something nasty. This is both comforting and mildly scary. I think the issue was that Airflow uses jinja templating to inject parameters into code which feels wrong but is probably a pragmatic and safe why to do it, these shenanigans are not required for Python operators. Also discussed are issues with code dependencies, which the authors suggest are best eliminated by putting operators into Docker containers each of which contain their own code dependencies – allowing otherwise dependency incompatible libraries to work together. 

Alongside the material on Airflow there are moderate chunks on Python modules, testing, Docker and Kubernetes and logging so you get a well rounded view not only of Airflow but also of the ecosystem it sits in. The book finishes with deployment into various Cloud environments. I found these parts quite useful since the most complicated work I do in my role is trying to get things to work in AWS! The data science part is easy…

The book finishes with some short chapters on Cloud deployments, mentioning first fully managed services such as astronomer.io, Amazon MWAA and Google Cloud Composer before going on to talking about implementation of one of the demos in the book on AWS, Azure and Google cloud services. I considered skipping these chapters but they turned out be quite interesting in highlighting the differences between services and perhaps the preferences of the authors of both the book and of Airflow.

I found this a readable introduction to Airflow with some nice examples, and interesting additional material. Useful if you are thinking about using Airflow, or even if you are working on data pipelines without Airflow since it provides a good description of the moving parts required.

The code repository for the book is here: https://github.com/BasPH/data-pipelines-with-apache-airflow

Book review: Hedy’s Folly by Richard Rhodes

Back to some history of technology with hedys_follyHedy’s Folly by Richard Rhodes. This book concerns the patent granted to Hedy Lamarr, Hollywood actress, and George Antheil, experimental musician, for the frequency hopping radio communications system. Originally it was intended to allow secure, jamming resistant communications between torpedoes and their control aircraft or ships, nowadays it is most notably the basis for Bluetooth and WiFi communications.

I’ve previously read Richard Rhodes “The Making of the Atomic Bomb”, which is a massive tome, Hedy’s Folly is a rather more modest affair. It provides some biographical material on Lamarr (born Hedwig Kiesler) and Antheil but only in as much as it leads to the patent of the title.

Hedy Lamarr was born in Austria in 1914, to a wealthy family – her father was banker who clearly cultivated her interest in how things worked. Following a brief career in European theatre and film she married Fritz Mandl in 1933. He was an arms manufacturer and one of the richest men in Austria. He didn’t want to see his wife continue her acting career. On the death of her father Lamarr resolved to leave her husband but in the interim she paid close attention to the technical discussions on armaments which she was party to. In all likelihood she was doing this throughout her marriage, despite his controlling nature Mandl clearly valued her opinions (even if he didn’t like them). Lamarr then moved to the States with Louis Mayer of MGM for whom was to make a number of films. In this milieu she met George Antheil.

Antheil in Trenton, New Jersey in 1900. He travelled to Europe in 1921 where he composed the Ballet Mécanique, originally intended as the score to a film it ended up twice the length of the film. As originally envisaged Ballet Mécanique required 16 player pianos and an aeroplane propeller – amongst many other sound making devices. Essentially Antheil vision was much in advance of what technology in the twenties and thirties could deliver. The player piano plays a part in the story. Player pianos were briefly popular as a way for everyone (who could afford one) to make music, they were automated pianos programmed using a paper roll with holes directing the music. The operator simply had to provide power and rhythm. They were supplanted by radio. The important feature was the ability to control sound automatically.

Antheil returned to the US, to Hollywood, in 1936 where he turned to writing film scores, his experimental music proving rather unpopular. It was here he met Hedy Lamarr.

The spirit of the Second World War in the US was that everyone would do what they could to help. Antheil had a sideline in writing about endocrinology, and made suggests on how to defeat the Nazis by this approach. Later in the war Hedy Lamarr was to do considerable work in encouraging Americans to buy government bonds to support the war effort, as well as volunteering at the Hollywood Canteen – entertainment for servicemen.

Lamarr was an inventor in her spare time, her background meant she knew the problems faced with torpedo guidance. So it was not unsurprising for her to work with Antheil on a frequency hopping patent for torpedo guidance. The central idea of the frequency hopping patent was to transmit radio instructions between controller and torpedo over a series of radio channels at different frequencies switching synchronously between channels. In the original patent the number of channels used is relatively small (less than 10), hops are relatively slow – of order minutes and were controlled by a player piano style roll.

The US Navy chose not to develop the patent, stating that the apparatus was too bulky. This seems to be a bit of a misunderstanding – the player piano inspiration was indeed quite bulky but could easily reduced in size using current technology. More likely was the fact that US torpedo performance at the beginning of the war was abysmal – 60% of torpedos experienced technical failure, so it was likely they had other priorities. 

Lamarr and Antheil’s patent on frequency hopping expired in 1959, the US military implemented several frequency hopping systems from the beginning of the sixties. As technology improved it evolved to so-called spread spectrum techniques. The difference between frequency hopping and spread spectrum is really just one of scale. These techniques finally became public in 1976.

Spread spectrum techniques eventually found important applications in Bluetooth and WiFi. Originally designed to be resistant to jamming – the deliberate use of noise to block signals – it is also resistant to unintentional noise. Furthermore it can be used with very low power transmissions so it can cohabit with other signals used for longer range applications and parts of the electromagnetic spectrum where there is a lot of noise.

Hedy Lamarr’s part in the development of frequency hopping is finally being recognised, and George Antheil’s more experimental music is finally being recognised too – technology has now reached the point where his original vision can finally be realised.

This is a fascinating little book, focused on one small invention with huge consequences. It isn’t a biography of Hedy Lamarr, and it isn’t a biography of her co-inventor George Antheil.   

Book review: William Armstrong–Magician of the North by Henrietta Heald

A return to industrial history with armstrongWilliam Armstrong: Magician of the North by Henrietta Armstrong. Armstrong was a 19th century industrialist who spent his life in the north-east of England around Newcastle. His great industrial innovation was the introduction of hydraulic power to cranes and the like. His great wealth, and honours (a knighthood and then a baronetcy) derived from his work in the invention and sale of armaments principally artillery and ships. His home, Cragside near Rothbury some 30 miles north of Newcastle upon Tyne, was the first to feature electric lighting amongst many other technical innovations.

Armstrong was a contemporary of Robert Stephenson, Isambard Kingdom Brunel and Joseph Whitworth – they were all born near the beginning of the 19th century, Armstrong dying in 1900 outlasted them all with Brunel and Robert Stephenson dying in 1859.

Armstrong was born in 1910 his parents started him on a career in the law. However, he had always been fascinated by water. This led to his realisation that the power that could be extracted from a head of water in a sealed system. A water wheel extracts energy from water falling the height of the wheel, a matter of a few metres. A sealed iron pipe, such as could now be manufactured allowed you to capture the energy from a fall of tens of metres or more. In Newcastle upon Tyne the local landscape could provide this head of pressure but with a little ingenuity the head of pressure could be created with a steam engine or other mechanical means. This energy could be used to drive all manner of machinery, Armstrong initially used it to power cranes, and lock gates, to be used in docks and the many factories springing up around the country. Ultimately his hydraulic mechanisms drove London’s Tower Bridge.

In the aftermath of the Crimean War, Armstrong switched his attention to building artillery. During the Crimean War the British artillery was found wanting in terms of accuracy, destructive power and firing rate. His innovations were to move from cannonballs to shells (shaped like bullets), and from muzzle loading to breech loading. He gave up the patents for his artillery pieces to the government but made a fair business on them. His activities with ordnance led to his knighthood and baronetcy although ultimately he withdraw from the close relationship with the British government in armaments as a result of political manoeuvrings by competitors.

The manufacture of artillery led to the manufacture of warships, which incidentally also carried the artillery. The Japanese Navy were particularly important.

He was a leading light of the Literary and Philosophical Society of Newcastle upon Tyne (Lit & Phil), and contributed to founding what is now Newcastle University. Late in his life, in 1897, he published Electric movement in air and water based on his experiments and featuring cutting-edge photographs of the phenomena he described. From a scientific point of view, Armstrong is not a name you will hear in physics classrooms (at any level) today – I don’t know if the same holds for his engineering innovations. Also late in his life he bought Bamburgh Castle, and spent a fair amount of money refurbishing it.

Magician of the North is a somewhat sympathetic view of Armstrong, along the lines of Man of Iron by Julian Glover about Thomas Telford. This contrasts with Samuel Smiles biography of George Stephenson and Rolt’s of Brunel which are much more effusive about their subjects. The Armstrong’s arms trading is discussed in some detail, it seems the company sailed somewhat close to the wind legally in supplying both sides in the American Civil War. A second blemish on Armstrong’s reputation came from industrial disputes with his, and other workers on the Tyne, asking for shorting working hours. That said, he was clearly a pillar of the Newcastle and north eastern community and highly regarded by most of the people most of the time. Many buildings in Newcastle bore his name as a result of his donations both whilst he was a live and after he died.

As usual the author of this biography bemoans the limited attention their subject has received. In the case of Armstrong they put this down to his extensive involvement in the arms trade which, never the most popular, was to fall further out of favour following the Great War. I’ve never seen a quantitative analysis of what makes the right amount of attention for figures in the history of science and technology.

William Armstrong died in 1900, after his death his company went into a slow decline. The Great War led to a distaste for the arms trade, and then came the Great Depression. With Armstrong gone there was no strong, capable leader for the company. The Armstrong name lived on in various spin off companies such as Armstrong Siddeley and various amalgamations with Whitworths and Vickers.

Book review: Remote Pairing by Joe Kutner

 

jkrp_xlargecoverThis review was first published at ScraperWiki.

Pair programming is an important part of the Agile process but sometimes the programmers are not physically co-located. At ScraperWiki we have staff who do both scheduled and ad hoc remote working therefore methods for working together remotely are important to us. A result of a casual comment on Twitter, I picked up “Remote Pairing” by Joe Kutner which covers just this subject.

Remote Pairing is a short volume, less than 100 pages. It starts for a motivation for pair programming with some presentation of the evidence for its effectiveness. It then goes on to cover some of the more social aspects of pairing – how do you tell your partner you need a “comfort break”? This theme makes a slight reprise in the final chapter with some case studies of remote pairing. And then into technical aspects.

The first systems mentioned are straightforward audio/visual packages including Skype and Google Hangouts. I’d not seen ScreenHero previously but it looks like it wouldn’t be an option for ScraperWiki since our developers work primarily in Ubuntu; ScreenHero only supports Windows and OS X currently. We use Skype regularly for customer calls, and Google Hangouts for our daily standup. For pairing we typically use appear.in which provides audio/visual connections and screensharing without the complexities of wrangling Google’s social ecosystem which come into play when we try to use Google Hangouts.

But these packages are not about shared interaction, for this Kutner starts with the vim/tmux combination. This is venerable technology built into Linux systems, or at least easily installable. Vim is the well-known editor, tmux allows a user to access multiple terminal sessions inside one terminal window. The combination allows programmers to work fully collaboratively on code, both partners can type into the same workspace. You might even want to use vim and tmux when you are standing next to one another. The next chapter covers proxy servers and tmate (a fork of tmux) which make the process of sharing a session easier by providing tunnels through the Cloud.

Remote Pairing then goes on to cover interactive screensharing using vnc and NoMachine, these look like pretty portable systems. Along with the chapter on collaborating using plugins for IDEs this is something we have not used at ScraperWiki. Around the office none of us currently make use of full blown IDEs despite having used them in the past. Several of us use Sublime Text for which there is a commercial sharing product (floobits) but we don’t feel sufficiently motivated to try this out.

The chapter on “building a pairing server” seems a bit out of place to me, the content is quite generic. Perhaps because at ScraperWiki we have always written code in the Cloud we take it for granted. The scheme Kutner follows uses vagrant and Puppet to configure servers in the Cloud. This is a fairly effective scheme. We have been using Docker extensively which is a slightly different thing, since a Docker container is not a virtual machine.

Are we doing anything different in the office as a result of this book? Yes – we’ve got a good quality external microphone (a Blue Snowball), and it’s so good I’ve got one for myself. Managing audio is still something that seems a challenge for modern operating systems. To a human it seems obvious that if we’ve plugged in a headset and opened up Google Hangouts then we might want to talk to someone and that we might want to hear their voice too. To a computer this seems unimaginable. I’m looking to try out NoMachine when a suitable occasion arises.

Remote Pairing is a handy guide for those getting started with remote working, and it’s a useful summary for those wanting to see if they are missing any tricks.

Book review: Fire & Steam by Christian Wolmar

FireAndSteamI’ve long been a bit of a train enthusiast, reflected in my reading of biographies of Brunel and Stephenson, and more recently Christian Wolmar’s The Subterranean Railway about the London Underground. This last one is my inspiration for reading Wolmar’s Fire & Steam: How the railways transformed Britain which is a more general history of railways in Britain.

Fire & Steam follows the arc of the development of the railways from the the earliest signs: the development of railed ways to carry minerals from mine to water, with carriages powered by horses or men.

The railways appeared at a happy confluence of partly developed technologies. In the later half of the 18th century the turnpike road system and canal systems were taking shape but were both limited in their capabilities. However, they demonstrated the feasibility of large civil engineering projects. Steam engines were becoming commonplace but were too heavy and cumbersome for the road system and the associated technologies: steering, braking, suspension and so forth were not yet ready. From a financial point of view, the railways were the first organisations to benefit from limited liability partnerships of more than six partners.

Wolmar starts his main story with the Liverpool & Manchester (L&M) line, completed in 1830, arguing that the earlier Stockton & Darlington line (1825) was not the real deal. It was much in the spirit of the earlier mine railways and passenger transport was a surprising success. The L&M was a twin-track line between two large urban centres, with trains pulled by steam engines. Although it was intended as a freight route passenger transport was built in from the start.

After a period of slow growth, limited by politics and economics, the 1840s saw an explosion in the growth of the railway system. The scale of this growth was staggering. In 1845 240 bills were put to parliament representing approximately £100million of work, at the time this was 150% of Gross National Product (GNP). Currently GNP is approximately £400billion, and HS2 is expected to cost approximately £43billion – so about 10% of GNP. Wolmar reports the opposition to the original London & Birmingham line in 1832, it sounds quite familiar. Opposition came from several directions, some from the owners of canals and turnpike roads, some from landowners unwilling to give up any of their land, some from opportunists.

The railways utterly changed life in Britain. At the beginning of the century travel beyond your neighbouring villages was hard but by the time of the Great Exhibition in 1851, a third of the population was able to get themselves to London, mostly by train. This was simply a part of the excursion culture, trains had been whizzing people off to the seaside, the races, and other events in great numbers from almost the beginning of the railway network. No longer were cows kept in central London in order to ensure a supply of fresh milk

In the 19th century, financing and building railways was left to private enterprise. The government’s role was in approving new schemes, controlling fares and conditions of carriage, and largely preventing amalgamations. There was no guiding mind at work designing the rail network. Companies built what they could and competed with their neighbours. This led to a network which was in some senses excessive, giving multiple routes between population centres but this gave it resilience.

The construction of the core network took the remainder of the 19th century, no major routes were built in the 20th century and we have only seen HS1, the fast line running from London to Dover completed in this century.

The 20th century saw the decline of the railways, commencing after the First World War when the motor car and the lorry started to take over, relatively uninhibited by regulation and benefitting from state funding for infrastructure. The railways were requisitioned for war use during both world wars, and were hard used by it – suffering a great deal of wear and tear for relatively little compensation. War seems also to have given governments a taste for control, after the First World War the government forced a rationalisation of the many railway companies to the “Big Four”. After the Second World War the railway was fully nationalised. For much of the next 25 years it suffered considerable decline, a combination of a lack of investment, a reluctance to move away from steam power to much cheaper diesel and electric propulsion, culminating in the Beeching “rationalisation” of the network in the 1960s.

The railways picked up during the latter half of the seventies with electrification, new high speed trains and the InterCity branding. Wolmar finishes with the rail privatisation of the late 1990s, of which he has a rather negative view.

Fire & Steam feels a more well-rounded book than Subterranean Railway which to my mind became a somewhat claustrophobic litany of lines and stations in places. Fire & Steam  focuses on the bigger picture and there is grander sweep to it.