@DanLebrero.

software, simply

First ever resilience test

Building confidence on the resilience of a home automation system.

Image attribution: Kurzschluss 12V20A / CC BY-SA 3.0.

Last week’s talk about team resilience brought back memories of my first ever automated resilience test.

Brace yourself for an old dev’s story from when lemons were the sweetest fruit available.

lemons

The scene

Between 2002 and 2006 I helped develop a pretty amazing piece of technology, internally codenamed TheBox:

TheBox

TheBox in all its glory.

TheBox was designed to be the heart and brain of the house:

  • A wifi modem-router.
  • A firewall with parental control.
  • A multimedia player.
  • A TV internet browser.
  • A home surveillance system.
  • A home automation system, controlling the lights, blinds, thermostat and locks.

First production deployment

The first production deployment happened around 2005 in 100 houses as part of a new residential development somewhere in the south of Spain.

deployment here means driving 800km (500 miles), flashing 100 drives, unscrewing 100 x 8 screws, slotting 100 flash drives, screwing 100 x 8 screws, carrying and plugging each box to its house, configuring the network, modem and home automation system, and driving back those 800km.

The new home owners were thrilled with their “futuristic” houses, our finance department (one person) was ecstatic with the first proper sales, and we – the dev team– were amazed that things actually worked.

All was happiness until …

storm

An electrical storm near the residential development caused a power outage in the area.

When the power was restored, the heart and brain of most houses (aka TheBox) failed to boot.

ahhh

To “quickly” restore service, we did a redeployment.

see previous side note.

We brought back a box to debug the issue and unfortunately the fix required yet another redeploy.

As the bug has demonstrated my lack of knowledge, a fix and some manual testing did not give me enough confidence to do yet another 1600km (1000 miles) trip.

Back then I was already a test automation zealot, but how could I test that TheBox could survive a power outage?

The first automated resilience test

The solution had been sitting on my desk for two years:

x10 switch

An X-10 switch that is able to turn on/off the power using an X-10 command.

X-10 is an ancient home automation system that uses the existing household electrical wiring to communicate between devices.

And we were building software for home automation! So I had all the pieces to write my first ever resilience test:

setup

And the pseudocode:

forever {
    send_command(X10.power_on)
    sleep(random(2 to 4 mins))
    send_command(X10.power_off)
    sleep(10 secs)
}

To give me a little bit more confidence, the power outage would happen while TheBox was trying to update its software, so it would be more likely that the power outage would happen as TheBox was writing to disk, simulating a more risky scenario.

The test spent four days doing:

on-off

2000 power outages later, TheBox was still happily booting with no trouble so we decided to do one last redeployment.

The end

And no new outages ever happened again. Hurray!

ever == while I worked for that company =~ 12 months

And snip, snap, snout, this tale’s told out.

But, why TheBox was not booting after an outage?

If you are curious, the issue was that we were using the ext2 filesystem, a similar issue to this. Moving the ext3 with journaling fixed the issue.

Or maybe it did not and we never found out.


Did you enjoy it? or share!

Tagged in : resilience