I used to think that we were good at designing systems. We would phase test them, user test them,
stress test them and more, and eventually we would roll out the system and it would
work. And life was good.
Then we started getting into a new world of developments
where systems relied on networks, networks relied on servers, servers relied on
mirroring, mirroring relied on programs and programs relied on programmers.
The interlinkage and interdependcies became more and more complex
and clouded, rather than simpler and easier, and the systems started to
fail. And life was bad.
This is the state of today’s nation, where we see increasing
numbers of systems failures across the markets.
It used to be the odd outage, but most banks and exchanges
reported 99.99999999999% uptime and, if they had that 0.00000000001% downtime,
it was so minimal as not ot be noticed.
This has changed, as I seem to blog more and more about
The last time was after the RBS glitch of summer 2012 –
maybe this is a summertime thing? – and noted a lot of other outages around the
- The RBS glitch
- The Flash Crash of 2010
- The issues faced by Aussie banks
- The London Stock Exchange outages
- Santander’s systems consolidation issues
- NASDAQ’s failures during the Facebook IPO
- The $440 million Knight Capital glitch
- BATS going batty
- Madrid and Tokyo stock exchanges outages
- The Nationwide glitch
Since then, there have been several others noted:
- NatWest hit by system failure less than a year after last
- Lloyds' banking systems failure hits 22m retail customers
- Lloyds Banking Group suffers (another) system failure ahead
of NYE celebrations
And the real biggie is then Nasdaq’s outage, again. But is this
Nasdaq or NYSE or something inbetween?
From Reuters today:
Five days after a
glitch that paralyzed Nasdaq-listed stocks for three hours on all U.S. markets,
Nasdaq and NYSE have a different understanding of what happened in the period
preceding and during the blackout, with each side blaming the other for the
outage, according to the sources.
At the center of the
disagreement is the role of Arca, NYSE's fully electronic stock market. The
blackout, which saw trading in about 3,200 Nasdaq-listed stocks such as Apple,
Google and Facebook grind to a halt, was preceded by connectivity problems
between Arca and the Nasdaq-operated Securities Information Processor (SIP).
The SIP consolidates stock prices and distributes them to the market.
What's not clear is
whether the problem at the SIP was caused by issues at Arca or technical flaws
at the processor.
The likelihood is that these things will get worse.
For example, I remember a recent discussion with a bank
about cloud computing, and what would happen if the systems were down?
We would blame the service provider, said the bank.
The service provider would blame their service provider,
The banks says that’s their problem.
Not if it’s your downtime, says I.
In other words, you can have all the Service Level Agreements
(SLAs) and penalty clauses in the world, but the world does not work the way
you think anymore.
If the systems are down is it the network, the cloud storage
system, the SaaS, the interconnectivity, the latency, the … the … the … the …
The world of today is incredibly complex, operating on
systems that are interdependent and highly reliant upon each other.
Even when you run your own, you may find the issue is not
your own but your partner’s or your partner’s partner or your partner’s partner’s
partner or …
Well, it’s obvious isn’t it?