Don’t Worry, Be Happy… Until One Day
Continuing the disclaimer of two other posts I am referring to – this is not a political post.
Gene Hughson has recently written on the US healthcare.gov project, in response to Uncle Bob’s post from November 12th.
This is not the first time that a software failure had caused severe damage to mammoth projects. Here’s a short quote from Wikipedia on the first launch of Ariane 5:
“Ariane 5’s first test flight (Ariane 5 Flight 501) on 4 June 1996 failed, with the rocket self-destructing 37 seconds after launch because of a malfunction in the control software. A data conversion from 64-bit floating point value to 16-bit signed integer value to be stored in a variable representing horizontal bias caused a processor trap (operand error) because the floating point value was too large to be represented by a 16-bit signed integer.”
The emphasis I have added points to a basic flaw in computer programming, often experienced by novice engineers. One would expect that a high-profile aerospace project will hire better engineers than that, don’t you agree?
Uncle Bob Martin thinks so:
“[…] So, if I were in government right now, I’d be thinking about laws to regulate the Software Industry. I’d be thinking about what languages and processes we should force them to use, what auditing should be done, what schooling is necessary, etc. etc. I’d be thinking about passing laws to get this unruly and chaotic industry under some kind of control.
If I were the President right now, I might even be thinking about creating a new Czar or Cabinet position: The Secretary of Software Quality. Someone who could regulate this misbehaving industry upon which so much of our future depends.”
Moreover, Uncle Bob refers to another aerospace disaster – the Challenger explosion, and the engineers’ responsibility in not stopping the launch:
“It’s easy to blame the managers. It’s appropriate to blame the managers. But it was the engineers who knew. On paper, the engineers did everything right. But they knew. They knew. And they failed to stop the launch. They failed to say: NO!, loud enough for the right people to hear.”
In response, Gene Hughson writes:
“Considering that all indications are that the laws and regulations around government purchasing and contracting contributed to this mess, I’m not sure how additional regulation is supposed to fix it.”
Sadly for our industry, I agree with Gene. Yes, engineering practice has, on the whole, a long, long way to go to become anywhere near excellent. I have a lot of respect for Uncle Bob for his huge contribution there.
But the Challenger disaster is first and foremost not an engineering failure. The disastrous potential of the problematic seal was known for a long time before it actually materialized, to everyone’s shock.
““The Rogers Commission found NASA’s organizational culture and decision-making processes had been key contributing factors to the accident. NASA managers had known contractor Morton Thiokol’s design of the SRBs contained a potentially catastrophic flaw in the O-rings since 1977, but failed to address it properly. They also disregarded warnings (an example of “go fever”) from engineers about the dangers of launching posed by the low temperatures of that morning and had failed in adequately reporting these technical concerns to their superiors.”
At the end of the day, it boils down to the fact that NASA’s leadership were operating under the false belief that with every launch of the shuttle, the risk of the seal failing reduces, completely opposite to common sense.
Mr. Larry Hirschhorn has an excellent description of this in his book The Workplace Within.
In such atmosphere, when my managers, and their managers, are so indifferent to life-threatening flaws, heck, why should I exercise excellence in my mundane tasks? Why should I risk my own livelihood? After all, this is the culture here, in this workplace.
It is heartbreaking that the loss of the Columbia can be attributed to similar management pitfalls as that of the Challenger:
“In a risk-management scenario similar to the Challenger disaster, NASA management failed to recognize the relevance of engineering concerns for safety for imaging to inspect possible damage, and failed to respond to engineer requests about the status of astronaut inspection of the left wing. Engineers made three separate requests for Department of Defense (DOD) imaging of the shuttle in orbit to more precisely determine damage.”
Coming back to Uncle Bob’s conclusions, in his talk, How schools kill creativity, Sir Ken Robinson points out that the school system, in its efforts to teach, are killing creativity in favor of grades. We can only assume that legislating computer engineering studies will, at best, not harm the existing engineering quality. It will probably achieve worse – well certified engineers, with little ability or drive to excel.
This failure has little to do with teaching and certifications, and all too much to do with culture, professionalism, and plain simple awareness.
When managers practice such “It will be OK” attitude, everyone does. By the sound of it, the healthcare.gov failure discussed here is not that far off.
In 1992, Prime Minister Yitzhak Rabin was speaking at the Staff and Command school to prospect senior officers. Here’s what he had to say about “It will be OK”:
“One of our painful problems has a name. A given name and a surname. It is the combination of two words – ‘Yihyeh B’seder’ [“it will be OK”]. This combination of words, which many voice in the day to day life of the State of Israel, is unbearable.
Behind these two words is generally hidden everything which is not OK. The arrogance and sense of self confidence, strength and power which has no place.
The ‘Yihyeh B’seder’ has accompanied us already for a long time. For many years. And it is the hallmark of an atmosphere that borders on irresponsibility in many areas of our lives.
The ‘Yihyeh B’seder’, that same friendly slap on the shoulder, that wink, that ‘count on me’, is the hallmark of the lack of order; a lack of discipline and an absence of professionalism; the presence of negligence; an atmosphere of covering up; which to my great sorrow is the legacy of many public bodies in Israel – not just the IDF.
It is devouring us.
And we have already learned the hard and painful way that ‘Yihyeh B’seder’ means that very much is not OK.”
No, Uncle Bob, engineers are not to blame on this. Management must take responsibility for nourishing a culture that allows such poor standards.
For leaders, there are very rare times when the importance of the mission is such that you have to take a gamble. Doing so should be a source of stress and pain and doubt. When it isn’t, when rolling the dice is something done with “that same friendly slap on the shoulder, that wink, that ‘count on me’”, then the leader has ceased to be a leader.
@Gene, I agree with you on the latter part of the response.
I think, however, that there are many occasions when “the importance of missions is such that you have to take a gamble”. You may call it an educated guess, or a calculated risk, or many other names.
The problem is when incredibly serious decisions are being taken with little transparency or with too little examination of empirical data or, sometimes. with little consideration to the gravity of the situation.
In another famous NASA decision, the ground crew took a gamble when they guided the Apollo 13 mission team to sling-shot themselves beyond the dark side of the moon.
From what I know – they did check the feasibility of the procedure, they did keep the crew informed, they did handle the situation with all seriousness.
But something happened to the engineering school of thought in the more recent decades (NASA failures mentioned in the post, Boston’s Big Dig, the Chernobyl disaster, to name a few), as did other fields (e.g. Enron, The Subrime Crisis).
Or maybe it has always been there? (the 1929 Stock-Market Crash, the Titanic) And it’s just that we are bad learners?
I should have been clearer re: the type of gamble I meant. Sometimes a leader has to risk injury to the team (casualties, stress, etc.) because of the importance of the mission. The Apollo 13 example was a gamble as well, but in that case it was a matter of taking a (calculated) chance in an attempt to save the team. Neither should be attempted in a cavalier manner, but first type should be most stressful, even when the gamble pays off. Those are (or, at least, should be) the rare birds.
Should be the rare ones. I agree.
Diane Vaughan’s book, ‘The Challenger Launch Decision’ presents an exhaustive analysis of the factors leading up to the disaster. Her work suggests much less of a reckless target driven managers / brave ignored engineers scenario than the official report.
She points out that many of the managers were engineers themselves, and had a pretty good grasp of the issues. Rather than a one- sided disaster, Vaughan’s analysis suggests a shared system failure, related to a common culture. She suggests that there was a process of ‘normalisation of deviance’, where in a process akin to the old story about boiling a frog, people gradually became accustomed to the situation. The performance of the O rings in low temperature was well known, but the blow past became interpreted as safe: in effect, many people concluded that it wasn’t a problem.
This isn’t as engaging a story as the popular narrative, but it probably has more in common with day to day experience, where people do their best, but get it wrong, often because of unspoken – and unexamined – assumptions.
@Cameron, this analysis is not far from Mr. Hirschhorn’s analysis in his 1988 book mentioned in the post.
The point that catches my eye is: “…but the blow past became interpreted as safe; in effect, many people concluded it wasn’t a problem.”
This is not an engineering issue. This is a cultural issue.
And culture, in my humble opinion, is a matter for leadership to attend to.
You point to a relevant truth (if there is a truth) – “because of unspoken – and unexamined – assumptions”.
There is an unconscious process in NASA – and arguably in any organization – “helping” the collective of all individuals to ignore important data.
Similarly, in the Healthcare initiative important data had been ignored, ones that leadership (management? government?) should either be aware of, or aware be of the possibility that such organizational dynamics may exist.
My argument with Uncle Bob is, that engineering malpractice is not the sole responsibility of engineering. It is leadership that either fosters such atmosphere, or ignore to nourish an alternative one.
“why should I exercise excellence in my mundane tasks? Why should I risk my own livelihood?”
I answer: Because there were live at stakes.
Yes its that simple.
When lack of action leads to life being lost, a border has just been crossed.
When someone fails to do everything in his power (disregarding personal costs) to prevent the death of other people. I will consider him guilty of negligence.
Disclaimer, I am no lawyer and have no grasps of the exact legal issues and terminology involved. this is just my personal opinion and sense of ethics speaking.
So I’ll be absolutely clear, When human lives are at stake, EVERYONE is responsible and accountable. its the job of EVERYONE to make sure no one will be hurt. Personal cost (like losing your job) are of no consequences.
@Lior, while I agree on the premise, I beg to differ on the practice – or, to be more precise, on the implications of organizational life on people’s practice.
Here’s a link to Menzies’ seminal work on nurses in hospital, who practiced neglect – not because they wanted to hurt patients, but because their experience of working in hospital drove them there:
Click to access Lythp439.opd.pdf
One can speculate that Healthcare.gov had similar effect on people.
In the NASA case there is enough empirical work to suggest that this holds true.
Larry Hirschhorn writes about the Healthcare.Gov failure in his latest blog-post.
Long post, very interesting read.
He suggests that the dynamics that enacted the Obama administration had seeped through to other level.
A short quote from the blog:
“…[T]his concept of a crusade, where loyalty is favored over critical thinking, can help explain why people involved in the administration’s efforts were reluctant to express their anxieties and doubts. As the Washington Post journalists write, “On Sept. 5, White House officials visited CMS for a final demonstration of HealthCare.gov. Some staff members worried that it would fail right in front of the president’s aides. A few secretly rooted for it to fail so that perhaps the White House would wait to open the exchange until it was ready.” In other words, they withheld their doubts, secretly rooting for failure as they only way in which their doubts could then be justified.”