Everything is Hunky Dori, Always, No Matter What
What might it mean when you ask someone “how are you?” and the answer is always hunky dory?
Gene Hughson in his latest post refers to The Daily WTF’s post on a system that never reports an error. This provokes several thoughts.
1. There is a fantasy to create fail-safe systems. There is a similar fantasy to create fail-safe organizations and teams. The truth is that we would much prefer systems (both software and organizational) that are safe-to-fail. Since most software systems and most organizations operate in complex environments which are impossible to predict, knowing what failed is paramount to the evolution of the system.
In an analogy, we would love that our children will have developed magical qualities to get top grades, be friends with everyone they wish, etc. However, we will become much better parents if we invested efforts to help them let us know when they need our help.
2. In the past, Windows, based on the x86 architecture, had a magical error that told us everything: General Protection Fault. Maybe the fantasy at Intel and/or Microsoft was that the x86 with Windows 3.1 is a super quality system, that fails on seldom occasions, so only GPF is required in such conditions (of course, this is a wild and unvalidated hypothesis for the sake of the argument only).
In practice, Even the infamous Dr. Watson, in “his” first versions, was not good enough to tell us what’s wrong, and additional tools were required.
Luckily today Windows combined with 3rd party tools is much better at telling us what went wrong.
Moreover, modern tools tell us what is going wrong now, and even what is about to go wrong.
3. Conways Law tells us that the architecture of the organization is a reflection of the product architecture (and vice versa).
Relating to B.M’s co-worker in Daily WTF’s post, rather than putting the responsibility on him/her, I wonder if and how their organization is structured to hide faults, and what does it mean to admit having made an error.
In organizational life, when your team members always keep telling you that everything is OK, a good advice is to explore how you contribute together to not telling when they could use your help. What are you collectively avoiding to address the real problem issues.
Parallelly, if you are getting frequent customer complaints that are undetectable before the product is released, a good advice is to explore how your architecture is contributing to hiding away such errors.
Great post, Ilan and very apt observation about Conway’s law (I definitely would not bet against the organization being exactly what you suspect). Item 2 made me nostalgic – early in my career I had to battle with an “Out of memory” Error on a machine that had tons of resources. It turns out that the actual issue was related to security – I guess the resource they were low on was more descriptive error codes.
Sadly, I must admit that I was a culprit of such a behavior myself.
In a data access layer I made a foolish mistake: when an un-handled error occurred I reported “Fatal Error”.
Following an innocent update to the DB layer, customers began to get this error when paging to the end of a list.
As a programmer and other future roles, this was an embarrassing lesson for life.
Everyone makes mistakes *, it’s the lesson you take away from it that’s important (report a more specific message vs say nothing & hide the evidence).
* I’ve had some really good (i.e. horrible) ones of my own over the years.
Very true. In the context of our mutual blog-posts, what story is being told when bugs and errors are hidden away?
That is hard to say. The possible motivations are many – it could be lack of competence/experience, it could be malicious, or it could be defensiveness caused by a toxic culture. The one constant, is that those using the product are put at risk, regardless of the motivation. The potential loss of trust is enormous. It’s a great way to create what Tom Graves calls an anti-customer.