An email that broke a workflow

An email that broke a workflow

T Ashok, @ash_thiru on Twitter

Summary

When you are building large systems that transform other’s business, stay defensive. Don’t assume that every action will be done, be it by a human or by another systems. Some of these can break the chain and the business.


Last Friday the SmartQA site went into a blink, inaccessible, socially distanced to use the modern terminology! A story of how some simple choices made by software developers while implementing an automated workflow can bring down a business, especially when humans in support decide to become inaccessible.

Let me tell you the story. The site smartqa.org became inaccessible last Friday and after a few minutes, I discovered that the site was not down, but unreachable. That is when my tryst with support started. Telephone, chat, emails were unanswered and after five days of relentless pursuit it was sorted without the help from support. So, what was the issue and what can we learn from this?

Well, the issue seemed to be that the domain expired on Friday
despite it being renewed many weeks ago. The renewal process seems to have been botched up. A process that is completely automated, without any human intervention. What went wrong? After five days of being at it, I was given to understand that current domain registrar has possibly shifted his business partnership of buying domains to a different bulk domain provider. This required the new bulk domain provider to authenticate every domain owner with the current registrar. So an email was sent by them to each domain owner (I guess, as I got one) which was supposed to be responded to by a certain date. In my case, the email seemed to have found its way into the ‘read’ folder somehow and therefore I did not respond by the given date. So, on the date of domain expiry, the site went blink. All because of ONE email that I did not respond to! The email that somehow did not show up in my inbox. This email was never resent when response was not received. So, I as a customer never knew about this and my business stopped. 

All because of ONE EMAIL THAT BROKE THE WORKFLOW of automated renewal! Know what is the cost of renewal? About Rs 1000 ($12)!
Just imagine if this has been an online business. $12 shuts down the business! All because of a developer making a choice, of assuming that a critical action in an automated workflow is done. Never contemplating what if it is not done, how can I ensure that it is indeed done? In these times with businesses becoming fully digital, these kind of simple choices can break a business. In my case, I pursued the problem relentlessly, by analysing, by talking to a lot of people and finally a good samaritan helped me nail the problem and then poof, the solution happened. We all know that a problem is a problem, until the solution happens. And in most cases, the solution is simple!
On a lighter note with support going into quarantine, the site socially distanced, I went into the ICU 🙂 A happy Covid19 story this turned out to be at the end.

“A typical accident takes seven consecutive errors” quoted Malcolm Gladwell in his book “The Outliers”. In the chapter on “The theory of plane crashes”, he analyses airplane disasters where he says that it is a series of small errors that results in a catastrophe. The other example he quotes is the famous accident – “Three Mile Island” (nuclear station disaster in 1979). You may want to read a nice article that I wrote on this <Seven consecutive errors = A Catastrophe>.

When you are building large systems that transform other’s business, stay defensive. Don’t assume that every action will be done, be it by a human or by another systems. Some of these can break the chain and the business.
Have a great day. 


Signup to receive SmartQA digest that has something interesting weekly to becoming smarter in QA and delivering great products.

4 Comments

  1. Cost of renewal is just $12 but the cost of this defect might be huge isn’t it?

    I feel it may not be just a choice made by a developer. I have often seen such things missed in design level. But in RCA meeting, first question would be to QA, “why didn’t you ask this question during testing”? 🙂

    • True, the impact is huge. As a choice made by dev, I meant that as ‘dev team’ we make choices that can cost. True it can be a function of design as probably in this case. Yes, the QA will always get roasted ! Thank you Prasad.

  2. Ahw! This is an exceptional scenario happens out of the AUT. However the impact is high due to dependency of other domain provider.
    Ideally the domain provider should have triggered reminder emails if not acted upon by site owner within predefined SLA and would have communicated the impact. This is a failure towards customer retention.

    • The remainder email came and I did renew it well before time. A missed authentication that never was asked for again resulted in a serious failure. It is always the expectation that breaks a camel’s back! Thank you Vinodh.


Add a Comment

Your email address will not be published. Required fields are marked *