One morning my CTO called me at 6 in the morning, “The site is down! Fix it!”
I examined the system. I poked around, and the system recovered. No big deal. I reported everything was okay.
And then the hiccups began occurring frequently, correlated with the increase of customers and processing of data. The outages were traced to our real-time filtering engine and packets going through would not come out on the other side. This critical component locked up, meaning the queues handling data would build up. These freezes were temporary and the system would experience a surge when the queues drained.
So I ignored the problem, probably blaming the network connectivity. I went off doing other tasks like plan a roadmap. Hire people. Write documentation.
Then came the worst 6 months for one of my new engineers and myself. The system began freezing daily, the queues would fill up, causing memory to run out. Virtual machine performance slowed down because the processes started using swap space… and then crash, so we’d lose data. We began watching the system every few hours, manually rebooting whenever the queues filled up since traffic cut out. Eventually we were able to automate some of the restarting.
But that’s not how you keep a system going long term.
Worst of all, I couldn’t explain what was going on. I understood there was a thread that kept locking up and to fix it – I never managed to get into the details. To this day, I don’t have a good explanation. We built a parallel system with a different architecture while keeping the old one on life support.
After this scenario, I recommend documenting every glitch. There’s always a reason. I cannot emphasize how important your team needs to understand how everything works. Be proactive by investigating the root causes of a glitch. Treat them like a real fire so spend at least a day digging at the issue and document the findings.
When you catch yourself saying, “Oh the system is glitchy!” Write down that glitch on a piece of paper on the spot. Then transfer to Redmine or JIRA for the entire team. I have another anecdote when an analyst notified me he wasn’t receiving emails from a system. I told him, “Oh probably the mail server is acting up.”
The next day I walk into a major fire because he didn’t receive the acknowledgement his bulk editing of multiple records went through. Customers were wondering what happened to their data. Either the tool was broken, he forgot to press the button, or processing froze. This was my mistake because I made an assumption whatever he was doing would go through, and the email confirmation could be ignored.
That’s not good.
Every time there’s a glitch, record that glitch! At the end of every week, look for patterns. Examine where the glitch can be isolated. There’s always an explanation for a glitch. You might not have to fix them. You might be able to work around it. However – you simply cannot bury them.
An ignored glitch can turn into a major fire. Or a volcano in my situation.
Leave a Reply