Avoiding your own 3am disasters

To catch a bug

Irit Shwarchberg
© Shutterstock / HitToon

Being on call is a fact of life for some developers. In this article, Irit Shwarchberg explains that although bugs are a fact of life, there are ways to mitigate those catastrophic failures and save yourself from a midnight bug hunt.

The phone rang. It was 3am—not entirely unexpected, but still unwelcome. It was my turn on call. No, not at the hospital or at a crisis center, but at the telecom company where I was working. I was in the middle of a weeklong rotation as the go-to person for any bugs detected in the system, day or night. And just my luck, it was during the night part of “day or night” that an error with the data from the source had been detected, causing a complete system fail.

The bug was mine to find and I got to work looking for it. I began by searching our Cognos reporting system and Informatica; first opening the report in Cognos trying to see if there was an incorrect calculation anywhere. I went through all of the data items within the report and everything looked fine—nothing had been touched for the last few months. Then I went to dig inside Informatica. I started to open various maps, looking for a specific one that related to the bug. I began by opening the final map in the data flow, which inserts data into the table that the report relies on.

When I didn’t find the problem there, I went on to check the other maps, having to find and check all of the source tables. Making the task even more difficult was that with so many maps, it was nearly impossible to tell what had already been checked and what still needed to be looked at.  Although I combed the system from top to bottom to the best of my ability, the bug was nowhere to be found. Even worse, the bug was affecting a system for employee timesheets and salaries, so my coworkers’ livelihoods were entirely in my hands. Failing to find the bug on my own, I pulled in some of my colleagues to help, but they had no luck either. Neither did the Business Intelligence team.

The problem, at least on the surface, was missing data. But we triple checked all of the fields and couldn’t find any errors. We went through the data flow over and over again trying to find a connection that was made in the wrong way or a definition that was incorrect. After four days, I was nowhere and still had no idea where that missing data was. As the end of the month approached with salaries on the line, we were racing against the clock. We had no option but to enter all of the information manually.

SEE ALSO: Over 16,000 bugs later, Google’s fuzz tester is now open source

I began opening each row of data to map the information transformation into the system. If the map was small and had only a source-to-table with a few rows, it was like winning the jackpot. But most maps have complex data flows, and understanding each data transformation is not only very difficult, it can also take days. Yet luckily, through our manual check, I eventually found the culprit. There was a field whose function was to count mistakes in the data—to show how many errors existed. Yet whoever had created the field had only accounted for a single digit. And with more than nine errors in the data, the system had simply failed.

Our crisis had been averted and our salaries were paid in time. Our hunt for the bug left us with a very valuable lesson about data: understanding how our data is mapped cannot be understated. In any database, there is both the data itself and the field that it lives within. But when you look across an organization with multiple databases, ETL tools, and multiple owners of those databases—not to mention multiple employees entering the data—consistency is rarely achieved. In one database, gender may be entered as male/female, and in another as M/F; dates can be entered month first or day first, with a two-digit or four-digit year; and factors such as middle names can throw the entire system for a loop. And, as in the case of our 3am system failure, the people who build the data flows and processes are rarely the same people to use them. As a result, they’re often unaware of the limitations of the fields they create and the disasters that may cause.

How do you avoid your own 3am disaster? At the end of the day (and in the middle of the night), it comes down to knowing your data. These system fails are unavoidable; bugs will ultimately happen. But understanding how your data is mapped, what fields exist, what limitations are placed on each of them, and how they’re interconnected across databases will not only give organizations better control, it will also give them more accuracy, allow different teams to work with multiple data sets, and help avoid situations of missing data, such as the one I experienced. For anyone looking to avoid manually searching through each field, automating that process is key.


Irit Shwarchberg

Currently heading Customer Success at Octopai, Irit Shwarchberg is a BI expert with more than 10 years of experience leading complex data projects in IT and Telecom, both on the development side and the analysis side.

Inline Feedbacks
View all comments