“I crashed the New Zealand stock exchange”: 13 terrifyingly true developer horror stories
Pull up a chair and gaze into the glow of the warm monitor, as JAXenter shares with you our favourite Halloween chillers.
It’s October 31st – officially the scariest day of the year. And what better way to honour of this celebration of ghouls and gremlins, we thought, than sharing thirteen terrifying tales of developer horror gathered at JAX London?
Read on, dear thrill seeker. But don’t come crying to us if this gives you goosebumps…and hey, if you’ve got some spine chillers of your own, please feel free to share in the comment section!
1. Say rm -rf and die
This must have been, what, early noughties time? So I used to work for a company that owned another company, and I knew a guy that worked there. He was convinced that their site was being hacked, and he wanted helping.
So I ssh’d into their box, had a look around, spot some kind of problem, and fix it for him. And then I accidentally run a rm -rf command from the wrong directory, and wipe out their entire website.
And then ping him, and I was like, “dude, there’s a bit of a problem. Where are your backups?”
To which he replies: “What backups?”
2. Beware of the stock exchange
So…I crashed the New Zealand stock exchange. I shut down the servers in the wrong window.
We were sitting in the Ops room, and we had a police siren set up for when the stock exchange goes down, because we need to know about it. So I enter my command on the keyboard, and all of a sudden, the siren kicks off.
I didn’t put two and two together to start with. I thought, “Cool! Our siren’s gone off.” So I started looking at various systems, and then I thought: oh, hang on a second.
And that’s the day we discovered our disaster recovery did actually work. They’d never been tested before.
3. Diary of a mad tester
One of our tests says, “Get me a random number. Are these two numbers the same? No? Good. Pass.”
4. Legend of the lost operating system
I managed to delete an entire operating system.
A long time ago, I had a different setup of computers than we have today. They were on the removable disks – they’re all hidden away these days – but you had great big cabinets. We had something like £25m worth of developed software on the disks, and I was doing the rotating backups. Somebody set a switch on the front of the computer the wrong way, which was auto-start.
So, what it did was, I put the disk in, and we had little power glitch, and it auto-started halfway through the backup. I corrupted the first £25m worth of software, and then I put the next disk in and corrupted that one. And then I put the next in, because I was doing the full rotation! And in the end, I had none left.
I spent the next three weeks patching together a nearly-complete operating system and software. Never quite got there. I lost about two months’ work in the end. Oh my god, was I slaving away for hours and hours.
5. Print fever
We asked another department in the company to update an Excel sheet and edit the cells. They handed it back to us printed out, and someone had written the corrections on it! Because it was so big, they’d stuck it all together with sellotape, on pieces of A4, up on a whiteboard. Priceless, that.
6. Monster laptop
This is a company laptop – they don’t let you install anything on the laptop, you gotta be admin. So I spent most of the weekend doing a live CD, making it persistent as well, so I could use an external drive.
I came here to a workshop yesterday, booted it up once and it worked. Forgot to set the persistence flag, so reset it, set the persistence flag, and then I got a read error on the initrd Linux RAM disk.
So that was it, all that work gone out the window. I had no laptop for the workshops. That was a horror story, thanks to Windows and their permissions.
7. Night of the living exploit
We released a service that allowed people to download software, which had a huge security flaw in. Which we realised about fifteen minutes after we deployed it. And then frantically fixed the vulnerability and pushed the fix out there. We didn’t get exploited or anything, so everything was fine, but that was quite a good example of, “when the shit hits the fan, how good are you at getting a fix out?”
8. The curse of the cleaners
A lot of companies forget to lock where their server farms are, and the cleaners will just go in and clean, right? So they’ll dust on the computers, and off switches will happen – Chaos Monkey style. That’s happened at a major, major investment bank here in London.
They let the cleaners into the server rooms, and all of a sudden major production systems started going down. Everyone’s going, “what’s going on?”, rang security down thinking they were being hacked, and the cleaner’s like hoovering, headphones in, elbows smacking up against the servers.
9. The £100k coding horror
I used to work for a really famous bank, and we were doing some work for one of our clients who take credit card payments. And I had to make a change, and accidentally committed something without a test, and then six months later they found out there was a bug, and it cost them £100,000.
That’s not the worst bit. So I patched it, and we went to work on a couple of branches. And no-one pulled it in. I was there for another eighteen months, and by the time I left it still hadn’t been patched.
And the client – we had no idea of identifying which version of the code they had. Because they were all physical sites somewhere. And it reminded me of an engineer going out with a USB pen, when he feels like it, just plugging it in and upgrading their software. So that bug is probably still out there…
10. Be careful what you look for…
I came across, in our code, the perfect way to find if a number is negative. Convert it to a string and see if the first character is a minus sign. It was in actual production code.
11. Deep trouble
I’m at a conference today, and I got a call an hour ago, saying a number of clients’ websites had gone down. And of course, there’s no-one back in the office to fix it, because all of the guys are down here with JAX London. So we’ve got a very worried-sounding manager phoning to say help.
So we tried to get on the wifi network here, and of course everyone else is trying to use the wifi here, so we’re having to do it via mobile phones to connect in and sort out the problem. If we can get 3G in here…
We managed to fix it in the end, so at least there’s a happy ending.
12. My hairiest adventure
I used to work in a hospital, and it about was one month after I started to work in IT. I was still a student. So what I did, I was thinking I’m doing a fix and deploying it to test. What really happened was we had access to the live, and it went live.
So, the next morning I came in at like 9 o’clock. Everybody was looking at me. The information system of one of the biggest hospitals in the country’s not working, what did you do? Turn it back!
13. Attack of the mutant asynchronous system
Two systems, one doing application logic and the other doing the data logic. So if we have a cancellation of some kind, basically this system works on the information, and it’s sort of saving a bunch of data before it sends off an email – like, “we’re sorry to lose you”, that kind of stuff.
Because the systems, the cross-cancelling subscription is asynchronous, it sends some data, waits for the exact amount to come back, receives them, and then sends an email with all the information combined. So this intermediary data lives on the application system and waits for the other to complete its work asynchronously, was – still is, actually – in production, and it’s a list of 100 entries maximum. It sort of lives there, waits and then we pop it off the list.
So if at any point, for some reason, there are 100 sequential requests, and they were all still being processed, we will start losing data, because of that hard limit of 100 for no apparent reason – because, I guess 100 was enough. And if the server gets restarted, we lose that.
We’ve had that for about eight years now. Nobody really knows about it.