Let's all point the finger at Java

Leap second causes chaos - Java and Linux blamed

One second to kill the internet. That’s all it takes. Yesterday’s leap second to align atomic clocks with the earth’s rotation caused widespread online chaos, as ill-prepared websites and software fell foul of the extra second.

Among those who choked at midnight GMT, the scheduled time for the additional second, were Reddit, Gawker, StumbleUpon, Yelp, FourSquare, LinkedIn and Mozilla - all of whom experienced brief but substantial unpredictability in their servers. And where was the finger pointed? Squarely in the court of Java and Linux.

Reddit blamed problems in the open source database, Apache Cassandra, which is built in Java.



Mozilla found abnormally high CPU levels in their Java and MySQL servers. One Mozilla engineer reported the Java ‘choking’ bug saying that:

Servers running java apps such as Hadoop and ElasticSearch and java doesn't appear to be working.We believe this is related to the leap second happening tonight becuase it happened at midnight GMT.

Gawker told Wired Magazine that their leap second bug problem came from the Tomcat web servers they use to serve up their site:

“Our web servers running tomcat came close to zero response (we were able to handle some requests),” read an email from a site spokesman. “We were able to connect to servers in order to reset them. Only rebooting the servers cleared up the issue.”

The Linux kernel freezing problem reported across the board was easily solved by resetting the date or rebooting the system, as Gawker say. But it shows you that forward planning for occurrences such as these need to take place.

Google, for example, managed to swerve this by gradually adding milliseconds to their NTP servers, in what they refer to the ‘leap smear’ workaround. Smart Google, but with a reach of that size, planning is paramount.

The leap smear is talked about internally in the Site Reliability Engineering group as one of our coolest workarounds, that took a lot of experimentation and verification, but paid off by ultimately saving us massive amounts of time and energy in inspecting and refactoring code. It meant that we didn’t have to sweep our entire (large) codebase, and Google engineers developing code don’t have to worry about leap seconds. The team involved in solving this issue was a handful of people, distributed around the world, who were able to work together without restriction in order to solve this problem.

It’s amazing what a little forward planning can do, as shown by Opera’s Marco Marongiu’s workaround, to avoid near-catastrophic situations such as these. Let this be a lesson heeded for some. Time is a complex matter, for certain, but it pays to be prepared.

Chris Mayer

What do you think?

Comments

Latest opinions