Leap second causes chaos – Java and Linux blamed
One second is all you need to kill the Internet it seems
One second to kill the internet. That’s all it takes.
Yesterday’s leap second to align atomic clocks with the earth’s
rotation caused widespread online chaos, as ill-prepared websites
and software fell foul of the extra second.
Among those who choked at midnight GMT, the scheduled time
for the additional second, were Reddit, Gawker,
StumbleUpon, Yelp, FourSquare, LinkedIn and
Mozilla – all of whom experienced brief but substantial
unpredictability in their servers. And where was the finger
pointed? Squarely in the court of Java and Linux.
Reddit blamed problems in the open source database, Apache
Cassandra, which is built in Java.
Servers running java apps such as Hadoop and ElasticSearch and
java doesn’t appear to be working.We believe this is related to the
leap second happening tonight becuase it happened at midnight
Gawker told Wired Magazine that their leap second bug problem
came from the Tomcat web servers they use to serve up their
“Our web servers running tomcat came close to zero response (we
were able to handle some requests),” read an email from a site
spokesman. “We were able to connect to servers in order to reset
them. Only rebooting the servers cleared up the issue.”
The Linux kernel freezing problem reported across the board was
easily solved by resetting the date or rebooting the system, as
Gawker say. But it shows you that forward planning for occurrences
such as these need to take place.
Google, for example, managed to swerve this by
gradually adding milliseconds to their NTP servers, in what they
refer to the ‘leap smear’ workaround. Smart Google, but with a
reach of that size, planning is paramount.
The leap smear is talked about internally in the Site
Reliability Engineering group as one of our coolest workarounds,
that took a lot of experimentation and verification, but paid off
by ultimately saving us massive amounts of time and energy in
inspecting and refactoring code. It meant that we didn’t have to
sweep our entire (large) codebase, and Google engineers developing
code don’t have to worry about leap seconds. The team involved in
solving this issue was a handful of people, distributed around the
world, who were able to work together without restriction in order
to solve this problem.
It’s amazing what a little forward planning can do, as shown by
Marco Marongiu’s workaround, to avoid near-catastrophic
situations such as these. Let this be a lesson heeded for some.
Time is a complex matter, for certain, but it pays to be