Time – More Zonkyness

NTP is one of those strange services that are so vital to the operation of an organisation’s network; if the servers around the network get their time in a muddle, all sorts of strange things can start happening. Besides which most people expect their computers to be able to tell the right time.

But often it is one of the unloved services. After all no user is going to ask about the health of the NTP service. And if you are a senior manager involved in IT, do you know who manages your NTP infrastructure ? If so, have you ever asked them to explain the design of the NTP infrastructure ? If not, you may find a nasty surprise – your network’s NTP infrastructure may rely on whatever servers can be scavenged and with the minimum investment of time.

Of course, NTP is pretty reliable and in most circumstances extremely resilient. NTP has built in safeguards against against confused time servers sending wildly inappropriate time adjustments, and even in the event of a total NTP failure, servers should be able to keep reasonable time for at least a while. Even with a minimal of investment, an NTP infrastructure can often run merrily in the background for years without an issue.

Not that it is a good idea to ignore NTP for years. It is better by far to spend a little time and money on a yearly basis to keep things fresh – perhaps a little server, and a day’s time each year.

That was quite a long rambling introduction to the NTP “glitch” that I learned about this week, but perhaps goes some way to explaining why such a glitch occurred.

A number of organisations reported that their network had started reporting a time way back in the year 2000. It turns out that :-

The USN(aval)O(observatory) had a server that for 51 minutes reported the year as 2000 rather than 2012.
A number of organisations with an insufficient number of clock sources (i.e. just the erroneous USNO one) attempted to synchronise to the year 2000 causing the NTP daemon to stop.
Some “clever” servers noticed that NTP had stopped, and restarted it. Because most default NTP startup scripts set the clock on startup, these servers were suddenly sent back in time to the year 2000.

And a cascade of relative minor issues, becomes a major issue.

Reading around, the recommendations to prevent this sort of thing happening :-

Use an appropriate number of time sources for your main NTP servers; various suggestions have been made ranging from 5 (probably too few) to 8 (perhaps about right) to 20 (possibly overkill).
Have an appropriate number of main NTP servers for your servers (and other equipment) to synchronise their time with. Anything less than 3 is inadequate; more than 4 is recommended.
Prevent your main NTP servers from setting their time when NTP is restarted and monitor the time on each server regularly.
And a personal recommendation: Restart all your NTP daemons regularly – perhaps daily – to get them to check with the DNS for any updated NTP server names.
And as suggested above, regularly review your NTP infrastructure.