Aug 092018

Well that was a weird error; I recently discovered that ntpd had mysteriously stopped working; specifically it was not able to resolve NTP “pool” names :-

ntpd: error resolving pool Name or service not known (-2)

After some time spent blundering around down dead ends with the help of an appropriate search engine, I ended up resorting to strace. This is a tool most commonly used by developers but can be surprisingly useful for diagnosing system problems too.

As long as you can look past all the inscrutable output!

The strace tool runs a command and records every system call that the command calls together with the results. And of course most commands make zillions of system calls so you’re likely to end up with a huge output file.

To generate the output file, I ran the modern equivalent of ntpdate (ntpd -d) which tries to do the same thing using the actual NTP daemon. Usefully in this case because the command starts, configures itself (which is where the error occurs), and then exits (unlike the normal dæmon). It is important to redirect the output to have a file to trawl through later :-

strace ntpd -d > /var/tmp/ntpd.strace 2>&1

Once the output was generated, it was necessary to trawl through it to look for clues. The first thing was to search for “europe” (as I use as one of my NTP servers). The first occurrence was the error claiming that the name didn’t exist :-

write(2, "error resolving pool europe.pool"..., 73error resolving pool Name or service not known (-2)

Which was somewhat odd because you would expect the string “europe” to occur within an instructable attempt resolve the name. Yet it appears as though the error occurs without any attempt to resolve the name!

As a bit of a guess I searched for “resolv.conf” which revealed :-

stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=362, ...}) = 0
openat(AT_FDCWD, "/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)

Apparently ntpd is unable to open the file due to a permissions problem!

Looking at my /etc/resolv.conf revealed an oddity dating back to when I tried configuring /etc/resolv.conf as a symbolic link to a file on a separate file system. The file itself was a symbolic link to /etc/resolv.conf.file.

For some reason ntpd didn’t like the symbolic link, which is a bit odd but changing it to an ordinary file fixed the problem.

Feb 222014

Having had a wee bit of fun at work dealing with an NTP DDoS attack, I feel it is long past time to tackle the root cause of the problem – the ISP’s who have neglected to implement ingress/egress filtering despite it being considered best practice for well over 15 years. Yes, longer than most of us have been connected to the Internet.

It is easy to point at the operators of NTP services that allow their servers to be used as attack amplifiers. And yes these insecure NTP servers should be fixed, but given the widespread deployment of NTP in everything it could take up to a decade for a fix to be universally deployed.

And what then? Before the widespread use of NTP for the amplification distributed denial of service attacks, DNS was commonly used. And after NTP is cleaned up? Or even before? There are other services which can be exploited in the same way.

But the way that amplification attacks are carried out involves two “vulnerabilities”. In addition to the vulnerable service, the attacker forges the packets they send to the vulnerable service so that the replies go back to the victim. Essentially they trick the Internet into thinking that the victim has asked a question – millions of times.

Forging the source address contained within packets is relatively easy to do, and it has been known about for a very long time and the counter-measure has also been known for nearly as long. To put it simply, all the ISP has to do is to not allow packets to exit their network(s) which contain a source address that does not belong to them. Yet many ISPs – the so-called “bad” ISPs – do not implement this essential bit of basic security. The excuse that implementing such filters would be impossible with their current routers simply doesn’t wash – routers that will do this easily have been on the market for many years.

It is laziness pure and simple.

These bad ISPs need to be discovered, named, and shamed.

Nov 242012

NTP is one of those strange services that are so vital to the operation of an organisation’s network; if the servers around the network get their time in a muddle, all sorts of strange things can start happening. Besides which most people expect their computers to be able to tell the right time.

But often it is one of the unloved services. After all no user is going to ask about the health of the NTP service. And if you are a senior manager involved in IT, do you know who manages your NTP infrastructure ? If so, have you ever asked them to explain the design of the NTP infrastructure ? If not, you may find a nasty surprise – your network’s NTP infrastructure may rely on whatever servers can be scavenged and with the minimum investment of time.

Of course, NTP is pretty reliable and in most circumstances extremely resilient. NTP has built in safeguards against against confused time servers sending wildly inappropriate time adjustments, and even in the event of a total NTP failure, servers should be able to keep reasonable time for at least a while. Even with a minimal of investment, an NTP infrastructure can often run merrily in the background for years without an issue.

Not that it is a good idea to ignore NTP for years. It is better by far to spend a little time and money on a yearly basis to keep things fresh – perhaps a little server, and a day’s time each year.

That was quite a long rambling introduction to the NTP “glitch” that I learned about this week, but perhaps goes some way to explaining why such a glitch occurred.

A number of organisations reported that their network had started reporting a time way back in the year 2000. It turns out that :-

  • The USN(aval)O(observatory) had a server that for 51 minutes reported the year as 2000 rather than 2012.
  • A number of organisations with an insufficient number of clock sources (i.e. just the erroneous USNO one) attempted to synchronise to the year 2000 causing the NTP daemon to stop.
  • Some “clever” servers noticed that NTP had stopped, and restarted it. Because most default NTP startup scripts set the clock on startup, these servers were suddenly sent back in time to the year 2000.

And a cascade of relative minor issues, becomes a major issue.

Reading around, the recommendations to prevent this sort of thing happening :-

  1. Use an appropriate number of time sources for your main NTP servers; various suggestions have been made ranging from 5 (probably too few) to 8 (perhaps about right) to 20 (possibly overkill).
  2. Have an appropriate number of main NTP servers for your servers (and other equipment) to synchronise their time with. Anything less than 3 is inadequate; more than 4 is recommended.
  3. Prevent your main NTP servers from setting their time when NTP is restarted and monitor the time on each server regularly.
  4. And a personal recommendation: Restart all your NTP daemons regularly – perhaps daily – to get them to check with the DNS for any updated NTP server names.
  5. And as suggested above, regularly review your NTP infrastructure.