We had some problems this morning when one of our zone masters was
offline for a planned outage. Since we have redundant masters and this
was not even the primary master, I was not concerned and decided to
just let the DNS take care of itself. The whole point of multiple
masters
is to not rely on having all of them available for things to hum
along.

But things didn't quite hum along. I noticed that dynamic updates to
one zone were not finding their way back to the slave servers. The
updates
got passed to the primary master. It updated its copy of the zone
accordingly and incremented the SOA serial. It sent the NOTIFY
messages
which were acknowledged by the slave. But the slave would not actually
transfer the zone.

When I ran a "rndc status" I could see there were two hundred to three
hundred "xfers deferred." Sniffing the network, I could see the slave
trying
to reach the unreachable master, but easily communicating with the
available one.

So what happens is that it appears that I had such a backup of
transfers, that the new ones triggered by the updates and NOTIFYs
were being placed at the end of the queue. The problem was that
even though there was one master available, every zone check would
try both masters and have to wait for a UDP and TCP timeout on
the second master before giving up. This was taking f-o-r-e-v-e-r.

We have quite a few zones, 301, but not "a lot" by many standards.
Is that how things are supposed to work? That doesn't seem like a
very robust scheme for handling the possibility of a down master.
Or have we misconfigured something to get this problem?

BTW, I was playing with temporarily adding "transfers-in" and
"transfers-per-ns" statements to speed things up when the other
server came back online. Once it did, things cleared up very
quickly.