Thank you for all your help.

Both the memory utilization and the ID mismatch errors turned out to be irrelevant. I turned off iptables and the memory usage dropped by 50% but that didn't make any difference to the problem.

The issue appears to be fixed at the moment, but I think that is because the windows servers stopped sending all of those queries for records in 10.in-addr.arpa. We still don't know what triggered them all. One of the days in which I took a measurement, there were over 2 million queries that day for records in 10.in-addr.arpa.

One thing that I did during the time this problem was happening was to create a dummy zone file on the resolvers for 10.in-addr.arpa. This had the disadvantage of causing them to say no PTR records exist in that domain, but most clients should be pointing to the windows servers, so it shouldn't matter. To my surprise, that change did not fix the problem. They resolvers continued seizing up often two or three times a day, and the only thing that fixed them was restarting named. (I did try other things like rndc flush as one person recommended, but none of them helped.)

Thanks again,
Maria

On Tue, Sep 21, at 05:40%P so wrote Maria Iano (miano@gannett.com):

>
>
> -----Original Message-----
> From: bind-users-bounce@isc.org [mailto:bind-users-bounce@isc.org] On
> Behalf Of Ladislav Vobr
> Sent: Monday, September 13, 2004 11:51 PM
> Cc: bind-users@isc.org
> Subject: Re: Warning: ID mismatch:
>
> u will definitely experience problems when you have 100% utilization on
> the inside forwarding servers. The forwarding might be the reason, other
> thing you mentioned might be some reachibility problems, bind gets very
> busy when some domains are completely unreachable, and also have problem
> responding since it's queue gets full following up with all these
> unreachable servers several times for each such request. There is a tool
> called dnstop I think from caida.org it can show you real-time traffic
> going out and coming to your dns server sorted by the traffic, it might
> give you a hint what are your top talkers or your top destinations your
> name sever is trying to reach and you might be surprised what your
> nameserver is trying to do in the background.
>
> How much difficult would be to remove the forwarding and perhaps try a
> state full firewall, letting even the internal servers, follow up
> directly, there is really not many advantages in the forwarding, but it
> can be source of lot of confusion. Generally imho people try to avoid
> it, if there is some other choice.
>
> Ladislav
>
> Maria Iano wrote:
> > I agree with you on this - the ID mismatch error was a red herring.

> I'm won=
> > dering now if there is some issue with unexpectedly high memory use.
> >
> > I am still experiencing this issue, now almost daily during the work

> week. =
> > At the times of day when we get the most lookups (lunchtime when

> everyone s=
> > tarts surfing) one or the other of the servers stops responding to

> queries.=
> > In the debugging it looks like when this happens, it receives

> queries, and=
> > forwards them to the outside resolver, but doesn't recognize the

> reply fro=
> > m the outside server. I can see the packets returning from the outside

> serv=
> > er. The broken piece seems to occur at that point.
> >
> > One thing I have noticed about these inside resolvers is that they are

> runn=
> > ing at about 100% memory use (1 Gb of RAM on each) at all times.

> Everything=
> > else in the system is fine. The load reports as 0. Things like UDP

> socket =
> > use, and all sorts of data from sar, are all fine. The outside

> resolvers th=
> > at they forward to are identical builds on identical hardware, yet

> they run=
> > at about 50% memory use. The outside resolvers are also used by about

> 180 =
> > other locations, and get at least 10 times the number of queries, yet

> they =
> > are the ones doing fine.
> >
> > There are really two differences between the servers that are fine

> (the out=
> > side ones), and the servers that keep ceasing to resolve (the inside

> ones).=
> > The outside ones resolve queries in the usual iterative way. The

> inside on=
> > es resolver queries by forwarding to other servers. The other

> difference is=
> > that the inside servers get a lot of reverse 1918 queries which

> forward to=
> > other internal (Windows) servers, and those servers sometimes don't

> answer=
> > . In fact those servers sometimes forward the queries back out, but

> thankfu=
> > lly I don't see a loop occurring, so the inside resolvers seem smart

> enough=
> > to drop thing there. I am about to get this issue fixed, in that the

> Windo=
> > ws servers are about to be told they own all of that space, it has

> just tak=
> > en a week to get the process accomplished for this to happen. Last

> week I a=
> > lso created a lot of dummy zones for the reverse space on our inside

> resolv=
> > ers, so the servers could answer right away. I'm not convinced that

> will fi=
> > x the issue anyway.
> >
> > I am trying to determine why the inside servers run at 100% while the

> outsi=
> > de servers run at about 50% memory usage. I'm also building an updated

> repl=
> > acement for one of the inside resolvers to use fedora in place of RH8,

> and =
> > to no longer use the grsecurity patch, to see if that helps.
> >
> > Thanks,
> > Maria
> >
> > -----Original Message-----
> > From: bind-users-bounce@isc.org [mailto:bind-users-bounce@isc.org] On
> > Behalf Of Ladislav Vobr
> > Sent: Friday, September 10, 2004 7:39 PM
> > Cc: BIND Users Mailing List
> > Subject: Re: Warning: ID mismatch:
> >
> > sometimes, when you try to query unreachable domains, you recursive
> > servers tries to retry several times to all of the remote name severs
> > and most of the time there is no reply from your caching servers

> before
> > the dig time-out, sometimes there is a SERVFAIL reply later than the
> > time-out.
> >
> > so if you repeat the dig command, several times for the same domain,

> you
> > might get the first reply for the second dig you have issued, thus
> > seeing this message (ID Mismatch) and it is perfectly valid, but came

> in
> > the wrong time :-). Nothing wrong with your firewall or server itself.
> >
> > So you have to think little bit about the situation :-) I remember

> using
> > nslookup once and it is so stupid, it doesn't even check the source ip
> > address in the reply packets, I was troubleshooting it through the
> > firewall, with misconfigured NAT and nslookup keeps working even when
> > the reply came from different ip :-) than you sent it. (But the server
> > obviously not :-) Somebody did really poor job with nslookup. But this
> > is different story :-)
> >
> > Ladislav
> >
> >
> > Maria Iano wrote:
> >
> >>This same issue is recurring! This time it is on res1 again. res1 has

> >
> > address 172.21.0.100 and res2 has address 172.21.0.200. Below I have
> > pasted in the series of dig commands I ran on res2 sending queries to
> > res1. Below that I have pasted in the tethereal output during those
> > commands.
> >
> >>=20
> >>Since this issue seems to only be a problem for data which isn't

> >
> > cached, I wonder if there is any connection with the thread with

> subject
> > 'Weird named act!'. So I also issued this command suggested in that
> > thread:
> >
> >>=20
> >>res1 in: bind$ ps -flp 24708
> >>Warning: /boot/System.map has an incorrect kernel version.
> >> F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY

> >
> > TIME CMD
> >
> >>140 S bind 24708 1 0 74 0 - 3596 14372d Sep07 ?

> >
> > 00:00:55 [named]
> >
> >>=20
> >>This server has a non-modular kernel with the grsecurity patch. In

> >
> > case it's relevant here is the output of uname -a:=20
> >
> >>res1 in: bind$ uname -a
> >>Linux ent-mocux15.moc.gci 2.4.20-grsec #3 Tue Mar 25 09:21:41 EST 2003

> >
> > i686 i686 i386 GNU/Linux
> >
> >>=20
> >>Thanks in advance for any help!
> >>Maria
> >>=20
> >>################################################## #
> >>Commands issued on res2
> >>################################################## #
> >>=20
> >>res2 in: bind$ dig @res1.moc.gci www.silver.com
> >>=20
> >>; <<>> DiG 9.2.3 <<>> @res1.moc.gci www.silver.com
> >>;; global options: printcmd
> >>;; connection timed out; no servers could be reached
> >>res2 in: bind$ dig @res1.moc.gci www.silver.com
> >>;; Warning: ID mismatch: expected ID 56696, got 10590
> >>;; Warning: ID mismatch: expected ID 56696, got 10590
> >>=20
> >>; <<>> DiG 9.2.3 <<>> @res1.moc.gci www.silver.com
> >>;; global options: printcmd
> >>;; Got answer:
> >>;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56696
> >>;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0
> >>=20
> >>;; QUESTION SECTION:
> >>;www.silver.com. IN A
> >>=20
> >>;; ANSWER SECTION:
> >>www.silver.com. 86400 IN A 205.150.176.184
> >>=20
> >>;; AUTHORITY SECTION:
> >>silver.com. 259200 IN NS ns1.ktrafic.com.
> >>silver.com. 259200 IN NS ns2.ktrafic.com.
> >>=20
> >>;; Query time: 2716 msec
> >>;; SERVER: 172.21.0.100#53(res1.moc.gci)
> >>;; WHEN: Wed Sep 8 12:19:43 2004
> >>;; MSG SIZE rcvd: 92
> >>=20
> >>res2 in: bind$ dig @res1.moc.gci www.gold.com
> >>=20
> >>; <<>> DiG 9.2.3 <<>> @res1.moc.gci www.gold.com
> >>;; global options: printcmd
> >>;; connection timed out; no servers could be reached
> >>res2 in: bind$ dig @res1.moc.gci www.gold.com
> >>=20
> >>; <<>> DiG 9.2.3 <<>> @res1.moc.gci www.gold.com
> >>;; global options: printcmd
> >>;; connection timed out; no servers could be reached
> >>res2 in: bind$ dig @res1.moc.gci www.gold.com
> >>=20
> >>; <<>> DiG 9.2.3 <<>> @res1.moc.gci www.gold.com
> >>;; global options: printcmd
> >>;; connection timed out; no servers could be reached
> >>res2 in: bind$ dig @res1.moc.gci www.purple.com
> >>;; Warning: ID mismatch: expected ID 58216, got 51960
> >>;; Warning: ID mismatch: expected ID 58216, got 51960
> >>;; Warning: ID mismatch: expected ID 58216, got 36737
> >>;; Warning: ID mismatch: expected ID 58216, got 36737
> >>;; Warning: ID mismatch: expected ID 58216, got 20208
> >>;; Warning: ID mismatch: expected ID 58216, got 20208
> >>=20
> >>; <<>> DiG 9.2.3 <<>> @res1.moc.gci www.purple.com
> >>;; global options: printcmd
> >>;; connection timed out; no servers could be reached
> >>res2 in: bind$ dig @res1.moc.gci www.gold.com
> >>=20
> >>; <<>> DiG 9.2.3 <<>> @res1.moc.gci www.gold.com
> >>;; global options: printcmd
> >>;; Got answer:
> >>;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46790
> >>;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 0
> >>=20
> >>;; QUESTION SECTION:
> >>;www.gold.com. IN A
> >>=20
> >>;; ANSWER SECTION:
> >>www.gold.com. 86313 IN CNAME gold.com.
> >>gold.com. 86311 IN A 198.70.201.51
> >>=20
> >>;; AUTHORITY SECTION:
> >>gold.com. 86311 IN NS extns1.jewels.com.
> >>gold.com. 86311 IN NS extns2.jewels.com.
> >>=20
> >>;; Query time: 1 msec
> >>;; SERVER: 172.21.0.100#53(res1.moc.gci)
> >>;; WHEN: Wed Sep 8 12:21:41 2004
> >>;; MSG SIZE rcvd: 109
> >>=20
> >>
> >>=20
> >>res2 in: bind$ dig @res1.moc.gci www.gold.com
> >>=20
> >>; <<>> DiG 9.2.3 <<>> @res1.moc.gci www.gold.com
> >>;; global options: printcmd
> >>;; connection timed out; no servers could be reached
> >>=20
> >>################################################## #
> >>Output of tethereal during those commands
> >>################################################## #
> >>=20
> >> 0.000000 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.blue.com
> >
> >> 0.000124 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME blue.com A 216.91.187.86
> >
> >> 4.991126 Ibm_7b:a6:69 -> Ibm_7b:a4:a3 ARP Who has 172.21.0.200?

> >
> > Tell 172.21.0.100
> >
> >> 4.991493 Ibm_7b:a4:a3 -> Ibm_7b:a6:69 ARP 172.21.0.200 is at

> >
> > 00:02:55:7b:a4:a3
> >
> >> 6.320441 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.silver.com
> >
> >> 11.318427 Ibm_7b:a4:a3 -> Ibm_7b:a6:69 ARP Who has 172.21.0.100?

> >
> > Tell 172.21.0.200
> >
> >> 11.318438 Ibm_7b:a6:69 -> Ibm_7b:a4:a3 ARP 172.21.0.100 is at

> >
> > 00:02:55:7b:a6:69
> >
> >> 11.328548 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.silver.com
> >
> >> 24.820791 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.silver.com
> >
> >> 27.536065 172.21.0.100 -> 172.21.0.200 DNS Standard query response A

> >
> > 205.150.176.184
> >
> >> 27.536121 172.21.0.100 -> 172.21.0.200 DNS Standard query response A

> >
> > 205.150.176.184
> >
> >> 27.536184 172.21.0.100 -> 172.21.0.200 DNS Standard query response A

> >
> > 205.150.176.184
> >
> >> 36.446784 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >> 41.449517 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >> 49.777125 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >> 54.769991 Ibm_7b:a4:a3 -> Ibm_7b:a6:69 ARP Who has 172.21.0.100?

> >
> > Tell 172.21.0.200
> >
> >> 54.770002 Ibm_7b:a6:69 -> Ibm_7b:a4:a3 ARP 172.21.0.100 is at

> >
> > 00:02:55:7b:a6:69
> >
> >> 54.779985 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >> 61.418983 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >> 66.420344 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >> 76.502267 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.purple.com
> >
> >> 77.687081 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >> 77.687142 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >> 77.687208 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >> 77.687263 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >> 77.687328 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >> 77.687382 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >> 81.510874 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.purple.com
> >
> >> 82.684071 Ibm_7b:a6:69 -> Ibm_7b:a4:a3 ARP Who has 172.21.0.200?

> >
> > Tell 172.21.0.100
> >
> >> 82.684293 Ibm_7b:a4:a3 -> Ibm_7b:a6:69 ARP 172.21.0.200 is at

> >
> > 00:02:55:7b:a4:a3
> >
> >> 96.508164 172.21.0.100 -> 172.21.0.200 DNS Standard query response A

> >
> > 153.104.63.227
> >
> >> 96.508232 172.21.0.100 -> 172.21.0.200 DNS Standard query response A

> >
> > 153.104.63.227
> >
> >> 96.508587 172.21.0.200 -> 172.21.0.100 ICMP Destination unreachable
> >> 96.508589 172.21.0.200 -> 172.21.0.100 ICMP Destination unreachable
> >>101.501576 Ibm_7b:a4:a3 -> Ibm_7b:a6:69 ARP Who has 172.21.0.100?

> >
> > Tell 172.21.0.200
> >
> >>101.501587 Ibm_7b:a6:69 -> Ibm_7b:a4:a3 ARP 172.21.0.100 is at

> >
> > 00:02:55:7b:a6:69
> >
> >>145.126659 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >>145.127129 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >>150.123148 Ibm_7b:a4:a3 -> Ibm_7b:a6:69 ARP Who has 172.21.0.100?

> >
> > Tell 172.21.0.200
> >
> >>150.123159 Ibm_7b:a6:69 -> Ibm_7b:a4:a3 ARP 172.21.0.100 is at

> >
> > 00:02:55:7b:a6:69
> >
> >>=20
> >>
> >>=20
> >>229.285189 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >>234.276056 Ibm_7b:a4:a3 -> Ibm_7b:a6:69 ARP Who has 172.21.0.100?

> >
> > Tell 172.21.0.200
> >
> >>234.276067 Ibm_7b:a6:69 -> Ibm_7b:a4:a3 ARP 172.21.0.100 is at

> >
> > 00:02:55:7b:a6:69
> >
> >>234.286050 172.21.0.200 -> 172.21.0.100 DNS Standard query A

> >
> > www.gold.com
> >
> >>269.304469 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >>269.304526 172.21.0.100 -> 172.21.0.200 DNS Standard query response

> >
> > CNAME gold.com A 198.70.201.51
> >
> >>269.304821 172.21.0.200 -> 172.21.0.100 ICMP Destination unreachable
> >>269.304822 172.21.0.200 -> 172.21.0.100 ICMP Destination unreachable
> >>274.297311 Ibm_7b:a4:a3 -> Ibm_7b:a6:69 ARP Who has 172.21.0.100?

> >
> > Tell 172.21.0.200
> >
> >>274.297324 Ibm_7b:a6:69 -> Ibm_7b:a4:a3 ARP 172.21.0.100 is at

> >
> > 00:02:55:7b:a6:69
> >
> >>On Wed, Sep 08, at 10:58%P so wrote Ladislav Vobr

> >
> > (lvobr@ies.etisalat.ae):
> >
> >>=20
> >>=20
> >>
> >>>Maria Iano wrote:
> >>>
> >>>
> >>>>I have two caching servers, res1 and res2, running BIND 9.2.3 on Red

> >
> > Hat Linux release 8.0 (Psyche). They sit inside a firewall, and

> forward
> > queries to four different caching servers on the outside, as well as
> > some internal servers authoritative for internal zones.=20
> >
> >>>>Last week res2 starting being slow and failing resolution

> >
> > intermittently. Dig queries sent from res2 to the outside resolvers
> > worked correctly. Dig queries sent from res2 to res1 worked correctly.
> > However, dig queries from res1 to res2 produced error messages like
> > this:
> >
> >>>>;; Warning: ID mismatch: expected ID 3325, got 34596
> >>>>
> >>>>with various different IDs produced from different queries. It was

> >
> > late at night (I had been paged) so I went ahead and rebooted res2.

> This
> > cleared up the issue.
> >
> >>>>Now, a week later, this same issue is occurring on res1. res1 is

> slow
> >
> > to respond to queries and intermittently failing to resolve names.

> digs
> > issued on res1 pointing to the outside resolvers work fine. Digs

> issued
> > on res1 pointing to res2 work fine. Digs issued on res2 pointing to

> res1
> > produce the ID mismatch errors again.
> >
> >>>>I suspect that if I reboot it the error will clear up again, but

> >
> > before I do that I want to try and work out what is going on.
> >
> >>>>Any advice?
> >>>
> >>>You might possibly use a packetsniffer to see what you send and

> what=20
> >>>other side received and similiarly for the reply. On linux you can

> use
> >
> >
> >>>tcpdump or ethereal for example. I faced once these messages, when I

> >
> > was=20
> >
> >>>using query-source port 53 on my recursive nameserver, and I patched

> >
> > dig=20
> >
> >>>to use port 53 as a source port as well, than I got lot of these=20
> >>>everytime I issued such a command from the recursive server prompt,

> >
> > but=20
> >
> >>>it was understandable, since regular replies coming to my

> nameserver=20
> >>>confused dig.
> >>>
> >>>
> >>
> >>=20
> >>=20

> >
> >
> >
> >
> >
> > ----- End forwarded message -----
> >

>
>