max tcp sockets for bind 9.4.2-P1
Hello all,
Like many of you, I recently upgraded all of our caching nameservers.
Since we were already running BIND 9.4.2, I chose to upgrade to 9.4.2-P1.
After the upgrade, I started receiving complaints of DNS queries that were
truncated and retried over TCP failing.
It appears that BIND is limiting the number of open TCP connections to ~
100 per IP address it listens on. For example, on one of our caching
nameservers:
cachens-4:~# netstat -an | grep tcp | grep 72.3.128.240 | wc -l
99
cachens-4:~# netstat -an | grep tcp | grep 72.3.128.241 | wc -l
105
From an rndc status:
tcp clients: 0/1000
Almost all (~99%) of the TCP connections in the above netstat are at a
SYN_RECV state. My guess would be customer servers that have bad firewall
rules, but in any case, it's really not relevant to this particular
problem because nothing has changed except for the upgrade from 9.4.2 to
9.4.2-P1. I didn't change the named.conf or anything, and as you can see,
tcp-clients is set to 1000.
Did something change in the source code that would cause this? I'm
thinking a listen() call with backlog set to 100 that wasn't setup that
way previously? Something interesting to me is that the ARM specifies the
default for tcp-clients to be 100, but maybe that is a coincidence.
FWIW, SOMAXCONN is set to 128 on my servers. Prior to this patch, I was
using a Debian packaged version of 9.4.2, so maybe they had it set higher?
I looked all through the source and changes made by Debian to 9.4.2 and
couldn't find anything to indicate this is the case.
I'm open for suggestions! This a Debian Etch box running kernel 2.6.18 on
an x86_64 architecture. Thanks.
-- Jason
Confidentiality Notice: This e-mail message (including any attached or
embedded documents) is intended for the exclusive and confidential use of the
individual or entity to which this message is addressed, and unless otherwise
expressly indicated, is confidential and privileged information of Rackspace.
Any dissemination, distribution or copying of the enclosed material is prohibited.
If you receive this transmission in error, please notify us immediately by e-mail
at [email]abuse@rackspace.com[/email], and delete the original message.
Your cooperation is appreciated.
Re: max tcp sockets for bind 9.4.2-P1
On Jul 17, 6:09 am, "Jason Bratton" <jbrat...@rackspace.com> wrote:[color=blue]
> Hello all,
>
> Like many of you, I recently upgraded all of our caching nameservers.
> Since we were already running BIND 9.4.2, I chose to upgrade to 9.4.2-P1.
> After the upgrade, I started receiving complaints of DNS queries that were
> truncated and retried over TCP failing.
>
> It appears that BIND is limiting the number of open TCP connections to ~
> 100 per IP address it listens on. For example, on one of our caching
> nameservers:
>
> cachens-4:~# netstat -an | grep tcp | grep 72.3.128.240 | wc -l
> 99
> cachens-4:~# netstat -an | grep tcp | grep 72.3.128.241 | wc -l
> 105
>
> From an rndc status:
>
> tcp clients: 0/1000
>
> Almost all (~99%) of the TCP connections in the above netstat are at a
> SYN_RECV state. My guess would be customer servers that have bad firewall
> rules, but in any case, it's really not relevant to this particular
> problem because nothing has changed except for the upgrade from 9.4.2 to
> 9.4.2-P1. I didn't change the named.conf or anything, and as you can see,
> tcp-clients is set to 1000.
>
> Did something change in the source code that would cause this? I'm
> thinking a listen() call with backlog set to 100 that wasn't setup that
> way previously? Something interesting to me is that the ARM specifies the
> default for tcp-clients to be 100, but maybe that is a coincidence.
>
> FWIW, SOMAXCONN is set to 128 on my servers. Prior to this patch, I was
> using a Debian packaged version of 9.4.2, so maybe they had it set higher?
> I looked all through the source and changes made by Debian to 9.4.2 and
> couldn't find anything to indicate this is the case.
>
> I'm open for suggestions! This a Debian Etch box running kernel 2.6.18 on
> an x86_64 architecture. Thanks.
>
> -- Jason
>
> Confidentiality Notice: This e-mail message (including any attached or
> embedded documents) is intended for the exclusive and confidential use of the
> individual or entity to which this message is addressed, and unless otherwise
> expressly indicated, is confidential and privileged information of Rackspace.
> Any dissemination, distribution or copying of the enclosed material is prohibited.
> If you receive this transmission in error, please notify us immediately by e-mail
> at ab...@rackspace.com, and delete the original message.
> Your cooperation is appreciated.[/color]
I am experiencing a similar issue with vendor supplied bind with 9.4.2-
p1 fixes:
QDDNS 4.1 Build 6 - Lucent DNS Server (BIND 9.4.1-P1), Copyright (c)
2008 Alcatel-Lucent
+ Includes security fixes from BIND 9.4.2-P1
It all started with a complaint that a query was failing on one of our
15 internal DNS servers. All 15 servers were recently deployed and
were identical in configuration. When I looked into the issue, I
noticed that the query generated a response which was truncated and
then reattempted using TCP. I then tested queries against the
problematic server using "dig +tcp" and discovered that all DNS
queries using TCP were failing on this server. netstat showed lots of
connections in SYN_RECV. Since the same symptoms were encountered
before when our firewall team misconfigured rules, I then checked to
see if this was the cause. I got on the problematic server and issued
queries to itself using TCP. In doing so, I noticed something very
strange. A "dig +tcp somehost.domain.com @127.0.0.1" would succeed
with no issues while a "dig +tcp somehost.domain.com
@ip.of.the.server" would result in:
; <<>> DiG 9.4.1-P1 <<>> +tcp xxxx.xxxx.xxxx @xxx.xxx.xxx.xxx
; (1 server found)
;; global options: printcmd
;; connection timed out; no servers could be reached
I am still waiting for the vendor to accept this is not a firewall
issue since I can reproduce this by query the server from itself.