I'm hoping some of you TCP experts out there can help me figure out
what seems to me to be strange behavior with TCP connections between
two processes running on the same computer. The basic problem is
that I'm seeing ETIMEDOUT errors on calls to write() between two
sockets that were moments ago communicating just fine. Based on my
limited understanding of the TCP protocol, I never expected to see a
connection timeout between two processes running on the same box.
The basic setup is a dual CPU system running a fairly large number of
processes (~60) that do most of their communication using TCP
connections. Over a twelve hour period, I might see about a dozen of
these ETIMEDOUT errors. The timeouts tend to occur in processes that
have the highest data rates, but even those are relatively low (~ 1 mb/
sec). I've attached below a snippet from tcpdump that exhibits the
problem. The pattern (the last 6 lines of the attached tcpdump
output) is basically:

server sends packet N
client ACKs packet N
server sends packet N+1
250 msec later server resends packet N+1
client ACKS packet N
server resets the connection

Every instance of this problem I've managed to capture with tcpdump
exhibits this exact same behavior.
I see this behavior in both 2.6.16 and 2.6.18. I'll be trying 2.6.22
next.

So my questions are:

- Under what conditions would you expect to see ETIMEDOUT on a local
TCP connection?
- Are there any kernel parameters I can tweak with sysctl that might
alleviate the problem?
- Can you think of anything I could be doing wrong at the application
level that would cause these timeouts?

Below is a slightly edited fragment from tcpdump. I shortened a few
fields (removed a common prefix from the timestamps and sequence
counters) because I have no idea what google group's web interface
will do re: wrapping long lines.

:48.156263 49597 > 11007: . ack 21464 win 5 4819>
:48.156655 11007 > 49597: P 21464:21624(160) ack 1 win 64

:48.197066 49597 > 11007: . ack 21624 win 4 451174820>
:48.197083 11007 > 49597: P 21624:23248(1624) ack 1 win 64

:48.237049 49597 > 11007: . ack 23248 win 1 4860>

:48.674231 11007 > 49597: P 23248:23272(24) ack 1 win 64

:48.674250 49597 > 11007: . ack 23272 win 1 5337>
:48.674460 11007 > 49597: P 23272:23432(160) ack 1 win 64

:48.899935 11007 > 49597: P 23272:23432(160) ack 1 win 64

:50.210641 49597 > 11007: . ack 23272 win 41 5563>
:50.210670 11007 > 49597: R 1517483895:1517483895(0) win 0

Any suggestions are greatly appreciated,
Thanks,
John Filo
filo@arlut.utexas.edu