Re: TCP interactive data flow
On Jul 15, 1:18 am, Rick Jones <rick.jon...@hp.com> wrote:[color=blue]
> Sru...@gmail.com wrote:[color=green]
> > Aha, so Nagle?s algorithm only kicks in when packet to be send is very
> > small[/color]
>
> For some definition of very small, ususally < the TCP MSS (Maximum
> Segment Size). In broad handwaving terms, Nagle (should) work like
> this:
>[/color]
I'm not sure I understand why you were trying to say here.
[color=blue]
> 1) Is this send() by the user, plus any queued, untransmitted data, >=
> MSS? If yes, then transmit the data now, modulo constraints like
> congestion or receiver window. If no, go to question 2.
>
> 2) Is the connection otherwise "idle?" That is, is there no
> transmitted but not yet ACKnowledged data outstanding on the
> connection? If yes, transmit the data now, modulo constraints like
> congestion or receiver window. If no, go to 3.
>
> 3) Queue the data until:
> a) The application provides enough data to get >= the MSS
> b) The remote ACK's the currently unACKed data
> c) The retransmission timer for currently unACKed data (if any)
> expires and there is room for (some of) the queued data in the
> segment to be retransmitted.
>[/color]
So basically the size limit is MSS, where anything smaller is buffered
until there are no unacknowledged data? But why is MSS the limit? MSS
can be greater than 1000 bytes, which in my opinion is not a tinygram.
So why handle packets with 1 byte of data the same as packets of say
400 bytes of data?
Re: TCP interactive data flow
On Jul 14, 6:11*pm, Sru...@gmail.com wrote:
[color=blue]
> So basically the size limit is MSS, where anything smaller is buffered
> until there are no unacknowledged data? But why is MSS the limit? MSS
> can be greater than 1000 bytes, which in my opinion is not a tinygram.
> So why handle *packets with 1 byte of data *the same as packets of say
> 400 bytes of data?[/color]
Because in either case waiting is more efficient than sending. If you
have at least one MSS, you are going to send the same packet no matter
what.
Also, if you send at even one byte less than the MSS, you can get
repeatable degenerate behavior. For example, suppose the MSS is 768
bytes. Suppose an application has a huge amount of data to send, but
chooses to send it in 3,838 byte chunks (it has to use some chunk
size, right?). You can send 4 768-byte chunks immediately, and you
have 766 bytes left over. The application is about to call 'send'
again. Which is better? To wait a split send and send a full segment?
Or to repeatedly and inefficiently send unfull segments with no
possible application workaround? (Since the app doesn't know the
MSS.)
DS
Re: TCP interactive data flow
On Jul 14, 8:11*pm, Sru...@gmail.com wrote:[color=blue]
> On Jul 15, 1:18 am, Rick Jones <rick.jon...@hp.com> wrote:
>[color=green]
> > Sru...@gmail.com wrote:[color=darkred]
> > > Aha, so Nagle?s algorithm only kicks in when packet to be send is very
> > > small[/color][/color]
>[color=green]
> > For some definition of very small, ususally < the TCP MSS (Maximum
> > Segment Size). *In broad handwaving terms, Nagle (should) work like
> > this:[/color]
>
> I'm not sure I understand why you were trying to say here.
>
>
>
>
>[color=green]
> > 1) Is this send() by the user, plus any queued, untransmitted data, >=
> > * *MSS? *If yes, then transmit the data now, modulo constraints like
> > * *congestion or receiver window. *If no, go to question 2.[/color]
>[color=green]
> > 2) Is the connection otherwise "idle?" That is, is there no
> > * *transmitted but not yet ACKnowledged data outstanding on the
> > * *connection? *If yes, transmit the data now, modulo constraintslike
> > * *congestion or receiver window. *If no, go to 3.[/color]
>[color=green]
> > 3) Queue the data until:
> > *a) The application provides enough data to get >= the MSS
> > *b) The remote ACK's the currently unACKed data
> > *c) The retransmission timer for currently unACKed data (if any)
> > * * expires and there is room for (some of) the queued data in the
> > * * segment to be retransmitted.[/color]
>
> So basically the size limit is MSS, where anything smaller is buffered
> until there are no unacknowledged data? But why is MSS the limit? MSS
> can be greater than 1000 bytes, which in my opinion is not a tinygram.
> So why handle *packets with 1 byte of data *the same as packets of say
> 400 bytes of data?[/color]
You keep overanalyzing this. With Nagel data is buffered if there's
less than a packet's worth (MSS) to send, and there is already sent
data waiting for an acknowledgement from the other end. The idea is
to transmit as few packets as possible by making them as large as
possible (which results in the most efficient utilization of the
network), while still keeping interactive traffic prompt. Thus the
maximum amount of buffering time is about one round trip plus half a
second (200ms for most stacks).
But the question is really the opposite of yours - why *not* a full
packet (MSS)? It's obvious why you'd not want Nagel to buffer more
than a packet's (MSS) worth of data (because a full pack is actually
the goal of Nagel, there’s nothing left to accomplish but to transmit
the thing). But what would you gain from capping the buffering at
some lower limit? Remembering that it's for a rather limited time
interval. And just how many applications would actually be better off
because only (say) 200 bytes got buffered, rather than 1500?
Re: TCP interactive data flow
[email]Srubys@gmail.com[/email] wrote:[color=blue]
> So basically the size limit is MSS, where anything smaller is
> buffered until there are no unacknowledged data? But why is MSS the
> limit? MSS can be greater than 1000 bytes, which in my opinion is
> not a tinygram. So why handle packets with 1 byte of data the same
> as packets of say 400 bytes of data?[/color]
If you go back in time by reading the initial Nagle paper/RFC MSSes
were "typically" in the 536 byte range.
The MSS is the "best" TCP can do for the ratio of data to data+headers.
I'm not sure about your question wrt 1 byte vs 400 bytes. Are you
asking why the Nagle limit isn't based on a constant rather than the
MSS? In some stacks IIRC on can configure the value against which the
user's send is compared. It generally defaults to the MSS for the
connection. And yes, as MTU's and thus MSS's increase in size that
does start to look a little, well, odd... :)
rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Re: TCP interactive data flow
[email]robertwessel2@yahoo.com[/email] <robertwessel2@yahoo.com> wrote:[color=blue]
> But the question is really the opposite of yours - why *not* a full
> packet (MSS)?[/color]
Interestingly enough, TCP stacks trying to make use of TSO in the NIC
(Transport/TCP Segmentation Offload) have just that issue - when/if to
wait until there is even more than one MSS-worth of data before
shipping data down the stack.
rick jones
--
The computing industry isn't as much a game of "Follow The Leader" as
it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
- Rick Jones
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Re: TCP interactive data flow
> But what would you gain from capping the buffering at[color=blue]
> some lower limit? Remembering that it's for a rather limited time
> interval. And just how many applications would actually be better off
> because only (say) 200 bytes got buffered, rather than 1500?[/color]
not much I presume, but still, communication between two apps where at
least one of them buffered 200 bytes instead of MSS-1 would be just a
wee faster, provided this app ( one that buffers only up to 200
bytes ) would send lots of data of size greater than 200 but smaller
than MSS.
thank you all for your help
kind regards
Re: TCP interactive data flow
[email]Srubys@gmail.com[/email] wrote:[color=blue][color=green]
> > But what would you gain from capping the buffering at some lower
> > limit? Remembering that it's for a rather limited time interval.
> > And just how many applications would actually be better off
> > because only (say) 200 bytes got buffered, rather than 1500?[/color][/color]
[color=blue]
> not much I presume, but still, communication between two apps where at
> least one of them buffered 200 bytes instead of MSS-1 would be just a
> wee faster, provided this app ( one that buffers only up to 200
> bytes ) would send lots of data of size greater than 200 but smaller
> than MSS.[/color]
Not necessarily. Here we have an example of something sending 200
bytes at a time, leaving Nagle enabled, and then that same 200 byte
send, with Nagle disabled (the nodelay case)
manny:~# netperf -H moe -c -C -- -m 200
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to moe.west (10.208.0.15) port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 200 10.02 941.38 19.62 12.84 6.828 4.469
manny:~# netperf -H moe -c -C -- -m 200 -D
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to moe.west (10.208.0.15) port 0 AF_INET : nodelay
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 200 10.00 318.76 24.99 22.68 25.687 23.311
manny:~#
Notice that in the nodelay (Nagle off) case there is a significantly
higher depand placed on the CPU of the system - in this case a four
core system, which is why the CPU util caps at 25% since a single TCP
connection will not (generally) make use of the services of more than
one core. The increase is between 4x and 6x CPU consumed per KB
transferred.
rick jones
--
portable adj, code that compiles under more than one compiler
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Re: TCP interactive data flow
Rick Jones <rick.jones2@hp.com> wrote:[color=blue]
> Not necessarily. Here we have an example of something sending 200
> bytes at a time, leaving Nagle enabled, and then that same 200 byte
> send, with Nagle disabled (the nodelay case)[/color]
Those numbers were for a unidirectional test over a GbE LAN. If the
test is request/response then we start having "races" between
standalone ACK timers, RTT's and how many requests or responses will
be put into the connection at one time by the application. What
follows is the ./configure --enable-burst mode of netperf with a
TCP_RR test and a 200 byte request/response size. Again first is with
defaults, second is with nagle disabled. I've stripped the socket
buffer, request/response size and time columns to better fit in 80
columns:
manny:~# for i in 0 1 2 3 4 5 6 7 8 9 10; do netperf $HDR -t TCP_RR -H moe -c -C -B "burst $i" -- -r 200 -b $i; HDR="-P 0"; done
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to moe.west (10.208.0.15) port 0 AF_INET : first burst 0
Trans. CPU CPU S.dem S.dem
Rate local remote local remote
per sec % S % S us/Tr us/Tr
8657.38 3.61 3.47 16.681 16.044
9247.23 3.21 3.57 13.882 15.463 burst 1
10324.20 4.69 4.27 18.152 16.550 burst 2
11371.37 4.08 4.31 14.340 15.150 burst 3
13726.78 2.51 3.03 7.305 8.823 burst 4
16007.27 4.82 8.12 12.052 20.283 burst 5
18231.57 3.30 3.43 7.230 7.529 burst 6
20235.90 2.98 3.01 5.893 5.950 burst 7
22214.26 3.99 3.24 7.184 5.837 burst 8
24002.79 4.00 2.99 6.663 4.984 burst 9
25778.28 4.46 3.58 6.918 5.562 burst 10
....
67198.41 7.11 6.29 4.229 3.745 burst 20
98375.44 9.76 9.00 3.967 3.659 burst 30
132360.98 11.86 12.00 3.583 3.627 burst 40
173646.81 15.43 14.87 3.554 3.424 burst 50
204709.83 18.38 17.15 3.591 3.351 burst 60
235860.77 20.81 19.94 3.529 3.382 burst 70
manny:~# HDR="-P 1";for i in 0 1 2 3 4 5 6 7 8 9 10; do netperf $HDR -t TCP_RR -H moe -c -C -B "burst $i" -- -r 200 -b $i -D; HDR="-P 0"; done
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to moe.west (10.208.0.15) port 0 AF_INET : nodelay : first burst 0
Trans. CPU CPU S.dem S.dem
Rate local remote local remote
per sec % S % S us/Tr us/Tr
8523.55 3.78 3.52 17.720 16.509
17714.38 5.16 4.94 11.652 11.161 burst 1
18660.94 5.92 5.59 12.697 11.978 burst 2
27373.66 8.78 8.53 12.828 12.462 burst 3
34303.27 10.22 10.67 11.914 12.436 burst 4
41652.40 11.34 10.39 10.891 9.973 burst 5
42222.80 12.43 12.81 11.778 12.135 burst 6
45601.75 13.03 12.76 11.430 11.196 burst 7
48737.80 13.58 13.47 11.142 11.052 burst 8
52505.19 14.43 14.25 10.994 10.858 burst 9
56406.20 14.95 14.40 10.602 10.209 burst 10
....
101401.90 24.74 24.35 9.761 9.605 burst 20
102946.48 24.99 24.75 9.711 9.619 burst 30
104170.04 24.99 24.72 9.595 9.493 burst 40
I stopped at 40 in the Nagle disabled case because it was pretty clear
things had maxed-out - again one of the four cores was saturated.
So, for smaller numbers of transactions outstanding at one time, the
transaction rate for the RR test is higher with Nagle disabled, but as
you increase the concurrent transactions, having Nagle enabled enables
a higher transaction rate because it allows several transactions to be
carried in a single TCP segment. As before, this is reflected in the
lower service demand figures for the Nagle enabled case.
rick jones
--
The computing industry isn't as much a game of "Follow The Leader" as
it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
- Rick Jones
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Re: TCP interactive data flow
On Jul 15, 4:16*pm, Sru...@gmail.com wrote:
[color=blue]
> not much I presume, but still, communication between two apps where at
> least one of them buffered 200 bytes instead of MSS-1 would be just a
> wee faster, provided this app ( one that buffers only up to 200
> bytes ) would send lots of data of size greater than 200 but smaller
> than MSS.[/color]
No. Because such an application would be sending 200 bytes when no
unacknowledged data is pending and so wouldn't trigger Nagle. The data
replies from the other side would have ACKs piggy-backed on them, so
Nagle would never delay any transmissions.
Most people who criticize Nagle don't understand it.
DS
Re: TCP interactive data flow
On Jul 16, 5:01*am, David Schwartz <dav...@webmaster.com> wrote:[color=blue]
> On Jul 15, 4:16*pm, Sru...@gmail.com wrote:
>[color=green]
> > not much I presume, but still, communication between two apps where at
> > least one of them buffered 200 bytes instead of MSS-1 would be just a
> > wee faster, provided this app ( one that buffers only up to 200
> > bytes ) would send lots of data of size greater than 200 but smaller
> > than MSS.[/color]
>
> No. Because such an application would be sending 200 bytes when no
> unacknowledged data is pending and so wouldn't trigger Nagle. The data
> replies from the other side would have ACKs piggy-backed on them, so
> Nagle would never delay any transmissions.
>
> Most people who criticize Nagle don't understand it.[/color]
I was going to start a new thread on this, but after searching for
information on nagling and delayed acks, found this thread and decided
to hang my questions on here.
I work with some application architecture where timely notification of
data is far more important than max throughput (or at least, we do not
have any issues with max throughput).
The apps in question generally establish a connection with third party
servers over whose software we have no control of course. However,
it's not helpful to think in terms of client / server because
communication flows both ways. IOW either can initiate flow of
information at any time. That's irrelevant anyway. Transmission is
usually in small chunks of data that are likely to be sporadic and
almost always below the MSS. A typical flow (real data, not tcp/ip
acks) may look something like:
OurApp -> submit request 112 bytes -> 3rdParty
3rdParty -> acknowledge request 89 bytes -> OurApp
3rdParty -> dataX 140 bytes -> OurApp
The problem came in because 3rdParty was implementing nagling, and
OurApp was delaying tcp/ip acks, so we were getting delays between
"acknowledge request" and "dataX" of approximately 200ms. I have
WireShark captures if anyone really cares, but basically with tcp/ip
(simplified) included it looked something like this.
OurApp -> PSH "submit request" 112 bytes -> 3rdParty
3rdParty -> immediate ACK
3rdParty -> 40ms later PSH "acknowledge request" 89 bytes -> OurApp
(at this point 3rdParty has dataX available virtually instantly after
the above is sent, and I assume it tries to send, however due to
nagling, it is waiting for the ACK to the previous PSH).
OurApp -> delays ack: 200ms later ACK
3rdParty -> immediate PSH "dataX" 140 bytes -> OurApp
From our point of view, that delay of 200ms to receive "dataX" was
unacceptable.
I apologise in advance for the great simplifcation of what is
happening, or if I misunderstood the situation, but with a recent
upgrade to the 3rdParty server software, nagling has been disabled and
we see immediate resolution of this problem with so far no negative
consequences.
We are now seeing similar latency from a different 3rd party, and are
about to start investigations. In the previous example, disabling
delayed acks on the boxes OurApp ran on pretty much resolved the
problem. We saw a slight round trip latency because 3rdParty was still
waiting for the tcp/ip ACK, but it was acceptable. The better solution
was disabling nagling on their side since "dataX" is sent immediately,
incurring no round trip.
So if the case with the latest problem turns out to be the same, and
assuming we can't rely on 3rd party2 to disable nagling, what negative
effects will disabling delayed acks on our boxes have?
If it were just my application on the box I'd have no concerns, but
they are production machines we share server real estate with any
number of other applications. I wouldn't want to degrade their
performance in some way, say consuming more CPU for example.
Any corrections to my (mis)understanding are welcome.
Re: TCP interactive data flow
Mark (newsgroups) wrote:[color=blue]
> On Jul 16, 5:01 am, David Schwartz <dav...@webmaster.com> wrote:[color=green]
>> On Jul 15, 4:16 pm, Sru...@gmail.com wrote:
>>[color=darkred]
>>> not much I presume, but still, communication between two apps where at
>>> least one of them buffered 200 bytes instead of MSS-1 would be just a
>>> wee faster, provided this app ( one that buffers only up to 200
>>> bytes ) would send lots of data of size greater than 200 but smaller
>>> than MSS.[/color]
>> No. Because such an application would be sending 200 bytes when no
>> unacknowledged data is pending and so wouldn't trigger Nagle. The data
>> replies from the other side would have ACKs piggy-backed on them, so
>> Nagle would never delay any transmissions.
>>
>> Most people who criticize Nagle don't understand it.[/color]
>
> I was going to start a new thread on this, but after searching for
> information on nagling and delayed acks, found this thread and decided
> to hang my questions on here.
>
> I work with some application architecture where timely notification of
> data is far more important than max throughput (or at least, we do not
> have any issues with max throughput).
>
> The apps in question generally establish a connection with third party
> servers over whose software we have no control of course. However,
> it's not helpful to think in terms of client / server because
> communication flows both ways. IOW either can initiate flow of
> information at any time. That's irrelevant anyway. Transmission is
> usually in small chunks of data that are likely to be sporadic and
> almost always below the MSS. A typical flow (real data, not tcp/ip
> acks) may look something like:
>
> OurApp -> submit request 112 bytes -> 3rdParty
> 3rdParty -> acknowledge request 89 bytes -> OurApp
> 3rdParty -> dataX 140 bytes -> OurApp
>
> The problem came in because 3rdParty was implementing nagling, and
> OurApp was delaying tcp/ip acks, so we were getting delays between
> "acknowledge request" and "dataX" of approximately 200ms. I have
> WireShark captures if anyone really cares, but basically with tcp/ip
> (simplified) included it looked something like this.
>
> OurApp -> PSH "submit request" 112 bytes -> 3rdParty
> 3rdParty -> immediate ACK
> 3rdParty -> 40ms later PSH "acknowledge request" 89 bytes -> OurApp
> (at this point 3rdParty has dataX available virtually instantly after
> the above is sent, and I assume it tries to send, however due to
> nagling, it is waiting for the ACK to the previous PSH).
> OurApp -> delays ack: 200ms later ACK
> 3rdParty -> immediate PSH "dataX" 140 bytes -> OurApp
>
> From our point of view, that delay of 200ms to receive "dataX" was
> unacceptable.
>
> I apologise in advance for the great simplifcation of what is
> happening, or if I misunderstood the situation, but with a recent
> upgrade to the 3rdParty server software, nagling has been disabled and
> we see immediate resolution of this problem with so far no negative
> consequences.
>
> We are now seeing similar latency from a different 3rd party, and are
> about to start investigations. In the previous example, disabling
> delayed acks on the boxes OurApp ran on pretty much resolved the
> problem. We saw a slight round trip latency because 3rdParty was still
> waiting for the tcp/ip ACK, but it was acceptable. The better solution
> was disabling nagling on their side since "dataX" is sent immediately,
> incurring no round trip.
>
> So if the case with the latest problem turns out to be the same, and
> assuming we can't rely on 3rd party2 to disable nagling, what negative
> effects will disabling delayed acks on our boxes have?
>
> If it were just my application on the box I'd have no concerns, but
> they are production machines we share server real estate with any
> number of other applications. I wouldn't want to degrade their
> performance in some way, say consuming more CPU for example.
>
> Any corrections to my (mis)understanding are welcome.[/color]
Gah. I forgot to add, the connection and communication to 3rdParty and
OurApp is done via an API provided by 3rdParty (lib/dll). We have no
control over setting no ack delay at a socket level. It would have to be
at machine level, unless someone knows ways around this.
Re: TCP interactive data flow
On Jul 16, 11:55*am, "Mark (newsgroups)" <marknewsgro...@yahoo.com>
wrote:[color=blue]
> So if the case with the latest problem turns out to be the same, and
> assuming we can't rely on 3rd party2 to disable nagling, what negative
> effects will disabling delayed acks on our boxes have?
>
> If it were just my application on the box I'd have no concerns, but
> they are production machines we share server real estate with any
> number of other applications. I wouldn't want to degrade their
> performance in some way, say consuming more CPU for example.[/color]
Turning off delayed acks will (usually slightly) increase the CPU load
on both ends of the conversation(s), and will result in more send
traffic from your host. It will usually help response time, at some
cost in bandwidth.
Re: TCP interactive data flow
Here is my "boilerplate" Nagle discussion, performance discussion at
the end:
In broad terms, whenever an application does a send() call, the logic
of the Nagle algorithm is supposed to go something like this:
1) Is the quantity of data in this send, plus any queued, unsent data,
greater than the MSS (Maximum Segment Size) for this connection? If
yes, send the data in the user's send now (modulo any other
constraints such as receiver's advertised window and the TCP
congestion window). If no, go to 2.
2) Is the connection to the remote otherwise idle? That is, is there
no unACKed data outstanding on the network. If yes, send the data in
the user's send now. If no, queue the data and wait. Either the
application will continue to call send() with enough data to get to a
full MSS-worth of data, or the remote will ACK all the currently sent,
unACKed data, or our retransmission timer will expire.
Now, where applications run into trouble is when they have what might
be described as "write, write, read" behaviour, where they present
logically associated data to the transport in separate 'send' calls
and those sends are typically less than the MSS for the connection.
It isn't so much that they run afoul of Nagle as they run into issues
with the interaction of Nagle and the other heuristics operating on
the remote. In particular, the delayed ACK heuristics.
When a receiving TCP is deciding whether or not to send an ACK back to
the sender, in broad handwaving terms it goes through logic similar to
this:
a) is there data being sent back to the sender? if yes, piggy-back the
ACK on the data segment.
b) is there a window update being sent back to the sender? if yes,
piggy-back the ACK on the window update.
c) has the standalone ACK timer expired.
Window updates are generally triggered by the following heuristics:
i) would the window update be for a non-trivial fraction of the window
- typically somewhere at or above 1/4 the window, that is, has the
application "consumed" at least that much data? if yes, send a
window update. if no, check ii.
ii) would the window update be for, the application "consumed," at
least 2*MSS worth of data? if yes, send a window update, if no wait.
Now, going back to that write, write, read application, on the sending
side, the first write will be transmitted by TCP via logic rule 2 -
the connection is otherwise idle. However, the second small send will
be delayed as there is at that point unACKnowledged data outstanding
on the connection.
At the receiver, that small TCP segment will arrive and will be passed
to the application. The application does not have the entire app-level
message, so it will not send a reply (data to TCP) back. The typical
TCP window is much much larger than the MSS, so no window update would
be triggered by heuristic i. The data just arrived is < 2*MSS, so no
window update from heuristic ii. Since there is no window update, no
ACK is sent by heuristic b.
So, that leaves heuristic c - the standalone ACK timer. That ranges
anywhere between 50 and 200 milliseconds depending on the TCP stack in
use.
If you've read this far :) now we can take a look at the effect of
various things touted as "fixes" to applications experiencing this
interaction. We take as our example a client-server application where
both the client and the server are implemented with a write of a small
application header, followed by application data. First, the
"default" case which is with Nagle enabled (TCP_NODELAY _NOT_ set) and
with standard ACK behaviour:
Client Server
Req Header ->
<- Standalone ACK after Nms
Req Data ->
<- Possible standalone ACK
<- Rsp Header
Standalone ACK ->
<- Rsp Data
Possible standalone ACK ->
For two "messages" we end-up with at least six segments on the wire.
The possible standalone ACKs will depend on whether the server's
response time, or client's think time is longer than the standalone
ACK interval on their respective sides. Now, if TCP_NODELAY is set we
see:
Client Server
Req Header ->
Req Data ->
<- Possible Standalone ACK after Nms
<- Rsp Header
<- Rsp Data
Possible Standalone ACK ->
In theory, we are down two four segments on the wire which seems good,
but frankly we can do better. First though, consider what happens
when someone disables delayed ACKs
Client Server
Req Header ->
<- Immediate Standalone ACK
Req Data ->
<- Immediate Standalone ACK
<- Rsp Header
Immediate Standalone ACK ->
<- Rsp Data
Immediate Standalone ACK ->
Now we definitly see 8 segments on the wire. It will also be that way
if both TCP_NODELAY is set and delayed ACKs are disabled.
How about if the application did the "right" think in the first place?
That is sent the logically associated data at the same time:
Client Server
Request ->
<- Possible Standalone ACK
<- Response
Possible Standalone ACK ->
We are down to two segments on the wire.
For "small" packets, the CPU cost is about the same regardless of data
or ACK. This means that the application which is making the propper
gathering send call will spend far fewer CPU cycles in the networking
stack.
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Re: TCP interactive data flow
Rick Jones wrote:[color=blue]
> Here is my "boilerplate" Nagle discussion, performance discussion at
> the end:
>
> In broad terms, whenever an application does a send() call, the logic
> of the Nagle algorithm is supposed to go something like this:
>
> 1) Is the quantity of data in this send, plus any queued, unsent data,
> greater than the MSS (Maximum Segment Size) for this connection? If
> yes, send the data in the user's send now (modulo any other
> constraints such as receiver's advertised window and the TCP
> congestion window). If no, go to 2.
>
> 2) Is the connection to the remote otherwise idle? That is, is there
> no unACKed data outstanding on the network. If yes, send the data in
> the user's send now. If no, queue the data and wait. Either the
> application will continue to call send() with enough data to get to a
> full MSS-worth of data, or the remote will ACK all the currently sent,
> unACKed data, or our retransmission timer will expire.
>
> Now, where applications run into trouble is when they have what might
> be described as "write, write, read" behaviour, where they present
> logically associated data to the transport in separate 'send' calls
> and those sends are typically less than the MSS for the connection.
> It isn't so much that they run afoul of Nagle as they run into issues
> with the interaction of Nagle and the other heuristics operating on
> the remote. In particular, the delayed ACK heuristics.
>
> When a receiving TCP is deciding whether or not to send an ACK back to
> the sender, in broad handwaving terms it goes through logic similar to
> this:
>
> a) is there data being sent back to the sender? if yes, piggy-back the
> ACK on the data segment.
>
> b) is there a window update being sent back to the sender? if yes,
> piggy-back the ACK on the window update.
>
> c) has the standalone ACK timer expired.
>
> Window updates are generally triggered by the following heuristics:
>
> i) would the window update be for a non-trivial fraction of the window
> - typically somewhere at or above 1/4 the window, that is, has the
> application "consumed" at least that much data? if yes, send a
> window update. if no, check ii.
>
> ii) would the window update be for, the application "consumed," at
> least 2*MSS worth of data? if yes, send a window update, if no wait.
>
> Now, going back to that write, write, read application, on the sending
> side, the first write will be transmitted by TCP via logic rule 2 -
> the connection is otherwise idle. However, the second small send will
> be delayed as there is at that point unACKnowledged data outstanding
> on the connection.
>
> At the receiver, that small TCP segment will arrive and will be passed
> to the application. The application does not have the entire app-level
> message, so it will not send a reply (data to TCP) back. The typical
> TCP window is much much larger than the MSS, so no window update would
> be triggered by heuristic i. The data just arrived is < 2*MSS, so no
> window update from heuristic ii. Since there is no window update, no
> ACK is sent by heuristic b.
>
> So, that leaves heuristic c - the standalone ACK timer. That ranges
> anywhere between 50 and 200 milliseconds depending on the TCP stack in
> use.
>
> If you've read this far :) now we can take a look at the effect of
> various things touted as "fixes" to applications experiencing this
> interaction. We take as our example a client-server application where
> both the client and the server are implemented with a write of a small
> application header, followed by application data. First, the
> "default" case which is with Nagle enabled (TCP_NODELAY _NOT_ set) and
> with standard ACK behaviour:
>
> Client Server
> Req Header ->
> <- Standalone ACK after Nms
> Req Data ->
> <- Possible standalone ACK
> <- Rsp Header
> Standalone ACK ->
> <- Rsp Data
> Possible standalone ACK ->
>
>
> For two "messages" we end-up with at least six segments on the wire.
> The possible standalone ACKs will depend on whether the server's
> response time, or client's think time is longer than the standalone
> ACK interval on their respective sides. Now, if TCP_NODELAY is set we
> see:
>
>
> Client Server
> Req Header ->
> Req Data ->
> <- Possible Standalone ACK after Nms
> <- Rsp Header
> <- Rsp Data
> Possible Standalone ACK ->
>
> In theory, we are down two four segments on the wire which seems good,
> but frankly we can do better. First though, consider what happens
> when someone disables delayed ACKs
>
> Client Server
> Req Header ->
> <- Immediate Standalone ACK
> Req Data ->
> <- Immediate Standalone ACK
> <- Rsp Header
> Immediate Standalone ACK ->
> <- Rsp Data
> Immediate Standalone ACK ->
>
> Now we definitly see 8 segments on the wire. It will also be that way
> if both TCP_NODELAY is set and delayed ACKs are disabled.
>
> How about if the application did the "right" think in the first place?
> That is sent the logically associated data at the same time:
>
>
> Client Server
> Request ->
> <- Possible Standalone ACK
> <- Response
> Possible Standalone ACK ->
>
> We are down to two segments on the wire.
>
> For "small" packets, the CPU cost is about the same regardless of data
> or ACK. This means that the application which is making the propper
> gathering send call will spend far fewer CPU cycles in the networking
> stack.[/color]
A pretty complete description of the problem, and seems to be exactly as
I understood it. Thanks for that.
Re: TCP interactive data flow
[email]robertwessel2@yahoo.com[/email] wrote:[color=blue]
> On Jul 16, 11:55 am, "Mark (newsgroups)" <marknewsgro...@yahoo.com>
> wrote:[color=green]
>> So if the case with the latest problem turns out to be the same, and
>> assuming we can't rely on 3rd party2 to disable nagling, what negative
>> effects will disabling delayed acks on our boxes have?
>>
>> If it were just my application on the box I'd have no concerns, but
>> they are production machines we share server real estate with any
>> number of other applications. I wouldn't want to degrade their
>> performance in some way, say consuming more CPU for example.[/color]
>
>
> Turning off delayed acks will (usually slightly) increase the CPU load
> on both ends of the conversation(s), and will result in more send
> traffic from your host. It will usually help response time, at some
> cost in bandwidth.[/color]
Thank you. I guess the real answer is to look at the performance of the
boxes in question and see how much give we have. It's not trivial since
as I said, other applications share server real estate. But it seems
that I have understood the problem correctly.
Re: TCP interactive data flow
"Mark (newsgroups)" <marknewsgroups@yahoo.com> wrote:[color=blue]
> A pretty complete description of the problem, and seems to be
> exactly as I understood it. Thanks for that.[/color]
My pleasure. You should be able to plug-in your message sizes and the
sizes for a standalone TCP ACK segment (plus IP header and link-layer
header) and arrive at an estimate for the differences in maximum
network bandwidth achievable.
rick jones
--
Wisdom Teeth are impacted, people are affected by the effects of events.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Re: TCP interactive data flow
Rick Jones wrote:[color=blue]
> "Mark (newsgroups)" <marknewsgroups@yahoo.com> wrote:[color=green]
>> A pretty complete description of the problem, and seems to be
>> exactly as I understood it. Thanks for that.[/color]
>
> My pleasure. You should be able to plug-in your message sizes and the
> sizes for a standalone TCP ACK segment (plus IP header and link-layer
> header) and arrive at an estimate for the differences in maximum
> network bandwidth achievable.[/color]
Interesting in theory, but not applicable to me in practice. I have a
few options
1) Leave things as they are - not really acceptable, we have clients
complaining about these 100-200ms latencies.
2) Try turning off delayed ack on a production machine - possible but
I'm very worried about the negative impacts on other applications
3) Hope the 3rd party comes out with a solution with nagling disabled on
their side.
As I mentioned, I have no control at a socket level on the tcp/ip
communication since this is done through their own provided API.
Re: TCP interactive data flow
> Interesting in theory, but not applicable to me in practice. I have a[color=blue]
> few options[/color]
[color=blue]
> 1) Leave things as they are - not really acceptable, we have clients
> complaining about these 100-200ms latencies.[/color]
[color=blue]
> 2) Try turning off delayed ack on a production machine - possible
> but I'm very worried about the negative impacts on other
> applications[/color]
That was one of the reasons for doing the packet size overhead
calculation. If we are talking about an "ethernet like" thing, there
is 14 bytes worth of link-layer header, 20 bytes of IPv4 header and
then, assuming timstamps are on in TCP, 32 bytes of TCP header. So,
headers for any packet on the wire will be at least 14+20+32 or 66
bytes. You can then use your known application-level message and ack
sizes. That could tell you the effect at the network bandwidth level.
Effect at the CPU util level would require gathering some fundamental
performance figures for your system(s) and stacks(s) with something
like netperf. Perhaps using a test system if you have one.
[color=blue]
> 3) Hope the 3rd party comes out with a solution with nagling
> disabled on their side.[/color]
An application-layer ACK implies an application-layer retransmission
mechanism. Is there one? Any idea what those timers happen to be and
whether the application can implemented application-layer delayed ACK?
Then it could piggy-back its ACKs on replies and avoid the nagle bit.
rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Re: TCP interactive data flow
On Jul 16, 2:29*pm, "Mark (newsgroups)" <marknewsgro...@yahoo.com>
wrote:
[color=blue]
> 3) Hope the 3rd party comes out with a solution with nagling disabled on
> their side.[/color]
You left out:
4) Hope the 3rt party comes out what a *proper* solution on their
side, sending all the data in a single write call like they're
supposed to.
5) Disabling Nagle on your side and dribbling data in the delay
interval to give your ACKs something to piggyback on.
DS
Re: TCP interactive data flow
On Jul 17, 2:50*am, David Schwartz <dav...@webmaster.com> wrote:[color=blue]
> On Jul 16, 2:29*pm, "Mark (newsgroups)" <marknewsgro...@yahoo.com>
> wrote:
>[color=green]
> > 3) Hope the 3rd party comes out with a solution with nagling disabled on
> > their side.[/color]
>
> You left out:
>
> 4) Hope the 3rt party comes out what a *proper* solution on their
> side, sending all the data in a single write call like they're
> supposed to.[/color]
Thanks but you lack understanding of the problem space. I'm reluctant
to actually say what it is we're doing due to sensitivities, but
needless to say you should accept that the situation I described is as
it is for a reason (I don't mean the nagling I mean sending the
"acknowledge" and "dataX" in seperate write calls).
[color=blue]
> 5) Disabling Nagle on your side and dribbling data in the delay
> interval to give your ACKs something to piggyback on.[/color]
Firstly, I don't think this would solve the problem since data can be
sporadic therefore we'd still see the delay in many cases. Which is
not a solution. Secondly, as I mentioned, I do not have control over
the underlying tcp/ip communication which is done via an api provided.