TCP limit of 4 segments bursts drastically reduces performance? - TCP-IP

This is a discussion on TCP limit of 4 segments bursts drastically reduces performance? - TCP-IP ; I've run into an issue where performance is drastically reduced (e.g. 70 KB/sec when it should be 5 MB/sec) because of what appears to be the 4 segment burst limit in the TCP stack I'm using (NetBSD). It has 17 ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: TCP limit of 4 segments bursts drastically reduces performance?

  1. TCP limit of 4 segments bursts drastically reduces performance?

    I've run into an issue where performance is drastically reduced (e.g.
    70 KB/sec when it should be 5 MB/sec) because of what appears to be the
    4 segment burst limit in the TCP stack I'm using (NetBSD). It has 17 KB
    to send, but stops after 4 segments and waits for an ack before sending
    more. The receiving TCP stack appears to intentionally have the
    "stretch ack violation" to improve performance so it doesn't ack the 4
    segments until its delayed ack timer (200ms) fires. This means the
    sender sees a 200ms pause in each 17 KB transaction, which greatly
    reduces performance.

    I've been able to hack around this by simply increasing the max burst
    rate to 8 segments. This sends enough packets to the client to trigger
    it to send a normal ack so the sender isn't slowed down by the delayed
    ack. What I'm wondering is how bad would this type of change be? Was
    the number 4 chosen because it was the best burst rate averaged across
    all situations?

    It seems that rather than putting a fixed upper limit on the number of
    segments that can be sent at once (which seems to come from RFC 3782's
    desire to avoid a burst during fast recovery), it could use the slow
    start mechanism to increase the burst rate. During normal operation
    (i.e. not during recovery), it could have no burst limit at all, to
    maximize performance. I have to admit that I don't really know what I'm
    talking about, due to my inexperience with TCP at this level, so I
    couldn't be missing something obvious.


  2. Re: TCP limit of 4 segments bursts drastically reduces performance?

    Hi Skillzero-

    Is this related to your previous post ?

    * Without looking at the complete packet trace, this is my best guess.

    This does not seem to be a case of "stretch ack violation". In TCP,
    ACKs have to be sent for atleast every two full sized segments (delayed
    ACKs). However, it is possible to write a custom stack that would delay
    ACKs for a long time - even 4,8,16 segments. This type of behavior
    increases bursty behavior that can overwhelm the internet. Stretch acks
    are frowned upon - even documented as a known implementation problem
    (See RFC 2525)

    So what else could it be ?

    I think your BSD kernel has the Congestion Window Monitoring algorithm
    enabled. This is also called Hughes/Touch/Heidemann algorithm. This
    imposes a burst limit of 4 full sized segments. Apparently this
    algorithm helps web servers. You can disable this and recompile your
    kernel. This would be the ideal solution for you. I lost the links but
    you can google for instructions how to disable it.

    Read this link for the CWM algorithm
    http://www3.ietf.org/proceedings/98d...restart-00.txt

    skillzero@gmail.com wrote:
    > I've run into an issue where performance is drastically reduced (e.g.
    > 70 KB/sec when it should be 5 MB/sec) because of what appears to be the
    > 4 segment burst limit in the TCP stack I'm using (NetBSD). It has 17 KB
    > to send, but stops after 4 segments and waits for an ack before sending
    > more. The receiving TCP stack appears to intentionally have the
    > "stretch ack violation" to improve performance so it doesn't ack the 4
    > segments until its delayed ack timer (200ms) fires. This means the
    > sender sees a 200ms pause in each 17 KB transaction, which greatly
    > reduces performance.
    >
    > I've been able to hack around this by simply increasing the max burst
    > rate to 8 segments. This sends enough packets to the client to trigger
    > it to send a normal ack so the sender isn't slowed down by the delayed
    > ack. What I'm wondering is how bad would this type of change be? Was
    > the number 4 chosen because it was the best burst rate averaged across
    > all situations?
    >


    The number 4 represents twice the interval of delayed ACKs. If you bump
    it up to eight, it would be difficult to characterize the behavior
    without simulations.

    > It seems that rather than putting a fixed upper limit on the number of
    > segments that can be sent at once (which seems to come from RFC 3782's
    > desire to avoid a burst during fast recovery), it could use the slow
    > start mechanism to increase the burst rate. During normal operation
    > (i.e. not during recovery), it could have no burst limit at all, to
    > maximize performance. I have to admit that I don't really know what I'm
    > talking about, due to my inexperience with TCP at this level, so I
    > couldn't be missing something obvious.


    Yes, you could disable the CWM algorithm and get this behavior. See
    above.

    Regards,
    Vivek Rajan
    http://www.unleashnetworks.com


  3. Re: TCP limit of 4 segments bursts drastically reduces performance?

    VivekRajan wrote:
    > This does not seem to be a case of "stretch ack violation". In TCP,
    > ACKs have to be sent for atleast every two full sized segments


    Picking nits, the RFC's say "should" (IIRC).

    > (delayed ACKs). However, it is possible to write a custom stack that
    > would delay ACKs for a long time - even 4,8,16 segments. This type
    > of behavior increases bursty behavior that can overwhelm the
    > internet. Stretch acks are frowned upon - even documented as a known


    _may_ affect the Internet but 99 times out of 10 someones intranet
    won't care, and the burstiness will be self-limiting - segments get
    dropped cwnd's are shrunk etc. it isn't as if it will lead to
    congestive collapse.

    > implementation problem (See RFC 2525)


    I happen to like well-implemented ACK avoidance heuristics. HP-UX
    11/11i, Solaris and IIRC Mac OS 9 all have rather good ones (their TCP
    stacks share a common ancestor). In broad handwaving terms a bare ACK
    is just as expensive in terms of CPU cycles as a data segment. Here
    is some netperf TCP_MAERTS data where the maximum number of segments
    to wait before generating an ACK is altered from 2 to 8:

    The first column is the "deferred_ack_max" the next four are socket
    and send sizes and runtime, then throughput in Mbit/s, then local and
    remote CPU util, then local and remote service demand - microseconds
    of CPU consumed to transfer on KB (K == 1024) of data. The systems
    were not completely isolated so the data will be a little noisy.

    2 131072 131072 32768 10.01 946.25 64.67 43.32 11.198 7.500
    3 131072 131072 32768 10.00 944.66 51.63 37.08 8.955 6.431
    4 131072 131072 32768 10.01 946.81 42.53 34.59 7.360 5.985
    5 131072 131072 32768 10.01 946.76 39.96 31.33 6.915 5.422
    6 131072 131072 32768 10.01 946.82 39.89 32.59 6.903 5.639
    7 131072 131072 32768 10.00 946.37 36.79 28.39 6.368 4.916
    8 131072 131072 32768 10.01 946.75 35.35 32.39 6.117 5.606

    We can see that on the system receiving the data, the service demand
    goes from ~11.2 usec/KB to ~6.1 usec/KB. On the side sending the
    data, it drops from 7.5 to as low as 5.

    It is possible to go beyond 8, the returns do start to diminish.

    rick jones
    --
    denial, anger, bargaining, depression, acceptance, rebirth...
    where do you want to be today?
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  4. Re: TCP limit of 4 segments bursts drastically reduces performance?

    > _may_ affect the Internet but 99 times out of 10 someones intranet
    > won't care, and the burstiness will be self-limiting - segments get
    > dropped cwnd's are shrunk etc. it isn't as if it will lead to
    > congestive collapse.


    I could be totally off-track due to lack of real world experience with
    custom TCP stacks.

    If only a few TCPs are modified to exhibit stretch ack behavior in an
    intranet, that might not be a problem. If a majority of TCPs are
    modified to 'strech ack' - it might lead to unacceptable congestion. Is
    my understanding right ?

    The sender window opens up based on incoming ACKs. When you stretch ack
    for 4 full size segments - each ACK covers more data. The ACKs come in
    twice as slow (compared to the default 2 MSS) - but when they do come
    in, they open up the window twice as big. This means the sender now
    bursts more data less frequently. This behavior applies both to the
    slow start and congestion avoidance phases. Of course, duplicate acks
    will cut the window and push TCP into fast recovery. Will such bursting
    cause the intervening routers to drop packets and push the sender into
    FR even though the pipe is not full ?

    If the network has a large bandwidth - wouldnt increasing the window
    size be a better way of achieving higher throughput than stretch acks ?

    >
    > > implementation problem (See RFC 2525)

    >
    > I happen to like well-implemented ACK avoidance heuristics. HP-UX
    > 11/11i, Solaris and IIRC Mac OS 9 all have rather good ones (their TCP
    > stacks share a common ancestor). In broad handwaving terms a bare ACK
    > is just as expensive in terms of CPU cycles as a data segment. Here
    > is some netperf TCP_MAERTS data where the maximum number of segments
    > to wait before generating an ACK is altered from 2 to 8:
    >


    > The first column is the "deferred_ack_max" the next four are socket
    > and send sizes and runtime, then throughput in Mbit/s, then local and
    > remote CPU util, then local and remote service demand - microseconds
    > of CPU consumed to transfer on KB (K == 1024) of data. The systems
    > were not completely isolated so the data will be a little noisy.
    >
    > 2 131072 131072 32768 10.01 946.25 64.67 43.32 11.198 7.500
    > 3 131072 131072 32768 10.00 944.66 51.63 37.08 8.955 6.431
    > 4 131072 131072 32768 10.01 946.81 42.53 34.59 7.360 5.985
    > 5 131072 131072 32768 10.01 946.76 39.96 31.33 6.915 5.422
    > 6 131072 131072 32768 10.01 946.82 39.89 32.59 6.903 5.639
    > 7 131072 131072 32768 10.00 946.37 36.79 28.39 6.368 4.916
    > 8 131072 131072 32768 10.01 946.75 35.35 32.39 6.117 5.606
    >

    I read somewhere that stretch ACKs placed a high load on the CPU. This
    was in an embedded devices forum. It is interesting that stretch acks
    seem to decrease the CPU load from 64% to 35%. Going from 7mss to 8mss
    seems to degrade things. So in your test setup can we assume 7mss is
    the optimal stretch ack. Is there any way to measure bursts too ?

    BTW: Are you the same Rick Jones who maintains netperf ? Thanks for the
    netperf tool. It is on my to-learn list !

    Regards,
    Vivek Rajan


+ Reply to Thread