I found a bug in the tcp congestion control in linux (kernels 2.6.x) - Linux

This is a discussion on I found a bug in the tcp congestion control in linux (kernels 2.6.x) - Linux ; Hello, I found a strage behaviour of the tcp stack in linux. I thing, that congestion control mechanism in linux works inproperly. At the beggining of transmission, the TCP sender has the congestion window set to 2 maximum segments size ...

+ Reply to Thread
Results 1 to 20 of 20

Thread: I found a bug in the tcp congestion control in linux (kernels 2.6.x)

  1. I found a bug in the tcp congestion control in linux (kernels 2.6.x)

    Hello,

    I found a strage behaviour of the tcp stack in linux. I thing, that
    congestion control mechanism in linux works inproperly. At the
    beggining of transmission, the TCP sender has the congestion window
    set to 2 maximum segments size (cwnd = 2MSS). When MSS = 1460B, then
    cwnd = 2920B at the beggining od transmission.
    I found, that linux instead of sending 2900 bytes of data it sends
    only two packets. When the packets are small (160 bytes), then it
    sends only 320 bytes in this widnow and waits for acknowledge from
    receiver. When packets are large (1460), it sends 2920B. I think, that
    in the first example sender should be able to send 18 packets of 160
    bytes size and then wait for the ack.

    In the tests, algorithm Nagles was disabled, and I was using wireshark
    to trace it.

    Does anyone known why it behaves in this way. Is there a posibility to
    change this (using SYSCTL)?


    BR,
    Riki


  2. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On Sep 11, 2:35*am, piotr.rich...@gmail.com wrote:

    > I found a strage behaviour of the tcp stack in linux. I thing, that
    > congestion control mechanism in linux works inproperly. At the
    > beggining of transmission, the TCP sender has the congestion window
    > set to 2 maximum segments size (cwnd = 2MSS). When MSS = 1460B, then
    > cwnd = 2920B at the beggining od transmission.
    > I found, that linux instead of sending 2900 bytes of data it sends
    > only two packets. When the packets are small (160 bytes), then it
    > sends only 320 bytes in this widnow and waits for acknowledge from
    > receiver. When packets are large (1460), it sends 2920B. I think, that
    > in the first example sender should be able to send 18 packets of 160
    > bytes size and then wait for the ack.
    >
    > In the tests, algorithm Nagles was disabled, and I was using wireshark
    > to trace it.
    >
    > Does anyone known why it behaves in this way. Is there a posibility to
    > change this (using SYSCTL)?


    This is not congestion control working improperly, it is slow start
    working properly. Congestion control is about finding the ultimate
    limit of what the link can handle. Slow start is about not blasting
    through that limit with your first few packets and harming existing
    connections.

    DS

  3. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    Ok, so something else is working inproperly. In the slow start phase,
    at the beginning linux can send only two small packets, however it
    should be able to send the amount of all 2*MSS bytes (about 3000
    bytes).
    I was testing the same thing on windows XP and the beaviour was
    correct. Windows xp was sending about 18 small packets (160 bytes)
    without waiting for ACK.

    BR,
    Riki

    On 11 Wrz, 12:42, David Schwartz wrote:
    > On Sep 11, 2:35*am, piotr.rich...@gmail.com wrote:
    >
    >
    >
    >
    >
    > > I found a strage behaviour of the tcp stack in linux. I thing, that
    > > congestion control mechanism in linux works inproperly. At the
    > > beggining of transmission, the TCP sender has the congestion window
    > > set to 2 maximum segments size (cwnd = 2MSS). When MSS = 1460B, then
    > > cwnd = 2920B at the beggining od transmission.
    > > I found, that linux instead of sending 2900 bytes of data it sends
    > > only two packets. When the packets are small (160 bytes), then it
    > > sends only 320 bytes in this widnow and waits for acknowledge from
    > > receiver. When packets are large (1460), it sends 2920B. I think, that
    > > in the first example sender should be able to send 18 packets of 160
    > > bytes size and then wait for the ack.

    >
    > > In the tests, algorithm Nagles was disabled, and I was using wireshark
    > > to trace it.

    >
    > > Does anyone known why it behaves in this way. Is there a posibility to
    > > change this (using SYSCTL)?

    >
    > This is not congestion control working improperly, it is slow start
    > working properly. Congestion control is about finding the ultimate
    > limit of what the link can handle. Slow start is about not blasting
    > through that limit with your first few packets and harming existing
    > connections.
    >
    > DS- Ukryj cytowany tekst -
    >
    > - Pokaż cytowany tekst -



  4. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)


    piotr.rich...@gmail.com wrote:

    > Ok, so something else is working inproperly. In the slow start phase,
    > at the beginning linux can send only two small packets, however it
    > should be able to send the amount of all 2*MSS bytes (about 3000
    > bytes).
    > I was testing the same thing on windows XP and the beaviour was
    > correct. Windows xp was sending about 18 small packets (160 bytes)
    > without waiting for ACK.


    That sounds like Nagle. Why would you want Linux to keep blasting tiny
    packets at a network with a larger MSS? It sounds like Windows
    behavior is wrong.

    DS

  5. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    Nagle algorithm was disabled in both situations (windows xp and linux)
    and widnows was corretly sending the amount of bytes that was allowed
    by the congestion window. I'm investigating, how TCP can be used for
    VoIP transmission and that is the reason why I want to send packets
    more frequently.

    P.S. RFC2581 says that congestion window is measured in bytes, not in
    packets

    BR,
    Riki


    On 12 Wrz, 00:38, David Schwartz wrote:
    > piotr.rich...@gmail.com wrote:
    > > Ok, so something else is working inproperly. In the slow start phase,
    > > at the beginning linux can send only two small packets, however it
    > > should be able to send the amount of all 2*MSS bytes (about 3000
    > > bytes).
    > > I was testing the same thing on windows XP and the beaviour was
    > > correct. Windows xp was sending about 18 small packets (160 bytes)
    > > without waiting for ACK.

    >
    > That sounds like Nagle. Why would you want Linux to keep blasting tiny
    > packets at a network with a larger MSS? It sounds like Windows
    > behavior is wrong.
    >
    > DS



  6. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On Sep 11, 5:05*pm, piotr.rich...@gmail.com wrote:

    > Nagle algorithm was disabled in both situations (windows xp and linux)


    That's kind of silly, as it will result in poorer network utilization.
    Sending the packet sooner will just get it to the queue on the
    bandwidth-limiting link sooner. Sending a bigger packet will result in
    better utilization of the bandwidth-limiting link.

    > and widnows was corretly sending the amount of bytes that was allowed
    > by the congestion window. I'm investigating, how TCP can be used for
    > VoIP transmission and that is the reason why I want to send packets
    > more frequently.


    > P.S. RFC2581 says that congestion window is measured in bytes, not in
    > packets


    We're not talking about the congestion window, we're talking about
    slow start. You don't want to send a whole bunch of small packets
    blind.

    Disabling Nagle is killing you. It requires both more bytes and more
    packets to be sent on the wire to send the same data. Getting the data
    on the wire quickly is *NOT* important.

    DS

  7. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On 12 Wrz, 03:20, David Schwartz wrote:
    > On Sep 11, 5:05*pm, piotr.rich...@gmail.com wrote:
    >
    > > Nagle algorithm was disabled in both situations (windows xp and linux)

    >
    > That's kind of silly, as it will result in poorer network utilization.
    > Sending the packet sooner will just get it to the queue on the
    > bandwidth-limiting link sooner. Sending a bigger packet will result in
    > better utilization of the bandwidth-limiting link.


    I'm sending TCP 64 kbps CBR flow, so I don't thing that it can make
    any congestion.

    > > and widnows was corretly sending the amount of bytes that was allowed
    > > by the congestion window. I'm investigating, how TCP can be used for
    > > VoIP transmission and that is the reason why I want to send packets
    > > more frequently.
    > > P.S. RFC2581 says that congestion window is measured in bytes, not in
    > > packets

    >
    > We're not talking about the congestion window, we're talking about
    > slow start. You don't want to send a whole bunch of small packets
    > blind.


    Slow start is the way of increasing congestion window, at the
    beginning of transmission and I want to send the amount of bytes, that
    congestion windows allows, but I can't. The only difference is that I
    want to send them using smaller packets.

    > Disabling Nagle is killing you. It requires both more bytes and more
    > packets to be sent on the wire to send the same data. Getting the data
    > on the wire quickly is *NOT* important.


    For the VoIP transmission it is very important. For example Skype is
    using TCP transmission with disabled Nagle algorithm when UDP traffic
    is blocked on FW.

    BR,
    Riki

  8. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On Sep 11, 11:28*pm, piotr.rich...@gmail.com wrote:

    > > Disabling Nagle is killing you. It requires both more bytes and more
    > > packets to be sent on the wire to send the same data. Getting the data
    > > on the wire quickly is *NOT* important.

    >
    > For the VoIP transmission it is very important. For example Skype is
    > using TCP transmission with disabled Nagle algorithm when UDP traffic
    > is blocked on FW.


    That's pretty dumb. It increases the number of packets required and it
    increases the number of bytes required. That's a lose/lose situation.

    DS

  9. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On 12 Wrz, 09:16, David Schwartz wrote:
    > On Sep 11, 11:28*pm, piotr.rich...@gmail.com wrote:
    >
    > > > Disabling Nagle is killing you. It requires both more bytes and more
    > > > packets to be sent on the wire to send the same data. Getting the data
    > > > on the wire quickly is *NOT* important.

    >
    > > For the VoIP transmission it is very important. For example Skype is
    > > using TCP transmission with disabled Nagle algorithm when UDP traffic
    > > is blocked on FW.

    >
    > That's pretty dumb. It increases the number of packets required and it
    > increases the number of bytes required. That's a lose/lose situation.
    >
    > DS


    OK, but it is the only way to use TCP in VoIP transmission. In other
    way there will be too big delay and jitter. In this kind of traffic we
    want to send small packets (160 bytes) in interval ok 20ms. When the
    Nagle'a algorith is working, then the packets are buffered and send
    with delays.

    The original question was: Why congestion control mechanism(phases:
    slow start and congestion avoidance) is working inproperly. It sends
    too small amount of data when we use small packets.

    BR,
    Riki

  10. Re: I found a bug in the tcp congestion control in linux (kernels 2.6.x)

    piotr.richter@gmail.com writes:

    [...]

    > I'm investigating, how TCP can be used for VoIP transmission


    And the answer is 'not at all'. Whenever a datagram is lost, the
    receiver will move somewhat more into the past of the sender, because
    instead of arriving at some time N (which is the time the datagram was
    sent + the time it needs to travel until it reaches its destination in
    absence of any transport problems), the datagram arrives at some time
    N + M, with M depending (exponentially!) on the number of
    retransmissions which were necessary until it arrived. But this means
    that all future segments now arrive M 'time units' in the past.

  11. Re: I found a bug in the tcp congestion control in linux (kernels 2.6.x)

    piotr.richter@gmail.com writes:

    [...]

    > The original question was: Why congestion control mechanism(phases:
    > slow start and congestion avoidance) is working inproperly. It sends
    > too small amount of data when we use small packets.


    It is working as defined:

    The sender starts by transmitting one segment and waiting for
    its ACK. When that ACK is received, the congestion window is
    incremented from one to two, and two segments can be sent.
    When each of those two segments is acknowledged, the
    congestion window is increased to four.
    [RFC2100]

    That you happen to force the sending TCP to use small segments instead
    of accumulating the data into larger segments just means that less
    data will sent during slow-start than would be possible. This doesn't
    turn TCP into a suitable protocol for isosynchronous streaming of
    audio data.

  12. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On Sep 12, 1:01*am, piotr.rich...@gmail.com wrote:

    > OK, but it is the only way to use TCP in VoIP transmission.


    Then why are we having this conversation? If you know how to make it
    work, go do that. I'm telling you why what you are doing won't work.

    > In other
    > way there will be too big delay and jitter.


    Okay, then pick the way that provides the best performance. I'm pretty
    sure disabling Nagle is going to make things worse, and I've explained
    why. If you find things are better with Nagle on, then turn it on.

    > In this kind of traffic we
    > want to send small packets (160 bytes) in interval ok 20ms. When the
    > Nagle'a algorith is working, then the packets are buffered and send
    > with delays.


    That is what you need, because these packets are too small to send
    efficiently.

    > The original question was: Why congestion control mechanism(phases:
    > slow start and congestion avoidance) is working inproperly. It sends
    > too small amount of data when we use small packets.


    It is trying to work properly, but you deliberately sabotaged it by
    disabling a critical part of its functioning (Nagle).

    DS

  13. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On 12 Wrz, 11:22, Rainer Weikusat wrote:
    > It is working as defined:
    >
    > * * * * The sender starts by transmitting one segment and waitingfor
    > * * * * its ACK. *When that ACK is received, the congestion window is
    > * * * * incremented from one to two, and two segments can be sent..
    > * * * * When each of those two segments is acknowledged, the
    > * * * * congestion window is increased to four.
    > * * * * [RFC2100]


    In the same RFC (2001) you can find this:
    When a new connection is established with a host on another network,
    the congestion window is initialized to one segment (i.e., the segment
    size announced by the other end, or the default, typically 536 or
    512).

    If, the window is defind in the number of segments, then why they are
    writing about the size of segment?? It should be not important.

    BR,
    Riki




  14. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On 12 Wrz, 13:14, piotr.rich...@gmail.com wrote:
    > On 12 Wrz, 11:22, Rainer Weikusat wrote:
    >
    > > It is working as defined:

    >
    > > * * * * The sender starts by transmitting one segment and waiting for
    > > * * * * its ACK. *When that ACK is received, the congestion window is
    > > * * * * incremented from one to two, and two segments can be sent.
    > > * * * * When each of those two segments is acknowledged, the
    > > * * * * congestion window is increased to four.
    > > * * * * [RFC2100]

    >
    > In the same RFC (2001) you can find this:
    > When a new connection is established with a host on another network,
    > the congestion window is initialized to one segment (i.e., the segment
    > size announced by the other end, or the default, typically 536 or
    > 512).
    >
    > If, the window is defind in the number of segments, then why they are
    > writing about the size of segment?? It should be not important.
    >
    > BR,
    > Riki


    I have some more proofs. that cwnd is defined in bytes, not in the
    number of packets:

    RFC 2581: "The congestion window (cwnd) is a sender-side limit on the
    amount of data the sender can transmit into the network before
    receiving an acknowledgment (ACK)|" <- They say about amount of data,
    not about about number of packets, that sender can transmit.

    RFC 2581:" IW, the initial value of cwnd, MUST be less than or equal
    to 2*SMSS bytes and MUST NOT be more than 2 segments." <- We can see
    that initial congestion window is defined in bytes

    RFC 813 : about advertised window: "This number of bytes, called the
    window, is the maximum which the sender is permitted to transmit
    until the receiver returns some additional window."
    We know that, the sender can transmit up to the minimum of the
    congestion window and the advertised window. If advertised window is
    defined in bytes, the congestion window also should be defined in
    bytes.



    P.S. about using TCP in VoIP and using Nagle algorithm you can read
    here: http://www1.cs.columbia.edu/~salman/...ucs-023-07.pdf
    I was making experiments described in this document and during them I
    noticed the wrong (in my opinion) behaviour of the congestion control
    in linux kernels 2.6.x


    BR,
    Riki







  15. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On Sep 12, 12:24*pm, piotr.rich...@gmail.com wrote:

    > I have some more proofs. that cwnd is defined in bytes, not in the
    > number of packets:


    > RFC 2581: * * * "The congestion window (cwnd) is a sender-side limit on the
    > amount of data the sender can transmit into the network before
    > receiving an acknowledgment (ACK)|" <- They say about amount of data,
    > not about about number of packets, that sender can transmit.


    The "amount of data" can be measures in bytes, packets, or any other
    way you like.

    > RFC 2581:" IW, the initial value of cwnd, MUST be less than or equal
    > to 2*SMSS bytes and MUST NOT be more than 2 segments." <- We can see
    > that initial congestion window is defined in bytes


    Umm, it says it "MUST NOT be more than 2 segments". There are *two*
    limits imposed here -- it must be less than or equal to 2*SMSS bytes
    and must also be less than or equal to 2 segments.

    > RFC 813 : * * * about advertised window: "This number of bytes, called the
    > window, is the maximum which the sender is permitted to *transmit
    > until *the *receiver returns *some *additional *window."


    The advertised window and the congestion window are completely
    different things that serve completely different purposes.

    > We know that, the sender can transmit up to the minimum of the
    > congestion window and the advertised window. If advertised window is
    > defined in bytes, the congestion window also should be defined in
    > bytes.


    That would completely break slow start. In typical cases, this would
    allow the transmitter to blast out hundreds of packets at full speed
    if an application disabled Nagle and repeatedly wrote 1-byte data
    units. That is an utterly insane thing for any application to do.

    Again, you messed up the behavior by disabling Nagle. Nagle is what
    stops this from being a problem by avoiding small packets.

    DS

  16. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On 12 Wrz, 22:07, David Schwartz wrote:
    > On Sep 12, 12:24*pm, piotr.rich...@gmail.com wrote:
    >
    > > I have some more proofs. that cwnd is defined in bytes, not in the
    > > number of packets:
    > > RFC 2581: * * * "The congestion window (cwnd) is a sender-side limit on the
    > > amount of data the sender can transmit into the network before
    > > receiving an acknowledgment (ACK)|" <- They say about amount of data,
    > > not about about number of packets, that sender can transmit.

    >
    > The "amount of data" can be measures in bytes, packets, or any other
    > way you like.
    >
    > > RFC 2581:" IW, the initial value of cwnd, MUST be less than or equal
    > > to 2*SMSS bytes and MUST NOT be more than 2 segments." <- We can see
    > > that initial congestion window is defined in bytes

    >
    > Umm, it says it "MUST NOT be more than 2 segments". There are *two*
    > limits imposed here -- it must be less than or equal to 2*SMSS bytes
    > and must also be less than or equal to 2 segments.
    >
    > > RFC 813 : * * * about advertised window: "This number of bytes, called the
    > > window, is the maximum which the sender is permitted to *transmit
    > > until *the *receiver returns *some *additional *window."

    >
    > The advertised window and the congestion window are completely
    > different things that serve completely different purposes.
    >
    > > We know that, the sender can transmit up to the minimum of the
    > > congestion window and the advertised window. If advertised window is
    > > defined in bytes, the congestion window also should be defined in
    > > bytes.



    > That would completely break slow start. In typical cases, this would
    > allow the transmitter to blast out hundreds of packets at full speed
    > if an application disabled Nagle and repeatedly wrote 1-byte data
    > units. That is an utterly insane thing for any application to do.
    > Again, you messed up the behavior by disabling Nagle. Nagle is what
    > stops this from being a problem by avoiding small packets.


    We don't want to send hundrets of packets with size of 1 byte. Only
    packets of 160 bytes size every 20ms. It is not a lot.

    P.S. One more document about disabling Nagle algorithm:
    http://www.cs.columbia.edu/techreports/cucs-033-04.pdf


    BR,
    Riki




  17. Re: I found a bug in the tcp congestion control in linux (kernels2.6.x)

    On Sep 12, 1:40*pm, piotr.rich...@gmail.com wrote:

    > We don't want to send hundrets of packets with size of 1 byte. Only
    > packets of 160 bytes size every 20ms. It is not a lot.


    Then you don't want the congestion window to be measured in data
    bytes. Yet that seems to be exactly what you are asking for.

    In your view, what would stop the TCP implementation from sending
    hundreds of packets with a size of 1 byte, thus overloading links and
    trashing other applications' performance, if an application disabled
    Nagle and then called 'write' in a tight loop with a 1 byte data size?

    Or do you think normal TCP-using applications should be permitted to
    completely hose the network?

    > P.S. One more document about disabling Nagle algorithm:
    > http://www.cs.columbia.edu/techreports/cucs-033-04.pdf


    I'm trying to explain to you why what you're doing won't work, you
    keep insisting that it does work and then keep asking how to fix it.
    Don't you see a contradiction there?

    Disabling Nagle means you hit the congestion window faster. It means
    you need more packets and more data to send the same information. It's
    a lose all around.

    On the bright side, it won't hurt you very much. When it screws you
    over and causes problems, it will effectively turn itself off (because
    data will build up in the send queue, allowing the sender to fill its
    segments), allowing your connection to recover (due to the boosted
    performance) until it messes you up again.

    If all you care about is 90th percentile delay and jitter, then
    disable Nagle. Your own study shows that this makes things good
    enough. What? You actually care about things like worst case
    performance? You actually care about how quickly you can ramp up? Hmm,
    so why point to a study that doesn't measure that?

    DS

  18. Re: I found a bug in the tcp congestion control in linux (kernels 2.6.x)

    piotr.richter@gmail.com writes:
    > On 12 Wrz, 13:14, piotr.rich...@gmail.com wrote:
    >> On 12 Wrz, 11:22, Rainer Weikusat wrote:
    >>
    >> > It is working as defined:

    >>
    >> > * * * * The sender starts by transmitting one segment and waiting for
    >> > * * * * its ACK. *When that ACK is received, the congestion window is
    >> > * * * * incremented from one to two, and two segments can be sent.
    >> > * * * * When each of those two segments is acknowledged, the
    >> > * * * * congestion window is increased to four.
    >> > * * * * [RFC2100]

    >>
    >> In the same RFC (2001) you can find this:
    >> When a new connection is established with a host on another network,
    >> the congestion window is initialized to one segment (i.e., the segment
    >> size announced by the other end, or the default, typically 536 or
    >> 512).
    >>
    >> If, the window is defind in the number of segments, then why they are
    >> writing about the size of segment?? It should be not important.


    [...]

    > I have some more proofs. that cwnd is defined in bytes, not in the
    > number of packets:


    The text I quoted above describes an algorithm to send TCP segments.
    A 'tcp segment' is the internal PDU used by TCP to transmit whatever
    data it has available for transmission. The text says 'send one
    segment, then wait for an ack, then send two segments, wait for the
    acks, and so forth'. It does not say 'send fifty segments, then wait
    for fifty acks, then do something completely unspecified, because
    someone would rather like that'.

    [...]

    > P.S. about using TCP in VoIP and using Nagle algorithm you can read
    > here:


    The TCP retransmission algorithm is not realtime-capable. No
    retransmission algorithm intended to always deliver all data that was
    sent in the order it was sent can be that, because it directly
    contradicts with the requirement to transmit continously generated new
    data at a fixed bitrate.

    That's again one of these apparently complicated 'stateful things
    changing over time' lots of people simply cannot grasp. So leave it.

  19. Re: I found a bug in the tcp congestion control in linux (kernels 2.6.x)

    In article <3456dc31-2948-4d93-a9f8-50d0ebe3bce9@34g2000hsh.googlegroups.com>,
    wrote:

    >We don't want to send hundrets of packets with size of 1 byte. Only
    >packets of 160 bytes size every 20ms. It is not a lot.


    And what happens when one of those packets gets lost along the way?

    --
    http://www.spinics.net/lists/kernel/

  20. Re: I found a bug in the tcp congestion control in linux (kernels 2.6.x)

    In article <88c6dd6b-8d31-45ba-8e2a-533f6a3be285@w1g2000prk.googlegroups.com>,
    David Schwartz wrote:

    >It is trying to work properly, but you deliberately sabotaged it by
    >disabling a critical part of its functioning (Nagle).


    Is it that or is he just trying to use TCP for something it shouldn't
    be used for? Things will go very bad for VOIP once a packet gets lost
    and the TCP retransmission timeouts get involved. TCP was never
    intended for realtime streaming.

    --
    http://www.spinics.net/lists/

+ Reply to Thread