Mysterious network IO behavior - Unix

This is a discussion on Mysterious network IO behavior - Unix ; I have a program which does this (pseudocode), where there is one input and one or more outputs. set all fd to be nonblocking while(1){ do over all io streams { call FD_SET (network and/or local) } select(read_set,write_set,NULL,timeout); do over ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 37

Thread: Mysterious network IO behavior

  1. Mysterious network IO behavior

    I have a program which does this (pseudocode), where there is
    one input and one or more outputs.


    set all fd to be nonblocking
    while(1){
    do over all io streams {
    call FD_SET (network and/or local)
    }
    select(read_set,write_set,NULL,timeout);
    do over all io_streams{
    check FD_ISSET;
    if(ok_to_read){
    read_data_into_ring_buffer;
    }
    else if(ok_to_write){
    write_data_from_ring_buffer;
    }
    }
    make_data_indices_agree
    }


    The program is a relay, it just reads from the input and copies to the
    outputs. It uses a single ring buffer (large, about 5Mb) and there are
    separate data indices for each IO stream so that the reads always go
    into the leading edge of the data block, "above" and "below" the
    existing data in the buffer, and the writes come from the trailing end
    of the data. After each cycle the indices are all compared so that
    "above" is adjusted to reflect any data read in and "below" reflects any
    data read out.


    Anyway, the odd thing is when this reads from the network, and writes to
    a local file (first) AND to the network (second) it runs at 11.4Mb/sec,
    which is pretty much full speed for this 100baseT network. Yet, when it
    reads from the network, and ONLY writes to the network file, it runs at
    about half that speed. In other words, when it ALSO (first) writes to a
    local disk, then the "network relay" part runs at full network speed,
    but when it JUST relays, which is less code executing, it runs SLOWER.
    It just started doing this after some minor changes to the code,
    essentially just changing some values which had been constants set by
    #define to variables - with the same value. Before that it ran at
    full speed with or without the extra copy to the local disk. By looking
    at the Recv and Transmit counts in ifconfig when it runs I can see that
    the number of packets moved in the slow mode is higher than in the fast
    mode, although at the receiving end of the relay the exact same data
    shows up (eventually) in either case.

    My best guess right now is that there is some timing issue uncovered by
    the minor change to the code, such that if the select(), FD_SET,
    FD_ISSET, or the read()/write() from/to the fd are hit
    at the "wrong time" (in this instance, too early) it somehow interferes
    with the actual network transfers, perhaps by placing a lock on the
    network buffer at just the wrong time, and that probably triggering a
    brief timeout, perhaps in the driver.

    I tried to run gprof() on this, but unfortunately it seems to not like
    this code very much and returns nothing useful in timing information.
    (Possibly because it spends 99% of its time in system calls.) Valgrind
    shows no memory access problems (but of course runs much slower).

    This is on Mandriva 2007.1, gcc 4.1.2, 3c59x network driver.

    ifconfig eth0
    eth0 Link encap:Ethernet HWaddr 00:E0:81:22:CC:3D
    inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
    inet6 addr: fe80::2e0:81ff:fe22:cc3d/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
    RX packets:17232365 errors:0 dropped:0 overruns:68178 frame:0
    TX packets:8912276 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:1000
    RX bytes:4086293930 (3.8 GiB) TX bytes:3793003690 (3.5 GiB)
    Interrupt:16
    ethtool eth0
    Settings for eth0:
    Supported ports: [ TP MII ]
    Supported link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    Supports auto-negotiation: Yes
    Advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    Advertised auto-negotiation: Yes
    Speed: 100Mb/s
    Duplex: Full
    Port: MII
    PHYAD: 24
    Transceiver: internal
    Auto-negotiation: on
    Current message level: 0x00000001 (1)
    Link detected: yes



    Any ideas what (else) might cause this effect, or more importantly, how
    to avoid it? Some kernel parameter to tweak perhaps?

    Thanks,

    David Mathog

  2. Re: Mysterious network IO behavior

    David Mathog wrote:

    > Anyway, the odd thing is when this reads from the network, and writes to
    > a local file (first) AND to the network (second) it runs at 11.4Mb/sec,
    > which is pretty much full speed for this 100baseT network. Yet, when it
    > reads from the network, and ONLY writes to the network file, it runs at
    > about half that speed.


    A bit more research deepens the mystery, at least for me.

    On looking at the ifconfig there is a very good inverse correlation
    between the delta overrun count during the run and the speed. The more
    of the former, the less of the latter. Now here is the odd part, this
    program also logs the number of bytes read/written, and these are always
    seen as (small) multiples of 1448. (1448 = The MTU of 1500 minus header
    sizes). The socket buffer sizes are known (196608 on both sides) and the
    read values are never even 10% of this. So the receiver isn't being
    overrun by being full, rather the interrupts are apparently not (always)
    being serviced on time. There is nothing else major running on the test
    nodes, but the background gstat, pvm, nfs, and assorted other packets
    may pass through from time to time.

    The program has a minwrite parameter (the number of bytes which must
    have been read from the input into the ring buffer before an attempt is
    made to write from the ring buffer to the output.) By setting that to
    13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
    network relay mode (no local write required). Yes, I checked,
    TCP_NODELAY is NOT set on these sockets. Using minwrite works
    sometimes, other times it makes no difference. When it works the
    speed is high and the overruns are zero, when if fails, the speed is low
    and the overruns high.

    This all seems to come back to the paradoxical observation that a short
    delay is required between reading from the receiver and writing to the
    transmitter for maximum throughput to be obtained. If that's true, then
    possibly it would be better to wait a bit on the reads as well, picking
    up N packets at a time instead of 1 by 1 as they come in? Perhaps drop
    a pause in before the select and just wait, umm, .005 seconds between
    reads? 12.5Mb/sec *.005 = 62500, and that should fit into
    the recv/xmit buffers with plenty of room to spare.

    Yes, that does get rid of the overruns, using a usleep() with a value
    between 2500 and 7500 dropped the overruns to only 0<->2. The number of
    packets returned by read() is much larger now, and the number returned
    by write() is usually exactly the same, so the output buffer is not
    backing up (it might however be emptying completely, which would reduce
    throughput). The throughput on the pure relay was at best 8.9Mb/sec (at
    5000 usecond delay). Without the usleep() but using either a local
    write, or a minwrite of 13500, a throughput of 11.4 Mb/sec was achieved.

    I'm groping around in the dark here. Who can tell me where the
    interference is which is resulting in these odd timing interactions? It
    is really hard optimizing code when making one part faster slows down
    the program as a whole...

    Thanks,

    David Mathog

  3. Re: Mysterious network IO behavior

    David Mathog wrote:

    Other info:
    iptables is not running on the compute nodes.

    Here are what I think are the relevant kernel parameters, anything in
    here which might be related to this thread?

    sysctl -a | grep ipv4.tcp
    net.ipv4.tcp_slow_start_after_idle = 1
    net.ipv4.tcp_workaround_signed_windows = 0
    net.ipv4.tcp_base_mss = 512
    net.ipv4.tcp_mtu_probing = 0
    net.ipv4.tcp_abc = 0
    net.ipv4.tcp_congestion_control = cubic
    net.ipv4.tcp_tso_win_divisor = 3
    net.ipv4.tcp_moderate_rcvbuf = 1
    net.ipv4.tcp_no_metrics_save = 0
    net.ipv4.tcp_low_latency = 0
    net.ipv4.tcp_frto = 0
    net.ipv4.tcp_tw_reuse = 0
    net.ipv4.tcp_adv_win_scale = 2
    net.ipv4.tcp_app_win = 31
    net.ipv4.tcp_rmem = 4096 87380 1048576
    net.ipv4.tcp_wmem = 4096 16384 1048576
    net.ipv4.tcp_mem = 24576 32768 49152
    net.ipv4.tcp_dsack = 1
    net.ipv4.tcp_ecn = 0
    net.ipv4.tcp_reordering = 3
    net.ipv4.tcp_fack = 1
    net.ipv4.tcp_orphan_retries = 0
    net.ipv4.tcp_max_syn_backlog = 1024
    net.ipv4.tcp_rfc1337 = 0
    net.ipv4.tcp_stdurg = 0
    net.ipv4.tcp_abort_on_overflow = 0
    net.ipv4.tcp_tw_recycle = 0
    net.ipv4.tcp_syncookies = 1
    net.ipv4.tcp_fin_timeout = 60
    net.ipv4.tcp_retries2 = 15
    net.ipv4.tcp_retries1 = 3
    net.ipv4.tcp_keepalive_intvl = 75
    net.ipv4.tcp_keepalive_probes = 9
    net.ipv4.tcp_keepalive_time = 7200
    net.ipv4.tcp_max_tw_buckets = 180000
    net.ipv4.tcp_max_orphans = 32768
    net.ipv4.tcp_synack_retries = 5
    net.ipv4.tcp_syn_retries = 5
    net.ipv4.tcp_retrans_collapse = 1
    net.ipv4.tcp_sack = 1
    net.ipv4.tcp_window_scaling = 1
    net.ipv4.tcp_timestamps = 1

    Thanks,

    David Mathog

  4. Re: Mysterious network IO behavior

    David Mathog wrote:
    > David Mathog wrote:


    > > Anyway, the odd thing is when this reads from the network, and
    > > writes to a local file (first) AND to the network (second) it runs
    > > at 11.4Mb/sec, which is pretty much full speed for this 100baseT
    > > network. Yet, when it reads from the network, and ONLY writes to
    > > the network file, it runs at about half that speed.


    > A bit more research deepens the mystery, at least for me.


    > On looking at the ifconfig there is a very good inverse correlation
    > between the delta overrun count during the run and the speed. The more
    > of the former, the less of the latter.


    Overrun's of what specifically? NIC buffers? Something else?

    > Now here is the odd part, this program also logs the number of bytes
    > read/written, and these are always seen as (small) multiples of
    > 1448. (1448 = The MTU of 1500 minus header sizes). The socket
    > buffer sizes are known (196608 on both sides)


    Only if you explicitly set them and under Linux (based on your next
    posting in this thread) you get 2X what you request to the core mem
    limits. Netperf has had to deal with this for a while - I finally
    broke down and added yet another socket buffer size report in the new
    "omni" tests - the "_end" socket buffer size since if one accepts
    defaults on linux, there will be the value you get at the beginning,
    and the value you have reached by the end of the connection.

    > and the read values are never even 10% of this. So the receiver
    > isn't being overrun by being full, rather the interrupts are
    > apparently not (always) being serviced on time. There is nothing
    > else major running on the test nodes, but the background gstat, pvm,
    > nfs, and assorted other packets may pass through from time to time.


    The socket buffer for TCP should _never_ be overun as that implies a
    failure of the TCP flow control.

    > The program has a minwrite parameter (the number of bytes which must
    > have been read from the input into the ring buffer before an attempt
    > is made to write from the ring buffer to the output.) By setting
    > that to 13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
    > network relay mode (no local write required).


    I presume you see larger bytes per recv() call in that situation?

    > Yes, I checked, TCP_NODELAY is NOT set on these sockets. Using
    > minwrite works sometimes, other times it makes no difference. When
    > it works the speed is high and the overruns are zero, when if fails,
    > the speed is low and the overruns high.


    When send()ing anything >= the MSS for the connection, TCP_NODELAY
    should be a noop.

    > This all seems to come back to the paradoxical observation that a
    > short delay is required between reading from the receiver and
    > writing to the transmitter for maximum throughput to be obtained.
    > If that's true, then possibly it would be better to wait a bit on
    > the reads as well, picking up N packets at a time instead of 1 by 1
    > as they come in? Perhaps drop a pause in before the select and just
    > wait, umm, .005 seconds between reads? 12.5Mb/sec *.005 = 62500,
    > and that should fit into the recv/xmit buffers with plenty of room
    > to spare.


    You might have issues with the effective granularity of the usleep()
    call on the systems on which you run.

    > Yes, that does get rid of the overruns, using a usleep() with a value
    > between 2500 and 7500 dropped the overruns to only 0<->2. The number of
    > packets returned by read() is much larger now, and the number returned
    > by write() is usually exactly the same,


    Unless you've enable non-blocking mode on the sockets I would expect
    that the write() calls would always be returning the quantity asked.

    Are you using blocking or non-blocking sockets here?

    rick jones
    --
    Wisdom Teeth are impacted, people are affected by the effects of events.
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  5. Re: Mysterious network IO behavior

    David Mathog wrote:

    > Hmm, maybe it only locks on the second call in readnbytes_noblock_n
    > which sets EAGAIN? I'll try it again without that.


    That changed the dynamics a bit but it still has slow transfers with the
    second read() removed and the code to skip 8/9 reads removed. In other
    words, the extra read() may be doing something timing related, but it is
    not the primary problem. The same odd interactions still appear when
    the second read (and second write) are not used.

    Regards,

    David Mathog

  6. Re: Mysterious network IO behavior

    > > Overrun's of what specifically? NIC buffers? Something else?

    > Not the recv. buffer, that never filled up, as indicated by the read()
    > never coming back with a full buffer. I think it indicates
    > some sort of failure to service an interrupt on time.


    > >
    > >> Now here is the odd part, this program also logs the number of bytes
    > >> read/written, and these are always seen as (small) multiples of
    > >> 1448. (1448 = The MTU of 1500 minus header sizes). The socket
    > >> buffer sizes are known (196608 on both sides)

    > >
    > > Only if you explicitly set them and under Linux (based on your next
    > > posting in this thread) you get 2X what you request to the core mem
    > > limits.


    > The value cited was returned with a getsocket(). If the 2X bug


    They don't consider it a bug, they consider it a feature...

    > applies there as well then the buffer is half what I cited, but I
    > don't think so, because by setting long usleep() the value got above
    > 100K.


    With linux's receive buffer moderation, the value for SO_RCVBUF is
    dependent on when you make the setsockopt() call. If you make it at
    the beginning (right after connect() or accept() it may be smaller
    than if you make the call after some data has been transferred over
    the connection. IIRC the same can happen with SO_SNDBUF.


    > >> The program has a minwrite parameter (the number of bytes which must
    > >> have been read from the input into the ring buffer before an attempt
    > >> is made to write from the ring buffer to the output.) By setting
    > >> that to 13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
    > >> network relay mode (no local write required).

    > >
    > > I presume you see larger bytes per recv() call in that situation?


    > Yes, it goes up more or less linearly with delay time, with a fair
    > amount of variation.


    Aren't the sizes of your writes then a function of the receives (in
    the absence of minwrite). Wait, scratch that, I just saw the
    following:

    > > Unless you've enable non-blocking mode on the sockets I would
    > > expect that the write() calls would always be returning the
    > > quantity asked.
    > >
    > > Are you using blocking or non-blocking sockets here?


    > Non-blocking.


    And I presume you are using either select() or poll() to wait for
    either of the sockets to be readable/writeable?


    > I just did another experiment where rather than using usleep(), the code
    > just kept a counter so that it would only read every 9th cycle where
    > fd_isset on the input fd came back positive, but it would write every
    > cycle if it could. That had no significant overruns (0-2) and ran at


    Sorry to be dense, but overruns of what exactly again?

    > This experiment is a bit better though, since it narrows the problem
    > down to the 9x fewer read and write attempts, everything else being the
    > same. The read code is just:



    > if(count_reads >= bundle_reads){
    > count_reads=0;
    > ret = read_rb_noblock_n(nt->io_fd[i], &shadowrb[i], nt->size,
    > nt->minread);
    > }


    > read_rb_noblock_n calls readnbytes_noblock_n once or twice (because of
    > the ring buffer wrap), and readnbytes_noblock is shown below in its
    > entirety. Since "n" is quite large it normally hits read() once,
    > returning everything in the recv buffer, then read() comes back with -1
    > and errno==EAGAIN, at which point it exits. Perhaps it would be better
    > to just take the first thing which comes back and exit, skipping the
    > second attempt at a read() once the buffer is probably (usually) empty.
    > It was written the way it is because it wasn't clear if every
    > implementation was going to be able to return everything in the recv
    > buffer in one fell swoop.


    If you pass-in a buffer of N bytes, and a non-blocking read/recv on
    the socket returns < N bytes, you can and should (IMO) reasonably
    assume there are no more bytes to be read on the socket. That extra
    read() call will only waste CPU cycles. And you most certainly do
    _NOT_ want to sit there spinning on a read() call, you want to come
    back up to a select/poll/whatnot.

    Only if you actually get N bytes should you assume there might be
    other bytes there to be read.

    > The write code is more or less the same, but ends in write().


    Only if the write() call doesn't write all N bytes should you ever
    consider going back around, and even then, you shouldn't just sit
    there and keep trying to write() to the socket, you should come back
    to a select/poll/whatnot to await the socket becoming writable again.
    Spinning to write all the bytes is just going to waste CPU cycles.

    rick jones
    --
    firebug n, the idiot who tosses a lit cigarette out his car window
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  7. Re: Mysterious network IO behavior

    David Mathog writes:

    [...]

    > Now here is the odd part, this program also logs the number of
    > bytes read/written, and these are always seen as (small) multiples
    > of 1448. (1448 = The MTU of 1500 minus header sizes).


    There is nothing odd about that: It means that you are receiving data
    in small, integral multiples of the TCP MSS, or, putting this into
    other words: That the program reads data somewhat slower than data
    arrives from the network (otherwise, there would be exactly one
    segment read per call).

    > The socket buffer sizes are known (196608 on both sides) and the
    > read values are never even 10% of this. So the receiver isn't
    > being overrun by being full, rather the interrupts are apparently
    > not (always) being serviced on time.


    A better guess would be (and please note that this is only a guess,
    because examining the driver would be necessary to actually determine
    the details) that the network RX soft interrupt does not run often
    enough to pull skbs off the RX ring fast enough to prevent received
    frames from being dropped because there is no space on the ring to
    store them.

    > The program has a minwrite parameter (the number of bytes which must
    > have been read from the input into the ring buffer before an attempt
    > is made to write from the ring buffer to the output.) By setting that
    > to 13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
    > network relay mode (no local write required).


    .... and this problem could well be avoided by being less aggressive
    wrt running RX and TX at the same time.

    > Yes, I checked, TCP_NODELAY is NOT set on these sockets.


    What makes you believe this should relate to the phenomenon you are
    observing anyhow? The Nagle algorithm basically just prevents a
    non-fullsized segment from being transmitted if a previous,
    non-fullsized segment has not yet been acknowledged.

    [...]

    > I'm groping around in the dark here. Who can tell me where the
    > interference is which is resulting in these odd timing interactions?
    > It is really hard optimizing code when making one part faster slows
    > down the program as a whole...


    If you spend more CPU time transmitting, eg, because of less
    externally introduced delays in the transmit path, there will be less
    CPU time available for reception.

    I would consider this to be fairly straight-forward.

  8. Re: Mysterious network IO behavior

    Rick Jones wrote:

    >> read_rb_noblock_n calls readnbytes_noblock_n once or twice (because of
    >> the ring buffer wrap), and readnbytes_noblock is shown below in its
    >> entirety. Since "n" is quite large it normally hits read() once,
    >> returning everything in the recv buffer, then read() comes back with -1
    >> and errno==EAGAIN, at which point it exits. Perhaps it would be better
    >> to just take the first thing which comes back and exit, skipping the
    >> second attempt at a read() once the buffer is probably (usually) empty.
    >> It was written the way it is because it wasn't clear if every
    >> implementation was going to be able to return everything in the recv
    >> buffer in one fell swoop.

    >
    > If you pass-in a buffer of N bytes, and a non-blocking read/recv on
    > the socket returns < N bytes, you can and should (IMO) reasonably
    > assume there are no more bytes to be read on the socket. That extra
    > read() call will only waste CPU cycles. And you most certainly do
    > _NOT_ want to sit there spinning on a read() call, you want to come
    > back up to a select/poll/whatnot.
    >
    > Only if you actually get N bytes should you assume there might be
    > other bytes there to be read.


    I went over and over the man pages, and nowhere could I find a statement
    that "when read() tries to remove N bytes from a buffer
    with M bytes in it, M I could find was that once FD_ISSET came up positive that something was
    guaranteed to be returned.

    Anyway, I think I will take the extra read() out since clearly what you
    say is what happens most of the time, if not maybe all the time. I
    thought perhaps the second call could pick up more bytes in some
    instances, but now I think that it is more likely that whatever is
    keeping some of the bytes from being read the first time is probably
    still going to be in the same state on an immediate second read.

    Anyway, this still doesn't fix the original problem - these weird
    timing issues persists when the one shot read()/write() methods are used.

    Thanks,

    David Mathog

  9. Re: Mysterious network IO behavior

    David Mathog wrote:
    > Rick Jones wrote:
    > > Only if you actually get N bytes should you assume there might be
    > > other bytes there to be read.


    > I went over and over the man pages, and nowhere could I find a
    > statement that "when read() tries to remove N bytes from a buffer
    > with M bytes in it, M > I could find was that once FD_ISSET came up positive that something
    > was guaranteed to be returned.


    Manpages are notoriously incomplete, and linux's are no exception, but
    since your read() and write() calls ultimately end-up as recv() and
    send() calls (or at least functionally equivalent) you might also
    peruse those manpages. Also, something like "Unix Network
    Programming" by Stevens, Fenner and Rudoff would be a good reference
    to have onhand - either in your personal collection, or via a nearby
    library.

    rick jones
    --
    denial, anger, bargaining, depression, acceptance, rebirth...
    where do you want to be today?
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  10. Re: Mysterious network IO behavior

    Rainer Weikusat wrote:
    > David Mathog writes:
    >
    > [...]
    >
    >> Now here is the odd part, this program also logs the number of
    >> bytes read/written, and these are always seen as (small) multiples
    >> of 1448. (1448 = The MTU of 1500 minus header sizes).

    >
    > There is nothing odd about that:


    Right, the part that was odd followed that.

    >> The program has a minwrite parameter (the number of bytes which must
    >> have been read from the input into the ring buffer before an attempt
    >> is made to write from the ring buffer to the output.) By setting that
    >> to 13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
    >> network relay mode (no local write required).

    >
    > ... and this problem could well be avoided by being less aggressive
    > wrt running RX and TX at the same time.


    I do not understand what you mean. The program is supposed to copy from
    the input to the outputs with the goal of moving data as fast as
    possible through the relay. In what way could it be made "less
    aggressive" and still maintain maximum throughput.

    >
    >> Yes, I checked, TCP_NODELAY is NOT set on these sockets.

    >
    > What makes you believe this should relate to the phenomenon you are
    > observing anyhow?


    Earlier versions of the program had a bug which caused TCP_NODELAY to be
    set. That resulted in performance degradation similar to what I'm
    seeing now. Eliminating TCP_NODELAY let it run full speed, however
    later minor changes (see earlier postings) reintroduced the throughput
    issues. It may be that TCP_NODELAY was not the problem earlier per se,
    but rather that it acted similarly to the later code changes, to uncover
    timing issues.

    > I would consider this to be fairly straight-forward.


    Please look through the Rick Jones thread. I was able to increase
    throughput by having the program skip 8 out of 9 reads following a valid
    FD_ISSET, or to a lesser extent, by just setting a 5msec usleep before
    the select(), or by having it first write to disk before writing to the
    network. That is, by having the program waste time spinning, or
    sleeping, or writing somewhere else, the network throughput increased.
    That looks to me like some sort of problem with resource locks or
    collisions triggering brief timeouts. I would like to know what it is
    exactly so that the code can try to compensate for this effect, whatever
    it turns out to be.

    Regards,

    David Mathog

  11. Re: Mysterious network IO behavior

    On Feb 4, 1:22 pm, Rainer Weikusat wrote:

    > There is nothing odd about that: It means that you are receiving data
    > in small, integral multiples of the TCP MSS, or, putting this into
    > other words: That the program reads data somewhat slower than data
    > arrives from the network (otherwise, there would be exactly one
    > segment read per call).


    Actually, the program could be reading data much faster than data is
    arriving from the network. There is no guarantee that a 'read' call
    will get all the data the network card has received. Interrupt
    coalescing, for example, could cause the 'read' call to run wickedly
    fast and still get several packets worth of data per call.

    For example:

    1) You call 'read'.
    2) A packet arrives, no interrupt is generated. (It is delayed/
    coalesced)
    3) Another packet arrives, an interrupt is generated.
    4) All the packets are processed and the 'read' call is unblocked.
    5) The process runs, gets the data from both packets, and calls 'read'
    again long before the next packet is even on the wire.

    DS

  12. Re: Mysterious network IO behavior

    On Feb 4, 9:19 am, David Mathog wrote:

    > Any ideas what (else) might cause this effect, or more importantly, how
    > to avoid it? Some kernel parameter to tweak perhaps?


    What does your CPU usage look like? If it's maxed, you probably are
    putting descriptors in the write select set even when you have to data
    to write. That's a *huge* mistake.

    DS

  13. Re: Mysterious network IO behavior


    Okay, let's run through the basics:

    1) Do you set the sockets non-blocking? (If not, that's your problem.)

    2) How do you decide which sockets to put in the write set? (If you
    put a socket in the write set even when you have no data to write to
    it, that's your problem.)

    3) Do you attempt a 'write' as soon as you have data to write or do
    you wait for a hit from 'select'? (Ideally, you should only add the
    socket to the 'write' set when you get a WOULDBLOCK indication.)

    DS

  14. Re: Mysterious network IO behavior

    David Schwartz writes:
    > On Feb 4, 1:22 pm, Rainer Weikusat wrote:
    >
    >> There is nothing odd about that: It means that you are receiving data
    >> in small, integral multiples of the TCP MSS, or, putting this into
    >> other words: That the program reads data somewhat slower than data
    >> arrives from the network (otherwise, there would be exactly one
    >> segment read per call).

    >
    > Actually, the program could be reading data much faster than data is
    > arriving from the network.


    No, it couldn't. This isn't even a question of how the various
    software parts of the stack interact: If each read call returns more
    than one SDU, the program reads the data slower than it arrives: For
    any given time interval, there are n calls to read and m SDUs returned
    and m = x * n, x > 1. Let t be the time interval, than there is a
    'read speed' of n / t and a recv speed of m / t. Because m = x * n,

    m / t = (x * n) / t

    and because of x > 1

    (x * n) / t > n / t

    A statement like 'the program is reading data faster than it arrives'
    is impossible in its own right: At most, the program could read the
    data as fast as it arrives, ie n = m.

    BTW, actual (Linux-)NIC drivers work the other way round: They don't delay
    received frames in the (possibly vain) hope that other frames will
    arrive in feature, but start 'receiving' once the first frame is
    available and continue to copy received frames off the RX DMA ring of
    the NIC without re-enabling interrupts for as long as there are frames
    to copy, cf

    http://www.linux-foundation.org/en/Net:NAPI

  15. Re: Mysterious network IO behavior

    David Mathog writes:

    [...]

    > I do not understand what you mean.


    Understanding without listening hardly ever happens.

  16. Re: Mysterious network IO behavior

    Rainer Weikusat wrote:
    > BTW, actual (Linux-)NIC drivers work the other way round: They don't
    > delay received frames in the (possibly vain) hope that other frames
    > will arrive in feature, but start 'receiving' once the first frame
    > is available and continue to copy received frames off the RX DMA
    > ring of the NIC without re-enabling interrupts for as long as there
    > are frames to copy, cf


    > http://www.linux-foundation.org/en/Net:NAPI


    My experience with netperf TCP_RR performance over a number of
    drivers/NICs suggests that there may be more to it.

    ftp://ftp.cup.hp.com/dist/networking...cy_vs_tput.txt

    If there wasn't "something" waiting for more traffic, I would expect
    the netperf single-byte, single-transaction-outstanding, TCP_RR test
    to always report best latency (transactions per second) since it only
    ever has one request or response outstanding at a time on the
    connection, so packet arrival should "always" be to an otherwise idle
    NIC (under test conditions that is). My guess has been actual NIC HW
    interrupt mitigation settings.

    rick jones
    --
    oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  17. Re: Mysterious network IO behavior

    David Schwartz wrote:
    > Okay, let's run through the basics:
    >
    > 1) Do you set the sockets non-blocking? (If not, that's your problem.)


    They are non-blocking.

    >
    > 2) How do you decide which sockets to put in the write set? (If you
    > put a socket in the write set even when you have no data to write to
    > it, that's your problem.)


    input is placed in read_set if the buffer has space to accept input
    output is placed in write_set if the buffer holds at least minwrite
    bytes of data. (normally minwrite is 1)

    These all reference the same buffer. I tried increasing the size of
    that buffer well beyond the total size of the input (so that input
    would always go in) and it still had slow runs. So "filling up the
    input buffer" isn't the problem. As was mentioned previously,the
    receive buffer never fills all the way either. The transmit buffer
    is definitely not always ready to accept input though, unclear if
    this is because it is full though.

    >
    > 3) Do you attempt a 'write' as soon as you have data to write or do
    > you wait for a hit from 'select'? (Ideally, you should only add the
    > socket to the 'write' set when you get a WOULDBLOCK indication.)


    It waits for this select to come back:

    ret = select(fdmax,&read_set,&write_set,NULL,NULL);

    Elsewhere there was a suggestion that CPU usage be examined - that
    turned out to be interesting With "top -d 1" running in another window
    "fast" runs, which averaged about 11.4Mb/s ran at a sustained
    CPU usage of 20-25%. Oddly, slow runs ran at only 10-15% CPU usage,
    peak. But there was nothing else using time - this was the fastest
    process, it just didn't go as high as before. "nice -20" on the process
    didn't make any difference, neither did turning off sge, gmond, and the
    system health monitor, which calls sensors every 5 seconds.

    Just in case I wasn't clear, SOMETIMES the pure relay will run fast,
    just (a lot) less often then when the timing is jiggered with writes to
    the local disk, or a bunch of wasted cycles are used to slow down the
    reads slightly.

    Here is a summary of the observations for the straight relay (no messing
    about with extra cycles or local writes) for a SLOW and a FAST run

    SLOW
    CPU Usage peak: 10-15%
    delta RX 42511 (packets)
    delta TX 53444 (packets)
    delta overruns: 1983 (many)
    read() calls: 20415 (varies A LOT, +- 8000 or so)
    reads of 1448: 18801
    write() calls: 18755(varies A LOT)
    writes of 1448: 17159
    bytes relayed; 40960000

    FAST
    CPU Usage peak: 20-25%
    delta RX 42606 (packets)
    delta TX 42669 (packets. NOTE: TX very close to RX, plus a bit more)
    delta overruns: few (0-2)
    read() calls: 27275 (varies A LITTLE)
    reads of 1448: 26349
    write() calls: 22707 (varies A LOT)
    writes of 1448: 6805
    bytes relayed; 40960000

    The delta values are from ifconfig eth0 before and after the run on the
    relay node. The CPU usage is from "top -d 1" run at the same time. The
    other information is from fprintf(stderr) statements in the program,
    which are directed to /tmp/stderr.log during the run and then checked
    later. Note that when the program is running slowly the majority of
    single packet reads (1448 bytes) are written out before another packet
    is read. When it is running quickly the majority of reads are for
    single packets and a second or more packets is usually sent. (If all
    input packets were 1448, then the most reads() possible would be
    40960000/1448 => 28288. However, some packets are observed smaller than
    this.) This can happen fairly efficiently because the loop looks like this:

    while(1){
    set up FD
    select();
    if(FD_ISSET indicates input_ready)read_all_input;
    if(FD_ISSET indicates output_ready)write_all_output;
    }


    So it often does things like this:

    cycle i
    read 1448
    write 0 (output not ready)
    cycle i+1
    read 1448
    write 2496 (now output ready, write data from this and previous cycle)

    Anyway, it looks to me like when things slow down, the reads become
    rate limiting. FD_ISSET shows data available less often for read,
    and when the data comes back from the read it is often several
    packets worth. (However, it isn't ever a full recv buffer's worth.)
    Conversely, the output seems to be less congested, since more of the
    single packet input is able to go out as single packet output. That
    is consistent - if the data comes in slower there is less pressure
    on the output part of the relay.


    Others factoid, for topology:

    A->B->C

    where node B relays, B is often slow. However, for

    A->B
    or
    B->C

    where node B either just receives or just transmits, it always
    runs quickly with no tricks required.

    ifconfig doesn't show a lot of other traffic. In 60 seconds on node A
    there were 528 TX and 525 RX, it does not seem to be too bursty since
    several 10 second measurements were all about 1/5th of that. Similarly
    ifconfig on the node B showed almost the same numbers. Since even a
    slow run only takes about 6 seconds we are only talking about another 50
    RX and 50 TX packets, which shouldn't be enough to cause this.

    Regards,

    David Mathog

  18. Re: Mysterious network IO behavior

    Rick Jones wrote:
    > Rainer Weikusat wrote:
    >
    >>BTW, actual (Linux-)NIC drivers work the other way round: They don't
    >>delay received frames in the (possibly vain) hope that other frames
    >>will arrive in feature, but start 'receiving' once the first frame
    >>is available and continue to copy received frames off the RX DMA
    >>ring of the NIC without re-enabling interrupts for as long as there
    >>are frames to copy, cf

    >
    >
    >> http://www.linux-foundation.org/en/Net:NAPI

    >
    >
    > My experience with netperf TCP_RR performance over a number of
    > drivers/NICs suggests that there may be more to it.
    >
    > ftp://ftp.cup.hp.com/dist/networking...cy_vs_tput.txt


    There are interactions between InterruptThrottleRate and NAPI...in the
    case of e1000 7.6.12 using NAPI the throttle rate only comes into effect
    if it's set to "dynamic conservative" mode, and then it only applies
    once there are no more packets to process.

    > If there wasn't "something" waiting for more traffic, I would expect
    > the netperf single-byte, single-transaction-outstanding, TCP_RR test
    > to always report best latency (transactions per second) since it only
    > ever has one request or response outstanding at a time on the
    > connection, so packet arrival should "always" be to an otherwise idle
    > NIC (under test conditions that is). My guess has been actual NIC HW
    > interrupt mitigation settings.


    The other possibility is that in the multiple-outstanding-packet case
    you get a latency win for successive packets because you don't have the
    overhead of an additional interrupt...the cpu is already processing the
    first packet, so the incremental cost of additional packets is lower.

    Chris

  19. Re: Mysterious network IO behavior

    David Mathog wrote:
    > It waits for this select to come back:


    > ret = select(fdmax,&read_set,&write_set,NULL,NULL);


    > Elsewhere there was a suggestion that CPU usage be examined - that
    > turned out to be interesting With "top -d 1" running in another
    > window "fast" runs, which averaged about 11.4Mb/s ran at a sustained
    > CPU usage of 20-25%. Oddly, slow runs ran at only 10-15% CPU usage,
    > peak. But there was nothing else using time - this was the fastest
    > process, it just didn't go as high as before. "nice -20" on the
    > process didn't make any difference, neither did turning off sge,
    > gmond, and the system health monitor, which calls sensors every 5
    > seconds.


    Is this a single CPU (core) system?

    > Just in case I wasn't clear, SOMETIMES the pure relay will run fast,
    > just (a lot) less often then when the timing is jiggered with writes to
    > the local disk, or a bunch of wasted cycles are used to slow down the
    > reads slightly.


    > Here is a summary of the observations for the straight relay (no messing
    > about with extra cycles or local writes) for a SLOW and a FAST run


    > SLOW
    > CPU Usage peak: 10-15%
    > delta RX 42511 (packets)
    > delta TX 53444 (packets)
    > delta overruns: 1983 (many)
    > read() calls: 20415 (varies A LOT, +- 8000 or so)
    > reads of 1448: 18801
    > write() calls: 18755(varies A LOT)
    > writes of 1448: 17159
    > bytes relayed; 40960000


    > FAST
    > CPU Usage peak: 20-25%
    > delta RX 42606 (packets)
    > delta TX 42669 (packets. NOTE: TX very close to RX, plus a bit more)
    > delta overruns: few (0-2)
    > read() calls: 27275 (varies A LITTLE)
    > reads of 1448: 26349
    > write() calls: 22707 (varies A LOT)
    > writes of 1448: 6805
    > bytes relayed; 40960000


    > The delta values are from ifconfig eth0 before and after the run on the
    > relay node.


    Ok, so the overruns are from ifconfig - I'm guessing that ethtool
    would report them too. That suggests that the driver wasn't
    replenishing inbound DMA buffers fast enough, so you might want to
    consider increasing the inbound DMA ring size - if it is possible to
    do, it would be done via ethtool. Those drops almost certainly mean
    the data source has had to retransmit what it was sending, which means
    it will have started doing things like shrink its congestion window,
    which will limit how much data it will have outstanding at one time.

    > The CPU usage is from "top -d 1" run at the same time.


    Depending on the CPU count, your CPU usage numbers could represent a
    single CPU saturating.

    > while(1){
    > set up FD
    > select();
    > if(FD_ISSET indicates input_ready)read_all_input;
    > if(FD_ISSET indicates output_ready)write_all_output;
    > }



    > So it often does things like this:


    > cycle i
    > read 1448
    > write 0 (output not ready)
    > cycle i+1
    > read 1448
    > write 2496 (now output ready, write data from this and previous cycle)


    I think that David already suggested this - since write_all_output
    already has to deal with not being able to write everything at one
    time (and you aren't spinning in write_all_output until it is done
    RIGHT?-) you should just go ahead and call write_all_output after
    read_all_input, and only set the bit in the writefds when you aren't
    able to put everything into the socket. So something like:

    while(1) {
    select()
    if (input_ready) {
    read_all_input;
    write_all_output;
    }
    if (have_output) && output_ready {
    write_all_output;
    }

    where write_all_output does the "right" thing wrt setting the
    output_ready bit from the current return of select(), and what it
    should be for the next call to select(), all based on whether or not
    it gets all its data into the socket buffer in the one call. Again,
    there should be _NO_ spinning on non-blocking writes in the
    write_all_output routine... (nor on reads in the read_all_input
    routine)

    rick jones
    --
    web2.0 n, the dot.com reunion tour...
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  20. Re: Mysterious network IO behavior

    David Mathog wrote:

    > input is placed in read_set if the buffer has space to accept input
    > output is placed in write_set if the buffer holds at least minwrite
    > bytes of data. (normally minwrite is 1)
    >
    > These all reference the same buffer. I tried increasing the size of
    > that buffer well beyond the total size of the input (so that input
    > would always go in) and it still had slow runs. So "filling up the
    > input buffer" isn't the problem.


    Correction, this is wrong.

    Counters were added to see how many consecutive times FD_SET() was
    used for an fd and was NOT followed by a successful read/write in the
    same cycle. In some of the slow runs extremely high "miss" counts were
    seen on the transmit fd, with up to 5M bytes stored in the buffer 2525
    "misses". That corresponds to the output stalling for .4 seconds (or
    so). However, in other instances equally slow runs were seen with
    no significant misses whatsoever.

    Watching "top -d 1" without this program running didn't turn up anything
    which was spiking up to a significant CPU load, and previous
    measurements showed no spikes in network load.

    There's something bizarre going on here but I cannot see what it might
    be. All the data ends up where it is supposed to, it just takes varying
    amounts of time to get there.

    Regards,

    David Mathog

+ Reply to Thread
Page 1 of 2 1 2 LastLast