Mysterious network IO behavior - Unix
This is a discussion on Mysterious network IO behavior - Unix ; I have a program which does this (pseudocode), where there is
one input and one or more outputs.
set all fd to be nonblocking
while(1){
do over all io streams {
call FD_SET (network and/or local)
}
select(read_set,write_set,NULL,timeout);
do over ...
-
Mysterious network IO behavior
I have a program which does this (pseudocode), where there is
one input and one or more outputs.
set all fd to be nonblocking
while(1){
do over all io streams {
call FD_SET (network and/or local)
}
select(read_set,write_set,NULL,timeout);
do over all io_streams{
check FD_ISSET;
if(ok_to_read){
read_data_into_ring_buffer;
}
else if(ok_to_write){
write_data_from_ring_buffer;
}
}
make_data_indices_agree
}
The program is a relay, it just reads from the input and copies to the
outputs. It uses a single ring buffer (large, about 5Mb) and there are
separate data indices for each IO stream so that the reads always go
into the leading edge of the data block, "above" and "below" the
existing data in the buffer, and the writes come from the trailing end
of the data. After each cycle the indices are all compared so that
"above" is adjusted to reflect any data read in and "below" reflects any
data read out.
Anyway, the odd thing is when this reads from the network, and writes to
a local file (first) AND to the network (second) it runs at 11.4Mb/sec,
which is pretty much full speed for this 100baseT network. Yet, when it
reads from the network, and ONLY writes to the network file, it runs at
about half that speed. In other words, when it ALSO (first) writes to a
local disk, then the "network relay" part runs at full network speed,
but when it JUST relays, which is less code executing, it runs SLOWER.
It just started doing this after some minor changes to the code,
essentially just changing some values which had been constants set by
#define to variables - with the same value. Before that it ran at
full speed with or without the extra copy to the local disk. By looking
at the Recv and Transmit counts in ifconfig when it runs I can see that
the number of packets moved in the slow mode is higher than in the fast
mode, although at the receiving end of the relay the exact same data
shows up (eventually) in either case.
My best guess right now is that there is some timing issue uncovered by
the minor change to the code, such that if the select(), FD_SET,
FD_ISSET, or the read()/write() from/to the fd are hit
at the "wrong time" (in this instance, too early) it somehow interferes
with the actual network transfers, perhaps by placing a lock on the
network buffer at just the wrong time, and that probably triggering a
brief timeout, perhaps in the driver.
I tried to run gprof() on this, but unfortunately it seems to not like
this code very much and returns nothing useful in timing information.
(Possibly because it spends 99% of its time in system calls.) Valgrind
shows no memory access problems (but of course runs much slower).
This is on Mandriva 2007.1, gcc 4.1.2, 3c59x network driver.
ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:E0:81:22:CC:3D
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::2e0:81ff:fe22:cc3d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:17232365 errors:0 dropped:0 overruns:68178 frame:0
TX packets:8912276 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4086293930 (3.8 GiB) TX bytes:3793003690 (3.5 GiB)
Interrupt:16
ethtool eth0
Settings for eth0:
Supported ports: [ TP MII ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 24
Transceiver: internal
Auto-negotiation: on
Current message level: 0x00000001 (1)
Link detected: yes
Any ideas what (else) might cause this effect, or more importantly, how
to avoid it? Some kernel parameter to tweak perhaps?
Thanks,
David Mathog
-
Re: Mysterious network IO behavior
David Mathog wrote:
> Anyway, the odd thing is when this reads from the network, and writes to
> a local file (first) AND to the network (second) it runs at 11.4Mb/sec,
> which is pretty much full speed for this 100baseT network. Yet, when it
> reads from the network, and ONLY writes to the network file, it runs at
> about half that speed.
A bit more research deepens the mystery, at least for me.
On looking at the ifconfig there is a very good inverse correlation
between the delta overrun count during the run and the speed. The more
of the former, the less of the latter. Now here is the odd part, this
program also logs the number of bytes read/written, and these are always
seen as (small) multiples of 1448. (1448 = The MTU of 1500 minus header
sizes). The socket buffer sizes are known (196608 on both sides) and the
read values are never even 10% of this. So the receiver isn't being
overrun by being full, rather the interrupts are apparently not (always)
being serviced on time. There is nothing else major running on the test
nodes, but the background gstat, pvm, nfs, and assorted other packets
may pass through from time to time.
The program has a minwrite parameter (the number of bytes which must
have been read from the input into the ring buffer before an attempt is
made to write from the ring buffer to the output.) By setting that to
13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
network relay mode (no local write required). Yes, I checked,
TCP_NODELAY is NOT set on these sockets. Using minwrite works
sometimes, other times it makes no difference. When it works the
speed is high and the overruns are zero, when if fails, the speed is low
and the overruns high.
This all seems to come back to the paradoxical observation that a short
delay is required between reading from the receiver and writing to the
transmitter for maximum throughput to be obtained. If that's true, then
possibly it would be better to wait a bit on the reads as well, picking
up N packets at a time instead of 1 by 1 as they come in? Perhaps drop
a pause in before the select and just wait, umm, .005 seconds between
reads? 12.5Mb/sec *.005 = 62500, and that should fit into
the recv/xmit buffers with plenty of room to spare.
Yes, that does get rid of the overruns, using a usleep() with a value
between 2500 and 7500 dropped the overruns to only 0<->2. The number of
packets returned by read() is much larger now, and the number returned
by write() is usually exactly the same, so the output buffer is not
backing up (it might however be emptying completely, which would reduce
throughput). The throughput on the pure relay was at best 8.9Mb/sec (at
5000 usecond delay). Without the usleep() but using either a local
write, or a minwrite of 13500, a throughput of 11.4 Mb/sec was achieved.
I'm groping around in the dark here. Who can tell me where the
interference is which is resulting in these odd timing interactions? It
is really hard optimizing code when making one part faster slows down
the program as a whole...
Thanks,
David Mathog
-
Re: Mysterious network IO behavior
David Mathog wrote:
Other info:
iptables is not running on the compute nodes.
Here are what I think are the relevant kernel parameters, anything in
here which might be related to this thread?
sysctl -a | grep ipv4.tcp
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_abc = 0
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_frto = 0
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_rmem = 4096 87380 1048576
net.ipv4.tcp_wmem = 4096 16384 1048576
net.ipv4.tcp_mem = 24576 32768 49152
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_fack = 1
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_max_tw_buckets = 180000
net.ipv4.tcp_max_orphans = 32768
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
Thanks,
David Mathog
-
Re: Mysterious network IO behavior
David Mathog wrote:
> David Mathog wrote:
> > Anyway, the odd thing is when this reads from the network, and
> > writes to a local file (first) AND to the network (second) it runs
> > at 11.4Mb/sec, which is pretty much full speed for this 100baseT
> > network. Yet, when it reads from the network, and ONLY writes to
> > the network file, it runs at about half that speed.
> A bit more research deepens the mystery, at least for me.
> On looking at the ifconfig there is a very good inverse correlation
> between the delta overrun count during the run and the speed. The more
> of the former, the less of the latter.
Overrun's of what specifically? NIC buffers? Something else?
> Now here is the odd part, this program also logs the number of bytes
> read/written, and these are always seen as (small) multiples of
> 1448. (1448 = The MTU of 1500 minus header sizes). The socket
> buffer sizes are known (196608 on both sides)
Only if you explicitly set them and under Linux (based on your next
posting in this thread) you get 2X what you request to the core mem
limits. Netperf has had to deal with this for a while - I finally
broke down and added yet another socket buffer size report in the new
"omni" tests - the "_end" socket buffer size since if one accepts
defaults on linux, there will be the value you get at the beginning,
and the value you have reached by the end of the connection.
> and the read values are never even 10% of this. So the receiver
> isn't being overrun by being full, rather the interrupts are
> apparently not (always) being serviced on time. There is nothing
> else major running on the test nodes, but the background gstat, pvm,
> nfs, and assorted other packets may pass through from time to time.
The socket buffer for TCP should _never_ be overun as that implies a
failure of the TCP flow control.
> The program has a minwrite parameter (the number of bytes which must
> have been read from the input into the ring buffer before an attempt
> is made to write from the ring buffer to the output.) By setting
> that to 13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
> network relay mode (no local write required).
I presume you see larger bytes per recv() call in that situation?
> Yes, I checked, TCP_NODELAY is NOT set on these sockets. Using
> minwrite works sometimes, other times it makes no difference. When
> it works the speed is high and the overruns are zero, when if fails,
> the speed is low and the overruns high.
When send()ing anything >= the MSS for the connection, TCP_NODELAY
should be a noop.
> This all seems to come back to the paradoxical observation that a
> short delay is required between reading from the receiver and
> writing to the transmitter for maximum throughput to be obtained.
> If that's true, then possibly it would be better to wait a bit on
> the reads as well, picking up N packets at a time instead of 1 by 1
> as they come in? Perhaps drop a pause in before the select and just
> wait, umm, .005 seconds between reads? 12.5Mb/sec *.005 = 62500,
> and that should fit into the recv/xmit buffers with plenty of room
> to spare.
You might have issues with the effective granularity of the usleep()
call on the systems on which you run.
> Yes, that does get rid of the overruns, using a usleep() with a value
> between 2500 and 7500 dropped the overruns to only 0<->2. The number of
> packets returned by read() is much larger now, and the number returned
> by write() is usually exactly the same,
Unless you've enable non-blocking mode on the sockets I would expect
that the write() calls would always be returning the quantity asked.
Are you using blocking or non-blocking sockets here?
rick jones
--
Wisdom Teeth are impacted, people are affected by the effects of events.
these opinions are mine, all mine; HP might not want them anyway... 
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
-
Re: Mysterious network IO behavior
David Mathog wrote:
> Hmm, maybe it only locks on the second call in readnbytes_noblock_n
> which sets EAGAIN? I'll try it again without that.
That changed the dynamics a bit but it still has slow transfers with the
second read() removed and the code to skip 8/9 reads removed. In other
words, the extra read() may be doing something timing related, but it is
not the primary problem. The same odd interactions still appear when
the second read (and second write) are not used.
Regards,
David Mathog
-
Re: Mysterious network IO behavior
> > Overrun's of what specifically? NIC buffers? Something else?
> Not the recv. buffer, that never filled up, as indicated by the read()
> never coming back with a full buffer. I think it indicates
> some sort of failure to service an interrupt on time.
> >
> >> Now here is the odd part, this program also logs the number of bytes
> >> read/written, and these are always seen as (small) multiples of
> >> 1448. (1448 = The MTU of 1500 minus header sizes). The socket
> >> buffer sizes are known (196608 on both sides)
> >
> > Only if you explicitly set them and under Linux (based on your next
> > posting in this thread) you get 2X what you request to the core mem
> > limits.
> The value cited was returned with a getsocket(). If the 2X bug
They don't consider it a bug, they consider it a feature...
> applies there as well then the buffer is half what I cited, but I
> don't think so, because by setting long usleep() the value got above
> 100K.
With linux's receive buffer moderation, the value for SO_RCVBUF is
dependent on when you make the setsockopt() call. If you make it at
the beginning (right after connect() or accept() it may be smaller
than if you make the call after some data has been transferred over
the connection. IIRC the same can happen with SO_SNDBUF.
> >> The program has a minwrite parameter (the number of bytes which must
> >> have been read from the input into the ring buffer before an attempt
> >> is made to write from the ring buffer to the output.) By setting
> >> that to 13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
> >> network relay mode (no local write required).
> >
> > I presume you see larger bytes per recv() call in that situation?
> Yes, it goes up more or less linearly with delay time, with a fair
> amount of variation.
Aren't the sizes of your writes then a function of the receives (in
the absence of minwrite). Wait, scratch that, I just saw the
following:
> > Unless you've enable non-blocking mode on the sockets I would
> > expect that the write() calls would always be returning the
> > quantity asked.
> >
> > Are you using blocking or non-blocking sockets here?
> Non-blocking.
And I presume you are using either select() or poll() to wait for
either of the sockets to be readable/writeable?
> I just did another experiment where rather than using usleep(), the code
> just kept a counter so that it would only read every 9th cycle where
> fd_isset on the input fd came back positive, but it would write every
> cycle if it could. That had no significant overruns (0-2) and ran at
Sorry to be dense, but overruns of what exactly again?
> This experiment is a bit better though, since it narrows the problem
> down to the 9x fewer read and write attempts, everything else being the
> same. The read code is just:
> if(count_reads >= bundle_reads){
> count_reads=0;
> ret = read_rb_noblock_n(nt->io_fd[i], &shadowrb[i], nt->size,
> nt->minread);
> }
> read_rb_noblock_n calls readnbytes_noblock_n once or twice (because of
> the ring buffer wrap), and readnbytes_noblock is shown below in its
> entirety. Since "n" is quite large it normally hits read() once,
> returning everything in the recv buffer, then read() comes back with -1
> and errno==EAGAIN, at which point it exits. Perhaps it would be better
> to just take the first thing which comes back and exit, skipping the
> second attempt at a read() once the buffer is probably (usually) empty.
> It was written the way it is because it wasn't clear if every
> implementation was going to be able to return everything in the recv
> buffer in one fell swoop.
If you pass-in a buffer of N bytes, and a non-blocking read/recv on
the socket returns < N bytes, you can and should (IMO) reasonably
assume there are no more bytes to be read on the socket. That extra
read() call will only waste CPU cycles. And you most certainly do
_NOT_ want to sit there spinning on a read() call, you want to come
back up to a select/poll/whatnot.
Only if you actually get N bytes should you assume there might be
other bytes there to be read.
> The write code is more or less the same, but ends in write().
Only if the write() call doesn't write all N bytes should you ever
consider going back around, and even then, you shouldn't just sit
there and keep trying to write() to the socket, you should come back
to a select/poll/whatnot to await the socket becoming writable again.
Spinning to write all the bytes is just going to waste CPU cycles.
rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window
these opinions are mine, all mine; HP might not want them anyway... 
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
-
Re: Mysterious network IO behavior
David Mathog writes:
[...]
> Now here is the odd part, this program also logs the number of
> bytes read/written, and these are always seen as (small) multiples
> of 1448. (1448 = The MTU of 1500 minus header sizes).
There is nothing odd about that: It means that you are receiving data
in small, integral multiples of the TCP MSS, or, putting this into
other words: That the program reads data somewhat slower than data
arrives from the network (otherwise, there would be exactly one
segment read per call).
> The socket buffer sizes are known (196608 on both sides) and the
> read values are never even 10% of this. So the receiver isn't
> being overrun by being full, rather the interrupts are apparently
> not (always) being serviced on time.
A better guess would be (and please note that this is only a guess,
because examining the driver would be necessary to actually determine
the details) that the network RX soft interrupt does not run often
enough to pull skbs off the RX ring fast enough to prevent received
frames from being dropped because there is no space on the ring to
store them.
> The program has a minwrite parameter (the number of bytes which must
> have been read from the input into the ring buffer before an attempt
> is made to write from the ring buffer to the output.) By setting that
> to 13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
> network relay mode (no local write required).
.... and this problem could well be avoided by being less aggressive
wrt running RX and TX at the same time.
> Yes, I checked, TCP_NODELAY is NOT set on these sockets.
What makes you believe this should relate to the phenomenon you are
observing anyhow? The Nagle algorithm basically just prevents a
non-fullsized segment from being transmitted if a previous,
non-fullsized segment has not yet been acknowledged.
[...]
> I'm groping around in the dark here. Who can tell me where the
> interference is which is resulting in these odd timing interactions?
> It is really hard optimizing code when making one part faster slows
> down the program as a whole...
If you spend more CPU time transmitting, eg, because of less
externally introduced delays in the transmit path, there will be less
CPU time available for reception.
I would consider this to be fairly straight-forward.
-
Re: Mysterious network IO behavior
Rick Jones wrote:
>> read_rb_noblock_n calls readnbytes_noblock_n once or twice (because of
>> the ring buffer wrap), and readnbytes_noblock is shown below in its
>> entirety. Since "n" is quite large it normally hits read() once,
>> returning everything in the recv buffer, then read() comes back with -1
>> and errno==EAGAIN, at which point it exits. Perhaps it would be better
>> to just take the first thing which comes back and exit, skipping the
>> second attempt at a read() once the buffer is probably (usually) empty.
>> It was written the way it is because it wasn't clear if every
>> implementation was going to be able to return everything in the recv
>> buffer in one fell swoop.
>
> If you pass-in a buffer of N bytes, and a non-blocking read/recv on
> the socket returns < N bytes, you can and should (IMO) reasonably
> assume there are no more bytes to be read on the socket. That extra
> read() call will only waste CPU cycles. And you most certainly do
> _NOT_ want to sit there spinning on a read() call, you want to come
> back up to a select/poll/whatnot.
>
> Only if you actually get N bytes should you assume there might be
> other bytes there to be read.
I went over and over the man pages, and nowhere could I find a statement
that "when read() tries to remove N bytes from a buffer
with M bytes in it, M
I could find was that once FD_ISSET came up positive that something was
guaranteed to be returned.
Anyway, I think I will take the extra read() out since clearly what you
say is what happens most of the time, if not maybe all the time. I
thought perhaps the second call could pick up more bytes in some
instances, but now I think that it is more likely that whatever is
keeping some of the bytes from being read the first time is probably
still going to be in the same state on an immediate second read.
Anyway, this still doesn't fix the original problem - these weird
timing issues persists when the one shot read()/write() methods are used.
Thanks,
David Mathog
-
Re: Mysterious network IO behavior
David Mathog wrote:
> Rick Jones wrote:
> > Only if you actually get N bytes should you assume there might be
> > other bytes there to be read.
> I went over and over the man pages, and nowhere could I find a
> statement that "when read() tries to remove N bytes from a buffer
> with M bytes in it, M
> I could find was that once FD_ISSET came up positive that something
> was guaranteed to be returned.
Manpages are notoriously incomplete, and linux's are no exception, but
since your read() and write() calls ultimately end-up as recv() and
send() calls (or at least functionally equivalent) you might also
peruse those manpages. Also, something like "Unix Network
Programming" by Stevens, Fenner and Rudoff would be a good reference
to have onhand - either in your personal collection, or via a nearby
library.
rick jones
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
these opinions are mine, all mine; HP might not want them anyway... 
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
-
Re: Mysterious network IO behavior
Rainer Weikusat wrote:
> David Mathog writes:
>
> [...]
>
>> Now here is the odd part, this program also logs the number of
>> bytes read/written, and these are always seen as (small) multiples
>> of 1448. (1448 = The MTU of 1500 minus header sizes).
>
> There is nothing odd about that:
Right, the part that was odd followed that.
>> The program has a minwrite parameter (the number of bytes which must
>> have been read from the input into the ring buffer before an attempt
>> is made to write from the ring buffer to the output.) By setting that
>> to 13500 (9 * 1500) the throughput can reach 11.4Mb/sec in pure
>> network relay mode (no local write required).
>
> ... and this problem could well be avoided by being less aggressive
> wrt running RX and TX at the same time.
I do not understand what you mean. The program is supposed to copy from
the input to the outputs with the goal of moving data as fast as
possible through the relay. In what way could it be made "less
aggressive" and still maintain maximum throughput.
>
>> Yes, I checked, TCP_NODELAY is NOT set on these sockets.
>
> What makes you believe this should relate to the phenomenon you are
> observing anyhow?
Earlier versions of the program had a bug which caused TCP_NODELAY to be
set. That resulted in performance degradation similar to what I'm
seeing now. Eliminating TCP_NODELAY let it run full speed, however
later minor changes (see earlier postings) reintroduced the throughput
issues. It may be that TCP_NODELAY was not the problem earlier per se,
but rather that it acted similarly to the later code changes, to uncover
timing issues.
> I would consider this to be fairly straight-forward.
Please look through the Rick Jones thread. I was able to increase
throughput by having the program skip 8 out of 9 reads following a valid
FD_ISSET, or to a lesser extent, by just setting a 5msec usleep before
the select(), or by having it first write to disk before writing to the
network. That is, by having the program waste time spinning, or
sleeping, or writing somewhere else, the network throughput increased.
That looks to me like some sort of problem with resource locks or
collisions triggering brief timeouts. I would like to know what it is
exactly so that the code can try to compensate for this effect, whatever
it turns out to be.
Regards,
David Mathog
-
Re: Mysterious network IO behavior
On Feb 4, 1:22 pm, Rainer Weikusat wrote:
> There is nothing odd about that: It means that you are receiving data
> in small, integral multiples of the TCP MSS, or, putting this into
> other words: That the program reads data somewhat slower than data
> arrives from the network (otherwise, there would be exactly one
> segment read per call).
Actually, the program could be reading data much faster than data is
arriving from the network. There is no guarantee that a 'read' call
will get all the data the network card has received. Interrupt
coalescing, for example, could cause the 'read' call to run wickedly
fast and still get several packets worth of data per call.
For example:
1) You call 'read'.
2) A packet arrives, no interrupt is generated. (It is delayed/
coalesced)
3) Another packet arrives, an interrupt is generated.
4) All the packets are processed and the 'read' call is unblocked.
5) The process runs, gets the data from both packets, and calls 'read'
again long before the next packet is even on the wire.
DS
-
Re: Mysterious network IO behavior
On Feb 4, 9:19 am, David Mathog wrote:
> Any ideas what (else) might cause this effect, or more importantly, how
> to avoid it? Some kernel parameter to tweak perhaps?
What does your CPU usage look like? If it's maxed, you probably are
putting descriptors in the write select set even when you have to data
to write. That's a *huge* mistake.
DS
-
Re: Mysterious network IO behavior
Okay, let's run through the basics:
1) Do you set the sockets non-blocking? (If not, that's your problem.)
2) How do you decide which sockets to put in the write set? (If you
put a socket in the write set even when you have no data to write to
it, that's your problem.)
3) Do you attempt a 'write' as soon as you have data to write or do
you wait for a hit from 'select'? (Ideally, you should only add the
socket to the 'write' set when you get a WOULDBLOCK indication.)
DS
-
Re: Mysterious network IO behavior
David Schwartz writes:
> On Feb 4, 1:22 pm, Rainer Weikusat wrote:
>
>> There is nothing odd about that: It means that you are receiving data
>> in small, integral multiples of the TCP MSS, or, putting this into
>> other words: That the program reads data somewhat slower than data
>> arrives from the network (otherwise, there would be exactly one
>> segment read per call).
>
> Actually, the program could be reading data much faster than data is
> arriving from the network.
No, it couldn't. This isn't even a question of how the various
software parts of the stack interact: If each read call returns more
than one SDU, the program reads the data slower than it arrives: For
any given time interval, there are n calls to read and m SDUs returned
and m = x * n, x > 1. Let t be the time interval, than there is a
'read speed' of n / t and a recv speed of m / t. Because m = x * n,
m / t = (x * n) / t
and because of x > 1
(x * n) / t > n / t
A statement like 'the program is reading data faster than it arrives'
is impossible in its own right: At most, the program could read the
data as fast as it arrives, ie n = m.
BTW, actual (Linux-)NIC drivers work the other way round: They don't delay
received frames in the (possibly vain) hope that other frames will
arrive in feature, but start 'receiving' once the first frame is
available and continue to copy received frames off the RX DMA ring of
the NIC without re-enabling interrupts for as long as there are frames
to copy, cf
http://www.linux-foundation.org/en/Net:NAPI
-
Re: Mysterious network IO behavior
David Mathog writes:
[...]
> I do not understand what you mean.
Understanding without listening hardly ever happens.
-
Re: Mysterious network IO behavior
Rainer Weikusat wrote:
> BTW, actual (Linux-)NIC drivers work the other way round: They don't
> delay received frames in the (possibly vain) hope that other frames
> will arrive in feature, but start 'receiving' once the first frame
> is available and continue to copy received frames off the RX DMA
> ring of the NIC without re-enabling interrupts for as long as there
> are frames to copy, cf
> http://www.linux-foundation.org/en/Net:NAPI
My experience with netperf TCP_RR performance over a number of
drivers/NICs suggests that there may be more to it.
ftp://ftp.cup.hp.com/dist/networking...cy_vs_tput.txt
If there wasn't "something" waiting for more traffic, I would expect
the netperf single-byte, single-transaction-outstanding, TCP_RR test
to always report best latency (transactions per second) since it only
ever has one request or response outstanding at a time on the
connection, so packet arrival should "always" be to an otherwise idle
NIC (under test conditions that is). My guess has been actual NIC HW
interrupt mitigation settings.
rick jones
--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
these opinions are mine, all mine; HP might not want them anyway... 
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
-
Re: Mysterious network IO behavior
David Schwartz wrote:
> Okay, let's run through the basics:
>
> 1) Do you set the sockets non-blocking? (If not, that's your problem.)
They are non-blocking.
>
> 2) How do you decide which sockets to put in the write set? (If you
> put a socket in the write set even when you have no data to write to
> it, that's your problem.)
input is placed in read_set if the buffer has space to accept input
output is placed in write_set if the buffer holds at least minwrite
bytes of data. (normally minwrite is 1)
These all reference the same buffer. I tried increasing the size of
that buffer well beyond the total size of the input (so that input
would always go in) and it still had slow runs. So "filling up the
input buffer" isn't the problem. As was mentioned previously,the
receive buffer never fills all the way either. The transmit buffer
is definitely not always ready to accept input though, unclear if
this is because it is full though.
>
> 3) Do you attempt a 'write' as soon as you have data to write or do
> you wait for a hit from 'select'? (Ideally, you should only add the
> socket to the 'write' set when you get a WOULDBLOCK indication.)
It waits for this select to come back:
ret = select(fdmax,&read_set,&write_set,NULL,NULL);
Elsewhere there was a suggestion that CPU usage be examined - that
turned out to be interesting With "top -d 1" running in another window
"fast" runs, which averaged about 11.4Mb/s ran at a sustained
CPU usage of 20-25%. Oddly, slow runs ran at only 10-15% CPU usage,
peak. But there was nothing else using time - this was the fastest
process, it just didn't go as high as before. "nice -20" on the process
didn't make any difference, neither did turning off sge, gmond, and the
system health monitor, which calls sensors every 5 seconds.
Just in case I wasn't clear, SOMETIMES the pure relay will run fast,
just (a lot) less often then when the timing is jiggered with writes to
the local disk, or a bunch of wasted cycles are used to slow down the
reads slightly.
Here is a summary of the observations for the straight relay (no messing
about with extra cycles or local writes) for a SLOW and a FAST run
SLOW
CPU Usage peak: 10-15%
delta RX 42511 (packets)
delta TX 53444 (packets)
delta overruns: 1983 (many)
read() calls: 20415 (varies A LOT, +- 8000 or so)
reads of 1448: 18801
write() calls: 18755(varies A LOT)
writes of 1448: 17159
bytes relayed; 40960000
FAST
CPU Usage peak: 20-25%
delta RX 42606 (packets)
delta TX 42669 (packets. NOTE: TX very close to RX, plus a bit more)
delta overruns: few (0-2)
read() calls: 27275 (varies A LITTLE)
reads of 1448: 26349
write() calls: 22707 (varies A LOT)
writes of 1448: 6805
bytes relayed; 40960000
The delta values are from ifconfig eth0 before and after the run on the
relay node. The CPU usage is from "top -d 1" run at the same time. The
other information is from fprintf(stderr) statements in the program,
which are directed to /tmp/stderr.log during the run and then checked
later. Note that when the program is running slowly the majority of
single packet reads (1448 bytes) are written out before another packet
is read. When it is running quickly the majority of reads are for
single packets and a second or more packets is usually sent. (If all
input packets were 1448, then the most reads() possible would be
40960000/1448 => 28288. However, some packets are observed smaller than
this.) This can happen fairly efficiently because the loop looks like this:
while(1){
set up FD
select();
if(FD_ISSET indicates input_ready)read_all_input;
if(FD_ISSET indicates output_ready)write_all_output;
}
So it often does things like this:
cycle i
read 1448
write 0 (output not ready)
cycle i+1
read 1448
write 2496 (now output ready, write data from this and previous cycle)
Anyway, it looks to me like when things slow down, the reads become
rate limiting. FD_ISSET shows data available less often for read,
and when the data comes back from the read it is often several
packets worth. (However, it isn't ever a full recv buffer's worth.)
Conversely, the output seems to be less congested, since more of the
single packet input is able to go out as single packet output. That
is consistent - if the data comes in slower there is less pressure
on the output part of the relay.
Others factoid, for topology:
A->B->C
where node B relays, B is often slow. However, for
A->B
or
B->C
where node B either just receives or just transmits, it always
runs quickly with no tricks required.
ifconfig doesn't show a lot of other traffic. In 60 seconds on node A
there were 528 TX and 525 RX, it does not seem to be too bursty since
several 10 second measurements were all about 1/5th of that. Similarly
ifconfig on the node B showed almost the same numbers. Since even a
slow run only takes about 6 seconds we are only talking about another 50
RX and 50 TX packets, which shouldn't be enough to cause this.
Regards,
David Mathog
-
Re: Mysterious network IO behavior
Rick Jones wrote:
> Rainer Weikusat wrote:
>
>>BTW, actual (Linux-)NIC drivers work the other way round: They don't
>>delay received frames in the (possibly vain) hope that other frames
>>will arrive in feature, but start 'receiving' once the first frame
>>is available and continue to copy received frames off the RX DMA
>>ring of the NIC without re-enabling interrupts for as long as there
>>are frames to copy, cf
>
>
>> http://www.linux-foundation.org/en/Net:NAPI
>
>
> My experience with netperf TCP_RR performance over a number of
> drivers/NICs suggests that there may be more to it.
>
> ftp://ftp.cup.hp.com/dist/networking...cy_vs_tput.txt
There are interactions between InterruptThrottleRate and NAPI...in the
case of e1000 7.6.12 using NAPI the throttle rate only comes into effect
if it's set to "dynamic conservative" mode, and then it only applies
once there are no more packets to process.
> If there wasn't "something" waiting for more traffic, I would expect
> the netperf single-byte, single-transaction-outstanding, TCP_RR test
> to always report best latency (transactions per second) since it only
> ever has one request or response outstanding at a time on the
> connection, so packet arrival should "always" be to an otherwise idle
> NIC (under test conditions that is). My guess has been actual NIC HW
> interrupt mitigation settings.
The other possibility is that in the multiple-outstanding-packet case
you get a latency win for successive packets because you don't have the
overhead of an additional interrupt...the cpu is already processing the
first packet, so the incremental cost of additional packets is lower.
Chris
-
Re: Mysterious network IO behavior
David Mathog wrote:
> It waits for this select to come back:
> ret = select(fdmax,&read_set,&write_set,NULL,NULL);
> Elsewhere there was a suggestion that CPU usage be examined - that
> turned out to be interesting With "top -d 1" running in another
> window "fast" runs, which averaged about 11.4Mb/s ran at a sustained
> CPU usage of 20-25%. Oddly, slow runs ran at only 10-15% CPU usage,
> peak. But there was nothing else using time - this was the fastest
> process, it just didn't go as high as before. "nice -20" on the
> process didn't make any difference, neither did turning off sge,
> gmond, and the system health monitor, which calls sensors every 5
> seconds.
Is this a single CPU (core) system?
> Just in case I wasn't clear, SOMETIMES the pure relay will run fast,
> just (a lot) less often then when the timing is jiggered with writes to
> the local disk, or a bunch of wasted cycles are used to slow down the
> reads slightly.
> Here is a summary of the observations for the straight relay (no messing
> about with extra cycles or local writes) for a SLOW and a FAST run
> SLOW
> CPU Usage peak: 10-15%
> delta RX 42511 (packets)
> delta TX 53444 (packets)
> delta overruns: 1983 (many)
> read() calls: 20415 (varies A LOT, +- 8000 or so)
> reads of 1448: 18801
> write() calls: 18755(varies A LOT)
> writes of 1448: 17159
> bytes relayed; 40960000
> FAST
> CPU Usage peak: 20-25%
> delta RX 42606 (packets)
> delta TX 42669 (packets. NOTE: TX very close to RX, plus a bit more)
> delta overruns: few (0-2)
> read() calls: 27275 (varies A LITTLE)
> reads of 1448: 26349
> write() calls: 22707 (varies A LOT)
> writes of 1448: 6805
> bytes relayed; 40960000
> The delta values are from ifconfig eth0 before and after the run on the
> relay node.
Ok, so the overruns are from ifconfig - I'm guessing that ethtool
would report them too. That suggests that the driver wasn't
replenishing inbound DMA buffers fast enough, so you might want to
consider increasing the inbound DMA ring size - if it is possible to
do, it would be done via ethtool. Those drops almost certainly mean
the data source has had to retransmit what it was sending, which means
it will have started doing things like shrink its congestion window,
which will limit how much data it will have outstanding at one time.
> The CPU usage is from "top -d 1" run at the same time.
Depending on the CPU count, your CPU usage numbers could represent a
single CPU saturating.
> while(1){
> set up FD
> select();
> if(FD_ISSET indicates input_ready)read_all_input;
> if(FD_ISSET indicates output_ready)write_all_output;
> }
> So it often does things like this:
> cycle i
> read 1448
> write 0 (output not ready)
> cycle i+1
> read 1448
> write 2496 (now output ready, write data from this and previous cycle)
I think that David already suggested this - since write_all_output
already has to deal with not being able to write everything at one
time (and you aren't spinning in write_all_output until it is done
RIGHT?-) you should just go ahead and call write_all_output after
read_all_input, and only set the bit in the writefds when you aren't
able to put everything into the socket. So something like:
while(1) {
select()
if (input_ready) {
read_all_input;
write_all_output;
}
if (have_output) && output_ready {
write_all_output;
}
where write_all_output does the "right" thing wrt setting the
output_ready bit from the current return of select(), and what it
should be for the next call to select(), all based on whether or not
it gets all its data into the socket buffer in the one call. Again,
there should be _NO_ spinning on non-blocking writes in the
write_all_output routine... (nor on reads in the read_all_input
routine)
rick jones
--
web2.0 n, the dot.com reunion tour...
these opinions are mine, all mine; HP might not want them anyway... 
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
-
Re: Mysterious network IO behavior
David Mathog wrote:
> input is placed in read_set if the buffer has space to accept input
> output is placed in write_set if the buffer holds at least minwrite
> bytes of data. (normally minwrite is 1)
>
> These all reference the same buffer. I tried increasing the size of
> that buffer well beyond the total size of the input (so that input
> would always go in) and it still had slow runs. So "filling up the
> input buffer" isn't the problem.
Correction, this is wrong.
Counters were added to see how many consecutive times FD_SET() was
used for an fd and was NOT followed by a successful read/write in the
same cycle. In some of the slow runs extremely high "miss" counts were
seen on the transmit fd, with up to 5M bytes stored in the buffer 2525
"misses". That corresponds to the output stalling for .4 seconds (or
so). However, in other instances equally slow runs were seen with
no significant misses whatsoever.
Watching "top -d 1" without this program running didn't turn up anything
which was spiking up to a significant CPU load, and previous
measurements showed no spikes in network load.
There's something bizarre going on here but I cannot see what it might
be. All the data ends up where it is supposed to, it just takes varying
amounts of time to get there.
Regards,
David Mathog