non_blocking question - Unix
This is a discussion on non_blocking question - Unix ; I have a question concerning write rates to different devices though the
same piece of code. This comes up in the context of optimizing
throughput in a single threaded program which is relaying data along a
network daisy chain: it ...
-
non_blocking question
I have a question concerning write rates to different devices though the
same piece of code. This comes up in the context of optimizing
throughput in a single threaded program which is relaying data along a
network daisy chain: it reads from the upstream machine, then writes to
both the local node (either a pipe or to a disk) and also over the
network to the downstream node. This is all typical FSET and select()
code. The speed and buffer sizes of the network and local devices can
differ dramatically from one working environment to another.
Let "addr" be a buffer holding "N" bytes, of which at least "n" (n
are requested to be written, accessed through an fd which has been set
to the nonblocking state. This code is executed:
ret = write(fd, addr, n);
fd may be for a network socket, a pipe, or a disk file. The question is
how long will this process typically reside in the write() function for
the different devices, and how much data should one expect to be
transferred per call based on the size of "n"?
For instance, the buffers associated with a network socket are typically
pretty small, and it seems like "n" can be pretty much anything bigger
than that and the write will return quickly having sent a smallish chunk
of data into the network buffer. (The write does a memory to memory
copy on the computer, which is orders of magnitude faster than any of
these writes take to complete.) For disks though, making "n" large can
result in write() hanging around sending all of the data, which
decreases the amount of time available for network reads and writes.
Is there a good way to determine the optimal size of "n" at run time for
the different IO devices? Currently my code can be tuned by setting a
"maxwrite" parameter for each device (which limits "n" to no more than
that value), but optimizing throughput that way requires a fair amount
of experimentation for each new system.
Thanks,
David Mathog
-
Re: non_blocking question
On a related note, have any of you ever seen programs which can be used
as rate controlled data sources and syncs? Currently pretty much all
linux/unix variants provides this:
dd if=/dev/zero bs=4196 count=100000 > /dev/null
but both the source (dd) and sync (/dev/null) run as fast as they
possibly can. Even on old systems like an Athlon MP the above command
moves 3.4GB/sec. Real IO is much, much slower. So those sorts of
sources and syncs are much too fast for emulating on a single node the
rate issues discussed in the previous post. It would be really handy if
this was possible:
mkfifo /tmp/in
mkfifo /tmp/out1
mkfifo /tmp/out2
dsource -rate 10000000 > /tmp/in &
dsync -rate 10000000 < /tmp/out1 &
dsync -rate 60000000 < /tmp/out2 &
testprogram -in /tmp/in -out1 /tmp/out1 -out2 /tmp/out2
That is, where the dsource and dsync programs emit and absorb data at
the specified rates (in MB/s). Not bursty though, at that rate even on
short time scales.
Regards,
David Mathog
-
Re: non_blocking question
David Mathog wrote:
> On a related note, have any of you ever seen programs which can be used
> as rate controlled data sources and syncs? Currently pretty much all
> linux/unix variants provides this:
>
> dd if=/dev/zero bs=4196 count=100000 > /dev/null
>
> but both the source (dd) and sync (/dev/null) run as fast as they
> possibly can. Even on old systems like an Athlon MP the above command
> moves 3.4GB/sec. Real IO is much, much slower. So those sorts of
> sources and syncs are much too fast for emulating on a single node the
> rate issues discussed in the previous post. It would be really handy if
> this was possible:
>
> mkfifo /tmp/in
> mkfifo /tmp/out1
> mkfifo /tmp/out2
> dsource -rate 10000000 > /tmp/in &
> dsync -rate 10000000 < /tmp/out1 &
> dsync -rate 60000000 < /tmp/out2 &
> testprogram -in /tmp/in -out1 /tmp/out1 -out2 /tmp/out2
>
> That is, where the dsource and dsync programs emit and absorb data at
> the specified rates (in MB/s). Not bursty though, at that rate even on
> short time scales.
Hmmm. The programs should be pretty simple to write if you
don't get absurdly anal about avoiding burstiness. If you really
really care about it you've got a real-time problem and a whole
slew of additional issues. You'll most likely want to distinguish
between time lost while blocked (output FIFO full, input FIFO empty)
and time elapsed in other activities, which you can probably do by
using non-blocking mode plus select, poll, /dev/poll, or the alert
mechanism of your choice.
Pre-emptive bug report: "dsync" should be "dsink".
Observation: If your example is intended to be realistic, you
obviously use much faster hardware than I do. Eighty terabytes a
second is *really* cranking ...
--
Eric.Sosman@sun.com
-
Re: non_blocking question
Eric Sosman wrote:
> Hmmm. The programs should be pretty simple to write if you
> don't get absurdly anal about avoiding burstiness. If you really
> really care about it you've got a real-time problem and a whole
> slew of additional issues. You'll most likely want to distinguish
> between time lost while blocked (output FIFO full, input FIFO empty)
> and time elapsed in other activities, which you can probably do by
> using non-blocking mode plus select, poll, /dev/poll, or the alert
> mechanism of your choice.
>
I thought about doing it using nanosleep() between writes/reads to do IO
at a steady rate (R, say 1000 times per second) with a fixed buffer size
(S), giving a data transfer rate of R * S, and the user can specify both
R and S. However at least on Linux that seems like it probably will not
work very well since:
The current implementation of nanosleep() is based on the
normal kernel timer mechanism, which
has a resolution of 1/HZ s (see time(7)). Therefore, nanosleep()
pauses always for at least the
specified time, however it can take up to 10 ms longer than
specified until the process becomes
runnable again. For the same reason, the value returned in case
of a delivered signal in *rem is
usually rounded to the next larger multiple of 1/HZ s.
10ms is a pretty big chunk of time when each interval is supposed to be
1 ms. Dropping it to 100 IO operations per second would likely get out
of that hole, but that would raise the size of the chunks so much that
it no longer be a very a good model for network IO.
For rate limited read select() (alone) would not work, the sender could
stuff as much data as it wanted into the reader, which would just come
back before the timeout period. A second timer would be required to
make the rate steady.
Hmm, just discovered the function clock_nanosleep() which is designed
for just this sort of problem. Not sure how portable that would be, but
it looks like it should do the trick, at least on recent Linux's.
> Pre-emptive bug report: "dsync" should be "dsink".
SNIP
> Eighty terabytes a second is *really* cranking ...
Right on both counts. The second one was supposed to be B/s.
Regards,
David Mathog
-
Re: non_blocking question
David Mathog writes:
> Eric Sosman wrote:
>
>> Hmmm. The programs should be pretty simple to write if you
>> don't get absurdly anal about avoiding burstiness. If you really
>> really care about it you've got a real-time problem and a whole
>> slew of additional issues. You'll most likely want to distinguish
>> between time lost while blocked (output FIFO full, input FIFO empty)
>> and time elapsed in other activities, which you can probably do by
>> using non-blocking mode plus select, poll, /dev/poll, or the alert
>> mechanism of your choice.
>>
>
> I thought about doing it using nanosleep() between writes/reads to do
> IO at a steady rate (R, say 1000 times per second)
[...]
> However at least on Linux that seems like it
> probably will not work very well since:
>
> The current implementation of nanosleep() is based on the
> normal kernel timer mechanism, which has a resolution of 1/HZ
[...]
> Hmm, just discovered the function clock_nanosleep() which is designed
> for just this sort of problem.
It isn't, cf
The suspension time caused by this function may be longer than
requested because the argument value is rounded up to an
integer multiple of the sleep resolution
(clock_nanosleep, SUS)
The same statement can be found in the description of nanosleep and
the clock_nanosleep 'application usage' section says:
Calling clock_nanosleep() with the value TIMER_ABSTIME not
set in the flags argument and with a clock_id of
CLOCK_REALTIME is equivalent to calling nanosleep() with the
same rqtp and rmtp arguments.
The only reason 'clock_nanosleep' exists is to provide a way to sleep
until a specific absolute time has been reached (which may be an
absolute time based on CLOCK_MONTONIC and therefore unaffected by
setting the realtime clock to a certain value).
-
Re: non_blocking question
David Mathog wrote:
> Eric Sosman wrote:
>
>> Hmmm. The programs should be pretty simple to write if you
>> don't get absurdly anal about avoiding burstiness. If you really
>> really care about it you've got a real-time problem and a whole
>> slew of additional issues. You'll most likely want to distinguish
>> between time lost while blocked (output FIFO full, input FIFO empty)
>> and time elapsed in other activities, which you can probably do by
>> using non-blocking mode plus select, poll, /dev/poll, or the alert
>> mechanism of your choice.
>
> I thought about doing it using nanosleep() between writes/reads to do IO
> at a steady rate (R, say 1000 times per second) with a fixed buffer size
> (S), giving a data transfer rate of R * S, and the user can specify both
> R and S. [...]
You can't rely on the delay all by itself to achieve a desired
rate. Go to sleep, sure, but when you wake up check the current
time (gethrtime or something like it), and compute how many bytes
you ought to have read or written by now, whatever "now" turns out
to be. Subtract the number already done, then read or write the
difference.
The granularity of the sleep timer will cause some burstiness,
and so will the unpredictability of the scheduler as it chooses to
delay you or let you run. If you need to worry about that level
of accuracy, you'll need real-time techniques to address it. But
if you can live with "it looks right when averaged over a full
second," the naive approach is probably plenty good enough.
>> Pre-emptive bug report: "dsync" should be "dsink".
> SNIP
>> Eighty terabytes a second is *really* cranking ...
>
> Right on both counts. The second one was supposed to be B/s.
A correct but unfortunate abbreviation ... ;-)
--
Eric.Sosman@sun.com
-
Re: non_blocking question
On Oct 22, 3:28*pm, David Mathog wrote:
> On a related note, have any of you ever seen programs which can be used
> as rate controlled data sources and syncs? *Currently pretty much all
> linux/unix variants provides this:
>
> * *dd if=/dev/zero bs=4196 count=100000 > /dev/null
>
> but both the source (dd) and sync (/dev/null) run as fast as they
> possibly can. *Even on old systems like an Athlon MP the above command
> moves 3.4GB/sec. *Real IO is much, much slower. *So those sorts of
> sources and syncs are much too fast for emulating on a single node the
> rate issues discussed in the previous post. *It would be really handy if
> this was possible:
>
> * *mkfifo /tmp/in
> * *mkfifo /tmp/out1
> * *mkfifo /tmp/out2
> * *dsource -rate 10000000 * *> /tmp/in * &
> * *dsync * -rate 10000000 * *< /tmp/out1 &
> * *dsync * -rate 60000000 * *< /tmp/out2 &
> * *testprogram -in /tmp/in -out1 /tmp/out1 -out2 /tmp/out2
>
> That is, where the dsource and dsync programs emit and absorb data at
> the specified rates (in MB/s). *Not bursty though, at that rate even on
> short time scales.
>
> Regards,
>
> David Mathog
--rate-limit option of "Pipe Viewer" of any use?
http://www.ivarch.com/programs/quickref/pv.shtml
-
Re: non_blocking question
David Mathog schrieb:
> On a related note, have any of you ever seen programs which can be used
> as rate controlled data sources and syncs? Currently pretty much all
> linux/unix variants provides this:
>
> dd if=/dev/zero bs=4196 count=100000 > /dev/null
>
I'm not sure if I understand your question completely, but mbuffer has
options to specify the rate at which data gets transfered. It is a
little bit similar to dd or even more to buffer, but has a whole bunch
of additional options to specify buffer size and prevent tape
stop-rewind-restart conditions when using it for performing backups to
tape drives.
Get it here: http://www.maier-komor.de/mbuffer.html
- Thomas
-
Re: non_blocking question
David Mathog wrote:
> Let "addr" be a buffer holding "N" bytes, of which at least "n" (n
> are requested to be written, accessed through an fd which has been set
> to the nonblocking state. This code is executed:
>
> ret = write(fd, addr, n);
>
> fd may be for a network socket, a pipe, or a disk file. The question is
> how long will this process typically reside in the write() function for
> the different devices, and how much data should one expect to be
> transferred per call based on the size of "n"?
>
(This is long)
I'd like to come back to this if we can. I have a program "nettee"
which creates a daisy chain connection down a number of nodes, and is
used to distribute data to those nodes. So it is like this, in general:
first node: reads data from stdin, sends to network
internal nodes: reads data from preceding node, writes it to
local disk AND copies it to next node
final node: reads data from preceding node, writes it to local disk
On our (oldish) cluster the disks can write at about 40MB/sec sustained
(they are WDC WD400BB-00DEA0, ATA 100, 2MB cache
% hdparm -tT /dev/hda
/dev/hda:
Timing cached reads: 530 MB in 2.01 seconds = 264.22 MB/sec
Timing buffered disk reads: 134 MB in 3.02 seconds = 44.30 MB/sec
% hdparm /dev/hda
/dev/hda:
multcount = 16 (on)
IO_support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 256 (on)
geometry = 65535/16/63, sectors = 78165360, start = 0
),
and the 100baseT network can move data between two nodes using TCP at
around 11.7 MB/s. There is apparently some sort of interference between
the network IO and actual disk IO, and that is at a fairly low level
because in this experiment (using the 0.2.0, beta nettee, the data
generated on the master is stored to the target file and also echoed to
one more node, which throws it away ):
(master): dd if=/dev/zero bs=4196 count=50000 \
| nettee -root -next $NEXTNODE -out $TARGET -v 31
(slave): nettee -out /dev/null -next _EOC_
TARGET Speed (MB/sec)
/dev/null 11.76
/tmp/foo.dat 11.76
/disk/foo.dat 10.7 (the rate fluctuates a lot)
Disk caching is enabled, so one might have thought that writing to
/disk/foo.dat (on a real disk) would be as fast as writing to
/tmp/foo.dat (a tmpfs), but it isn't, not even for a 200MB file which
fits completely into memory.
% dd if=/dev/zero bs=4096 count=50000 >/dev/null
50000+0 records in
50000+0 records out
204800000 bytes (205 MB) copied, 0.0568505 seconds, 3.6 GB/s
% dd if=/dev/zero bs=4096 count=50000 >/tmp/foo.dat
50000+0 records in
50000+0 records out
204800000 bytes (205 MB) copied, 0.865042 seconds, 237 MB/s
% dd if=/dev/zero bs=4096 count=50000 >foo.dat
50000+0 records in
50000+0 records out
204800000 bytes (205 MB) copied, 1.31895 seconds, 155 MB/s
155 MB/s isn't terrible, but it's a lot less than the speed to write
straight to memory in a tmpfs, which suggests that something about
actual disk writes slows down the system even when network IO is not
part of the equation. Also the "237 MB/s" number corresponds to a
nettee test speed which is the same as writing to /dev/null. In other
words, the "write to memory" part of this equation isn't limiting
network IO, something about actually sending data to the disk is.
There's no "sync" in these tests - they should
be perfectly happy stuffing most of the data into memory and then
exiting, to leave it to dribble out to the disk in actual writes "whenever".
I'm thinking there may be some sort of interrupt level conflict, where
the network card uses a ton of interrupts and even a few generated by
the disk controller result in a disproportionate hit in network performance.
Kernel is 2.6.19.3.
Any thoughts?
Thanks,
David Mathog
-
Re: non_blocking question
David Mathog writes:
>David Mathog wrote:
>
>> Let "addr" be a buffer holding "N" bytes, of which at least "n" (n
>> are requested to be written, accessed through an fd which has been set
>> to the nonblocking state. This code is executed:
>>
>> ret = write(fd, addr, n);
>>
>> fd may be for a network socket, a pipe, or a disk file. The question is
>> how long will this process typically reside in the write() function for
>> the different devices, and how much data should one expect to be
>> transferred per call based on the size of "n"?
>>
>
>(This is long)
>
>I'd like to come back to this if we can. I have a program "nettee"
>which creates a daisy chain connection down a number of nodes, and is
>used to distribute data to those nodes. So it is like this, in general:
It would take a long time to enumerate all the different factors
that will help isolate your performance issues, but you should keep the
following in mind:
Physical Disk Sustained Write Peformance
----------------------------------------
This varies based on the actual filesystem layout on
the device. For files to typical Linux filesystem (e.g.
a non-extent based filesystem such as ext3), the physical
storage locations for the blocks that make up your file
will not necessarily be contiguous on the media, so the
drive will need to move heads (seek) and wait for the
appropriate sector to move under the head (rotational latency).
Given a typical ext3 filesystem blocksize of 4096 bytes, a single
64kbyte write can result in potentially 16 writes to non-contiguous
sectors. This reduces your throughput considerable.
Extent-based filesystems can do better, since they typically
deal with large extents (an extent is a set of contiguous sectors
assigned to a specific file).
Processor Interrupt Handling
----------------------------------------
Most network interface cards only use a single interrupt vector
and thus handle all incoming traffic on a single core. This can
saturate the core and reduce your throughput (albeit more likely
at 1Gbe or 10Gbe speeds). Check /proc/interrupts to see
how the interrupts are distributed. 11.7MBytes/sec is very very
good for TCP on 100BaseT. Check sar/iostat/vmstat to see what the
processor utilization looks like.
On-board Disk Caches
----------------------------------------
They'll buffer, somewhat, the impact of the seek/rotational
latency issues, but only insofar as the disk buffer(s) allow. Considering
the relatively small buffers on the disk electronics, and the fact
you still pay the seek/latency penalties, you'll not buy much by
enabling write caches on the drive for large streaming I/O workloads.
File Caches
----------------------------------------
Unix (and Linux) applications don't write directly to the device
(even dd(1)). They write to the file cache (which consists of
otherwise unused physical memory), and the kernel eventually
flushes the output to the device as the memory is needed for other
purposes (or if the sync(1) command is used, or if the O_SYNC or
O_DATASYNC flags are provided at open(2)).
Using O_DIRECT with standard file descriptors on Linux, or using
raw devices on Unix will cause the I/O to go directly to the device
bypassing the file cache.
All of these factors conspire to make performance analysis interesting in
unix/linux systems.
hth
s