non_blocking question - Unix

This is a discussion on non_blocking question - Unix ; I have a question concerning write rates to different devices though the same piece of code. This comes up in the context of optimizing throughput in a single threaded program which is relaying data along a network daisy chain: it ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: non_blocking question

  1. non_blocking question

    I have a question concerning write rates to different devices though the
    same piece of code. This comes up in the context of optimizing
    throughput in a single threaded program which is relaying data along a
    network daisy chain: it reads from the upstream machine, then writes to
    both the local node (either a pipe or to a disk) and also over the
    network to the downstream node. This is all typical FSET and select()
    code. The speed and buffer sizes of the network and local devices can
    differ dramatically from one working environment to another.

    Let "addr" be a buffer holding "N" bytes, of which at least "n" (n are requested to be written, accessed through an fd which has been set
    to the nonblocking state. This code is executed:

    ret = write(fd, addr, n);

    fd may be for a network socket, a pipe, or a disk file. The question is
    how long will this process typically reside in the write() function for
    the different devices, and how much data should one expect to be
    transferred per call based on the size of "n"?

    For instance, the buffers associated with a network socket are typically
    pretty small, and it seems like "n" can be pretty much anything bigger
    than that and the write will return quickly having sent a smallish chunk
    of data into the network buffer. (The write does a memory to memory
    copy on the computer, which is orders of magnitude faster than any of
    these writes take to complete.) For disks though, making "n" large can
    result in write() hanging around sending all of the data, which
    decreases the amount of time available for network reads and writes.

    Is there a good way to determine the optimal size of "n" at run time for
    the different IO devices? Currently my code can be tuned by setting a
    "maxwrite" parameter for each device (which limits "n" to no more than
    that value), but optimizing throughput that way requires a fair amount
    of experimentation for each new system.

    Thanks,

    David Mathog

  2. Re: non_blocking question

    On a related note, have any of you ever seen programs which can be used
    as rate controlled data sources and syncs? Currently pretty much all
    linux/unix variants provides this:

    dd if=/dev/zero bs=4196 count=100000 > /dev/null

    but both the source (dd) and sync (/dev/null) run as fast as they
    possibly can. Even on old systems like an Athlon MP the above command
    moves 3.4GB/sec. Real IO is much, much slower. So those sorts of
    sources and syncs are much too fast for emulating on a single node the
    rate issues discussed in the previous post. It would be really handy if
    this was possible:

    mkfifo /tmp/in
    mkfifo /tmp/out1
    mkfifo /tmp/out2
    dsource -rate 10000000 > /tmp/in &
    dsync -rate 10000000 < /tmp/out1 &
    dsync -rate 60000000 < /tmp/out2 &
    testprogram -in /tmp/in -out1 /tmp/out1 -out2 /tmp/out2

    That is, where the dsource and dsync programs emit and absorb data at
    the specified rates (in MB/s). Not bursty though, at that rate even on
    short time scales.

    Regards,

    David Mathog

  3. Re: non_blocking question

    David Mathog wrote:
    > On a related note, have any of you ever seen programs which can be used
    > as rate controlled data sources and syncs? Currently pretty much all
    > linux/unix variants provides this:
    >
    > dd if=/dev/zero bs=4196 count=100000 > /dev/null
    >
    > but both the source (dd) and sync (/dev/null) run as fast as they
    > possibly can. Even on old systems like an Athlon MP the above command
    > moves 3.4GB/sec. Real IO is much, much slower. So those sorts of
    > sources and syncs are much too fast for emulating on a single node the
    > rate issues discussed in the previous post. It would be really handy if
    > this was possible:
    >
    > mkfifo /tmp/in
    > mkfifo /tmp/out1
    > mkfifo /tmp/out2
    > dsource -rate 10000000 > /tmp/in &
    > dsync -rate 10000000 < /tmp/out1 &
    > dsync -rate 60000000 < /tmp/out2 &
    > testprogram -in /tmp/in -out1 /tmp/out1 -out2 /tmp/out2
    >
    > That is, where the dsource and dsync programs emit and absorb data at
    > the specified rates (in MB/s). Not bursty though, at that rate even on
    > short time scales.


    Hmmm. The programs should be pretty simple to write if you
    don't get absurdly anal about avoiding burstiness. If you really
    really care about it you've got a real-time problem and a whole
    slew of additional issues. You'll most likely want to distinguish
    between time lost while blocked (output FIFO full, input FIFO empty)
    and time elapsed in other activities, which you can probably do by
    using non-blocking mode plus select, poll, /dev/poll, or the alert
    mechanism of your choice.

    Pre-emptive bug report: "dsync" should be "dsink".

    Observation: If your example is intended to be realistic, you
    obviously use much faster hardware than I do. Eighty terabytes a
    second is *really* cranking ...

    --
    Eric.Sosman@sun.com

  4. Re: non_blocking question

    Eric Sosman wrote:

    > Hmmm. The programs should be pretty simple to write if you
    > don't get absurdly anal about avoiding burstiness. If you really
    > really care about it you've got a real-time problem and a whole
    > slew of additional issues. You'll most likely want to distinguish
    > between time lost while blocked (output FIFO full, input FIFO empty)
    > and time elapsed in other activities, which you can probably do by
    > using non-blocking mode plus select, poll, /dev/poll, or the alert
    > mechanism of your choice.
    >


    I thought about doing it using nanosleep() between writes/reads to do IO
    at a steady rate (R, say 1000 times per second) with a fixed buffer size
    (S), giving a data transfer rate of R * S, and the user can specify both
    R and S. However at least on Linux that seems like it probably will not
    work very well since:

    The current implementation of nanosleep() is based on the
    normal kernel timer mechanism, which
    has a resolution of 1/HZ s (see time(7)). Therefore, nanosleep()
    pauses always for at least the
    specified time, however it can take up to 10 ms longer than
    specified until the process becomes
    runnable again. For the same reason, the value returned in case
    of a delivered signal in *rem is
    usually rounded to the next larger multiple of 1/HZ s.

    10ms is a pretty big chunk of time when each interval is supposed to be
    1 ms. Dropping it to 100 IO operations per second would likely get out
    of that hole, but that would raise the size of the chunks so much that
    it no longer be a very a good model for network IO.

    For rate limited read select() (alone) would not work, the sender could
    stuff as much data as it wanted into the reader, which would just come
    back before the timeout period. A second timer would be required to
    make the rate steady.

    Hmm, just discovered the function clock_nanosleep() which is designed
    for just this sort of problem. Not sure how portable that would be, but
    it looks like it should do the trick, at least on recent Linux's.


    > Pre-emptive bug report: "dsync" should be "dsink".

    SNIP
    > Eighty terabytes a second is *really* cranking ...


    Right on both counts. The second one was supposed to be B/s.

    Regards,

    David Mathog


  5. Re: non_blocking question

    David Mathog writes:
    > Eric Sosman wrote:
    >
    >> Hmmm. The programs should be pretty simple to write if you
    >> don't get absurdly anal about avoiding burstiness. If you really
    >> really care about it you've got a real-time problem and a whole
    >> slew of additional issues. You'll most likely want to distinguish
    >> between time lost while blocked (output FIFO full, input FIFO empty)
    >> and time elapsed in other activities, which you can probably do by
    >> using non-blocking mode plus select, poll, /dev/poll, or the alert
    >> mechanism of your choice.
    >>

    >
    > I thought about doing it using nanosleep() between writes/reads to do
    > IO at a steady rate (R, say 1000 times per second)


    [...]

    > However at least on Linux that seems like it
    > probably will not work very well since:
    >
    > The current implementation of nanosleep() is based on the
    > normal kernel timer mechanism, which has a resolution of 1/HZ


    [...]

    > Hmm, just discovered the function clock_nanosleep() which is designed
    > for just this sort of problem.


    It isn't, cf

    The suspension time caused by this function may be longer than
    requested because the argument value is rounded up to an
    integer multiple of the sleep resolution
    (clock_nanosleep, SUS)

    The same statement can be found in the description of nanosleep and
    the clock_nanosleep 'application usage' section says:

    Calling clock_nanosleep() with the value TIMER_ABSTIME not
    set in the flags argument and with a clock_id of
    CLOCK_REALTIME is equivalent to calling nanosleep() with the
    same rqtp and rmtp arguments.

    The only reason 'clock_nanosleep' exists is to provide a way to sleep
    until a specific absolute time has been reached (which may be an
    absolute time based on CLOCK_MONTONIC and therefore unaffected by
    setting the realtime clock to a certain value).

  6. Re: non_blocking question

    David Mathog wrote:
    > Eric Sosman wrote:
    >
    >> Hmmm. The programs should be pretty simple to write if you
    >> don't get absurdly anal about avoiding burstiness. If you really
    >> really care about it you've got a real-time problem and a whole
    >> slew of additional issues. You'll most likely want to distinguish
    >> between time lost while blocked (output FIFO full, input FIFO empty)
    >> and time elapsed in other activities, which you can probably do by
    >> using non-blocking mode plus select, poll, /dev/poll, or the alert
    >> mechanism of your choice.

    >
    > I thought about doing it using nanosleep() between writes/reads to do IO
    > at a steady rate (R, say 1000 times per second) with a fixed buffer size
    > (S), giving a data transfer rate of R * S, and the user can specify both
    > R and S. [...]


    You can't rely on the delay all by itself to achieve a desired
    rate. Go to sleep, sure, but when you wake up check the current
    time (gethrtime or something like it), and compute how many bytes
    you ought to have read or written by now, whatever "now" turns out
    to be. Subtract the number already done, then read or write the
    difference.

    The granularity of the sleep timer will cause some burstiness,
    and so will the unpredictability of the scheduler as it chooses to
    delay you or let you run. If you need to worry about that level
    of accuracy, you'll need real-time techniques to address it. But
    if you can live with "it looks right when averaged over a full
    second," the naive approach is probably plenty good enough.

    >> Pre-emptive bug report: "dsync" should be "dsink".

    > SNIP
    >> Eighty terabytes a second is *really* cranking ...

    >
    > Right on both counts. The second one was supposed to be B/s.


    A correct but unfortunate abbreviation ... ;-)

    --
    Eric.Sosman@sun.com

  7. Re: non_blocking question

    On Oct 22, 3:28*pm, David Mathog wrote:
    > On a related note, have any of you ever seen programs which can be used
    > as rate controlled data sources and syncs? *Currently pretty much all
    > linux/unix variants provides this:
    >
    > * *dd if=/dev/zero bs=4196 count=100000 > /dev/null
    >
    > but both the source (dd) and sync (/dev/null) run as fast as they
    > possibly can. *Even on old systems like an Athlon MP the above command
    > moves 3.4GB/sec. *Real IO is much, much slower. *So those sorts of
    > sources and syncs are much too fast for emulating on a single node the
    > rate issues discussed in the previous post. *It would be really handy if
    > this was possible:
    >
    > * *mkfifo /tmp/in
    > * *mkfifo /tmp/out1
    > * *mkfifo /tmp/out2
    > * *dsource -rate 10000000 * *> /tmp/in * &
    > * *dsync * -rate 10000000 * *< /tmp/out1 &
    > * *dsync * -rate 60000000 * *< /tmp/out2 &
    > * *testprogram -in /tmp/in -out1 /tmp/out1 -out2 /tmp/out2
    >
    > That is, where the dsource and dsync programs emit and absorb data at
    > the specified rates (in MB/s). *Not bursty though, at that rate even on
    > short time scales.
    >
    > Regards,
    >
    > David Mathog


    --rate-limit option of "Pipe Viewer" of any use?
    http://www.ivarch.com/programs/quickref/pv.shtml

  8. Re: non_blocking question

    David Mathog schrieb:
    > On a related note, have any of you ever seen programs which can be used
    > as rate controlled data sources and syncs? Currently pretty much all
    > linux/unix variants provides this:
    >
    > dd if=/dev/zero bs=4196 count=100000 > /dev/null
    >


    I'm not sure if I understand your question completely, but mbuffer has
    options to specify the rate at which data gets transfered. It is a
    little bit similar to dd or even more to buffer, but has a whole bunch
    of additional options to specify buffer size and prevent tape
    stop-rewind-restart conditions when using it for performing backups to
    tape drives.

    Get it here: http://www.maier-komor.de/mbuffer.html

    - Thomas

  9. Re: non_blocking question

    David Mathog wrote:

    > Let "addr" be a buffer holding "N" bytes, of which at least "n" (n > are requested to be written, accessed through an fd which has been set
    > to the nonblocking state. This code is executed:
    >
    > ret = write(fd, addr, n);
    >
    > fd may be for a network socket, a pipe, or a disk file. The question is
    > how long will this process typically reside in the write() function for
    > the different devices, and how much data should one expect to be
    > transferred per call based on the size of "n"?
    >


    (This is long)

    I'd like to come back to this if we can. I have a program "nettee"
    which creates a daisy chain connection down a number of nodes, and is
    used to distribute data to those nodes. So it is like this, in general:

    first node: reads data from stdin, sends to network
    internal nodes: reads data from preceding node, writes it to
    local disk AND copies it to next node
    final node: reads data from preceding node, writes it to local disk

    On our (oldish) cluster the disks can write at about 40MB/sec sustained
    (they are WDC WD400BB-00DEA0, ATA 100, 2MB cache

    % hdparm -tT /dev/hda

    /dev/hda:
    Timing cached reads: 530 MB in 2.01 seconds = 264.22 MB/sec
    Timing buffered disk reads: 134 MB in 3.02 seconds = 44.30 MB/sec

    % hdparm /dev/hda

    /dev/hda:
    multcount = 16 (on)
    IO_support = 1 (32-bit)
    unmaskirq = 1 (on)
    using_dma = 1 (on)
    keepsettings = 0 (off)
    readonly = 0 (off)
    readahead = 256 (on)
    geometry = 65535/16/63, sectors = 78165360, start = 0
    ),

    and the 100baseT network can move data between two nodes using TCP at
    around 11.7 MB/s. There is apparently some sort of interference between
    the network IO and actual disk IO, and that is at a fairly low level
    because in this experiment (using the 0.2.0, beta nettee, the data
    generated on the master is stored to the target file and also echoed to
    one more node, which throws it away ):

    (master): dd if=/dev/zero bs=4196 count=50000 \
    | nettee -root -next $NEXTNODE -out $TARGET -v 31

    (slave): nettee -out /dev/null -next _EOC_

    TARGET Speed (MB/sec)
    /dev/null 11.76
    /tmp/foo.dat 11.76
    /disk/foo.dat 10.7 (the rate fluctuates a lot)

    Disk caching is enabled, so one might have thought that writing to
    /disk/foo.dat (on a real disk) would be as fast as writing to
    /tmp/foo.dat (a tmpfs), but it isn't, not even for a 200MB file which
    fits completely into memory.

    % dd if=/dev/zero bs=4096 count=50000 >/dev/null
    50000+0 records in
    50000+0 records out
    204800000 bytes (205 MB) copied, 0.0568505 seconds, 3.6 GB/s
    % dd if=/dev/zero bs=4096 count=50000 >/tmp/foo.dat
    50000+0 records in
    50000+0 records out
    204800000 bytes (205 MB) copied, 0.865042 seconds, 237 MB/s
    % dd if=/dev/zero bs=4096 count=50000 >foo.dat
    50000+0 records in
    50000+0 records out
    204800000 bytes (205 MB) copied, 1.31895 seconds, 155 MB/s

    155 MB/s isn't terrible, but it's a lot less than the speed to write
    straight to memory in a tmpfs, which suggests that something about
    actual disk writes slows down the system even when network IO is not
    part of the equation. Also the "237 MB/s" number corresponds to a
    nettee test speed which is the same as writing to /dev/null. In other
    words, the "write to memory" part of this equation isn't limiting
    network IO, something about actually sending data to the disk is.
    There's no "sync" in these tests - they should
    be perfectly happy stuffing most of the data into memory and then
    exiting, to leave it to dribble out to the disk in actual writes "whenever".

    I'm thinking there may be some sort of interrupt level conflict, where
    the network card uses a ton of interrupts and even a few generated by
    the disk controller result in a disproportionate hit in network performance.

    Kernel is 2.6.19.3.

    Any thoughts?

    Thanks,

    David Mathog

  10. Re: non_blocking question

    David Mathog writes:
    >David Mathog wrote:
    >
    >> Let "addr" be a buffer holding "N" bytes, of which at least "n" (n >> are requested to be written, accessed through an fd which has been set
    >> to the nonblocking state. This code is executed:
    >>
    >> ret = write(fd, addr, n);
    >>
    >> fd may be for a network socket, a pipe, or a disk file. The question is
    >> how long will this process typically reside in the write() function for
    >> the different devices, and how much data should one expect to be
    >> transferred per call based on the size of "n"?
    >>

    >
    >(This is long)
    >
    >I'd like to come back to this if we can. I have a program "nettee"
    >which creates a daisy chain connection down a number of nodes, and is
    >used to distribute data to those nodes. So it is like this, in general:


    It would take a long time to enumerate all the different factors
    that will help isolate your performance issues, but you should keep the
    following in mind:

    Physical Disk Sustained Write Peformance
    ----------------------------------------

    This varies based on the actual filesystem layout on
    the device. For files to typical Linux filesystem (e.g.
    a non-extent based filesystem such as ext3), the physical
    storage locations for the blocks that make up your file
    will not necessarily be contiguous on the media, so the
    drive will need to move heads (seek) and wait for the
    appropriate sector to move under the head (rotational latency).

    Given a typical ext3 filesystem blocksize of 4096 bytes, a single
    64kbyte write can result in potentially 16 writes to non-contiguous
    sectors. This reduces your throughput considerable.

    Extent-based filesystems can do better, since they typically
    deal with large extents (an extent is a set of contiguous sectors
    assigned to a specific file).

    Processor Interrupt Handling
    ----------------------------------------

    Most network interface cards only use a single interrupt vector
    and thus handle all incoming traffic on a single core. This can
    saturate the core and reduce your throughput (albeit more likely
    at 1Gbe or 10Gbe speeds). Check /proc/interrupts to see
    how the interrupts are distributed. 11.7MBytes/sec is very very
    good for TCP on 100BaseT. Check sar/iostat/vmstat to see what the
    processor utilization looks like.

    On-board Disk Caches
    ----------------------------------------

    They'll buffer, somewhat, the impact of the seek/rotational
    latency issues, but only insofar as the disk buffer(s) allow. Considering
    the relatively small buffers on the disk electronics, and the fact
    you still pay the seek/latency penalties, you'll not buy much by
    enabling write caches on the drive for large streaming I/O workloads.

    File Caches
    ----------------------------------------

    Unix (and Linux) applications don't write directly to the device
    (even dd(1)). They write to the file cache (which consists of
    otherwise unused physical memory), and the kernel eventually
    flushes the output to the device as the memory is needed for other
    purposes (or if the sync(1) command is used, or if the O_SYNC or
    O_DATASYNC flags are provided at open(2)).

    Using O_DIRECT with standard file descriptors on Linux, or using
    raw devices on Unix will cause the I/O to go directly to the device
    bypassing the file cache.

    All of these factors conspire to make performance analysis interesting in
    unix/linux systems.

    hth

    s

+ Reply to Thread