Infortend SATA-FC RAID and IO tunning - SGI

This is a discussion on Infortend SATA-FC RAID and IO tunning - SGI ; Hi all, we're trying to optimize I/O performance with the following configuration: two 8x400 Mhz origin2000 IRIX 6.5.21f 2 port 2gb/s qlogic 2342 in XIO shoehorn in each Infortrend A16F-R1A ( dual RAID controllers w 2 2gb/s ports each, 16 ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: Infortend SATA-FC RAID and IO tunning

  1. Infortend SATA-FC RAID and IO tunning

    Hi all,

    we're trying to optimize I/O performance with the following
    configuration:
    two 8x400 Mhz origin2000 IRIX 6.5.21f
    2 port 2gb/s qlogic 2342 in XIO shoehorn in each
    Infortrend A16F-R1A ( dual RAID controllers w 2 2gb/s ports each,
    16 SATA 250GB drives)

    We need to access the RAID from both systems, so we configured
    the RAID as:
    2 logical drives raid5 7+1 128kb stripe (one mapped to primary
    and the other to secondary raid controllers)
    each logical drive divided into 2 equal partitions (using the
    RAID feature)
    Next we use xvm stripe over partitions on the different raid
    controllers
    with stripe unit=896k; ctq=32

    Running four diskperf -D -W ... on the one filesystem resulted with
    4 x 60MB/s with the 4m i/o request size (read)

    Next - we're trying to optimize the performance of our application.
    Watching on the par -k resulted us that 70% of i/o request sizes
    are = 64k and 90% of i/o <= 64k

    So it seems that the best way is to set the stripe size of raid5 as
    32k
    but unfortunately the buggy raid's firmware allows us only 128k!!! So
    the
    only way is to try to tune the application itself. I've asked the
    vendor
    about the possibility to change the i/o request size and got a reply:

    "When ProMAX writes a seismic trace to disk we simply make a system
    call using the "write" routine. If the O/S chooses a 64K transfer,
    then that is what it does."

    So the question is - why IRIX chooses 64K and is there any way to
    increase
    it? Alexis mentioned in the earlier posts that IRIX can coalesce
    smaller
    i/o into larger ones, but how to achieve it? Tryied increasing
    dwcluster,
    mkfs -d sunit=896k -d swidth=1792k - no help.
    Maybe I incorrectly interpreted par -k output (I didnot find any full
    description of it) below is the typical extract from par -k:

    21mS[ 14] exec.exe( 2252): I/O queued; flags 0x4004009 dev
    0,385 count 65536 blkno 1342042880
    21mS[ 14] exec.exe( 2252): I/O started; flags 0x4004009 dev
    0,385 count 65536 blkno 1342042880
    27mS[ 6] exec.exe( 2157): I/O queued; flags 0x4004109 dev
    0,385 count 65536 blkno 589130496
    27mS[ 6] exec.exe( 2157): I/O started; flags 0x4004109 dev
    0,385 count 65536 blkno 589130496
    28mS[ 5] pcibr_intrd[0x2(4294967337): I/O done; flags 0x4004109
    dev 0,385 count 65536 blkno 589130496
    31mS[ 5] pcibr_intrd[0x2(4294967337): I/O done; flags 0x4004009
    dev 0,385 count 65536 blkno 1342042880
    32mS[ 14] exec.exe( 2252): I/O queued; flags 0x4004009 dev
    0,423 count 65536 blkno 1340080640
    32mS[ 14] exec.exe( 2252): I/O started; flags 0x4004009 dev
    0,423 count 65536 blkno 1340080640
    36mS[ 6] exec.exe( 2157): I/O queued; flags 0x4004109 dev
    0,385 count 65536 blkno 589130624
    36mS[ 6] exec.exe( 2157): I/O started; flags 0x4004109 dev
    0,385 count 65536 blkno 589130624
    38mS[ 5] pcibr_intrd[0x2(4294967337): I/O done; flags 0x4004109
    dev 0,385 count 65536 blkno 589130624
    39mS[ 5] pcibr_intrd[0x2(4294967339): I/O done; flags 0x4004009
    dev 0,423 count 65536 blkno 1340080640

    There are also xfsd & bdflush entries in it, but too few.
    Is there any other way to understand the i/o parameters (is it direct,
    buffered, sync or async, etc), may be PCP? Or anybody can suggest
    other
    ways of i/o tunning (f.e. our xvm slices are not aliigned with current
    raid5 7+1 128kb stripe config, maybe it's better rely on xvm
    partitioning/
    slices then on raid hardware partitioning, something else)

    Thank you in advance,
    Alex

  2. Re: Infortend SATA-FC RAID and IO tunning

    googlegroups at mail.ru wrote:

    > Next - we're trying to optimize the performance of our application.
    > Watching on the par -k resulted us that 70% of i/o request sizes
    > are = 64k and 90% of i/o <= 64k


    Can't see that in the small par -k snippet you posted, though.
    >
    > So it seems that the best way is to set the stripe size of raid5 as
    > 32k


    Not really. Rule #1 is not to detune a RAID to an application, but to
    fix the application's I/O patterns.

    > but unfortunately the buggy raid's firmware allows us only 128k!!! So
    > the
    > only way is to try to tune the application itself. I've asked the
    > vendor
    > about the possibility to change the i/o request size and got a reply:
    >
    > "When ProMAX writes a seismic trace to disk we simply make a system
    > call using the "write" routine. If the O/S chooses a 64K transfer,
    > then that is what it does."


    If it's a Fortran 90 application, then FFIO may be of help.

    If they're calling libc.so's "write" routine from a C routine,
    that's worse. Unles you have the source, in which case
    it's pretty easy to make the app use FFIO.

    You'll go to the buffer cache, so you'd better get ready to tune
    that. Your par -k output is already behind the buffer cache.

    >
    > So the question is - why IRIX chooses 64K and is there any way to
    > increase
    > it?


    the maximum biosize is already 64KB. The fact you have smaller
    I/Os means that you're either doing small *direct* I/Os (have they
    opened the file with O_DIRECT? Are they calling fsync()? ) or
    that the system isn't waiting for very long before making the
    pages clean by flushing them to disk, and that the I/O patterns
    don't allow for either coalescing.

    Of course, reads will never be coalesced (though IRIX will
    schedule readaheads, so that you will have a bunch of I/Os
    hitting the disk drive caches or RAID caches before the spindles
    are repositioned).


    Alexis mentioned in the earlier posts that IRIX can coalesce
    > smaller
    > i/o into larger ones, but how to achieve it?


    It's automagical, though you have to tune up maxdmasz to 16MB
    (1025) to make it cluster that large a cluster to a single LUN, of
    course. Don't tune it higher.

    Of course, that only helps *writes* -- are you sure the par -k
    corresponds to writes?

    If they're really writes, and you have enough memory to contain
    a few minutes' worth of writes, try bumping up autoup and lowering
    bdflushr drastically.

    > There are also xfsd & bdflush entries in it, but too few.
    > Is there any other way to understand the i/o parameters (is it direct,
    > buffered, sync or async, etc), may be PCP?


    PCP, topio (measures things *in front* of the buffer cache), sar -d,
    sar -u (to know which I/O goes through the buffer cache or is "wphy"
    direct I/O), osview (same).

    Or anybody can suggest
    > other
    > ways of i/o tunning (f.e. our xvm slices are not aliigned with current
    > raid5 7+1 128kb stripe config,


    Ouch! Really bad for write performance on long writes on the RAID. Make
    sure your slices are all aligned to the first byte of a stripe width
    on the 7+1 LUN.

    But then, you're not doing long writes right now anyway (yet). In
    that case, I sure hope the controller has write caching (and
    that you can use it -- if it's not write cache mirrored to
    a second controller, it may be too unsafe for you to use).

    --
    Alexis Cousein Senior Systems Engineer
    alexis@sgi.com SGI/Silicon Graphics Brussels

    If I have seen further, it is by standing on reference manuals.


  3. Re: Infortend SATA-FC RAID and IO tunning

    Alexis Cousein wrote
    > googlegroups at mail.ru wrote:
    >
    > Can't see that in the small par -k snippet you posted, though.

    % gzcat par-k.txt.gz | wc -l
    133965
    % gzcat par-k.txt.gz | grep "count 65536"|wc -l
    78044
    % gzcat par-k.txt.gz | grep "count 16384" | wc -l
    27904
    % gzcat par-k.txt.gz | grep "count 32768" | wc -l
    7192
    % gzcat par-k.txt.gz | grep "count 49152" | wc -l
    3960
    % gzcat par-k.txt.gz | grep "count 8192" | wc -l
    3096
    % gzcat par-k.txt.gz | grep "count 4096" | wc -l
    1689

    The io requests with sizes > 64k corresponds to "xfsd" and "bdflush",
    not the "exec.exe". Does it meen that I can ignore "xfsd" and
    "bdflush"
    when watching for "exec.exe"?

    1509mS[ 6] exec.exe( 2157): I/O queued; flags 0x4004109 dev
    0,385 count 65536 blkno 589137280
    1509mS[ 6] exec.exe( 2157): I/O started; flags 0x4004109 dev
    0,385 count 65536 blkno 589137280
    1510mS[ 5] pcibr_intrd[0x2(4294967339): I/O done; flags 0x24908 dev
    0,423 count 458752 blkno 954171008
    1511mS[ 15] xfsd(4294968410): I/O queued; flags 0x24908
    dev 0,423 count 65536 blkno 954158592
    1511mS[ 15] xfsd(4294968410): I/O started; flags 0x24908
    dev 0,423 count 65536 blkno 954158592
    1511mS[ 5] pcibr_intrd[0x2(4294967337): I/O done; flags 0x4004109
    dev 0,385 count 65536 blkno 589137280
    1511mS[ 15] xfsd(4294968410): I/O queued; flags 0x24908
    dev 0,385 count 65536 blkno 954158720
    1511mS[ 15] xfsd(4294968410): I/O started; flags 0x24908
    dev 0,385 count 65536 blkno 954158720
    1513mS[ 11] bdflush(4294967375): I/O queued; flags 0x24108
    dev 0,423 count 131072 blkno 957218432
    1513mS[ 11] bdflush(4294967375): I/O started; flags 0x24108
    dev 0,423 count 131072 blkno 957218432
    1518mS[ 5] pcibr_intrd[0x2(4294967339): I/O done; flags 0x24908 dev
    0,423 count 65536 blkno 954158592
    1518mS[ 5] pcibr_intrd[0x2(4294967337): I/O done; flags 0x24908 dev
    0,385 count 65536 blkno 954158720
    1518mS[ 1] xfsd(4294968413): I/O queued; flags 0x24908
    dev 0,423 count 655360 blkno 954167680
    1518mS[ 1] xfsd(4294968413): I/O started; flags 0x24908
    dev 0,423 count 655360 blkno 954167680
    1518mS[ 1] xfsd(4294968413): I/O queued; flags 0x24908
    dev 0,385 count 1048576 blkno 954168960
    1518mS[ 1] xfsd(4294968413): I/O started; flags 0x24908
    dev 0,385 count 1048576 blkno 954168960
    1528mS[ 6] exec.exe( 2157): I/O queued; flags 0x4004109 dev
    0,385 count 65536 blkno 589137408
    1528mS[ 6] exec.exe( 2157): I/O started; flags 0x4004109 dev
    0,385 count 65536 blkno 589137408
    1531mS[ 5] pcibr_intrd[0x2(4294967337): I/O done; flags 0x4004109
    dev 0,385 count 65536 blkno 589137408

    >
    > the maximum biosize is already 64KB. The fact you have smaller
    > I/Os means that you're either doing small *direct* I/Os (have they
    > opened the file with O_DIRECT? Are they calling fsync()? ) or
    > that the system isn't waiting for very long before making the
    > pages clean by flushing them to disk, and that the I/O patterns
    > don't allow for either coalescing.
    >

    I've asked them but still have no meaningfull reply, only
    example of writing code
    *nbytes = write( *ihandle, cbuffer, *nbytes );
    And the phrase that because of multiplatform code they will not
    change anything...

    >
    > Of course, that only helps *writes* -- are you sure the par -k
    > corresponds to writes?

    Not sure for this particular case, but I've tryied to gather data many
    times and in general it's same as above.

    > If they're really writes, and you have enough memory to contain
    > a few minutes' worth of writes, try bumping up autoup and lowering
    > bdflushr drastically.

    The problem is that most of the time the application is cpu/memory
    bound,
    not the io bound, so we'd like to keep the memory free...

    > PCP, topio (measures things *in front* of the buffer cache), sar -d,
    > sar -u (to know which I/O goes through the buffer cache or is "wphy"
    > direct I/O), osview (same).

    According to sar -u wphy=0 all the time. And max wio=10% (average=3%)
    with average usr+sys=30% (1 week watching, almost in production mode)

    >
    > Or anybody can suggest
    > > other
    > > ways of i/o tunning (f.e. our xvm slices are not aliigned with current
    > > raid5 7+1 128kb stripe config,

    >
    > Ouch! Really bad for write performance on long writes on the RAID. Make
    > sure your slices are all aligned to the first byte of a stripe width
    > on the 7+1 LUN.
    >

    Could you please help me with alignment parameters? Should I add the
    parity
    drive in calculations? Now my volhdr size=3076 (by default), should I
    have to
    use 7*128k*Constant, f.e. 7*128*4=3584 ?

    > But then, you're not doing long writes right now anyway (yet). In
    > that case, I sure hope the controller has write caching (and
    > that you can use it -- if it's not write cache mirrored to
    > a second controller, it may be too unsafe for you to use).


    Write cache enabled but not mirrored.

    Thank you in advance,
    Alex

  4. Re: Infortend SATA-FC RAID and IO tunning

    googlegroups at mail.ru wrote:
    > Could you please help me with alignment parameters? Should I add the
    > parity
    > drive in calculations?


    No.

    Now my volhdr size=3076 (by default), should I
    > have to
    > use 7*128k*Constant, f.e. 7*128*4=3584 ?


    Yes. slice starts should be divisible by this number.

    --
    Alexis Cousein Senior Systems Engineer
    alexis@sgi.com SGI/Silicon Graphics Brussels

    If I have seen further, it is by standing on reference manuals.


  5. Re: Infortend SATA-FC RAID and IO tunning

    Alexis Cousein wrote in message news:<40866F45.1060301@brussels.sgi.com>...

    > > Is there any other way to understand the i/o parameters (is it direct,
    > > buffered, sync or async, etc), may be PCP?

    >
    > PCP, topio (measures things *in front* of the buffer cache), sar -d,
    > sar -u (to know which I/O goes through the buffer cache or is "wphy"
    > direct I/O), osview (same).


    Typical "topio":

    IRIX64 6.5 bigben IP27
    132 processes: 126 sleeping, 6 running
    System totals: 5990.0 6256.8 11.6m 23.5m
    PID command user wcall/s rcall/s wbyte/s rbyte/s wavelen ravelen
    9558 exec.ex prom72 1140.3 1522.8 3249.1k 7956.3k 2849.3 5224.7
    9490 exec.ex prom72 1957.6 2612.4 3063.2k 5923.6k 1564.8 2267.5
    9561 flexExp prom72 2511.8 85.9 5294.9k 243.9k 2108.0 2840.0
    9581 imag3d_ prom72 380.5 1140.3 3065.9 3249.1k 8.1 2849.3
    9454 exec.ex prom72 0.0 348.5 0.0 2422.6k 0.0 6951.5
    8812 exec.ex prom72 0.0 349.1 0.0 2418.2k 0.0 6927.0
    9517 exec.ex prom72 0.0 190.3 0.0 1323.6k 0.0 6957.0

    Does it mean that IRIX already done it's job and grouped exec.exe's < 7k
    i/os into 64k (didnot change maxdmasz yet...)

+ Reply to Thread