IA64 Linux VM performance woes. - SGI

This is a discussion on IA64 Linux VM performance woes. - SGI ; Hello all. We are trying to deploy a 128 PE SGI Altix 3700 running Linux, with 265GB main memory and 10TB RAID disk (TP9500) : # cat /etc/redhat-release Red Hat Linux Advanced Server release 2.1AS (Derry) # cat /etc/sgi-release SGI ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: IA64 Linux VM performance woes.

  1. IA64 Linux VM performance woes.

    Hello all.

    We are trying to deploy a 128 PE SGI Altix 3700 running Linux, with 265GB main
    memory and 10TB RAID disk (TP9500) :

    # cat /etc/redhat-release
    Red Hat Linux Advanced Server release 2.1AS (Derry)

    # cat /etc/sgi-release
    SGI ProPack 2.4 for Linux, Build 240rp04032500_10054-0403250031

    # uname -a
    Linux c 2.4.21-sgi240rp04032500_10054 #1 SMP Thu Mar 25 00:45:27
    PST 2004 ia64 unknown

    We have been experiencing bad performance and downright bad behavior when we
    are trying to read or write large files (10-100GB).

    File Throughput Issues
    ----------------------
    At first the throughtput we are getting without file cache bypass is at around
    440MB/sec MAX. This specific file system has LUNs whose primary FC paths go
    over all four 2Gb/sec FC channels and the max throughput should have been
    close to 800MB/sec.

    I've also noticed that the FC adapter driver threads are running at 100% CPU
    utilization, when they are pumping data to the RAID for long time. Is there
    any data copy taking place at the drivers? The HBAs are from QLogic.


    VM Untoward Behavior
    -------------------
    A more disturbing issue is that the system does NOT clean up the file cache
    and eventually all memory gets occupied by FS pages. Then the system simply
    hungs.

    We tried enabling / removing bootCPUsets, bcfree and anything else available
    to us. The crashes are just keep comming. Recently we started experiencing a
    lot of 'Cannot do kernel page out at address' by the bdflush and kupdated
    threads as well. This complicates any attempt to tune the FS in a way that can
    maximize the throughput and finally setup sub-volumes on the RAID in a way
    that different FS performance objectives can be attained.


    Tunning bdlsuh/kupdated Behavior
    -------------------------------

    One of our main objectives at our center is to maximize file thoughput for our
    systems. We are a medium size Supercomputing Center were compute and I/O
    intensive numerical computation code runs in batch sub-systems. Several
    programs expect and generate often very large files, in the order of 10-70GBs.
    Minimizing file access time is importand in a batch environment since
    processors remain allocated and idle while data is shuttled back and forth
    from the file system.

    Another common problem is the competition between file cache and computation
    pages. We definitely do NOT want file cache pages being cached, while
    computation pages are reclaimed.

    As far as I know, the only place in Linux that the VM / file cache behavior
    can be tuned is with the 'bdflush/kupdated' settings. We need a good way to
    tuneup the 'bdflush' parameters. I have been trying very hard to find in-depth
    documentation on this.

    Unfortunately I have only gleaned some general and abstract advices on the
    bdflush parameters, mainly in the kernel source documentation tree
    (/usr/src/kernel/Documentation/).

    For instance, what is a 'buffer'? Is it a fixed size block (e.g., a VM page)
    or it can be of any size? This is important as bdlush uses number and
    percentages of dirty buffers. A small number of large buffers require much
    more data to get transferred to the disks, vs. a large number of small
    buffers.

    Controls that are Needed
    ------------------------
    Ideally we need to:

    1. Set an upper bound on the number of memory pages ever caching FS blocks.

    2. Control the amount of data flushed out to disk in set time periods; that is
    we need to be able to match the long term flushing rate with the service rate
    that the I/O subsystem is capable of delivering, tolerating possible transient
    spikes. We also need to be able to control the amount of read-ahead, write
    behind or even hint that data are only being streamed through, never to be
    reused again.

    3. Specify different parameters for 2., above, per file system: we have file
    systems that are meant to transfer wide stripes of sequential data, vs. file
    systems that need to perform well with smaller block, random I/O, vs. ones
    that need to provide access to numerous smaller files. Also, cache percentages
    per file system would be useful.

    4. Specify, if else fails, what parts of the FS cache should flushed in the
    near future.

    5. Provide in-depth technical documentation on the internal workings of the
    file system cache, its interaction with the VM and the interaction of XFS/LVM
    with the VM.

    6. We do operate IRIX Origins and IBM Regatta SMPs where all these issues have
    been addressed to a far more satisfying degree than on Linux. Is the IRIX file
    system cache going to be ported to ALTIX Linux? There is already a LOT of
    experience in IRIX for these types of matters that should NOT remain
    unleveraged.


    Any information/hint or pointers for in-depth discussion on the bugs and
    tunning of VM/FS and I/O subsystems or other relevant topics would be
    GREATLY appreciated!

    We are willing to share our experience with anyone who is interested in
    improving any of the above kernel sub-systems and provide feedback with
    experimental results and insights.

    Thanks

    Michael Thomadakis

    Supercomputing Center
    Texas A&M University

  2. Re: IA64 Linux VM performance woes.

    Dans article ,
    miket@hellas.tamu.edu disait...
    > At first the throughtput we are getting without file cache bypass is at around
    > 440MB/sec MAX. This specific file system has LUNs whose primary FC paths go
    > over all four 2Gb/sec FC channels and the max throughput should have been
    > close to 800MB/sec.
    >
    > I've also noticed that the FC adapter driver threads are running at 100% CPU
    > utilization, when they are pumping data to the RAID for long time. Is there
    > any data copy taking place at the drivers? The HBAs are from QLogic.


    You don't even mention what filesystem you're using. I never trued this
    sort of configuration (how unfortunate but here's what I know :

    1) most common linux FS suck. Use XFS, and tune your FS parameters
    properly.

    2) The stock qlogic linux driver is poorly opimized. You'll have to edit
    the qlogicfc.h source file, change the various BLOCK values to much
    higher levels to obtain satisfactory performance (typically cahge
    whatever appears to be 64KB to 256KB is a good start.

    Yes, you'll have to recompile the driver. Welcome to Linux, man.

    3) You should first try to optimize the driver and the filesystem and
    test one HBA at a time; don't start by striping everything up...

    >
    > VM Untoward Behavior
    > -------------------
    > A more disturbing issue is that the system does NOT clean up the file cache
    > and eventually all memory gets occupied by FS pages.


    This is the usual Linux way.

    > Then the system simply
    > hungs.


    This is not the usual way

    > We tried enabling / removing bootCPUsets, bcfree and anything else available
    > to us. The crashes are just keep comming. Recently we started experiencing a
    > lot of 'Cannot do kernel page out at address' by the bdflush and kupdated
    > threads as well. This complicates any attempt to tune the FS in a way that can
    > maximize the throughput and finally setup sub-volumes on the RAID in a way
    > that different FS performance objectives can be attained.
    >


    Try first one CPU accessing one FS through one HBA, for heaven's sake...
    What did you learn at University?

    --
    Quis, quid, ubi, quibus auxiliis, cur, quomodo, quando?

  3. Re: IA64 Linux VM performance woes.

    In article ,
    "Michael E. Thomadakis" wrote:

    : Hello all.
    :
    : We are trying to deploy a 128 PE SGI Altix 3700 running Linux, with 265GB main
    : memory and 10TB RAID disk (TP9500) :
    :
    : # cat /etc/redhat-release
    : Red Hat Linux Advanced Server release 2.1AS (Derry)
    :
    : # cat /etc/sgi-release
    : SGI ProPack 2.4 for Linux, Build 240rp04032500_10054-0403250031
    :
    : # uname -a
    : Linux c 2.4.21-sgi240rp04032500_10054 #1 SMP Thu Mar 25 00:45:27
    : PST 2004 ia64 unknown
    :
    : We have been experiencing bad performance and downright bad behavior when we
    : are trying to read or write large files (10-100GB).
    [...]

    There's a few things you can try.

    1: Ask on lkml, since nobody in the SGI newsgroups will admit to any linux
    knowledge whatsoever.

    2: Ask SGI if they have a 2.6 kernel you can test. A lot of the changes
    (hopefully improvements) between 2.4 and 2.6 were in the VM and related areas.

    3: Twiddle with the bdflush parameters a bit more. nfract should be low, ndirty
    should be high, nref_dirt should be low, age_buffer should be low. Unfortunately
    I don't think there's any combination of values that will give you a smooth,
    controllable streamed writeout.

    You can't directly control how many pages are used for buffer, cache, or data.
    You can only probabalisticly control them through the bdflush parameters.


    Sorry I can't give you better news.


    Cheers - Tony 'Nicoya' Mantler

    --
    Tony 'Nicoya' Mantler -- Master of Code-fu -- nicoya@ubb.ca
    -- http://nicoya.feline.pp.se/ -- http://www.ubb.ca/ --

  4. Re: IA64 Linux VM performance woes.

    Michael E. Thomadakis wrote:

    > Hello all.
    >
    > We are trying to deploy a 128 PE SGI Altix 3700 running Linux, with 265GB main
    > memory and 10TB RAID disk (TP9500) :
    >
    > # cat /etc/redhat-release
    > Red Hat Linux Advanced Server release 2.1AS (Derry)
    >
    > # cat /etc/sgi-release
    > SGI ProPack 2.4 for Linux, Build 240rp04032500_10054-0403250031
    >
    > # uname -a
    > Linux c 2.4.21-sgi240rp04032500_10054 #1 SMP Thu Mar 25 00:45:27
    > PST 2004 ia64 unknown
    >
    > We have been experiencing bad performance and downright bad behavior when we
    > are trying to read or write large files (10-100GB).
    >
    > File Throughput Issues
    > ----------------------
    > At first the throughtput we are getting without file cache bypass is at around
    > 440MB/sec MAX. This specific file system has LUNs whose primary FC paths go
    > over all four 2Gb/sec FC channels and the max throughput should have been
    > close to 800MB/sec.
    >
    > I've also noticed that the FC adapter driver threads are running at 100% CPU
    > utilization, when they are pumping data to the RAID for long time. Is there
    > any data copy taking place at the drivers? The HBAs are from QLogic.
    >

    You do have to live with the Linux 2.4 block layer for some time -- and if
    you're used to IRIX (or ever used the 2.6 layer).

    Some other people commented about the higher layers being a piece of crock,
    but you can safely ignore those: the Qlogics driver and XSCSI layer are
    pretty solid (in the case of XSCSI, the architecture is close to that of
    the corresponding IRIX layer). That part of the work is something SGI
    should have nailed down pretty much or you.

    It's worse for RAIDs than for JBODs, as they depend on lots of I/O
    operations on flight for a single LUN (and as a result, really hate
    long I/O operations to be cut in smaller pieces).

    Of course, make sure that you've used xscsiqueue to set CTQ parameters
    correctly, or not even the 2.6 block layer + XSCSI + Qlogics driver
    combination will save you...

    >
    > VM Untoward Behavior
    > -------------------
    > A more disturbing issue is that the system does NOT clean up the file cache
    > and eventually all memory gets occupied by FS pages. Then the system simply
    > hungs.


    There are quite a few engineering incidents logged on several of these aspects
    -- I would indeed suggest that this forum may not be the appropriate place
    to ask about those.

    There are already quite a number of 2.4 patches that at least address some
    issues; make sure you're current on them. You really, really, really, want
    *10065* or successors on that machine at the least (note: assuming you are
    not using CXFS). You seem to be running 10054.
    >
    > We tried enabling / removing bootCPUsets, bcfree and anything else available
    > to us. The crashes are just keep comming. Recently we started experiencing a
    > lot of 'Cannot do kernel page out at address' by the bdflush and kupdated
    > threads as well.


    That suggests you may be running with not much swap (or perhaps missing 10065).

    Yes, IRIX wouldn't need any swap, but in high-I/O situations where the buffer
    cache fills up the memory, it *is* possible for Linux to think it has to swap
    pages when it should be reclaiming free buffer cache pages - I'd suggest
    making sure that you have about 1/4th of the memory configured as swap, to
    make the kernel resist the temptation of waking up the out of memory killer
    when you don't want it to, and to let it sort things out by moving things to
    swap (albeit unncessarily) when it's painted itself into a corner,
    rather than letting it stomp on your feet.

    > Tunning bdlsuh/kupdated Behavior
    > -------------------------------
    >
    > One of our main objectives at our center is to maximize file thoughput for our
    > systems. We are a medium size Supercomputing Center were compute and I/O
    > intensive numerical computation code runs in batch sub-systems. Several
    > programs expect and generate often very large files, in the order of 10-70GBs.
    > Minimizing file access time is importand in a batch environment since
    > processors remain allocated and idle while data is shuttled back and forth
    > from the file system.


    There are some things you can do in a batch environment if users are cooperative:
    if they make large files in a scratch or tmp area you can identify, instead of
    running bcfree (or in addition to it), you can use a program that calls posix_fadvise()
    to tell the system you've finished with all the files in that directory (call
    using a find command from an epilogue script).

    Of course, trying to make most applications use FFIO (in C) and thus private
    user-space caches instead of the buffer cache is also a worthwhile endeavour,
    especially for the I/O hogs.
    >
    > Another common problem is the competition between file cache and computation
    > pages. We definitely do NOT want file cache pages being cached, while
    > computation pages are reclaimed.


    With the 2.4 kernel, the kernel doesn't go off-node if it can reclaim clean
    buffer cache pages on-node. Of course, that means you'd better be set up
    to flush pages to disk fast enough to avoid on-node memory being filled with
    dirty pages while you're still allocating memory.

    > Ideally we need to:


    Talk to support (and your local system engineer). There are many, many people
    in engineering helping people (probably including you) on this, and the channels
    should be working; at least they are from where I'm sitting...isn't Trey Prinz
    already working with you on these issues?

    With your setup, you'd be pushing the envelope even on an IRIX machine - and
    it's only years of hard work that have made IRIX "do the right thing" in many
    circumstances (but not all ) automagically.

    --
    Alexis Cousein Senior Systems Engineer
    alexis@sgi.com SGI/Silicon Graphics Brussels

    If I have seen further, it is by standing on reference manuals.


  5. Re: IA64 Linux VM performance woes.

    Tony 'Nicoya' Mantler wrote:
    > 1: Ask on lkml, since nobody in the SGI newsgroups will admit to any linux
    > knowledge whatsoever.
    >

    Well, that's not strictly true .

    But asking on lkml is a worthwhile endeavour indeed.


    --
    Alexis Cousein Senior Systems Engineer
    alexis@sgi.com SGI/Silicon Graphics Brussels

    If I have seen further, it is by standing on reference manuals.


+ Reply to Thread