CPU utilization and faulty CPUs on IRIX 6.5.x - SGI

This is a discussion on CPU utilization and faulty CPUs on IRIX 6.5.x - SGI ; If the o/s detects a fault on a CPU, will the o/s stop assigning threads to that CPU? Any difference if a thread has locked itself down to a CPU that becomes faulty? Also, can the o/s itself be "locked ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: CPU utilization and faulty CPUs on IRIX 6.5.x

  1. CPU utilization and faulty CPUs on IRIX 6.5.x

    If the o/s detects a fault on a CPU, will the o/s stop assigning
    threads to that CPU? Any difference if a thread has locked itself
    down to a CPU that becomes faulty?

    Also, can the o/s itself be "locked down" to a specific CPU? This was
    mentioned to me by a non-techincal manager. I would assume that some
    part of the o/s is operating on each CPU.

    Thanks.

    Les

  2. Re: CPU utilization and faulty CPUs on IRIX 6.5.x

    Les Hartzman wrote:

    > If the o/s detects a fault on a CPU, will the o/s stop assigning
    > threads to that CPU? Any difference if a thread has locked itself
    > down to a CPU that becomes faulty?


    Chances are that IRIX will panic. Then at reboot, it may disable
    that CPU.

    There is also indeed a mode where a single CPU will hang in the
    kernel, but things like TLB flushes that require that CPU's
    cooperation will gradually tend to hang other CPUs in the kernel,
    so IRIX will tend to die a horrible death by the crippling of more
    and more unrelated CPUs, but somewhat slower
    (and sometimes, slow enough to enable you to see who's ill using icrash).
    >
    > Also, can the o/s itself be "locked down" to a specific CPU?


    Some kernel threads can, and processes certainly can (see man boot_cpuset),
    and processors can be set aside so they don't have to service interrupts.

    But you'll still need some kernel code to run there -- if only to manage
    the local runqueue and local node-related kernel structures. There's
    no such thing as the "service CPUs" on a T3E.

    --
    Alexis Cousein Senior Systems Engineer
    alexis@sgi.com SGI/Silicon Graphics Brussels

    Nobody Expects the Belgian Inquisition!


  3. Re: CPU utilization and faulty CPUs on IRIX 6.5.x

    Alexis Cousein wrote in message news:...
    > Les Hartzman wrote:
    >
    > > If the o/s detects a fault on a CPU, will the o/s stop assigning
    > > threads to that CPU? Any difference if a thread has locked itself
    > > down to a CPU that becomes faulty?

    >
    > Chances are that IRIX will panic. Then at reboot, it may disable
    > that CPU.
    >
    > There is also indeed a mode where a single CPU will hang in the
    > kernel, but things like TLB flushes that require that CPU's
    > cooperation will gradually tend to hang other CPUs in the kernel,
    > so IRIX will tend to die a horrible death by the crippling of more
    > and more unrelated CPUs, but somewhat slower
    > (and sometimes, slow enough to enable you to see who's ill using icrash).
    > >


    Thank you for responding, Alexis.

    Well more specifically, on a 32 processor machine, when bit errors are
    detected on a processor (I believe these were occuring in the cache)
    and IRIX does not panic, what can be expected? Will anything change
    in the scheduling of threads to that processor?

    Les

  4. Re: CPU utilization and faulty CPUs on IRIX 6.5.x

    Les Hartzman wrote:

    > Alexis Cousein wrote in message news:...
    >
    >>Les Hartzman wrote:
    >>
    >>
    >>>If the o/s detects a fault on a CPU, will the o/s stop assigning
    >>>threads to that CPU? Any difference if a thread has locked itself
    >>>down to a CPU that becomes faulty?

    >>
    >>Chances are that IRIX will panic. Then at reboot, it may disable
    >>that CPU.
    >>
    >>There is also indeed a mode where a single CPU will hang in the
    >>kernel, but things like TLB flushes that require that CPU's
    >>cooperation will gradually tend to hang other CPUs in the kernel,
    >>so IRIX will tend to die a horrible death by the crippling of more
    >>and more unrelated CPUs, but somewhat slower
    >>(and sometimes, slow enough to enable you to see who's ill using icrash).
    >>

    >
    > Thank you for responding, Alexis.
    >
    > Well more specifically, on a 32 processor machine, when bit errors are
    > detected on a processor (I believe these were occuring in the cache)


    Well, the caches and memories all have ECC. So, if you have single-bit errors,
    you'll just see an error in SYSLOG and everything will just continue
    running.

    But if there's an *un*correctable
    cache error, it all depends on what the memory operations were.

    If you get hit by an uncorrectable error in a processor cache, chances are
    that processor is so sick it'll make everything go down.

    If you have one on *memory*, IRIX will try to contain the fault if it's
    recent -- mark the page as bad if it was unused or in the process of zeroing
    it, kill only the job using it and mark it as bad, but continue running,
    but if it held vital kernel data, the only safe thing a kernel *can*
    do is panic.

    > and IRIX does not panic, what can be expected? Will anything change
    > in the scheduling of threads to that processor?
    >

    I have seen situations where IRIX sees one processor in 0% user, 0% sys,
    0% intr and 0% idle, and the rest of the machine continues running. But
    these situations tend to be rare -- a CPU that goes awry tends to confuse
    the hub it's attached to, and an error avalanche is much more likely.

    --
    Alexis Cousein Senior Systems Engineer
    alexis@sgi.com SGI/Silicon Graphics Brussels

    Nobody Expects the Belgian Inquisition!


+ Reply to Thread