Brand new machine mystery lockup - Hardware

This is a discussion on Brand new machine mystery lockup - Hardware ; I just built a server that seems to be posessed, or at least flaky. It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2 4600+ CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The ...

+ Reply to Thread
Results 1 to 17 of 17

Thread: Brand new machine mystery lockup

  1. Brand new machine mystery lockup

    I just built a server that seems to be posessed, or at least flaky.

    It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2 4600+
    CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The SCSI
    adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The power
    supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.

    Once in a while (like every 2-5 days) the machine locks up:

    Screen goes black, all fans go to full-on, and neither the power nor the
    reset button will work. It takes a flip of the power switch on the PS to
    restart it.

    Normally I would say that it's the PS, but sometimes - only sometimes,
    though - the system won't boot because mdadm can't find any of the md
    devices to boot. At this point the kernel's already booted off the SCSI
    drives, so I know they're spinning; just mdadm can't find them. This
    typically happens on a soft-reboot; again, I have to fully power cycle
    the machine to get it to boot.

    Of course there are no errors anywhere at any time in any log. The
    machine just stops.

    Google says people have had trouble with that SCSI adapter under windows
    but that seems to be a driver problem and it's reported to work fine with
    linux.

    So, I have 3 possible culprits:

    Power Supply
    Mobo
    SCSI adapter

    Any place I can look? Any diagnostics I can do? I have about 2 weeks
    left of Newegg's 30 day return timeframe, so I can do some testing....


  2. Re: Brand new machine mystery lockup

    On 2007-10-27, Yan Seiner wrote:
    > I just built a server that seems to be posessed, or at least flaky.
    >
    > It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2 4600+
    > CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The SCSI
    > adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The power
    > supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.
    >
    > Once in a while (like every 2-5 days) the machine locks up:
    >
    > Screen goes black, all fans go to full-on, and neither the power nor the
    > reset button will work. It takes a flip of the power switch on the PS to
    > restart it.
    >
    > Normally I would say that it's the PS, but sometimes - only sometimes,
    > though - the system won't boot because mdadm can't find any of the md
    > devices to boot. At this point the kernel's already booted off the SCSI
    > drives, so I know they're spinning; just mdadm can't find them. This
    > typically happens on a soft-reboot; again, I have to fully power cycle
    > the machine to get it to boot.
    >
    > Of course there are no errors anywhere at any time in any log. The
    > machine just stops.
    >
    > Google says people have had trouble with that SCSI adapter under windows
    > but that seems to be a driver problem and it's reported to work fine with
    > linux.
    >
    > So, I have 3 possible culprits:
    >
    > Power Supply
    > Mobo
    > SCSI adapter
    >
    > Any place I can look? Any diagnostics I can do? I have about 2 weeks
    > left of Newegg's 30 day return timeframe, so I can do some testing....


    Running memtest86 for several hours may be useful.

    HTH

    --
    Robert Riches
    spamtrap42@verizon.net
    (Yes, that is one of my email addresses.)

  3. Re: Brand new machine mystery lockup

    On Oct 27, 9:48 am, Yan Seiner wrote:
    > I just built a server that seems to be posessed, or at least flaky.
    >
    > It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2 4600+
    > CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The SCSI
    > adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The power
    > supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.
    >
    > Once in a while (like every 2-5 days) the machine locks up:
    >
    > Screen goes black, all fans go to full-on, and neither the power nor the
    > reset button will work. It takes a flip of the power switch on the PS to
    > restart it.
    >
    > Normally I would say that it's the PS, but sometimes - only sometimes,
    > though - the system won't boot because mdadm can't find any of the md
    > devices to boot. At this point the kernel's already booted off the SCSI
    > drives, so I know they're spinning; just mdadm can't find them. This
    > typically happens on a soft-reboot; again, I have to fully power cycle
    > the machine to get it to boot.
    >
    > Of course there are no errors anywhere at any time in any log. The
    > machine just stops.
    >
    > Google says people have had trouble with that SCSI adapter under windows
    > but that seems to be a driver problem and it's reported to work fine with
    > linux.
    >
    > So, I have 3 possible culprits:
    >
    > Power Supply
    > Mobo
    > SCSI adapter
    >
    > Any place I can look? Any diagnostics I can do? I have about 2 weeks
    > left of Newegg's 30 day return timeframe, so I can do some testing....


    Have you updated the mobo'a firmware? At least a couple of years
    ago, (some) mobo's shipped w/ outdated BIOS - it was up to the
    end-user to get updates from the OEM.

    Other things to do:
    -Check dumb things. Completely disassemble and subsequently
    reassemble the entire system, looking for HW 'bugs' along the
    way; is the CPU heatsink tight? Is there enough thermal compund
    on the CPU-heatsink interface? Are boards and memory modules
    inserted firmly? Are cable connectors inserted firmly? The
    principle here is to rule out the obvious, dumb things that bite
    people who don't check for them.

    -Did you calculate total system power load? Is your power supply rated
    high enough for peak load? Do you have another, higher-power,
    compatible
    unit to swap it with?

    -Read the manual on the BIOS settings, or last go through all the
    items in the menu. Do they make sense? Did you tweak any voltage,
    speed, or memory access settings? If you have the inclination, return
    them all to 'default' or 'normal' settings, and apply each tweak
    one by one. Any memory tweaks sould be followed with a decent round
    of memtest86.

    (Actually, BIOS update step should be here, then repeat the step
    above)

    -Software: I am unfamiliar with mdadm (is that a Minix or *BSD
    boot manager?), but if all of the above checks out okay, that's
    the next place to look for bugs. Is the software 64-bit compatible?
    Are there documentation notes/extra settings/etc for 64 bit systems?
    Have you run some searches on appropriate user lists/web sites/docs?

    I hope you don't the impression I'm talking down to you- I've
    learned the hard way, several times, to check obvious, 'dumb'
    things first. And there is a certain amount of 'magic' to
    completely disassembling and reassembling a system. But the
    steps I described, taken in order, are exactly what I do when
    hunting subtle bugs.

    Good hunting and HTH,
    Tarkin


  4. Re: Brand new machine mystery lockup

    On Sat, 27 Oct 2007 13:48:48 +0000, Yan Seiner wrote:

    > I just built a server that seems to be posessed, or at least flaky.
    >
    > It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2 4600+
    > CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The SCSI
    > adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The power
    > supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.
    >
    > Once in a while (like every 2-5 days) the machine locks up:
    >
    > Screen goes black, all fans go to full-on, and neither the power nor the
    > reset button will work. It takes a flip of the power switch on the PS to
    > restart it.
    >
    > Normally I would say that it's the PS, but sometimes - only sometimes,
    > though - the system won't boot because mdadm can't find any of the md
    > devices to boot. At this point the kernel's already booted off the SCSI
    > drives, so I know they're spinning; just mdadm can't find them. This
    > typically happens on a soft-reboot; again, I have to fully power cycle
    > the machine to get it to boot.
    >
    > Of course there are no errors anywhere at any time in any log. The
    > machine just stops.
    >
    > Google says people have had trouble with that SCSI adapter under windows
    > but that seems to be a driver problem and it's reported to work fine with
    > linux.
    >
    > So, I have 3 possible culprits:
    >
    > Power Supply
    > Mobo
    > SCSI adapter
    >
    > Any place I can look? Any diagnostics I can do? I have about 2 weeks
    > left of Newegg's 30 day return timeframe, so I can do some testing....


    YS:

    I believe the entire ASUS-m2n line is 'queered' w.r.t. *nix. That is ,
    every model will fail some function with every linux distro. Different
    failures on different m2n_mobos for different distros.
    Do a GOOGLE -- then get out your 'tin hat'. It's certainly not a hw
    cluster*k that M$ would ever ...

    nss
    ****


  5. Re: Brand new machine mystery lockup

    On Sat, 27 Oct 2007 20:00:04 +0000, Tarkin wrote:

    > On Oct 27, 9:48 am, Yan Seiner wrote:
    >> I just built a server that seems to be posessed, or at least flaky.


    >> Any place I can look? Any diagnostics I can do? I have about 2 weeks
    >> left of Newegg's 30 day return timeframe, so I can do some testing....

    >
    > Have you updated the mobo'a firmware? At least a couple of years ago,
    > (some) mobo's shipped w/ outdated BIOS - it was up to the end-user to
    > get updates from the OEM.


    Good idea, I think I'll do that anyway. Read on.

    >
    > Other things to do:
    > -Check dumb things. Completely disassemble and subsequently
    > reassemble the entire system, looking for HW 'bugs' along the way; is
    > the CPU heatsink tight? Is there enough thermal compund on the
    > CPU-heatsink interface? Are boards and memory modules inserted firmly?
    > Are cable connectors inserted firmly? The principle here is to rule out
    > the obvious, dumb things that bite people who don't check for them.
    >
    > -Did you calculate total system power load?


    Yes.

    > Is your power supply rated
    > high enough for peak load?


    Yes. It should provide power to all 8 drives in the box, ATM it only has
    2.

    > Do you have another, higher-power, compatible
    > unit to swap it with?


    No.

    >
    > -Read the manual on the BIOS settings, or last go through all the
    > items in the menu. Do they make sense? Did you tweak any voltage, speed,
    > or memory access settings? If you have the inclination, return them all
    > to 'default' or 'normal' settings, and apply each tweak one by one. Any
    > memory tweaks sould be followed with a decent round of memtest86.
    >
    > (Actually, BIOS update step should be here, then repeat the step above)
    >
    > -Software: I am unfamiliar with mdadm (is that a Minix or *BSD
    > boot manager?),


    It's linux's softraid manager.

    > but if all of the above checks out okay, that's the next
    > place to look for bugs. Is the software 64-bit compatible? Are there
    > documentation notes/extra settings/etc for 64 bit systems? Have you run
    > some searches on appropriate user lists/web sites/docs?


    It's pretty bulletproof - I've not had any problems with mdadm in years
    of using it.

    >
    > I hope you don't the impression I'm talking down to you- I've learned
    > the hard way, several times, to check obvious, 'dumb' things first. And
    > there is a certain amount of 'magic' to completely disassembling and
    > reassembling a system. But the steps I described, taken in order, are
    > exactly what I do when hunting subtle bugs.


    No, not offended. Exactly the procedure I followed - and discovered that
    the culprit is most likely a bad SCSI cable. I have /tmp on a raid0
    partition striped across 2 drives, and the scsi drives would just
    disappear, bringing the whole systme down.

    I reseated the cable and found the drives wouldn't boot at all. So I've
    slowed the whole SCSI bus down to a crawl and I have my system back. New
    cable on order.

    Fingers crossed.

  6. Re: Brand new machine mystery lockup

    On Mon, 29 Oct 2007 00:57:58 +0000, Yan Seiner wrote:


    >> Have you updated the mobo'a firmware? At least a couple of years ago,
    >> (some) mobo's shipped w/ outdated BIOS - it was up to the end-user to
    >> get updates from the OEM.

    >
    > Good idea, I think I'll do that anyway. Read on.


    You might want to wait until you clear up the flakiness. If the system hangs while you are
    updating the bios . . .

  7. Re: Brand new machine mystery lockup

    On Sat, 27 Oct 2007 13:48:48 +0000, Yan Seiner wrote:

    > I just built a server that seems to be posessed, or at least flaky.
    >
    > It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2
    > 4600+ CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The
    > SCSI adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The
    > power supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.
    >
    > Once in a while (like every 2-5 days) the machine locks up:
    >
    > Screen goes black, all fans go to full-on, and neither the power nor the
    > reset button will work. It takes a flip of the power switch on the PS
    > to restart it.
    >
    > Normally I would say that it's the PS, but sometimes - only sometimes,
    > though - the system won't boot because mdadm can't find any of the md
    > devices to boot. At this point the kernel's already booted off the SCSI
    > drives, so I know they're spinning; just mdadm can't find them. This
    > typically happens on a soft-reboot; again, I have to fully power cycle
    > the machine to get it to boot.
    >
    > Of course there are no errors anywhere at any time in any log. The
    > machine just stops.
    >
    > Google says people have had trouble with that SCSI adapter under windows
    > but that seems to be a driver problem and it's reported to work fine
    > with linux.
    >
    > So, I have 3 possible culprits:
    >
    > Power Supply
    > Mobo
    > SCSI adapter
    >
    > Any place I can look? Any diagnostics I can do? I have about 2 weeks
    > left of Newegg's 30 day return timeframe, so I can do some testing....


    I wrote a system stress test that you can run,

    http://www.polybus.com/sys_basher_web/

    Sys_basher puts all of the subsystems except graphics under maximum load.
    It's multithreaded so it can keep all of your cores at maximum load. It
    also does a good job of stressing memory and disk subsystems. The log
    file records the temperatures after each test and it writes the log to
    disk between tests so that you'll have a record if the system crashes.


  8. Re: Brand new machine mystery lockup

    General Schvantzkopf wrote:
    > On Sat, 27 Oct 2007 13:48:48 +0000, Yan Seiner wrote:
    >
    >> I just built a server that seems to be posessed, or at least flaky.
    >>
    >> It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2
    >> 4600+ CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The
    >> SCSI adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The
    >> power supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.
    >>
    >> Once in a while (like every 2-5 days) the machine locks up:
    >>
    >> Screen goes black, all fans go to full-on, and neither the power nor the
    >> reset button will work. It takes a flip of the power switch on the PS
    >> to restart it.
    >>
    >> Normally I would say that it's the PS, but sometimes - only sometimes,
    >> though - the system won't boot because mdadm can't find any of the md
    >> devices to boot. At this point the kernel's already booted off the SCSI
    >> drives, so I know they're spinning; just mdadm can't find them. This
    >> typically happens on a soft-reboot; again, I have to fully power cycle
    >> the machine to get it to boot.
    >>
    >> Of course there are no errors anywhere at any time in any log. The
    >> machine just stops.
    >>
    >> Google says people have had trouble with that SCSI adapter under windows
    >> but that seems to be a driver problem and it's reported to work fine
    >> with linux.
    >>
    >> So, I have 3 possible culprits:
    >>
    >> Power Supply
    >> Mobo
    >> SCSI adapter
    >>
    >> Any place I can look? Any diagnostics I can do? I have about 2 weeks
    >> left of Newegg's 30 day return timeframe, so I can do some testing....

    >
    > I wrote a system stress test that you can run,
    >
    > http://www.polybus.com/sys_basher_web/
    >
    > Sys_basher puts all of the subsystems except graphics under maximum load.
    > It's multithreaded so it can keep all of your cores at maximum load. It
    > also does a good job of stressing memory and disk subsystems. The log
    > file records the temperatures after each test and it writes the log to
    > disk between tests so that you'll have a record if the system crashes.
    >


    I would say that some piece of hardware is defunct, or just possibly you
    have a driver that has a bug....I wrote code that exhibited this kind of
    behaviour..we left a hardware analyser on it..if a timer interrupt
    happened in one, and one only byte of the BIOS code, it went onto a
    'deadly embrace'.


    Try seeing if any updated drivers or firmware exist for the SCSI adapter.

  9. Re: Brand new machine mystery lockup

    On Oct 28, 7:57 pm, Yan Seiner wrote:
    > On Sat, 27 Oct 2007 20:00:04 +0000, Tarkin wrote:
    > > On Oct 27, 9:48 am, Yan Seiner wrote:
    > >> I just built a server that seems to be posessed, or at least flaky.

    >
    > >> Any place I can look? Any diagnostics I can do? I have about 2 weeks
    > >> left of Newegg's 30 day return timeframe, so I can do some testing....

    >
    > > Have you updated the mobo'a firmware? At least a couple of years ago,
    > > (some) mobo's shipped w/ outdated BIOS - it was up to the end-user to
    > > get updates from the OEM.

    >
    > Good idea, I think I'll do that anyway. Read on.
    >
    >
    >
    > > Other things to do:
    > > -Check dumb things. Completely disassemble and subsequently
    > > reassemble the entire system, looking for HW 'bugs' along the way; is
    > > the CPU heatsink tight? Is there enough thermal compund on the
    > > CPU-heatsink interface? Are boards and memory modules inserted firmly?
    > > Are cable connectors inserted firmly? The principle here is to rule out
    > > the obvious, dumb things that bite people who don't check for them.

    >
    > > -Did you calculate total system power load?

    >
    > Yes.
    >
    > > Is your power supply rated
    > > high enough for peak load?

    >
    > Yes. It should provide power to all 8 drives in the box, ATM it only has
    > 2.
    >
    > > Do you have another, higher-power, compatible
    > > unit to swap it with?

    >
    > No.
    >
    >
    >
    > > -Read the manual on the BIOS settings, or last go through all the
    > > items in the menu. Do they make sense? Did you tweak any voltage, speed,
    > > or memory access settings? If you have the inclination, return them all
    > > to 'default' or 'normal' settings, and apply each tweak one by one. Any
    > > memory tweaks sould be followed with a decent round of memtest86.

    >
    > > (Actually, BIOS update step should be here, then repeat the step above)

    >
    > > -Software: I am unfamiliar with mdadm (is that a Minix or *BSD
    > > boot manager?),

    >
    > It's linux's softraid manager.
    >
    > > but if all of the above checks out okay, that's the next
    > > place to look for bugs. Is the software 64-bit compatible? Are there
    > > documentation notes/extra settings/etc for 64 bit systems? Have you run
    > > some searches on appropriate user lists/web sites/docs?

    >
    > It's pretty bulletproof - I've not had any problems with mdadm in years
    > of using it.
    >
    >
    >
    > > I hope you don't the impression I'm talking down to you- I've learned
    > > the hard way, several times, to check obvious, 'dumb' things first. And
    > > there is a certain amount of 'magic' to completely disassembling and
    > > reassembling a system. But the steps I described, taken in order, are
    > > exactly what I do when hunting subtle bugs.

    >
    > No, not offended. Exactly the procedure I followed - and discovered that
    > the culprit is most likely a bad SCSI cable. I have /tmp on a raid0
    > partition striped across 2 drives, and the scsi drives would just
    > disappear, bringing the whole systme down.
    >
    > I reseated the cable and found the drives wouldn't boot at all. So I've
    > slowed the whole SCSI bus down to a crawl and I have my system back. New
    > cable on order.
    >
    > Fingers crossed.


    Right on. One the on the one hand, I'm sorry you're having systems
    problems;
    on the other, I can appreciate the irony that in this new wunder era
    of 64 bit
    megamachines, silly things like a bad cable can bring them down
    like a ton of bricks ;^)

    Good luck and TTFN,
    Tarkin


  10. Re: Brand new machine mystery lockup

    Tarkin wrote:
    > On Oct 28, 7:57 pm, Yan Seiner wrote:


    >> I reseated the cable and found the drives wouldn't boot at all. So I've
    >> slowed the whole SCSI bus down to a crawl and I have my system back. New
    >> cable on order.
    >>
    >> Fingers crossed.

    >
    > Right on. One the on the one hand, I'm sorry you're having systems
    > problems;
    > on the other, I can appreciate the irony that in this new wunder era
    > of 64 bit
    > megamachines, silly things like a bad cable can bring them down
    > like a ton of bricks ;^)


    Oh, my bad. The cable is one that I had laying around; I chose not to
    buy a new one with the system. So I have no one to blame but myself for
    being a scrooge and trying to save $50.

    --Yan

  11. Re: Brand new machine mystery lockup

    In comp.periphs.scsi Yan Seiner wrote:
    > I just built a server that seems to be posessed, or at least flaky.
    >
    > It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2 4600+
    > CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The SCSI
    > adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The power
    > supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.
    >
    > Once in a while (like every 2-5 days) the machine locks up:
    >
    > Screen goes black, all fans go to full-on, and neither the power nor the
    > reset button will work. It takes a flip of the power switch on the PS to
    > restart it.
    >
    > Normally I would say that it's the PS, but sometimes - only sometimes,
    > though - the system won't boot because mdadm can't find any of the md
    > devices to boot. At this point the kernel's already booted off the SCSI
    > drives, so I know they're spinning; just mdadm can't find them. This
    > typically happens on a soft-reboot; again, I have to fully power cycle
    > the machine to get it to boot.
    >
    > Of course there are no errors anywhere at any time in any log. The
    > machine just stops.
    >
    > Google says people have had trouble with that SCSI adapter under windows
    > but that seems to be a driver problem and it's reported to work fine with
    > linux.
    >
    > So, I have 3 possible culprits:
    >
    > Power Supply
    > Mobo
    > SCSI adapter
    >
    > Any place I can look? Any diagnostics I can do? I have about 2 weeks
    > left of Newegg's 30 day return timeframe, so I can do some testing....


    newegg is prtty good about returns. just send it back, and try again.



  12. Re: Brand new machine mystery lockup

    On Oct 29, 12:17 pm, CptDondo wrote:
    > Tarkin wrote:
    > > On Oct 28, 7:57 pm, Yan Seiner wrote:
    > >> I reseated the cable and found the drives wouldn't boot at all. So I've
    > >> slowed the whole SCSI bus down to a crawl and I have my system back. New
    > >> cable on order.

    >
    > >> Fingers crossed.

    >
    > > Right on. One the on the one hand, I'm sorry you're having systems
    > > problems;
    > > on the other, I can appreciate the irony that in this new wunder era
    > > of 64 bit
    > > megamachines, silly things like a bad cable can bring them down
    > > like a ton of bricks ;^)

    >
    > Oh, my bad. The cable is one that I had laying around; I chose not to
    > buy a new one with the system. So I have no one to blame but myself for
    > being a scrooge and trying to save $50.
    >
    > --Yan


    Ouch. That has to sting. I'm guessing if you had ordered the cable
    w/ the rest of the system, it would have been cheaper than running
    to the nearest retailer or having one 3-day shipped.

    A lesson for the logs, and an affirmation of an old saw:
    "Pennywise and pound-foolish". I've made similar mistakes,
    and they're always painful 8-.

    TTFN,
    Stevo


  13. Re: Brand new machine mystery lockup

    In comp.periphs.scsi General Schvantzkopf wrote:
    > On Sat, 27 Oct 2007 13:48:48 +0000, Yan Seiner wrote:
    >
    >> I just built a server that seems to be posessed, or at least flaky.
    >>
    >> It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2
    >> 4600+ CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The
    >> SCSI adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The
    >> power supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.
    >>
    >> Once in a while (like every 2-5 days) the machine locks up:
    >>
    >> Screen goes black, all fans go to full-on, and neither the power nor the
    >> reset button will work. It takes a flip of the power switch on the PS
    >> to restart it.
    >>
    >> Normally I would say that it's the PS, but sometimes - only sometimes,
    >> though - the system won't boot because mdadm can't find any of the md
    >> devices to boot. At this point the kernel's already booted off the SCSI
    >> drives, so I know they're spinning; just mdadm can't find them. This
    >> typically happens on a soft-reboot; again, I have to fully power cycle
    >> the machine to get it to boot.
    >>
    >> Of course there are no errors anywhere at any time in any log. The
    >> machine just stops.
    >>
    >> Google says people have had trouble with that SCSI adapter under windows
    >> but that seems to be a driver problem and it's reported to work fine
    >> with linux.
    >>
    >> So, I have 3 possible culprits:
    >>
    >> Power Supply
    >> Mobo
    >> SCSI adapter
    >>
    >> Any place I can look? Any diagnostics I can do? I have about 2 weeks
    >> left of Newegg's 30 day return timeframe, so I can do some testing....

    >
    > I wrote a system stress test that you can run,
    >
    > http://www.polybus.com/sys_basher_web/
    >
    > Sys_basher puts all of the subsystems except graphics under maximum load.
    > It's multithreaded so it can keep all of your cores at maximum load. It
    > also does a good job of stressing memory and disk subsystems. The log
    > file records the temperatures after each test and it writes the log to
    > disk between tests so that you'll have a record if the system crashes.


    this program looks interesting. Can you port it solaris 10? bug me off
    list if you need access to test machines to run it on.

  14. Re: Brand new machine mystery lockup

    On Mon, 29 Oct 2007 19:56:15 +0000, Cydrome Leader wrote:

    > In comp.periphs.scsi General Schvantzkopf
    > wrote:
    >> On Sat, 27 Oct 2007 13:48:48 +0000, Yan Seiner wrote:
    >>
    >>> I just built a server that seems to be posessed, or at least flaky.
    >>>
    >>> It's built on an Asus M2N-SLI DELUXE mobo, with an AMD Athlon 64 X2
    >>> 4600+ CPU, 2 gig RAM, and an Adaptec ASC-29320ALP SCSI adapter. The
    >>> SCSI adapter has 2 Fujitsu 36 GB 15K drives in a software RAID-1. The
    >>> power supply is a SILVERSTONE ST50EF-SC ATX12V / EPS12V 500W.
    >>>
    >>> Once in a while (like every 2-5 days) the machine locks up:
    >>>
    >>> Screen goes black, all fans go to full-on, and neither the power nor
    >>> the reset button will work. It takes a flip of the power switch on
    >>> the PS to restart it.
    >>>
    >>> Normally I would say that it's the PS, but sometimes - only sometimes,
    >>> though - the system won't boot because mdadm can't find any of the md
    >>> devices to boot. At this point the kernel's already booted off the
    >>> SCSI drives, so I know they're spinning; just mdadm can't find them.
    >>> This typically happens on a soft-reboot; again, I have to fully power
    >>> cycle the machine to get it to boot.
    >>>
    >>> Of course there are no errors anywhere at any time in any log. The
    >>> machine just stops.
    >>>
    >>> Google says people have had trouble with that SCSI adapter under
    >>> windows but that seems to be a driver problem and it's reported to
    >>> work fine with linux.
    >>>
    >>> So, I have 3 possible culprits:
    >>>
    >>> Power Supply
    >>> Mobo
    >>> SCSI adapter
    >>>
    >>> Any place I can look? Any diagnostics I can do? I have about 2 weeks
    >>> left of Newegg's 30 day return timeframe, so I can do some testing....

    >>
    >> I wrote a system stress test that you can run,
    >>
    >> http://www.polybus.com/sys_basher_web/
    >>
    >> Sys_basher puts all of the subsystems except graphics under maximum
    >> load. It's multithreaded so it can keep all of your cores at maximum
    >> load. It also does a good job of stressing memory and disk subsystems.
    >> The log file records the temperatures after each test and it writes the
    >> log to disk between tests so that you'll have a record if the system
    >> crashes.

    >
    > this program looks interesting. Can you port it solaris 10? bug me off
    > list if you need access to test machines to run it on.


    It should compile on Solaris 10, it's straight POSIX C. It uses lmsensors
    which doesn't exist in Solaris, but you can build it with a make
    nosensors. It also uses pthreads, I don't know if Solaris supports
    pthreads. If you have a Solaris system please test it and let me know if
    it works. It's open source so if you need to make a patch I'd appreciate
    it if you would send me the changes and I'll incorporate it in the next
    release. You can find my e-mail address on the the sys_basher website.



  15. Re: Brand new machine mystery lockup

    Yan Seiner wrote:
    > On Sat, 27 Oct 2007 20:00:04 +0000, Tarkin wrote:
    >
    >
    >> but if all of the above checks out okay, that's the next
    >> place to look for bugs. Is the software 64-bit compatible? Are there
    >> documentation notes/extra settings/etc for 64 bit systems? Have you run
    >> some searches on appropriate user lists/web sites/docs?

    >
    > It's pretty bulletproof - I've not had any problems with mdadm in years
    > of using it.
    >


    Well Sh*t. I take it back. Turns out mdadm is broken on AMD64/Debian
    Lenny.



    So... The total wreckage so far:

    1. Bad SCSI cable - causes hard drives to disappear
    2. broken mdadm - causes the machine to fail to boot as mdadm segfaults
    rather than building arrays
    3. broken keyboard (hitting the keyboard hard enough causes the reboots)

    I hope I've worked through all the bad luck for next server build.

    :-)

  16. Re: Brand new machine mystery lockup

    On Oct 31, 11:41 am, CptDondo wrote:
    > Yan Seiner wrote:
    > > On Sat, 27 Oct 2007 20:00:04 +0000, Tarkin wrote:

    >
    > >> but if all of the above checks out okay, that's the next
    > >> place to look for bugs. Is the software 64-bit compatible? Are there
    > >> documentation notes/extra settings/etc for 64 bit systems? Have you run
    > >> some searches on appropriate user lists/web sites/docs?

    >
    > > It's pretty bulletproof - I've not had any problems with mdadm in years
    > > of using it.

    >
    > Well Sh*t. I take it back. Turns out mdadm is broken on AMD64/Debian
    > Lenny.
    >
    >
    >
    > So... The total wreckage so far:
    >
    > 1. Bad SCSI cable - causes hard drives to disappear
    > 2. broken mdadm - causes the machine to fail to boot as mdadm segfaults
    > rather than building arrays
    > 3. broken keyboard (hitting the keyboard hard enough causes the reboots)
    >
    > I hope I've worked through all the bad luck for next server build.
    >
    > :-)


    Man-oh-man. My condolences. But, it's a brave
    new 64-bit world, so the bug will hopefully
    be fixed soon. I assume you've begged, borrowed,
    stolen, or bought a new cable (or will soon).
    And keyboards are super-cheap these days. It
    probably sounds trite, but cheer up- at least you
    didn't cook the processor or some other unholy
    disaster (involving sparks and magic smoke).

    TTFN,
    Tarkin


  17. Re: Brand new machine mystery lockup

    Yan Seiner writes:

    >Screen goes black, all fans go to full-on, and neither the power nor the
    >reset button will work. It takes a flip of the power switch on the PS to
    >restart it.
    >


    Ok. This behavior indicates that the following is happening:

    - Interrupts are disabled
    - The processor is in a tight loop
    - The processor temperature goes up (due to the tight loop)
    - an SMI interrupt triggers the BIOS to speed up the fans
    (due to the high processor temperature).

    As for reasons, in decending order of probability:

    - Undetected memory parity error
    - HBA hardware problem (driver is polling and never sees the polled bit)
    (this would also be considered a driver problem as no sane driver
    should poll forever).
    - A device driver or operating system bug.

    scott

+ Reply to Thread