I have made some changes, and provided requested details.
Issue is still occuring, so if it looks like it's going to be more trouble
than it's worth I will probably just replace 3 of the PATA IDE disks with a
SATA disk and just throw the remaining PATA on the Nvidia ATA controller?

Thanks for your help thus far!

On Sun, Oct 19, 2008 at 8:25 AM, Jeremy Chadwick wrote:
> On Sun, Oct 19, 2008 at 03:32:29AM +1100, Kristian Rooke wrote:
>> Thanks for the quick response!
>>
>> Please see requested output below:

>
> Cool, thanks. One thing I forgot to ask for was "vmstat -i" output.


interrupt total rate
irq1: atkbd0 6 0
irq6: fdc0 1 0
irq14: ata0 2060 2
irq16: atapci1 612 0
irq17: em0 810 0
cpu0: timer 1812646 1998
cpu1: timer 1812344 1998
Total 3628479 4000

> For now, let's break it down for ease of understanding:
>
> FreeBSD 7.0-RELEASE i386, built February 2008.
>
> atapci0: nVidia nForce MCP73 ATA133 controller -- IRQ 14
> atapci1: Silicon Image 0680 ATA133 controller -- IRQ 16
>
> ata0: attached to atapci0
> ata1: attached to atapci0
> ata2: attached to atapci1
> ata3: attached to atapci1
>
> ad0: at ata0-master PIO4
> ad4: at ata2-master PIO4
> ad5: at ata2-slave PIO4
> ad6: at ata3-master PIO4
> ad7: at ata3-slave PIO4
>
> ATA errors are reported for disks ad4, ad5, ad6, and ad7. ad0 appears
> to be error-free.
>
> First and foremost: there are known problems with Silicon Image
> controllers on all operating systems (Windows, Linux, and FreeBSD in
> particular), known for causing data loss and other sporadic issues.
> This is at least confirmed on their SATA controllers, and I've become
> quite the "pick something else" advocate when it comes to their stuff.
> However: I've no idea about their PATA controllers.


I was originally using a Promise PATA IDE controller, but that's when the
issues first began so I bought a cheap Silicon Image IDE controller to
replace it. After reading your email I have replaced the SI card with the
Promise controller. Below is the detail from dmesg:

atapci1: port
0xcf00-0xcf07,0xce00-0xce03,0xcd00-0xcd07,0xcc00-0xcc03,0xcb00-0xcb0f mem
0xefbf0000-0xefbfffff irq 16 at device 5.0 on pci1

>
> Secondly, so far there isn't any evidence that the ad0 disk, which uses
> the nVidia controller, has any problem -- all the disks having problems
> are on the Silicon Image controller. That is a very key piece of
> information here.
>
> If when you're writing data to, say, the ad4 disk, and you start to see
> errors on all disks (ad4 through ad7), then what this probably means is
> the controller has locked up or is behaving badly. This adds further
> evidence that the Silicon Image controller may be at fault here.
>
> Thirdly, you said the system requires a hard reset to get things back in
> working order. Sometimes this can be induced by a power supply that
> isn't providing decent/proper voltages, or is being overloaded,
> particularly during heavy disk I/O (drawing more power in some cases).
> It might be good to check your voltages inside of your system BIOS,
> write them down, and type them in here. FreeBSD does not provide a
> decent set of tools for monitoring this stuff inside the OS (yet; I'm
> working on it, mainly for server boards. I do what I can...)


When error messages (same as pasted previously) begin being displayed in
console, the system becomes unresponsive.
I can no longer SSH to the device, and when I attempt to use it via console
it simply continues to constantly scroll the disk error messages.

I am currently using an Anter 550w PSU. Below are the Voltage details from
BIOS:

Vcore - 1.19V
Vcc12V - 12.30V
Vcc3.3V - 3.28V
Vcc5.0V - 5.04V

> But keep in mind that a controller locking up hard could also require a
> hard reset (pressing reset on the front of the PC) -- a soft reset
> (Ctrl-Alt-Del) would probably work, except much of the running kernel is
> spinning hard trying to deal with ATA problems.
>
> Fourthly, I see a "" line in your original dmesg.
> Can you provide that output? It's important -- sometimes people have
> seen issues where their ATA controller shows problems, but it turns out
> to be an IRQ sharing or device compatibility problem with another device
> (e.g. their board was showing ATA errors, but at the exact same time,
> also showing NIC watchdog timeouts or other anomalies). They omitted
> the dmesg data thinking it had nothing to do with the problem, when in
> fact it helps determine if the issue is truly with one piece or the
> entire system.


The was simply repeats of error messages I previously
provided. I just had a look then and there was no mention of anything but
ad4-ad7 errors in /var/log/messages. However, if you believe the extra logs
would help, let me know and I will drop the whole lot in.

Also, it seems that when this error has been occuring recently no errors
have been written into /var/log/messages, I'm guessing this is due to the
system load during ATA problem.

>
> Next, let's take a look at your SMART output, which tells a tale of
> something very very bad:
>
> Disk ad4 has a good temperature, and no sign of bad blocks/sectors. The
> disk had been powered on for a total of 7799 hours.
>
> There was a CRC error detected when attempting to set specific
> capabilities on the device. The error occurred at LBA 0 on the disk,
> which is completely bizarre, but the SMART error log might just say LBA
> 0 to indicate "no LBA was being accessed" (e.g. the error was purely
> during the mode setting attempts). However, the SMART error "wraps" its
> timestamps at 49.710 days (every 1149.840 hours), so it's going to be
> difficult to determine if the below SMART error log entry was from long
> ago, or was fairly recent. Looking at other disks might help, so let's
> continue.
>
> Disk ad5 has an excellent temperature, and no sign of bad blocks/sectors
> either. The disk has been powered on for a total of 11956 hours. No
> errors were found in the SMART log.
>
> Disk ad6 has a good temperature, and no sign of bad blocks/sectors. No
> errors were found in the SMART log.
>
> Disk ad7 has an excellent temperature, and no sign of bad blocks/sectors
> either. The disk had been powered on for a total of 12512 hours.
>
> However, much like disk ad4, this disk also witnessed a CRC error when
> attempting to either do a DMA read operation or when setting
> capabilities on the device. I'm prone to believe it's when setting
> capabilities, because LBA 0 is also seen here, which isn't a likely LBA.
> This error happened at the 6310 hour mark, which was about half of its
> lifetime ago.
>
> All of this is somewhat of a mystery. Disk ad4 is on a completely
> different physical cable than disk ad7, so that *could* rule out cabling
> problems. The errors seen are only when setting device capabilities
> (making an educated guess, but I'm not 100% positive), not when actually
> accessing data on the disks. Heck, I'm not even sure the errors in the
> SMART log are accurate, as the disks have been powered on for quite some
> time after the supposed errors occurred.
>
> Power draw could also explain this, ditto with the voltage possibility.
>
> I would start by doing 3 easy things:
>
> 1) Re-enable DMA mode; it's obviously not the cause of your problems
> since PIO mode shows the same problem for you,


This has now been re-enabled

> 2) Replacing both sets of PATA cables with brand new ones. There's no
> evidence this is the problem, but changing these is easy and cheap. If
> it doesn't solve the problem, then you're one step closer to tracking it
> down,


Cables (and controller) have both been changed. Just did some checks then
and confirmed issue is still occuring.
I have been using Samba to copy files over, but I also tested by mounting a
NTFS locally and issues still occured.

> 3) Getting voltages from the BIOS and providing them here. Again, this
> won't be an accurate representation of the system under load, but it's
> the best we've got right now.


As above.

> Assuming the problem continues after #2, and the voltages shown in #3
> look good, this is what I'd do for the next step:
>
> Buy a PCI, PCI-X (if this make sure it's backwards-compatible with
> 32-bit 33MHz PCI slots, unless you actually have a PCI-X slot!) or PCI
> Express PATA controller -- specifically, one that does not use a Silicon
> Image chip. This may be hard to accomplish since PATA is a dying
> interface (and good riddance!).
>
> I will also stress this in capitals, just to make it clear: DO NOT BUY A
> SATA CONTROLLER THEN USE PATA-TO-SATA ADAPTERS. Those adapters will
> cause you even more problems. If you go the SATA route, buy actual SATA
> disks and recycle or sell your old PATA ones.
>
> That said, Highpoint and Promise both make PATA controllers -- not to
> mention, I even see that you've tried to load the hptrr(4) driver on
> that system! :-) Additionally, DO NOT use the "RAID" features of these
> cards (if you end up buying one that has such); just plug the disks in
> and use them in a JBOD fashion.
>
> You might find that the disk numbers (e.g. ad4) change on you when
> doing this; that's to be expected.
>
> Others might recommend that you should try replacing the PSU before
> buying a new PATA controller, but I have doubts the problem is with the
> PSU; I would expect more odd/awkward problems if the PSU was to blame.
> If you do try a different PSU, go with one that does 450W or more. You
> DO NOT need a l33t-g4m3-d00dz-omgwtfbbq!! 850-1000W PSU; most of the
> power draw for hard disks happens during power-on, when the disks have
> to spin up, not once they're already spinning.
>
> Hope this helps, and good luck!
>
> --
> | Jeremy Chadwick jdc at parodius.com |
> | Parodius Networking http://www.parodius.com/ |
> | UNIX Systems Administrator Mountain View, CA, USA |
> | Making life hard for others since 1977. PGP: 4BD6C0CB |
>
>

_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/lis...freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"