On Wed, Feb 27, 2008 at 01:11:36AM -0800, Stephen Hurd wrote:
> ... The corrupted sync message scared the heck out of me:
> Waiting (max 60 seconds) for system process `vnlru' to stop...done
> Waiti
> Synncgi n(gm adxi sk6s0, svencoodnedss )r efmoari nsiynsgte.m. .pr1o0c ess
> `syncer' to stop...8 7 8 3 3 3 1 0 0 0 0 done


> And after the reboot, the READ_DMA timeouts were back.

You're not the only one seeing this behaviour. There are too many posts
in the past reporting similar. Here's the breakdown:

* Some reporting this problem have been told to replace their ATA or
SATA cables (which have previously been known to be working, but cables
going bad does happen) -- and this has fixed the problem for a couple.

* Some have checked their SMART stats and found their disks to be in
perfect condition.

* Some have switched to alternate operating systems (usually Linux) for
a short while and seen no sign of DMA timeouts.

* Some have replaced the storage controller to no avail, and some have
replaced the entire motherboard to no avail. In some cases (myself
included), replacing the motherboard did in fact help.

However: in your case, your disk does look to have problems based on the
SMART output you provided. It does not matter how new/old the disk is,
by the way. I'll point out the problematic stats. You need to replace
the disk ASAP.

BTW, any SMART stats you see labelled "Offline" means the numbers will
not be updated until you perform an offline test (smartctl -t short or
smartctl -t long).

> The only "odd" think I can think of about my system is an unusually high HZ
> value (2386) I'm building a kernel now with 1000 to check if that makes a
> difference.

This is not the cause, rest assured.

> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> 5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 4

This shows you've had 4 reallocated sectors, meaning your disk does in
fact have bad blocks. In 90% of the cases out there, bad blocks
continue to "grow" over time, due to whatever reason (I remember reading
an article explaining it, but I can't for the life of me find the URL).

> 194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 48

This is excessive, and may be attributing to problems. A hard disk
running at 48C is not a good sign. This should really be somewhere
between high 20s and mid 30s.

> 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 11498

This implies a large number of ECC (error correction) activities have
occured, but all were successful.

> Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
> When the command that caused the error occurred, the device was in an unknown state.
> Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
> When the command that caused the error occurred, the device was in an unknown state.

These are automated SMART log entries confirming the DMA failures. The
fact that SMART saw them means that the disk is also aware of said
issues. These may have been caused by the reallocated sectors. It's
also interesting that the LBAs are different than the ones FreeBSD
reported issues with.

My advice to you is: replace the disk ASAP. This problem will only get
worse. Try another hard disk brand too (I don't have anything "against"
Maxtor, but usually its recommended to avoid a brand you have problems
with until the next time you have issues, then switch brands, etc.
etc...). I'm very fond of Western Digital's SE16, RE, and RE2 series
currently. But avoid Fujitsu and Samsung (both have a long track record
of having buggy drive firmwares, forcing vendors to make custom
workarounds for issues); stick with Seagate, Western Digital, or Maxtor.

| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |

freebsd-stable@freebsd.org mailing list
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"