-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Jeremy Chadwick wrote:
> On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Hi,
>> I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
>> enabled SMART checking using the smartmontools as usual for the disk
>> (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
>> is that each time the test runs I get messages like the following in
>> /var/log/messages:
>>
>> Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
>> left) LBA=836986454
>> Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
>> retries left) LBA=836986454
>> Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
>> LBA=836986454
>> Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
>> length=16384)]error = 5
>>
>> And the SMART test results log on the disk contains line like this:
>>
>> # 1 Short offline Interrupted (host reset) 00% 297
>> -

>
> First and foremost, your above smartd.conf -s flags are conflicting.
> Your long offline test will never get run on Sunday; the short will run
> first, and the long won't ever start (because the short is already
> running). I would recommend telling the short test to run only between
> days 0-6, leaving Sunday solely for the long test. (I noticed this
> because the above "Interrupted" test indicates a short test was
> interrupted and not a long).

Thanks, I have not noticed the overlap at all.

>
> Second, your short offline test runs at 0300, but the errors you're
> seeing are at 0454 in the morning. A short offline test does not
> take 2 hours to run -- they take between 2-10 minutes -- unless the
> system is also in the middle of doing a lot of I/O, in which case the
> short test will be suspended.
>
> There are cronjobs (specifically periodic jobs) that run starting at
> 0301 in the morning ("periodic daily"), and many of those are I/O bound.
> This could possibly extend the length of the short test until 0454.
>
> Weekly periodic jobs run at 0415 in the morning, on Sundays. These also
> perform a lot of disk I/O, so it's possible that on Sunday specifically
> the short SMART test gets pushed back quite some time.
>
> Third, the DMA timeouts you're seeing are possibly caused by the drive
> taking too long when internally suspending the SMART test.
>
> In most cases, it's safe for SMART tests (short and long) to be run
> while the machine is operational, and disk I/O requests are being
> performed. When an I/O request comes and the disk is in the middle of
> performing a SMART test, the drive has to stop the SMART test (e.g.
> "suspend" it), complete the I/O request, then resume the SMART test.
>
> The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
> doesn't receive an acknowledgement back from the controller (disk)
> within 5 seconds, it'll report a timeout on whatever operation it was
> performing. I'm thinking the disk gets stuck in a "do the offline
> test, no wait stop there's an I/O request, okay its done continue the
> test, no way stop there's another I/O" loop.

Can I make the timeout higher? For the sake of elimination.

>
> Another possibility is that your drive really *does* have a bad block at
> LBA 836986454, and that one of those cron/periodic jobs is what's
> noticing it, and that upon noticing a bad block, the drive more or less
> aborts the SMART test to perform internal remapping of the block.
>
> To confirm this, you would need to boot the SeaTools utilities from DOS
> or from a CD (see Seagate's site) and run a full sector scan (NOT the
> "quick" test). This takes a few hours. Assuming it comes back clean,
> then my above claim of the offline test taking too long to suspend is
> probably the case.
>
> Possibly this is a firmware bug in the drive -- you might consider
> mailing Seagate about this problem, although I'm doubting their Tier 1
> support will understand what the issue is.
>
> Is the block number always the same? Do you only see this error on
> Sundays? These are two questions which might help narrow things down.

Nope, the LBA is always different and I see it in the logs once every day.

>
>> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
>> kernel.
>>
>> Now, does the timeout cause loss of any data? Is there anything besides
>> disabling the testing that I can do about it?

>
> Do you understand what short and long offline tests actually do and what
> they're used for? :-) If so, you'd know that running them periodically
> is more or less silly (IMHO).

I do not, not completely I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.

>
> If you're trying to accomplish a cheap version of disk scrubbing, e.g.
> scanning the entire disk for bad blocks and report them or have them
> automatically remapped by the drive, consider using sysutils/diskcheckd,
> which was made for this purpose. However, be aware of a problem I've
> run into with it (still needs someone clueful to figure out why this
> happens):
> http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853
>
> I do not advocate the use of periodic offline tests on disks, especially
> at such aggressive intervals (daily). In fact, I don't even know why
> Bruce added that option to smartd. There are only a few attributes in
> SMART which get updated on offline tests, so I cease to see the point.
>
> You shouldn't be doing what you're doing, IMHO. If you want to do
> these tests once every 2 weeks or once a month, that'd be a better idea.
> Stick with the short test, and do it during a time when disk I/O is
> very low (try something like 7am on a Saturday). Don't go with 2am
> if your system/environment honours Daylight Saving Time, because that
> could cause the test to run twice.

Ok, I am taking the advice and I have set longer intervals of checking.

Thanks for such extensive answer.

- --
VH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iFYEAREIAAYFAkkF+LoACgkQhQBMvHf/WHmX3ADfTosXsJI0wAKl1MT7PCvBpmOm
WnK9GavuuFsptwDgnjD0+tLGkZ2EEXjiXnvN/6wkz+wMWPCXYcHpGQ==
=oDRL
-----END PGP SIGNATURE-----
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/lis...freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"