This is a discussion on Re: LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock - FreeBSD ; On 23/04/2008, at 3:34 AM, John Baldwin wrote: >>> The >>> real problem at the bottom of the screen though is a real issue. >>> It's a LOR >>> of two different sleepqueue chain locks. The problem is that when ...
On 23/04/2008, at 3:34 AM, John Baldwin wrote:
>>> real problem at the bottom of the screen though is a real issue.
>>> It's a LOR
>>> of two different sleepqueue chain locks. The problem is that when
>>> setrunnable() encounters a swapped out thread it tries to wakeup
>>> proc0, but
>>> if proc0 is asleep (which is typical) then its thread lock is a
>>> sleep queue
>>> chain lock, so waking up a swapped out thread from wakeup() will
>>> trigger this LOR.
>>> I think the best fix is to not have setrunnable() kick proc0
>>> Perhaps setrunnable() should return an int and return true if proc0
>>> needs to
>>> be awakened and false otherwise. Then the the sleepq code (b/c only
>>> threads can be swapped out anyway) can return that value from
>>> sleepq_resume_thread() and can call kick_proc0() directly once it
>>> has dropped
>>> all of its own locks.
>>> John Baldwin
>> The way you describe it, it almost sounds like this LOR should be
>> happening for everyone, all the time. To try and eliminate the
>> which trigger it for us, we tried the following: removed PAE from
>> kernel, disabled PF. Neither of these things made any difference and
>> the error is fairly quickly reproducible (within a couple of hours
>> running various things to load the machine). The one thing we did not
>> test yet is removing ZFS from the picture. Note also that this box
>> for years and years on FreeBSD 4.x without a hiccup (non PAE, ipfw
>> instead of pf and no ZFS of course).
> There are two things. 1) Most people who run witness (that I know
> of) don't
> run it on spinlocks because of the overhead, so LORs of spin locks
> are less
> well-reported than LORs of other locks (mutexes, rwlocks, etc.). 2)
> You have
> to have enough load on the box to swap out active processes to get
> into this
> situation. Between those I think that is why this is not more widely
Thanks for your efforts so far to track this LOR down. I've been
keeping an eye on cvs logs, but haven't seen anything which looks like
a patch for this.
* is this still outstanding?
* or will it be addressed soon?
* if not, should I create a PR so that it doesn't get forgotten?
* in our case, although we can trigger it quickly with some load, the
problem occurs (and causes a complete machine lock) even under < 10%
load. Not sure if the combination of PAE/ZFS/SCHED ULE exacerbates
that in any way compared to a 'standard' build.
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001 fax +61 2 9550 4001
GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
firstname.lastname@example.org mailing list
To unsubscribe, send any mail to "email@example.com"