On Mon, 26 Nov 2007, Cristian KLEIN wrote:

> Great to hear this problem was solved. I still have one big fat question.
> Why did the system hang and not allow the kernel debugger show up? I
> strongly believe that this bug would have been easily spotted suppose KDB
> would have responded. Is it perhaps possible to "harden" KDB, so that such
> issues are easier to find and fix in future?

I don't know the details of this particular situation, but I can speak to at
least one known issue in DDB: right now, getting into DDB from a serial
console is a very quick and straight forward path, requiring only the delivery
of the serial interrupt and execution of its fast handler. The regular video
console keypresses take a much more circuitous route, as syscons isn't MPSAFE,
so include the scheduling of an ithread and acquisition of Giant. As such,
I've found breaking into the debugger much easier from a serial console for
several years. As Giant has been pushed off larger and larger parts of the
kernel, the syscons break path has gotten a lot more reliable. There will
always be certain cases where a console break (serial or video) will not work,
and those include cases where interrupts are disabled on all CPUs (such as if
spinlocks are held on all CPUs, perhaps due to one being leaked and then a
cascading deadline). In that situation, there's nothing like a nice NMI
button or IPMI NMI to get into the debugger :-).

We have a feature on i386 and amd64 called MP_WATCHDOG, which allows one CPU
to be dedicated to being a watchdog for the others--on lower end hardware this
isn't so useful, as CPUs aren't plentiful, but as the number of cores
increases, it becomes more and more possible to run this without disrupting
normal operation of the machine. When it notices the kernel is no longer
running callouts, it delivers an NMI to the other CPUs and kicks (hopefully)
one of them into DDB. There are a number of issues with the implementation,
not least that we do actually run some other code on the watchdog CPU
sometimes as our interrupt routing and scheduler need a bit more adaptation,
but it can be quite useful nonetheless.

Robert N M Watson
Computer Laboratory
University of Cambridge
freebsd-current@freebsd.org mailing list
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"