This is a discussion on Multi-core CPUs & our present Fault Finding capability - Linux ; Firstly, I hope this is the correct mailing group to post on. Apologies in advance if I have mailed to the wrong group. Please advise. -------------------- On one site I visit they are running 22.214.171.124-0.1 SMP x86_64 GNU/Linux Kernel from ...
Firstly, I hope this is the correct mailing group to post on.
Apologies in advance if I have mailed to the wrong group.
On one site I visit they are running 126.96.36.199-0.1 SMP x86_64 GNU/Linux
Kernel from a OpenSUSE 10.3 Distribution.
The system CPU is the new AMD Phenom(tm) 9600 Quad-Core Processor on a
GA-MA790FX-DQ6 (rev. 1.0) AMD 790FX Chipset motherboard, with 8GB of
I have noticed that at times (i.e.: intermittently over a 96/128 hour
period) that one (or worst case two) cores will be 100% hung usually
because of some missed event or at best some failing/hard-waiting loop.
At these times, the system keyboard is effectively dead or at the very least
unresponsive (my typical test of toggling the caps lock doesn't lead to any
led status change). All other processes are running correctly, there is no
memory leak detectable, or any failing device noticeably visible. The
failing process/task are locked solid so gdb doesn't succeed in appending
to them as a means of finding the failing party.
One can SSH into the failing system correctly but a "kill -9" doesn't remove
the suspect task(s), at best they are just zombie'd, according to "top"
which is as a matter of course is always running on the console.
Via SSH one can 'echo "t" > /proc/sysrq-trigger' which dumps individual task
backtrace to syslog as you would expect, but I have failed yet to force any
kernel core dump at all (yes friends, ulimit is set correctly) no matter
what is tried.
So, now that your got the situation. I hope you can see why my interests in:
A> Developing a way to "reset" explicitely an individual CPU core, without
resorting to a complete power down. A in-built kernel "feature" like this
would be a excellent system management device to have in ones toolbox.
B> Establishing a kernel mechanism/call that would 100% guarantees a core
dump, irrespective of other kernel/system considerations.
Your comments and suggestions on how to implement and build such would be
grahame ?aT? wild possum ?com?