v2.6.26-rc9: kernel BUG at kernel/sched.c:5858! - Kernel

This is a discussion on v2.6.26-rc9: kernel BUG at kernel/sched.c:5858! - Kernel ; Hi, Looks like CPU hotplug still has some problems. Just got this on latest mainline, and I couldn't find the exact same report on LKML or kerneloops, maybe it can be helpful for debugging the existing problem(s)? lockdep: fixing up ...

+ Reply to Thread
Results 1 to 20 of 20

Thread: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

  1. v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    Hi,

    Looks like CPU hotplug still has some problems. Just got this on
    latest mainline, and I couldn't find the exact same report on LKML
    or kerneloops, maybe it can be helpful for debugging the existing
    problem(s)?

    lockdep: fixing up alternatives.
    ------------[ cut here ]------------
    kernel BUG at kernel/sched.c:5858!
    invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 3934, comm: bash Not tainted (2.6.26-rc9-00057-g60d678c #3)
    EIP: 0060:[] EFLAGS: 00210046 CPU: 0
    EIP is at migration_call+0x495/0x4d0
    EAX: 00000000 EBX: c0803f00 ECX: f6bd0000 EDX: 017b0000
    ESI: e7d24fb0 EDI: c1fb3f00 EBP: f62e7e78 ESP: f62e7e48
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process bash (pid: 3934, ti=f62e6000 task=f60fbfc0 task.ti=f62e6000)
    Stack: 00000000 c06ddf70 f62e7e6c 00200246 c0803f00 00000001 c1fb3f00 f62e7e6c
    c0581fcf c074ec70 ffffffff 00000000 f62e7e98 c014d5a7 00000001 00000007
    c074ecf4 ffffffff 00000001 e7c86f90 f62e7eac c014d619 ffffffff 00000000
    Call Trace:
    [] ? preempt_schedule+0x3f/0x50
    [] ? notifier_call_chain+0x37/0x70
    [] ? __raw_notifier_call_chain+0x19/0x20
    [] ? raw_notifier_call_chain+0x1a/0x20
    [] ? _cpu_down+0x148/0x240
    [] ? cpu_maps_update_begin+0xf/0x20
    [] ? cpu_down+0x2b/0x40
    [] ? store_online+0x39/0x80
    [] ? store_online+0x0/0x80
    [] ? sysdev_store+0x2b/0x40
    [] ? sysfs_write_file+0xa2/0x100
    [] ? vfs_write+0x96/0x130
    [] ? sysfs_write_file+0x0/0x100
    [] ? sys_write+0x3d/0x70
    [] ? sysenter_past_esp+0x78/0xd1
    =======================
    Code: 45 e8 e8 2f 53 00 00 b8 01 00 00 00 e9 a2 fb ff ff bb 60 36 59 c0 eb 02 8b
    1b 89 f8 ff 53 18 85 c0 89 c6 74 f3 90 e9 89 fe ff ff <0f> 0b eb fe 8d b4 26 00
    00 00 00 e8 8b 82 bd ff 89 f0 50 9d 0f
    EIP: [] migration_call+0x495/0x4d0 SS:ESP 0068:f62e7e48


    Oh, I just saw

    commit dc7fab8b3bb388c57c6c4a43ba68c8a32ca25204
    Author: Dmitry Adamushko
    Date: Thu Jul 10 00:32:40 2008 +0200

    sched: fix cpu hotplug

    will apply and retry. Is this likely to fix the oops I saw, though?


    Vegard


    $ grep SCHED .config
    CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
    CONFIG_GROUP_SCHED=y
    CONFIG_FAIR_GROUP_SCHED=y
    CONFIG_RT_GROUP_SCHED=y
    CONFIG_USER_SCHED=y
    # CONFIG_CGROUP_SCHED is not set
    CONFIG_IOSCHED_NOOP=y
    CONFIG_IOSCHED_AS=y
    CONFIG_IOSCHED_DEADLINE=y
    CONFIG_IOSCHED_CFQ=y
    CONFIG_DEFAULT_IOSCHED="cfq"
    CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
    CONFIG_SCHED_SMT=y
    CONFIG_SCHED_MC=y
    CONFIG_SCHED_HRTICK=y
    # CONFIG_NET_SCHED is not set
    # CONFIG_USB_EHCI_TT_NEWSCHED is not set
    CONFIG_SCHED_DEBUG=y
    CONFIG_SCHEDSTATS=y
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 1:59 PM, Vegard Nossum wrote:
    > Hi,
    >
    > Looks like CPU hotplug still has some problems. Just got this on
    > latest mainline, and I couldn't find the exact same report on LKML
    > or kerneloops, maybe it can be helpful for debugging the existing
    > problem(s)?
    >
    > lockdep: fixing up alternatives.
    > ------------[ cut here ]------------
    > kernel BUG at kernel/sched.c:5858!
    > invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    > Pid: 3934, comm: bash Not tainted (2.6.26-rc9-00057-g60d678c #3)
    > EIP: 0060:[] EFLAGS: 00210046 CPU: 0
    > EIP is at migration_call+0x495/0x4d0
    > EAX: 00000000 EBX: c0803f00 ECX: f6bd0000 EDX: 017b0000
    > ESI: e7d24fb0 EDI: c1fb3f00 EBP: f62e7e78 ESP: f62e7e48
    > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    > Process bash (pid: 3934, ti=f62e6000 task=f60fbfc0 task.ti=f62e6000)
    > Stack: 00000000 c06ddf70 f62e7e6c 00200246 c0803f00 00000001 c1fb3f00 f62e7e6c
    > c0581fcf c074ec70 ffffffff 00000000 f62e7e98 c014d5a7 00000001 00000007
    > c074ecf4 ffffffff 00000001 e7c86f90 f62e7eac c014d619 ffffffff 00000000
    > Call Trace:
    > [] ? preempt_schedule+0x3f/0x50
    > [] ? notifier_call_chain+0x37/0x70
    > [] ? __raw_notifier_call_chain+0x19/0x20
    > [] ? raw_notifier_call_chain+0x1a/0x20
    > [] ? _cpu_down+0x148/0x240
    > [] ? cpu_maps_update_begin+0xf/0x20
    > [] ? cpu_down+0x2b/0x40
    > [] ? store_online+0x39/0x80
    > [] ? store_online+0x0/0x80
    > [] ? sysdev_store+0x2b/0x40
    > [] ? sysfs_write_file+0xa2/0x100
    > [] ? vfs_write+0x96/0x130
    > [] ? sysfs_write_file+0x0/0x100
    > [] ? sys_write+0x3d/0x70
    > [] ? sysenter_past_esp+0x78/0xd1
    > =======================
    > Code: 45 e8 e8 2f 53 00 00 b8 01 00 00 00 e9 a2 fb ff ff bb 60 36 59 c0 eb 02 8b
    > 1b 89 f8 ff 53 18 85 c0 89 c6 74 f3 90 e9 89 fe ff ff <0f> 0b eb fe 8d b4 26 00
    > 00 00 00 e8 8b 82 bd ff 89 f0 50 9d 0f
    > EIP: [] migration_call+0x495/0x4d0 SS:ESP 0068:f62e7e48
    >
    >
    > Oh, I just saw
    >
    > commit dc7fab8b3bb388c57c6c4a43ba68c8a32ca25204
    > Author: Dmitry Adamushko
    > Date: Thu Jul 10 00:32:40 2008 +0200
    >
    > sched: fix cpu hotplug
    >
    > will apply and retry. Is this likely to fix the oops I saw, though?


    Nope, I get the same thing (just 2 lines offset):

    ------------[ cut here ]------------
    kernel BUG at kernel/sched.c:5860!
    invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 3879, comm: bash Not tainted (2.6.26-rc9-00058-g2515e04 #4)
    EIP: 0060:[] EFLAGS: 00210046 CPU: 0
    EIP is at migration_call+0x495/0x4d0
    EAX: 00000000 EBX: c0593600 ECX: c1f65e80 EDX: 017b0000
    ESI: f6d08000 EDI: c1fb3f00 EBP: ccdf3e78 ESP: ccdf3e48
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process bash (pid: 3879, ti=ccdf2000 task=e7bb6f90 task.ti=ccdf2000)
    Stack: 00000000 c06ddf70 ccdf3e6c 00200246 c0803f00 00000001 c1fb3f00 ccdf3e6c
    c0581fdf c074ec70 ffffffff 00000000 ccdf3e98 c014d5b7 00000001 00000007
    c074ecf4 ffffffff 00000001 e7beafd0 ccdf3eac c014d629 ffffffff 00000000
    Call Trace:
    [] ? preempt_schedule+0x3f/0x50
    [] ? notifier_call_chain+0x37/0x70
    [] ? __raw_notifier_call_chain+0x19/0x20
    [] ? raw_notifier_call_chain+0x1a/0x20
    [] ? _cpu_down+0x148/0x240
    [] ? cpu_maps_update_begin+0xf/0x20
    [] ? cpu_down+0x2b/0x40
    [] ? store_online+0x39/0x80
    [] ? store_online+0x0/0x80
    [] ? sysdev_store+0x2b/0x40
    [] ? sysfs_write_file+0xa2/0x100
    [] ? vfs_write+0x96/0x130
    [] ? sysfs_write_file+0x0/0x100
    [] ? sys_write+0x3d/0x70
    [] ? sysenter_past_esp+0x78/0xd1
    =======================
    Code: 45 e8 e8 2f 53 00 00 b8 01 00 00 00 e9 a2 fb ff ff bb 60 36 59
    c0 eb 02 8b 1b 89 f8 ff 53 18 85 c0 89 c6 74 f3 90 e9 89 fe ff ff <0f>
    0b eb fe 8d b4 26 00 00 00 00 e8 8b 82 bd ff 89 f0 50 9d 0f


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    2008/7/10 Vegard Nossum :
    > On Thu, Jul 10, 2008 at 1:59 PM, Vegard Nossum wrote:
    >> Hi,
    >>
    >> Looks like CPU hotplug still has some problems. Just got this on
    >> latest mainline, and I couldn't find the exact same report on LKML
    >> or kerneloops, maybe it can be helpful for debugging the existing
    >> problem(s)?
    >>
    >> lockdep: fixing up alternatives.
    >> ------------[ cut here ]------------
    >> kernel BUG at kernel/sched.c:5858!
    >> invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    >> Pid: 3934, comm: bash Not tainted (2.6.26-rc9-00057-g60d678c #3)
    >> EIP: 0060:[] EFLAGS: 00210046 CPU: 0
    >> EIP is at migration_call+0x495/0x4d0
    >> EAX: 00000000 EBX: c0803f00 ECX: f6bd0000 EDX: 017b0000
    >> ESI: e7d24fb0 EDI: c1fb3f00 EBP: f62e7e78 ESP: f62e7e48
    >> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    >> Process bash (pid: 3934, ti=f62e6000 task=f60fbfc0 task.ti=f62e6000)
    >> Stack: 00000000 c06ddf70 f62e7e6c 00200246 c0803f00 00000001 c1fb3f00 f62e7e6c
    >> c0581fcf c074ec70 ffffffff 00000000 f62e7e98 c014d5a7 00000001 00000007
    >> c074ecf4 ffffffff 00000001 e7c86f90 f62e7eac c014d619 ffffffff 00000000
    >> Call Trace:
    >> [] ? preempt_schedule+0x3f/0x50
    >> [] ? notifier_call_chain+0x37/0x70
    >> [] ? __raw_notifier_call_chain+0x19/0x20
    >> [] ? raw_notifier_call_chain+0x1a/0x20
    >> [] ? _cpu_down+0x148/0x240
    >> [] ? cpu_maps_update_begin+0xf/0x20
    >> [] ? cpu_down+0x2b/0x40
    >> [] ? store_online+0x39/0x80
    >> [] ? store_online+0x0/0x80
    >> [] ? sysdev_store+0x2b/0x40
    >> [] ? sysfs_write_file+0xa2/0x100
    >> [] ? vfs_write+0x96/0x130
    >> [] ? sysfs_write_file+0x0/0x100
    >> [] ? sys_write+0x3d/0x70
    >> [] ? sysenter_past_esp+0x78/0xd1
    >> =======================
    >> Code: 45 e8 e8 2f 53 00 00 b8 01 00 00 00 e9 a2 fb ff ff bb 60 36 59 c0 eb 02 8b
    >> 1b 89 f8 ff 53 18 85 c0 89 c6 74 f3 90 e9 89 fe ff ff <0f> 0b eb fe 8d b4 26 00
    >> 00 00 00 e8 8b 82 bd ff 89 f0 50 9d 0f
    >> EIP: [] migration_call+0x495/0x4d0 SS:ESP 0068:f62e7e48
    >>
    >>
    >> Oh, I just saw
    >>
    >> commit dc7fab8b3bb388c57c6c4a43ba68c8a32ca25204
    >> Author: Dmitry Adamushko
    >> Date: Thu Jul 10 00:32:40 2008 +0200
    >>
    >> sched: fix cpu hotplug
    >>
    >> will apply and retry. Is this likely to fix the oops I saw, though?

    >
    > Nope, I get the same thing (just 2 lines offset):


    No, it shouldn't fix this particular problem.

    Does a patch from Miao Xie available via the link below makes this
    problem disappear? Both bugs are likely to have the same cause.

    http://lkml.org/lkml/2008/7/7/75


    --
    Best regards,
    Dmitry Adamushko
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 2:50 PM, Dmitry Adamushko
    wrote:
    > 2008/7/10 Vegard Nossum :
    >> On Thu, Jul 10, 2008 at 1:59 PM, Vegard Nossum wrote:
    >>> Hi,
    >>>
    >>> Looks like CPU hotplug still has some problems. Just got this on
    >>> latest mainline, and I couldn't find the exact same report on LKML
    >>> or kerneloops, maybe it can be helpful for debugging the existing
    >>> problem(s)?
    >>>


    [...]

    > Does a patch from Miao Xie available via the link below makes this
    > problem disappear? Both bugs are likely to have the same cause.
    >
    > http://lkml.org/lkml/2008/7/7/75


    Yep, it does, nice, thanks!

    Will these two patches (yours and Miao Xie's) be queued for 2.6.26 already?


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 3:04 PM, Vegard Nossum wrote:
    >> Does a patch from Miao Xie available via the link below makes this
    >> problem disappear? Both bugs are likely to have the same cause.
    >>
    >> http://lkml.org/lkml/2008/7/7/75

    >
    > Yep, it does, nice, thanks!


    I got something else now though:

    Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 19762, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    EIP is at kmem_cache_alloc+0xc7/0xe0
    EAX: 00000000 EBX: f6c3d0f0 ECX: 1adabf16 EDX: 6b6b6b6b
    ESI: 00200282 EDI: f6c44000 EBP: e7c2befc ESP: e7c2bedc
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process grep (pid: 19762, ti=e7c2a000 task=f7875fa0 task.ti=e7c2a000)
    Stack: c02fabd5 f6ca2f20 c02fabd5 000080d0 c075a720 c06f8a4c f78a1000 c02fa7a0
    e7c2bf28 c02fabd5 00000001 c1c987ec 00000000 f5b55000 f6df5f58 f72ea548
    c06f8a4c fffffffb c02fab70 e7c2bf3c c02fa9b6 f6df5f20 c06f8a60 f727df78
    Call Trace:
    [] ? show_uevent+0x65/0xe0
    [] ? show_uevent+0x65/0xe0
    [] ? dev_uevent+0x0/0x1f0
    [] ? show_uevent+0x65/0xe0
    [] ? show_uevent+0x0/0xe0
    [] ? dev_attr_show+0x26/0x50
    [] ? sysfs_read_file+0x7c/0x110
    [] ? putname+0x25/0x40
    [] ? vfs_read+0x94/0x130
    [] ? sysfs_read_file+0x0/0x110
    [] ? sys_read+0x3d/0x70
    [] ? sysenter_past_esp+0x78/0xd1
    =======================
    Code: b9 ff ff ff ff 8b 55 ec 89 7c 24 04 89 04 24 8b 45 f0 e8 4d f6 ff ff 89 c3
    eb a0 85 db 74 bc 8b 57 10 31 c0 89 df 89 d1 c1 e9 02 ab f6 c2 02 74 02 66
    ab f6 c2 01 74 01 aa eb 9f 90 8d b4 26
    EIP: [] kmem_cache_alloc+0xc7/0xe0 SS:ESP 0068:e7c2bedc
    ---[ end trace 348b87fe341cfd2d ]---
    lockdep: fixing up alternatives.
    SMP alternatives: switching to SMP code
    Booting processor 1/1 ip 6000
    Initializing CPU#1
    list_add corruption. prev->next should be next (c0859130), but was 00000000. (pr
    ev=f6c41f34).
    ------------[ cut here ]------------
    kernel BUG at lib/list_debug.c:33!
    invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 9, comm: events/0 Tainted: G D (2.6.26-rc9-00059-gb190333 #5)
    EIP: 0060:[] EFLAGS: 00010082 CPU: 0
    EIP is at __list_add+0x5c/0x60
    EAX: 00000061 EBX: f6c41f34 ECX: f7894000 EDX: 00000002
    ESI: 00000000 EDI: c0858b80 EBP: f7895ed8 ESP: f7895ec0
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process events/0 (pid: 9, ti=f7894000 task=f7898ff0 task.ti=f7894000)
    Stack: c067fd80 c0859130 00000000 f6c41f34 0000c3b1 f6e90154 f7895ee8 c013e276
    c0858b80 f6e90154 f7895f08 c013ec41 0000c3b1 00000000 00000096 00000000
    f6e90154 ffffffff f7895f20 c0145d6e ffffffff f6e90134 00000088 00000078
    Call Trace:
    [] ? internal_add_timer+0x36/0xb0
    [] ? __mod_timer+0x91/0xe0
    [] ? queue_delayed_work_on+0x8e/0xd0
    [] ? queue_delayed_work+0x22/0x30
    [] ? schedule_delayed_work+0x11/0x20
    [] ? flush_to_ldisc+0x18c/0x1b0
    [] ? run_workqueue+0x15b/0x1f0
    [] ? run_workqueue+0x107/0x1f0
    [] ? flush_to_ldisc+0x0/0x1b0
    [] ? worker_thread+0x99/0xf0
    [] ? autoremove_wake_function+0x0/0x50
    [] ? worker_thread+0x0/0xf0
    [] ? kthread+0x42/0x70
    [] ? kthread+0x0/0x70
    [] ? kernel_thread_helper+0x7/0x14
    =======================
    Code: 5c 24 04 c7 04 24 30 fd 67 c0 e8 80 0c ea ff 0f 0b eb fe 89 5c 24 0c 89 74
    24 08 89 4c 24 04 c7 04 24 80 fd 67 c0 e8 64 0c ea ff <0f> 0b eb fe 8b 0a 55 89
    e5 e8 96 ff ff ff 5d c3 90 90 90 90 55
    EIP: [] __list_add+0x5c/0x60 SS:ESP 0068:f7895ec0
    ---[ end trace 348b87fe341cfd2d ]---
    note: events/0[9] exited with preempt_count 2

    ...and then it died.

    This happened just as I killed syslog, does that seem related somehow?


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 3:17 PM, Vegard Nossum wrote:
    > On Thu, Jul 10, 2008 at 3:04 PM, Vegard Nossum wrote:
    >>> Does a patch from Miao Xie available via the link below makes this
    >>> problem disappear? Both bugs are likely to have the same cause.
    >>>
    >>> http://lkml.org/lkml/2008/7/7/75

    >>
    >> Yep, it does, nice, thanks!

    >
    > I got something else now though:
    >
    > Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    > Pid: 19762, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    > EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    > EIP is at kmem_cache_alloc+0xc7/0xe0


    $ addr2line -e vmlinux -i c01991c7
    include/asm/string_32.h:183
    mm/slub.c:1646

    For reference, the disassembly:
    c01991c7: f3 ab rep stos %eax,%es%edi)

    and registers in question:

    ESI: 00200282 EDI: f6c44000 EBP: e7c2befc ESP: e7c2bedc
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068


    ....though yet another run gets me this:

    BUG: unable to handle kernel NULL pointer dereference at 00000000
    IP: [<00000000>]
    *pde = 00000000
    Oops: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 3883, comm: cat Tainted: G D (2.6.26-rc9-00059-gb190333 #5)
    EIP: 0060:[<00000000>] EFLAGS: 00210286 CPU: 0
    EIP is at 0x0
    EAX: f7890400 EBX: 00000000 ECX: 00000102 EDX: 0175d000
    ESI: f7890400 EDI: c07feda0 EBP: f0461e6c ESP: f0461e58
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process cat (pid: 3883, ti=f0460000 task=f1891fe0 task.ti=f0460000)
    Stack: c0172d0c 00200296 c0752b40 00000001 0000000a f0461e88 c0139e03 c0803d00
    c0803d00 c0803d00 00200046 c0298720 f0461e98 c0139f25 c0803d00 00000000
    f0461ea4 c013a0c5 c1f12140 f0461ebc c011a4a8 c1f60f00 f0461f0c 00000000
    Call Trace:
    [] ? rcu_process_callbacks+0x6c/0xb0
    [] ? __do_softirq+0x83/0x100
    [] ? __rdmsr_safe_on_cpu+0x0/0x60
    [] ? do_softirq+0xa5/0xb0
    [] ? irq_exit+0x95/0xa0
    [] ? smp_apic_timer_interrupt+0x58/0x90
    [] ? apic_timer_interrupt+0x33/0x38
    [] ? __rdmsr_safe_on_cpu+0x0/0x60
    [] ? smp_call_function_single+0x91/0xc0
    [] ? _rdmsr_on_cpu+0x2f/0x70
    [] ? rdmsr_safe_on_cpu+0x1a/0x20
    [] ? msr_read+0x6e/0xa0
    [] ? vfs_read+0x94/0x130
    [] ? msr_read+0x0/0xa0
    [] ? sys_read+0x3d/0x70
    [] ? sysenter_past_esp+0x78/0xd1
    =======================
    Code: Bad EIP value.
    EIP: [<00000000>] 0x0 SS:ESP 0068:f0461e58
    Kernel panic - not syncing: Fatal exception in interrupt


    :-)

    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 3:33 PM, Vegard Nossum wrote:
    >> Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    >> Pid: 19762, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    >> EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    >> EIP is at kmem_cache_alloc+0xc7/0xe0

    >
    > $ addr2line -e vmlinux -i c01991c7
    > include/asm/string_32.h:183
    > mm/slub.c:1646
    >
    > For reference, the disassembly:
    > c01991c7: f3 ab rep stos %eax,%es%edi)
    >
    > and registers in question:
    >
    > ESI: 00200282 EDI: f6c44000 EBP: e7c2befc ESP: e7c2bedc
    > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068


    Got a similar one (same EIP), but this time with the "top" of the oops as well:

    BUG: unable to handle kernel paging request at e82be000
    IP: [] kmem_cache_alloc+0xc7/0xe0
    *pde = 35657163 *pte = 282be160
    Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 3879, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    EIP is at kmem_cache_alloc+0xc7/0xe0
    EAX: 00000000 EBX: e82bdf00 ECX: 1adada9a EDX: 6b6b6b6b
    ESI: 00200282 EDI: e82be000 EBP: da711e74 ESP: da711e54
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process grep (pid: 3879, ti=da710000 task=da7f6f90 task.ti=da710000)
    Stack: c019fa36 f7a9cf20 c019fa36 000080d0 f7811c30 00000003 da711f04 ffffffe9
    da711e8c c019fa36 c092e478 000001f6 00000003 da711f04 da711eac c01a8e44
    00001000 e802a100 ffffff9c da711f04 00018801 00000004 da711ec4 c01a8f31
    Call Trace:
    [] ? get_empty_filp+0x36/0x140
    [] ? get_empty_filp+0x36/0x140
    [] ? get_empty_filp+0x36/0x140
    [] ? __path_lookup_intent_open+0x24/0x90
    [] ? path_lookup_open+0x21/0x30
    [] ? do_filp_open+0x93/0x710
    [] ? get_unused_fd_flags+0xc8/0xf0
    [] ? _spin_unlock+0x27/0x50
    [] ? get_unused_fd_flags+0xc8/0xf0
    [] ? do_sys_open+0x49/0xe0
    [] ? fput+0x19/0x20
    [] ? sys_open+0x29/0x40
    [] ? sysenter_past_esp+0x78/0xd1
    =======================
    Code: b9 ff ff ff ff 8b 55 ec 89 7c 24 04 89 04 24 8b 45 f0 e8 4d f6 ff ff 89 c3
    eb a0 85 db 74 bc 8b 57 10 31 c0 89 df 89 d1 c1 e9 02 ab f6 c2 02 74 02 66
    ab f6 c2 01 74 01 aa eb 9f 90 8d b4 26
    EIP: [] kmem_cache_alloc+0xc7/0xe0 SS:ESP 0068:da711e54
    ---[ end trace f3a251e0ce11b6fd ]---

    This seems to be grep both times, the command I used is:

    $ grep -r . /sys


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    2008/7/10 Vegard Nossum :
    > On Thu, Jul 10, 2008 at 2:50 PM, Dmitry Adamushko
    > wrote:
    >> 2008/7/10 Vegard Nossum :
    >>> On Thu, Jul 10, 2008 at 1:59 PM, Vegard Nossum wrote:
    >>>> Hi,
    >>>>
    >>>> Looks like CPU hotplug still has some problems. Just got this on
    >>>> latest mainline, and I couldn't find the exact same report on LKML
    >>>> or kerneloops, maybe it can be helpful for debugging the existing
    >>>> problem(s)?
    >>>>

    >
    > [...]
    >
    >> Does a patch from Miao Xie available via the link below makes this
    >> problem disappear? Both bugs are likely to have the same cause.
    >>
    >> http://lkml.org/lkml/2008/7/7/75

    >
    > Yep, it does, nice, thanks!
    >
    > Will these two patches (yours and Miao Xie's) be queued for 2.6.26 already?


    Miao Xie's patch addresses a problem by fixing its consequence. In
    general, an 'offline' cpu must not be visible for any kind of
    load-balancing at this point. So I'd rather like to understand and
    address the root cause of this mis-behavior.

    Regarding new crashes. Do you get them

    (1) after a few cpu offline / onlines ?
    (2) on a freshly booted system?
    (3) (1) or (2) but only with Miao Xie's patch (should not be (2) then)
    (4) something else?

    TIA,


    --
    Best regards,
    Dmitry Adamushko
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 4:03 PM, Dmitry Adamushko
    wrote:
    > Miao Xie's patch addresses a problem by fixing its consequence. In
    > general, an 'offline' cpu must not be visible for any kind of
    > load-balancing at this point. So I'd rather like to understand and
    > address the root cause of this mis-behavior.
    >


    Oh, right. I fully agree.

    > Regarding new crashes. Do you get them
    >
    > (1) after a few cpu offline / onlines ?
    > (2) on a freshly booted system?
    > (3) (1) or (2) but only with Miao Xie's patch (should not be (2) then)
    > (4) something else?


    Without Miao Xie's patch, I regularly get a crash on the first cpu-up.
    So I am using it all the time. With this patch applied, the new
    crashes can happen from anywhere between 2 minutes to 20 while running
    a few different looping scripts simultaneously:

    1. cpu up/down
    2. grep -r . /sys
    3. swapon/swapoff
    4. cat /dev/cpu/*/msr

    I also just got a bootup which had some more info just after a
    kmem_cache_alloc failure:

    BUG: unable to handle kernel paging request at da87d000
    IP: [] kmem_cache_alloc+0xc7/0xe0
    *pde = 28180163 *pte = 1a87d160
    Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    EIP is at kmem_cache_alloc+0xc7/0xe0
    EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
    ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process grep (pid: 3850, ti=f60be000 task=e8318ff0 task.ti=f60be000)
    Stack: c019fa36 f739ff20 c019fa36 000080d0 f7811c30 00000001 f60bff04 ffffffe9
    f60bfe8c c019fa36 c09382b0 000001f7 00000001 f60bff04 f60bfeac c01a8e44
    00001000 da9a0000 ffffff9c f60bff04 00008001 00000004 f60bfec4 c01a8f31
    Call Trace:
    [] ? get_empty_filp+0x36/0x140
    [] ? get_empty_filp+0x36/0x140
    [] ? get_empty_filp+0x36/0x140
    [] ? __path_lookup_intent_open+0x24/0x90
    [] ? path_lookup_open+0x21/0x30
    [] ? do_filp_open+0x93/0x710
    [] ? get_unused_fd_flags+0xc8/0xf0
    [] ? _spin_unlock+0x27/0x50
    [] ? get_unused_fd_flags+0xc8/0xf0
    [] ? do_sys_open+0x49/0xe0
    [] ? sys_open+0x29/0x40
    [] ? sysenter_past_esp+0x78/0xd1
    =======================
    Code: b9 ff ff ff ff 8b 55 ec 89 7c 24 04 89 04 24 8b 45 f0 e8 4d f6
    ff ff 89 c3 eb a0 85 db 74 bc 8b 57 10 31 c0 89 df 89 d1 c1 e9 02
    ab f6 c2 02 74 02 66 ab f6 c2 01 74 01 aa eb 9f 90 8d b4 26
    EIP: [] kmem_cache_alloc+0xc7/0xe0 SS:ESP 0068:f60bfe54
    ---[ end trace 0647416e896c69d3 ]---
    Adding 2031608k swap on /dev/mapper/VolGroup00-LogVol01.
    Priority:-10871 extents:1 across:2031608k
    lockdep: fixing up alternatives.
    SMP alternatives: switching to SMP code
    ================================================== ===========================
    BUG filp: Redzone overwritten
    -----------------------------------------------------------------------------
    INFO: 0xda87ceb8-0xda87cebb. First byte 0x0 instead of 0xbb
    INFO: Slab 0xc1639d10 objects=16 used=6 fp=0xda87ce00 flags=0x400000c3
    INFO: Object 0xda87ce00 @offset=3584 fp=0x00000000
    Bytes b4 0xda87cdf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Object 0xda87ce00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Object 0xda87ce10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Object 0xda87ce20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Object 0xda87ce30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Object 0xda87ce40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Object 0xda87ce50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Object 0xda87ce60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Object 0xda87ce70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Redzone 0xda87ceb8: 00 00 00 00
    .....
    Padding 0xda87cee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Padding 0xda87cef0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    .................
    Pid: 18498, comm: swapon Tainted: G D 2.6.26-rc9-00059-gb190333 #5
    [] print_trailer+0xaa/0xe0
    [] check_bytes_and_report+0x9b/0xc0
    [] check_object+0x53/0x1f0
    [] __slab_alloc+0x44b/0x5d0
    [] kmem_cache_alloc+0xb3/0xe0
    [] ? get_empty_filp+0x36/0x140
    [] ? get_empty_filp+0x36/0x140
    [] get_empty_filp+0x36/0x140
    [] __path_lookup_intent_open+0x24/0x90
    [] path_lookup_open+0x21/0x30
    [] do_filp_open+0x93/0x710
    [] ? check_object+0xdf/0x1f0
    [] ? getname+0x25/0xe0
    [] ? get_unused_fd_flags+0x26/0xf0
    [] ? get_unused_fd_flags+0xc8/0xf0
    [] ? _spin_unlock+0x27/0x50
    [] ? get_unused_fd_flags+0xc8/0xf0
    [] do_sys_open+0x49/0xe0
    [] sys_open+0x29/0x40
    [] sysenter_past_esp+0x78/0xd1
    =======================
    FIX filp: Restoring 0xda87ceb8-0xda87cebb=0xbb
    FIX filp: Marking all objects used

    ...and then we had another NULL ptr in the rcu callbacks. Looks like
    something is putting zeroes all over the place where it shouldn't be
    allowed to?

    I'll try stopping the cpu up/down script and see if it makes a difference.


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 4:16 PM, Vegard Nossum wrote:
    >> Regarding new crashes. Do you get them
    >>
    >> (1) after a few cpu offline / onlines ?
    >> (2) on a freshly booted system?
    >> (3) (1) or (2) but only with Miao Xie's patch (should not be (2) then)
    >> (4) something else?

    >
    > Without Miao Xie's patch, I regularly get a crash on the first cpu-up.
    > So I am using it all the time. With this patch applied, the new
    > crashes can happen from anywhere between 2 minutes to 20 while running
    > a few different looping scripts simultaneously:
    >
    > 1. cpu up/down
    > 2. grep -r . /sys
    > 3. swapon/swapoff
    > 4. cat /dev/cpu/*/msr


    Inhibiting #1 kept the machine alive for at least 25 minutes. Then I
    started it and it hung after 492 rounds of cpu up/down, with this new
    report:

    list_add corruption. next->prev should be prev (f782d090), but was
    00000000. (next=f20b8438).
    ------------[ cut here ]------------
    kernel BUG at lib/list_debug.c:27!
    invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 3860, comm: bash Not tainted (2.6.26-rc9-00059-gb190333 #5)
    EIP: 0060:[] EFLAGS: 00210086 CPU: 0
    EIP is at __list_add+0x40/0x60
    EAX: 00000061 EBX: f782d090 ECX: 00000002 EDX: 00000002
    ESI: 00200282 EDI: c0a8de8c EBP: e7dd3e84 ESP: e7dd3e6c
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process bash (pid: 3860, ti=e7dd2000 task=e7e0afd0 task.ti=e7dd2000)
    Stack: c067fd30 f782d090 00000000 f20b8438 f20b81b0 00200282 e7dd3e8c c0294baa
    e7dd3e98 c019bf5e f782d070 e7dd3ec8 c019c4b1 c0a8deac 000000d0 e7de2a00
    c1da1ddc 00000005 00000000 f20b81b0 fff95000 fff98000 fff96000 e7dd3ed4
    Call Trace:
    [] ? list_add+0xa/0x10
    [] ? __mem_cgroup_add_list+0x3e/0x40
    [] ? mem_cgroup_charge_common+0x231/0x260
    [] ? mem_cgroup_charge+0x12/0x20
    [] ? do_wp_page+0x117/0x550
    [] ? handle_mm_fault+0x1b1/0x770
    [] ? handle_mm_fault+0x3e1/0x770
    [] ? down_read_trylock+0x55/0x60
    [] ? do_page_fault+0x298/0x700
    [] ? _spin_unlock_irq+0x36/0x60
    [] ? sigprocmask+0x7b/0xf0
    [] ? restore_nocheck+0x12/0x15
    [] ? do_page_fault+0x0/0x700
    [] ? error_code+0x72/0x78
    =======================
    Code: 75 2d 89 08 89 41 04 89 02 89 50 04 83 c4 10 5b 5e 5d c3 89 4c
    24 0c 89 54 24 08 89 5c 24 04 c7 04 24 30 fd 67 c0 e8 80 0c ea ff <0
    f> 0b eb fe 89 5c 24 0c 89 74 24 08 89 4c 24 04 c7 04 24 80 fd
    EIP: [] __list_add+0x40/0x60 SS:ESP 0068:e7dd3e6c
    ---[ end trace 89a65901b268513f ]---

    The list corruption now has a completely different backtrace, but they
    both were 0 instead of some other (expected) value. This fits with the
    theory that something is zeroed that shouldn't be.


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    Okay, some more info on this one...

    On Thu, Jul 10, 2008 at 4:16 PM, Vegard Nossum wrote:
    > BUG: unable to handle kernel paging request at da87d000
    > IP: [] kmem_cache_alloc+0xc7/0xe0
    > *pde = 28180163 *pte = 1a87d160
    > Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    > Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    > EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    > EIP is at kmem_cache_alloc+0xc7/0xe0
    > EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
    > ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
    > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068


    The register %ecx looks innocent but is very important here. The disassembly:

    mov %edx,%ecx
    shr $0x2,%ecx
    rep stos %eax,%es%edi) <-- the fault

    So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
    (0x6b6b6b6b >> 2 == 0x1adadada.)

    %ecx is the counter for the memset, from here:

    memset(object, 0, c->objsize);

    i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
    Where did "c" come from? Uh-oh...

    c = get_cpu_slab(s, smp_processor_id());

    This looks like it has very much to do with CPU hotplug/unplug. Is
    there a race between SLUB/hotplug since the CPU slab is used after it
    has been freed?


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thursday, 10 of July 2008, Vegard Nossum wrote:
    > Okay, some more info on this one...
    >
    > On Thu, Jul 10, 2008 at 4:16 PM, Vegard Nossum wrote:
    > > BUG: unable to handle kernel paging request at da87d000
    > > IP: [] kmem_cache_alloc+0xc7/0xe0
    > > *pde = 28180163 *pte = 1a87d160
    > > Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    > > Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    > > EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    > > EIP is at kmem_cache_alloc+0xc7/0xe0
    > > EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
    > > ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
    > > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

    >
    > The register %ecx looks innocent but is very important here. The disassembly:
    >
    > mov %edx,%ecx
    > shr $0x2,%ecx
    > rep stos %eax,%es%edi) <-- the fault
    >
    > So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
    > (0x6b6b6b6b >> 2 == 0x1adadada.)
    >
    > %ecx is the counter for the memset, from here:
    >
    > memset(object, 0, c->objsize);
    >
    > i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
    > Where did "c" come from? Uh-oh...
    >
    > c = get_cpu_slab(s, smp_processor_id());
    >
    > This looks like it has very much to do with CPU hotplug/unplug. Is
    > there a race between SLUB/hotplug since the CPU slab is used after it
    > has been freed?


    I wonder if this is related to the fix at:
    http://git.kernel.org/?p=linux/kerne...a1dbde09e2feea

    Rafael
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    2008/7/10 Vegard Nossum :
    > Okay, some more info on this one...
    >
    > On Thu, Jul 10, 2008 at 4:16 PM, Vegard Nossum wrote:
    >> BUG: unable to handle kernel paging request at da87d000
    >> IP: [] kmem_cache_alloc+0xc7/0xe0
    >> *pde = 28180163 *pte = 1a87d160
    >> Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    >> Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    >> EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    >> EIP is at kmem_cache_alloc+0xc7/0xe0
    >> EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
    >> ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
    >> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

    >
    > The register %ecx looks innocent but is very important here. The disassembly:
    >
    > mov %edx,%ecx
    > shr $0x2,%ecx
    > rep stos %eax,%es%edi) <-- the fault
    >
    > So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
    > (0x6b6b6b6b >> 2 == 0x1adadada.)
    >
    > %ecx is the counter for the memset, from here:
    >
    > memset(object, 0, c->objsize);
    >
    > i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
    > Where did "c" come from? Uh-oh...
    >
    > c = get_cpu_slab(s, smp_processor_id());
    >
    > This looks like it has very much to do with CPU hotplug/unplug. Is
    > there a race between SLUB/hotplug since the CPU slab is used after it
    > has been freed?


    Good analysis.

    [ quick look ]

    Yeah, it's possible that a caller of kmem_cache_alloc() ->
    slab_alloc() can be migrated on another CPU right after
    local_irq_restore() and before memset(). The inital cpu can become
    offline in the mean time (or a migration is a consequence of the CPU
    going offline) so its 'kmem_cache_cpu' structure gets freed (
    slab_cpuup_callback).

    At some point of time the caller continues on another CPU having an
    obsolete pointer...

    does something like this help?

    diff --git a/mm/slub.c b/mm/slub.c
    index 1a427c0..315c392 100644
    --- a/mm/slub.c
    +++ b/mm/slub.c
    @@ -1628,9 +1628,11 @@ static __always_inline void *slab_alloc(struct
    kmem_cache *s,
    void **object;
    struct kmem_cache_cpu *c;
    unsigned long flags;
    + unsigned int objsize;

    local_irq_save(flags);
    c = get_cpu_slab(s, smp_processor_id());
    + objsize = c->objsize;
    if (unlikely(!c->freelist || !node_match(c, node)))

    object = __slab_alloc(s, gfpflags, node, addr, c);
    @@ -1643,7 +1645,7 @@ static __always_inline void *slab_alloc(struct
    kmem_cache *s,
    local_irq_restore(flags);

    if (unlikely((gfpflags & __GFP_ZERO) && object))
    - memset(object, 0, c->objsize);
    + memset(object, 0, objsize);

    return object;
    }


    >
    >
    > Vegard
    >
    > --
    > "The animistic metaphor of the bug that maliciously sneaked in while
    > the programmer was not looking is intellectually dishonest as it
    > disguises that the error is the programmer's own creation."
    > -- E. W. Dijkstra, EWD1036
    >




    --
    Best regards,
    Dmitry Adamushko


  14. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 10:20 PM, Rafael J. Wysocki wrote:
    >> This looks like it has very much to do with CPU hotplug/unplug. Is
    >> there a race between SLUB/hotplug since the CPU slab is used after it
    >> has been freed?

    >
    > I wonder if this is related to the fix at:
    > http://git.kernel.org/?p=linux/kerne...a1dbde09e2feea


    Hi,

    I thought it might as well, but in fact it happened regardless of this patch.


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Thu, Jul 10, 2008 at 10:16 PM, Dmitry Adamushko
    wrote:
    > Yeah, it's possible that a caller of kmem_cache_alloc() ->
    > slab_alloc() can be migrated on another CPU right after
    > local_irq_restore() and before memset(). The inital cpu can become
    > offline in the mean time (or a migration is a consequence of the CPU
    > going offline) so its 'kmem_cache_cpu' structure gets freed (
    > slab_cpuup_callback).
    >
    > At some point of time the caller continues on another CPU having an
    > obsolete pointer...
    >
    > does something like this help?


    Nice :-)

    By the way, this also explains the heavy corruption I was seeing (NULL
    pointers in lists detected by list debugging, etc.); SLUB was doing a
    HUGE memset of 0 on arbitrary memory, i.e. the memset effectively
    became:

    memset(object, 0, 0x1adadada);

    ...and in some of the cases, the machine didn't crash inside SLUB but
    proceeded...

    I guess I should reload and try the latest -git now :-)

    Thanks!


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Fri, Jul 11, 2008 at 11:02 AM, Dmitry Adamushko
    wrote:
    > Vegard,
    >
    >
    > regarding the first crash. Would you please run your test with the
    > following debugging patch and let me know its output?
    >
    > The apperance of " * [ pid ] comm (name), orig_cpu() ... " means we
    > hit a problematic case (with Miao Xie's patch it shouldn't crash).
    >
    > I see that you have CONFIG_SCHED_DEBUG=y so I'm also interested in
    > messages from sched_domain_debug() - "CPU# attaching ...". IOW, all
    > the kernel messages appearing while a cpu is going down and up.


    I didn't look into it, but your patch gives the following warnings:

    kernel/sched.c: In function 'try_to_wake_up':
    kernel/sched.c:2080: warning: 'orig_cpu' may be used uninitialized in
    this function
    kernel/sched.c:2080: warning: 'cpu' may be used uninitialized in this function

    Running on qemu (sorry, real machine has to wait) gets me this:

    __migrate(6 -- migration/1) -> cpu (0) == ret (1)
    __migrate(7 -- ksoftirqd/1) -> cpu (0) == ret (1)
    __migrate(8 -- watchdog/1) -> cpu (0) == ret (1)
    __migrate(10 -- events/1) -> cpu (0) == ret (1)
    __migrate(104 -- kblockd/1) -> cpu (0) == ret (1)
    __migrate(111 -- ata/1) -> cpu (0) == ret (1)
    __migrate(118 -- khubd) -> cpu (0) == ret (1)
    __migrate(178 -- pdflush) -> cpu (0) == ret (1)
    __migrate(262 -- aio/1) -> cpu (0) == ret (1)
    __migrate(278 -- nfsiod) -> cpu (0) == ret (1)
    __migrate(994 -- khpsbpkt) -> cpu (0) == ret (1)
    __migrate(1023 -- kpsmoused) -> cpu (0) == ret (1)
    __migrate(1032 -- kstriped) -> cpu (0) == ret (1)
    __migrate(1036 -- kondemand/1) -> cpu (0) == ret (1)
    __migrate(1061 -- rpciod/1) -> cpu (0) == ret (1)
    __migrate(1209 -- udevd) -> cpu (0) == ret (1)
    __migrate(2793 -- getty) -> cpu (0) == ret (1)
    __migrate(2827 -- syslogd) -> cpu (0) == ret (1)
    __migrate(2849 -- dd) -> cpu (0) == ret (1)
    __migrate(2852 -- klogd) -> cpu (0) == ret (1)
    __migrate(2877 -- named) -> cpu (0) == ret (1)
    __migrate(2878 -- named) -> cpu (0) == ret (1)
    __migrate(2879 -- named) -> cpu (0) == ret (1)
    __migrate(2880 -- named) -> cpu (0) == ret (1)
    __migrate(2881 -- named) -> cpu (0) == ret (1)
    __migrate(2941 -- cupsd) -> cpu (0) == ret (1)
    __migrate(2978 -- mysqld_safe) -> cpu (0) == ret (1)
    __migrate(3039 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3050 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3051 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3052 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3054 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3058 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3059 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3067 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3071 -- mysqld) -> cpu (0) == ret (1)
    __migrate(3040 -- logger) -> cpu (0) == ret (1)
    __migrate(3152 -- postgres) -> cpu (0) == ret (1)
    __migrate(3350 -- bash) -> cpu (0) == ret (1)
    __migrate(3398 -- smbd) -> cpu (0) == ret (1)
    __migrate(3426 -- winbindd) -> cpu (0) == ret (1)
    __migrate(3433 -- winbindd) -> cpu (0) == ret (1)
    __migrate(3454 -- dovecot) -> cpu (0) == ret (1)
    __migrate(3472 -- pop3-login) -> cpu (0) == ret (1)
    __migrate(3474 -- imap-login) -> cpu (0) == ret (1)
    __migrate(3475 -- imap-login) -> cpu (0) == ret (1)
    __migrate(3498 -- cron) -> cpu (0) == ret (1)
    __migrate(3536 -- apache2) -> cpu (0) == ret (1)
    __migrate(3546 -- apache2) -> cpu (0) == ret (1)
    __migrate(3550 -- apache2) -> cpu (0) == ret (1)
    __migrate(3552 -- apache2) -> cpu (0) == ret (1)
    __migrate(3616 -- su) -> cpu (0) == ret (1)
    __migrate(3630 -- bash) -> cpu (0) == ret (1)
    __migrate(3634 -- kstopmachine) -> cpu (0) == ret (1)
    general protection fault: 0600 [#1] PREEMPT SMP DEBUG_PAGEALLOC

    Pid: 0, comm: swapper Not tainted (2.6.26-rc9-00103-g2702484 #18)
    EIP: 0600:[<00000004>] EFLAGS: 00000002 CPU: 1
    EIP is at 0x4
    EAX: 00010600 EBX: c0102d10 ECX: c123a080 EDX: c7858000
    ESI: 00000001 EDI: c07df880 EBP: c7861f94 ESP: c7861f84
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 0, ti=c7860000 task=c7858000 task.ti=c7860000)
    Stack: c0791020 c123a020 00000001 00000000 c7861fb4 c056d12b 00000000 00000000
    00000000 00000000 00000800 00000000 c7861fb0 00000000 00000000 00000000
    00000000 00000000 00000000 00000000 00000000 00000000 000000d8 00000000
    Call Trace:
    [] ? start_secondary+0x14b/0x1b0
    =======================
    Code: Bad EIP value.
    EIP: [<00000004>] 0x4 SS:ESP 0068:c7861f84
    ---[ end trace f8284f363f0b3f16 ]---
    Kernel panic - not syncing: Attempted to kill the idle task!


    Note: This was latest linux-2.6.git/master WITHOUT Miao Xie's patch.

    But it seems to have done "call $0" at one point or another. But it
    might just be qemu as we didn't see exactly this report on the real
    machine before.

    Ok, now I tested it on my laptop (sorry, no serial console :-)) and I
    get a spinlock recursion. Sorry for bad pics, but it's the best I can
    do with the camera at hand:

    http://folk.uio.no/vegardno/linux/DSC04925.JPG
    http://folk.uio.no/vegardno/linux/DSC04926.JPG
    http://folk.uio.no/vegardno/linux/DSC04927.JPG (probably best pic, but
    some of the msg is cut off, see pic #1 instead)

    Should I use Miao Xie's patch as well?

    Thanks,


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    2008/7/11 Vegard Nossum :
    > [ ... ]
    >
    > Ok, now I tested it on my laptop (sorry, no serial console :-)) and I
    > get a spinlock recursion.


    Not reproduceable without my debugging patch and with Miao's?


    > Sorry for bad pics, but it's the best I can
    > do with the camera at hand:
    >
    > http://folk.uio.no/vegardno/linux/DSC04925.JPG
    > http://folk.uio.no/vegardno/linux/DSC04926.JPG
    > http://folk.uio.no/vegardno/linux/DSC04927.JPG (probably best pic, but
    > some of the msg is cut off, see pic #1 instead)
    >
    > Should I use Miao Xie's patch as well?


    My debugging patch already includes it.

    Ok, so this initial problem is reproduceable on your laptop? I did try
    to get it on my dual-core laptop without success. Maybe you could send
    me your config-spec (offline) and your test scripts?
    Then we could stop crashing your PCs :-)


    >
    > Thanks,
    >
    >
    > Vegard
    >


    --
    Best regards,
    Dmitry Adamushko
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Fri, Jul 11, 2008 at 1:04 PM, Vegard Nossum wrote:
    > On Fri, Jul 11, 2008 at 11:02 AM, Dmitry Adamushko
    > wrote:
    >> Vegard,
    >>
    >>
    >> regarding the first crash. Would you please run your test with the
    >> following debugging patch and let me know its output?
    >>
    >> The apperance of " * [ pid ] comm (name), orig_cpu() ... " means we
    >> hit a problematic case (with Miao Xie's patch it shouldn't crash).
    >>
    >> I see that you have CONFIG_SCHED_DEBUG=y so I'm also interested in
    >> messages from sched_domain_debug() - "CPU# attaching ...". IOW, all
    >> the kernel messages appearing while a cpu is going down and up.

    [...]

    > Ok, now I tested it on my laptop (sorry, no serial console :-)) and I


    Now I tested using serial console, but nothing new:

    CPU0 attaching NULL sched-domain.
    CPU1 attaching NULL sched-domain.
    CPU0 attaching sched-domain:
    domain 0: span 0-1
    groups: 0 1
    domain 1: span 0-1
    groups: 0-1
    CPU1 attaching sched-domain:
    domain 0: span 0-1
    groups: 1 0
    domain 1: span 0-1
    groups: 0-1
    * [ 7 ] comm (ksoftirqd/1), orig_cpu (1), dst_cpu (1), cpu (1)
    CPU 1 is now offline
    * [ 1228 ] comm (kjournald), orig_cpu (0), dst_cpu (0), cpu (0)
    * [ 3113 ] comm (klogd), orig_cpu (0), dst_cpu (0), cpu (0)
    BUG: spinlock recursion on CPU#0, syslogd/3110

    and here the output stops. I find this REALLY strange, look at the
    spinlock recursion code:

    printk(KERN_EMERG "BUG: spinlock %s on CPU#%d, %s/%d\n",
    msg, raw_smp_processor_id(),
    current->comm, task_pid_nr(current));
    printk(KERN_EMERG " lock: %p, .magic: %08x, .owner: %s/%d, "
    ".owner_cpu: %d\n",
    lock, lock->magic,
    owner ? owner->comm : "",
    owner ? task_pid_nr(owner) : -1,
    lock->owner_cpu);

    why would it not be able to print the second line?


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    On Fri, Jul 11, 2008 at 7:51 PM, Vegard Nossum wrote:
    > Now I tested using serial console, but nothing new:
    >
    > CPU0 attaching NULL sched-domain.
    > CPU1 attaching NULL sched-domain.
    > CPU0 attaching sched-domain:
    > domain 0: span 0-1
    > groups: 0 1
    > domain 1: span 0-1
    > groups: 0-1
    > CPU1 attaching sched-domain:
    > domain 0: span 0-1
    > groups: 1 0
    > domain 1: span 0-1
    > groups: 0-1
    > * [ 7 ] comm (ksoftirqd/1), orig_cpu (1), dst_cpu (1), cpu (1)
    > CPU 1 is now offline
    > * [ 1228 ] comm (kjournald), orig_cpu (0), dst_cpu (0), cpu (0)
    > * [ 3113 ] comm (klogd), orig_cpu (0), dst_cpu (0), cpu (0)
    > BUG: spinlock recursion on CPU#0, syslogd/3110
    >
    > and here the output stops. I find this REALLY strange, look at the
    > spinlock recursion code:
    >
    > printk(KERN_EMERG "BUG: spinlock %s on CPU#%d, %s/%d\n",
    > msg, raw_smp_processor_id(),
    > current->comm, task_pid_nr(current));
    > printk(KERN_EMERG " lock: %p, .magic: %08x, .owner: %s/%d, "
    > ".owner_cpu: %d\n",
    > lock, lock->magic,
    > owner ? owner->comm : "",
    > owner ? task_pid_nr(owner) : -1,
    > lock->owner_cpu);
    >
    > why would it not be able to print the second line?


    Enabling NMI watchdog gives us this additional info (whoa, what a huge
    stacktrace):

    BUG: spinlock recursion on CPU#0, bash/3832
    BUG: NMI Watchdog detected LOCKUP on CPU0, ip c010a5a1, registers:
    Pid: 3832, comm: bash Not tainted (2.6.26-rc9-00103-g2702484 #18)
    EIP: 0060:[] EFLAGS: 00200082 CPU: 0
    EIP is at native_read_tsc+0x11/0x20
    EAX: 4633e3b2 EBX: c06c6d00 ECX: 00000001 EDX: 00000086
    ESI: 00000001 EDI: 00000000 EBP: f6307a68 ESP: f6307a68
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process bash (pid: 3832, ti=f6306000 task=f5b71fe0 task.ti=f6306000)
    Stack: f6307a80 c027d6b2 00000000 c06c6d00 00360dc6 00000000 f6307a88 c027d5c9
    f6307ac0 c028b8ea c0148e0a c06da1a0 c1fd8180 c06da1a0 f6307aac 00000001
    b25fadf4 00000000 b25fadf4 c06c6d10 c06c6d00 00200092 f6307ae0 c0572cb3
    Call Trace:
    [] ? delay_tsc+0x22/0xa4
    [] ? __delay+0x9/0x10
    [] ? _raw_spin_lock+0xea/0x180
    [] ? atomic_notifier_call_chain+0x1a/0x20
    [] ? _spin_lock_irqsave+0x63/0x80
    [] ? __wake_up+0x1b/0x50
    [] ? __wake_up+0x1b/0x50
    [] ? release_console_sem+0x1b5/0x1e0
    [] ? wake_up_klogd+0x3b/0x40
    [] ? release_console_sem+0x1c9/0x1e0
    [] ? vprintk+0x29c/0x410
    [] ? nmi_watchdog_tick+0x90/0x160
    [] ? do_nmi+0xb5/0x2c0
    [] ? nmi_stack_correct+0x26/0x2b
    [] ? printk+0x1b/0x20
    [] ? spin_bug+0x63/0x100
    [] ? _raw_spin_lock+0x16c/0x180
    [] ? atomic_notifier_call_chain+0x1a/0x20
    [] ? __wake_up+0x1b/0x50
    [] ? _spin_lock_irqsave+0x63/0x80
    [] ? __wake_up+0x1b/0x50
    [] ? __wake_up+0x1b/0x50
    [] ? release_console_sem+0x1b5/0x1e0
    [] ? wake_up_klogd+0x3b/0x40
    [] ? release_console_sem+0x1c9/0x1e0
    [] ? vprintk+0x29c/0x410
    [] ? enqueue_entity+0x70/0x210
    [] ? enqueue_task_fair+0x40/0x50
    [] ? try_to_wake_up+0x120/0x2b0
    [] ? printk+0x1b/0x20
    [] ? try_to_wake_up+0x167/0x2b0
    [] ? default_wake_function+0xb/0x10
    [] ? autoremove_wake_function+0x1b/0x50
    [] ? __wake_up_common+0x48/0x70
    [] ? __wake_up+0x37/0x50
    [] ? wake_up_klogd+0x3b/0x40
    [] ? release_console_sem+0x1c9/0x1e0
    [] ? vprintk+0x29c/0x410
    [] ? vprintk+0x309/0x410
    [] ? mark_held_locks+0x65/0x80
    [] ? __mutex_unlock_slowpath+0xb5/0x150
    [] ? printk+0x1b/0x20
    [] ? alternatives_smp_switch+0x18/0x1b0
    [] ? printk+0x1b/0x20
    [] ? __cpu_die+0x74/0x80
    [] ? _cpu_down+0x13a/0x240
    [] ? cpu_maps_update_begin+0xf/0x20
    [] ? cpu_down+0x2b/0x40
    [] ? store_online+0x39/0x80
    [] ? store_online+0x0/0x80
    [] ? sysdev_store+0x2b/0x40
    [] ? sysfs_write_file+0xa2/0x100
    [] ? vfs_write+0x96/0x130
    [] ? sysfs_write_file+0x0/0x100
    [] ? sys_write+0x3d/0x70
    [] ? sysenter_past_esp+0x6a/0xb1
    =======================
    Code: 56 30 17 00 5d c3 8d 74 26 00 e6 ed 5d c3 90 90 90 90 90 90 90 90 90 90 90
    90 55 89 e5 8d 76 00 0f ae e8 0f 31 8d 76 00 0f ae e8 <5d> c3 8d b6 00 00 00 00
    8d bc 27 00 00 00 00 55 89 e5 57 89 c7

    I don't know if it helps too much, but I think it can't really be
    anything but the printk()s from sched code...


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

    2008/7/11 Vegard Nossum :
    > On Fri, Jul 11, 2008 at 1:04 PM, Vegard Nossum wrote:
    >> On Fri, Jul 11, 2008 at 11:02 AM, Dmitry Adamushko
    >> wrote:
    >>> Vegard,
    >>>
    >>>
    >>> regarding the first crash. Would you please run your test with the
    >>> following debugging patch and let me know its output?
    >>>
    >>> The apperance of " * [ pid ] comm (name), orig_cpu() ... " means we
    >>> hit a problematic case (with Miao Xie's patch it shouldn't crash).
    >>>
    >>> I see that you have CONFIG_SCHED_DEBUG=y so I'm also interested in
    >>> messages from sched_domain_debug() - "CPU# attaching ...". IOW, all
    >>> the kernel messages appearing while a cpu is going down and up.

    > [...]
    >
    >> Ok, now I tested it on my laptop (sorry, no serial console :-)) and I

    >
    > Now I tested using serial console, but nothing new:
    >
    > CPU0 attaching NULL sched-domain.
    > CPU1 attaching NULL sched-domain.
    > CPU0 attaching sched-domain:
    > domain 0: span 0-1
    > groups: 0 1
    > domain 1: span 0-1
    > groups: 0-1
    > CPU1 attaching sched-domain:
    > domain 0: span 0-1
    > groups: 1 0
    > domain 1: span 0-1
    > groups: 0-1


    hmm, sched-domains have been rebuilt too early. The soon-to-be-offline
    cpu #1 is included (as it's still in cpu_online_map presumably).

    > * [ 7 ] comm (ksoftirqd/1), orig_cpu (1), dst_cpu (1), cpu (1)


    Have you removed "__migrate_dead ... " printk messages? This one
    should be printed after __stop_machine_run(take_cpu_down, ...) and
    before migrate_live_tasks() takes place... so we would have seen
    "__migrate_dead..." message for ksoftirqd/1 a bit later, I guess.

    > CPU 1 is now offline


    migrate_live_tasks() should take place here...

    > * [ 1228 ] comm (kjournald), orig_cpu (0), dst_cpu (0), cpu (0)
    > * [ 3113 ] comm (klogd), orig_cpu (0), dst_cpu (0), cpu (0)


    I guess, these were migrated onto cpu#0 by migrate_live_tasks() but
    now try_to_wake_up() has been called for them. Due to the fact that
    cpu#1 is visible on the sched-domains, the load-balancer
    (select_task_rq()) picks it up erronneusly... bum.




    >
    > Vegard
    >


    --
    Best regards,
    Dmitry Adamushko
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread