latest -git: A peculiar case of a stuck process (ext3/sched-related?) - Kernel

This is a discussion on latest -git: A peculiar case of a stuck process (ext3/sched-related?) - Kernel ; Hi, I was running a test which corrupts ext3 filesystem images on purpose. After quite a long time, I have ended up with a grep that runs at 98% CPU and is unkillable even though it is in state R: ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: latest -git: A peculiar case of a stuck process (ext3/sched-related?)

  1. latest -git: A peculiar case of a stuck process (ext3/sched-related?)

    Hi,

    I was running a test which corrupts ext3 filesystem images on purpose.
    After quite a long time, I have ended up with a grep that runs at 98%
    CPU and is unkillable even though it is in state R:

    root 6573 98.6 0.0 4008 820 pts/0 R 11:17 15:48 grep -r . mnt

    It doesn't go away with kill -9 either. A sysrq-t shows this info:

    grep R running 5704 6573 6552
    f4ff3c3c c0747b19 00000000 f4ff3bd4 c01507ba ffffffff 00000000 f4ff3bf0
    f5992fd0 f4ff3c4c 01597000 00000000 c09cd080 f312afd0 f312b248 c1fb2f80
    00000001 00000002 00000000 f312afd0 f312afd0 f4ff3c24 c015ab70 00000000
    Call Trace:
    [] ? schedule+0x459/0x960
    [] ? atomic_notifier_call_chain+0x1a/0x20
    [] ? mark_held_locks+0x40/0x80
    [] ? trace_hardirqs_on+0xb/0x10
    [] ? trace_hardirqs_on_caller+0x116/0x170
    [] preempt_schedule_irq+0x3e/0x70
    [] need_resched+0x1f/0x23
    [] ? ext3_find_entry+0x401/0x6f0
    [] ? __lock_acquire+0x2c9/0x1110
    [] ? slab_pad_check+0x3c/0x120
    [] ? trace_hardirqs_on_caller+0x116/0x170
    [] ? trace_hardirqs_off+0xb/0x10
    [] ext3_lookup+0x3a/0xd0
    [] ? d_alloc+0x133/0x190
    [] do_lookup+0x160/0x1b0
    [] __link_path_walk+0x208/0xdc0
    [] ? lock_release_holdtime+0x83/0x120
    [] ? mnt_want_write+0x4e/0xb0
    [] __link_path_walk+0x8f7/0xdc0
    [] ? trace_hardirqs_off+0xb/0x10
    [] path_walk+0x54/0xb0
    [] do_path_lookup+0x85/0x230
    [] __user_walk_fd+0x38/0x50
    [] vfs_stat_fd+0x21/0x50
    [] ? put_lock_stats+0xd/0x30
    [] ? mntput_no_expire+0x1d/0x110
    [] vfs_stat+0x11/0x20
    [] sys_stat64+0x14/0x30
    [] ? fput+0x1f/0x30
    [] ? trace_hardirqs_on_thunk+0xc/0x10
    [] ? trace_hardirqs_on_caller+0x116/0x170
    [] ? trace_hardirqs_on_thunk+0xc/0x10
    [] sysenter_past_esp+0x78/0xc5
    =======================

    ...so it's clearly related to the corrupted ext3 filesystem. The
    strange thing, in my opinion, is this stack frame:

    [] ext3_lookup+0x3a/0xd0

    ...but this address corresponds to fs/ext3/namei.c:1039:

    bh = ext3_find_entry(dentry, &de);
    inode = NULL;
    if (bh) { /* <--- here */
    unsigned long ino = le32_to_cpu(de->inode);
    brelse (bh);

    What happened? Did the scheduler get stuck? Softlockup detection and
    NMI watchdog are both enabled, but none of them are triggering.

    Trying to strace the problem doesn't really help either:

    # strace -p 6573
    Process 6573 attached - interrupt to quit
    ^C^C^C^C

    (and hangs unkillably too.)

    See full log at:

    http://folk.uio.no/vegardno/linux/log-1216370788.txt

    The machine is still running in the same state and CPU0 is still
    usable. What more info can I provide to help debug this?


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: latest -git: A peculiar case of a stuck process (ext3/sched-related?)

    On Fri, Jul 18, 2008 at 11:45 AM, Vegard Nossum wrote:
    > Hi,
    >
    > I was running a test which corrupts ext3 filesystem images on purpose.
    > After quite a long time, I have ended up with a grep that runs at 98%
    > CPU and is unkillable even though it is in state R:
    >
    > root 6573 98.6 0.0 4008 820 pts/0 R 11:17 15:48 grep -r . mnt
    >
    > It doesn't go away with kill -9 either. A sysrq-t shows this info:
    >
    > grep R running 5704 6573 6552
    > f4ff3c3c c0747b19 00000000 f4ff3bd4 c01507ba ffffffff 00000000 f4ff3bf0
    > f5992fd0 f4ff3c4c 01597000 00000000 c09cd080 f312afd0 f312b248 c1fb2f80
    > 00000001 00000002 00000000 f312afd0 f312afd0 f4ff3c24 c015ab70 00000000
    > Call Trace:
    > [] ? schedule+0x459/0x960
    > [] ? atomic_notifier_call_chain+0x1a/0x20
    > [] ? mark_held_locks+0x40/0x80
    > [] ? trace_hardirqs_on+0xb/0x10
    > [] ? trace_hardirqs_on_caller+0x116/0x170
    > [] preempt_schedule_irq+0x3e/0x70
    > [] need_resched+0x1f/0x23
    > [] ? ext3_find_entry+0x401/0x6f0
    > [] ? __lock_acquire+0x2c9/0x1110
    > [] ? slab_pad_check+0x3c/0x120
    > [] ? trace_hardirqs_on_caller+0x116/0x170
    > [] ? trace_hardirqs_off+0xb/0x10
    > [] ext3_lookup+0x3a/0xd0
    > [] ? d_alloc+0x133/0x190
    > [] do_lookup+0x160/0x1b0
    > [] __link_path_walk+0x208/0xdc0
    > [] ? lock_release_holdtime+0x83/0x120
    > [] ? mnt_want_write+0x4e/0xb0
    > [] __link_path_walk+0x8f7/0xdc0
    > [] ? trace_hardirqs_off+0xb/0x10
    > [] path_walk+0x54/0xb0
    > [] do_path_lookup+0x85/0x230
    > [] __user_walk_fd+0x38/0x50
    > [] vfs_stat_fd+0x21/0x50
    > [] ? put_lock_stats+0xd/0x30
    > [] ? mntput_no_expire+0x1d/0x110
    > [] vfs_stat+0x11/0x20
    > [] sys_stat64+0x14/0x30
    > [] ? fput+0x1f/0x30
    > [] ? trace_hardirqs_on_thunk+0xc/0x10
    > [] ? trace_hardirqs_on_caller+0x116/0x170
    > [] ? trace_hardirqs_on_thunk+0xc/0x10
    > [] sysenter_past_esp+0x78/0xc5
    > =======================


    Ah, I tried echo l > /proc/sysrq-trigger and it gives this useful information:

    SysRq : Show backtrace of all active CPUs
    CPU1:
    f4ff3bd8 00000000 00000000 c1fadcc0 f4ff3c00 c0106a66 00000000 c083c0d8
    00200096 00000002 f4ff3c14 c0498ccb c08937c4 00000001 e589b050 f4ff3c38
    c0161f88 f4ff3c24 c1fadcc8 f4ff3c24 f4ff3c24 c0a1bd80 00010000 f3f81041
    Call Trace:
    [] ? show_stack+0x36/0x40
    [] ? showacpu+0x4b/0x60
    [] ? generic_smp_call_function_single_interrupt+0x78/0xc0
    [] ? smp_call_function_single_interrupt+0x20/0x40
    [] ? call_function_single_interrupt+0x2d/0x34
    [] ? ext3_find_entry+0x3eb/0x6f0
    [] ? __lock_acquire+0x2c9/0x1110
    [] ? slab_pad_check+0x3c/0x120
    [] ? trace_hardirqs_on_caller+0x116/0x170
    [] ? trace_hardirqs_off+0xb/0x10
    [] ? ext3_lookup+0x3a/0xd0

    And the ext3_find_entry() corresponds to this line:

    for (; de < top; de = ext3_next_entry(de)) /* <--- HERE! */
    if (ext3_match (namelen, name, de)) {
    if (!ext3_check_dir_entry("ext3_find_entry",
    dir, de, bh,
    (block< +((char *)de - bh->b_data))) {
    brelse (bh);
    *err = ERR_BAD_DX_DIR;
    goto errout;
    }
    *res_dir = de;
    dx_release (frames);
    return bh;
    }

    Is it possible that this loop can get stuck with a corrupt filesystem image?

    A few more iterations of this gives that the task is ALWAYS
    interrupted somewhere on line 994:

    for (; de < top; de = ext3_next_entry(de))

    ...but at slightly different EIPs. I find a bit odd as there are no
    loops in ext3_next_entry(), and the for-loop itself isn't that tight
    either. Any ideas?


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: latest -git: A peculiar case of a stuck process (ext3/sched-related?)

    On Fri, 18 Jul 2008 12:17:17 +0200 "Vegard Nossum" wrote:

    > And the ext3_find_entry() corresponds to this line:
    >
    > for (; de < top; de = ext3_next_entry(de)) /* <--- HERE! */
    > if (ext3_match (namelen, name, de)) {
    > if (!ext3_check_dir_entry("ext3_find_entry",
    > dir, de, bh,
    > (block< > +((char *)de - bh->b_data))) {
    > brelse (bh);
    > *err = ERR_BAD_DX_DIR;
    > goto errout;
    > }
    > *res_dir = de;
    > dx_release (frames);
    > return bh;
    > }
    >
    > Is it possible that this loop can get stuck with a corrupt filesystem image?


    yup. ext2 did that a couple of times. See the explicit check for
    de->rec_len == 0 in ext2_find_entry().

    We fixed one filesystem and forgot the others. Again.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: latest -git: A peculiar case of a stuck process (ext3/sched-related?)

    On Fri, Jul 18, 2008 at 12:32 PM, Andrew Morton
    wrote:
    > On Fri, 18 Jul 2008 12:17:17 +0200 "Vegard Nossum" wrote:
    >
    >> And the ext3_find_entry() corresponds to this line:
    >>
    >> for (; de < top; de = ext3_next_entry(de)) /* <--- HERE! */
    >> if (ext3_match (namelen, name, de)) {
    >> if (!ext3_check_dir_entry("ext3_find_entry",
    >> dir, de, bh,
    >> (block< >> +((char *)de - bh->b_data))) {
    >> brelse (bh);
    >> *err = ERR_BAD_DX_DIR;
    >> goto errout;
    >> }
    >> *res_dir = de;
    >> dx_release (frames);
    >> return bh;
    >> }
    >>
    >> Is it possible that this loop can get stuck with a corrupt filesystem image?

    >
    > yup. ext2 did that a couple of times. See the explicit check for
    > de->rec_len == 0 in ext2_find_entry().
    >
    > We fixed one filesystem and forgot the others. Again.


    Wow, that change is OLD, though:

    commit aa4f3f285643956bb614cf7b8f88e15f3a375886
    Author: Andrew Morton
    Date: Mon Apr 29 23:51:34 2002 -0700

    [PATCH] ext2 directory handling

    Will you also send the patch for ext3? :-)


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: latest -git: A peculiar case of a stuck process (ext3/sched-related?)

    2008/7/18 Vegard Nossum :
    > And the ext3_find_entry() corresponds to this line:
    >
    > for (; de < top; de = ext3_next_entry(de)) /* <--- HERE! */
    > if (ext3_match (namelen, name, de)) {
    > if (!ext3_check_dir_entry("ext3_find_entry",
    > dir, de, bh,
    > (block< > +((char *)de - bh->b_data))) {
    > brelse (bh);
    > *err = ERR_BAD_DX_DIR;
    > goto errout;
    > }
    > *res_dir = de;
    > dx_release (frames);
    > return bh;
    > }
    >
    > Is it possible that this loop can get stuck with a corrupt filesystem image?


    It certainly is. This is the same problem as the first case reported
    at http://bugzilla.kernel.org/show_bug.cgi?id=10882. There is a patch
    in -mm for it already (2fde9f7a0faabe821b31ccd982d482c21f7c503f),
    posted here: http://marc.info/?l=linux-kernel&m=121486328013470.

    Hopefully that should fix the problem for you.

    Cheers,
    Duane.

    --
    "I never could learn to drink that blood and call it wine" - Bob Dylan
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: latest -git: A peculiar case of a stuck process (ext3/sched-related?)

    On Fri, Jul 18, 2008 at 3:00 PM, Duane Griffin wrote:
    > 2008/7/18 Vegard Nossum :
    >> And the ext3_find_entry() corresponds to this line:
    >>
    >> for (; de < top; de = ext3_next_entry(de)) /* <--- HERE! */
    >> if (ext3_match (namelen, name, de)) {
    >> if (!ext3_check_dir_entry("ext3_find_entry",
    >> dir, de, bh,
    >> (block< >> +((char *)de - bh->b_data))) {
    >> brelse (bh);
    >> *err = ERR_BAD_DX_DIR;
    >> goto errout;
    >> }
    >> *res_dir = de;
    >> dx_release (frames);
    >> return bh;
    >> }
    >>
    >> Is it possible that this loop can get stuck with a corrupt filesystem image?

    >
    > It certainly is. This is the same problem as the first case reported
    > at http://bugzilla.kernel.org/show_bug.cgi?id=10882. There is a patch
    > in -mm for it already (2fde9f7a0faabe821b31ccd982d482c21f7c503f),
    > posted here: http://marc.info/?l=linux-kernel&m=121486328013470.
    >
    > Hopefully that should fix the problem for you.


    Oh, right. Thanks!

    (This patch looks much shorter than the one I found for ext2, though,
    i.e. commit aa4f3f285643956bb614cf7b8f88e15f3a375886 from the
    historical git repository. Does it fix all the right cases? I don't
    mean to troll -- I just wanted to be sure you knew about it.)

    I'll try it and report back if there's any trouble. Thanks,

    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: latest -git: A peculiar case of a stuck process (ext3/sched-related?)

    2008/7/18 Vegard Nossum :
    > (This patch looks much shorter than the one I found for ext2, though,
    > i.e. commit aa4f3f285643956bb614cf7b8f88e15f3a375886 from the
    > historical git repository. Does it fix all the right cases? I don't
    > mean to troll -- I just wanted to be sure you knew about it.)


    Ah, the historical git archive, that explains why I couldn't find it!
    Please, don't worry about being seen as trolling, I didn't know about
    that commit and I appreciate the heads-up and your attention to this
    in general.

    I've just taken a look at it and where I could find the corresponding
    ext3 code it all looked OK. The only change the commit makes that I
    couldn't verify was to ext2_add_link, as I couldn't immediately spot
    where the corresponding ext3 functionality is.

    Having said that, as I mentioned in the comment to that patch, there
    may well be similar problems still lurking. I tried to fix up all the
    dodgy looking cases at first, but I screwed up and just introduced
    more problems. After that it seemed safer to do a minimal fix and to
    address other similar issues as and when we got test-cases for them.
    With the sterling efforts of yourself and Sami we'll hopefully get
    them sooner or later.

    > I'll try it and report back if there's any trouble. Thanks,


    Thanks!

    > Vegard


    Cheers,
    Duane.

    --
    "I never could learn to drink that blood and call it wine" - Bob Dylan
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread