[PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state - Kernel

This is a discussion on [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state - Kernel ; With the previous changes the sub-threads which participate in coredump do not need to have the valid ->mm when the coredump is in progress, now we can decouple exit_mm() from coredumping code. Change exit_mm() to clear ->mm first, then play ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

  1. [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

    With the previous changes the sub-threads which participate in coredump do
    not need to have the valid ->mm when the coredump is in progress, now we
    can decouple exit_mm() from coredumping code.

    Change exit_mm() to clear ->mm first, then play with mm->core_state. This
    simplifies the code because we can avoid unlock/lock games with ->mmap_sem,
    and more importantly this makes the coredumping process visible to oom_kill.
    Currently the PF_EXITING task can sleep with ->mm != NULL "unpredictably"
    long.

    The patch moves the coredumping code to the new function for the readability.

    Signed-off-by: Oleg Nesterov

    kernel/exit.c | 47 +++++++++++++++++++++++++----------------------
    fs/exec.c | 4 ++--
    2 files changed, 27 insertions(+), 24 deletions(-)

    --- 26-rc2/kernel/exit.c~7_CLEAR_MM_FIRST 2008-07-15 20:25:48.000000000 +0400
    +++ 26-rc2/kernel/exit.c 2008-07-15 20:24:50.000000000 +0400
    @@ -646,6 +646,28 @@ assign_new_owner:
    }
    #endif /* CONFIG_MM_OWNER */

    +static void exit_coredump(struct task_struct * tsk,
    + struct core_state *core_state)
    +{
    + struct core_thread self;
    +
    + self.task = tsk;
    + self.next = xchg(&core_state->dumper.next, &self);
    + /*
    + * Implies mb(), the result of xchg() must be visible
    + * to core_state->dumper.
    + */
    + if (atomic_dec_and_test(&core_state->nr_threads))
    + complete(&core_state->startup);
    +
    + for (; {
    + set_task_state(tsk, TASK_UNINTERRUPTIBLE);
    + if (!self.task) /* see coredump_finish() */
    + break;
    + schedule();
    + }
    + __set_task_state(tsk, TASK_RUNNING);
    +}
    /*
    * Turn us into a lazy TLB process if we
    * aren't already..
    @@ -667,28 +689,6 @@ static void exit_mm(struct task_struct *
    */
    down_read(&mm->mmap_sem);
    core_state = mm->core_state;
    - if (core_state) {
    - struct core_thread self;
    - up_read(&mm->mmap_sem);
    -
    - self.task = tsk;
    - self.next = xchg(&core_state->dumper.next, &self);
    - /*
    - * Implies mb(), the result of xchg() must be visible
    - * to core_state->dumper.
    - */
    - if (atomic_dec_and_test(&core_state->nr_threads))
    - complete(&core_state->startup);
    -
    - for (; {
    - set_task_state(tsk, TASK_UNINTERRUPTIBLE);
    - if (!self.task) /* see coredump_finish() */
    - break;
    - schedule();
    - }
    - __set_task_state(tsk, TASK_RUNNING);
    - down_read(&mm->mmap_sem);
    - }
    atomic_inc(&mm->mm_count);
    BUG_ON(mm != tsk->active_mm);
    /* more a memory barrier than a real lock */
    @@ -701,6 +701,9 @@ static void exit_mm(struct task_struct *
    task_unlock(tsk);
    mm_update_next_owner(mm);
    mmput(mm);
    +
    + if (core_state)
    + exit_coredump(tsk, core_state);
    }

    static void
    --- 26-rc2/fs/exec.c~7_CLEAR_MM_FIRST 2008-07-15 17:54:45.000000000 +0400
    +++ 26-rc2/fs/exec.c 2008-07-15 20:24:50.000000000 +0400
    @@ -1632,8 +1632,8 @@ static void coredump_finish(struct mm_st
    next = curr->next;
    task = curr->task;
    /*
    - * see exit_mm(), curr->task must not see
    - * ->task == NULL before we read ->next.
    + * see exit_coredump(), curr->task must not
    + * see ->task == NULL before we read ->next.
    */
    smp_mb();
    curr->task = NULL;

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

    > With the previous changes the sub-threads which participate in coredump do
    > not need to have the valid ->mm when the coredump is in progress, now we
    > can decouple exit_mm() from coredumping code.


    I'm all for separating the code more cleanly. But I don't think it can
    work to change the order of the operations, i.e. it is not really true that
    core dumps don't need each thread's ->mm link to be valid. Is there a
    benefit to unlinking the mm before waiting for the core dump to finish?

    The issue is that the user_regset calls to get "thread state" might
    actually read some user memory. Those calls use a task_struct pointer and
    you don't get to separately tell them the mm_struct describing the thread's
    address space. For example, the sparc64 "general registers" note for core
    files includes the register window read from user memory.

    So, it's not OK to clear the ->mm before everything examining the thread's
    machine state is really done, i.e. core dump and anything else.


    Thanks,
    Roland
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

    On 07/19, Roland McGrath wrote:
    >
    > > With the previous changes the sub-threads which participate in coredump do
    > > not need to have the valid ->mm when the coredump is in progress, now we
    > > can decouple exit_mm() from coredumping code.

    >
    > I'm all for separating the code more cleanly. But I don't think it can
    > work to change the order of the operations, i.e. it is not really true that
    > core dumps don't need each thread's ->mm link to be valid. Is there a
    > benefit to unlinking the mm before waiting for the core dump to finish?


    If select_bad_process() sees the PF_EXITING task with ->mm != NULL, it
    returns ERR_PTR(-1). This means that any prcoess doing the mt coredump
    blocks oom kill completely. It is not that oom_kill doesn't take this
    process into account, oom_kill just can't work intil ->core_dump()
    completes.

    Yes, oom_kill.c in turn need fixes but still this is not nice, and I
    personally hate this coredump code in the middle of exit_mm().

    However,

    > The issue is that the user_regset calls to get "thread state" might
    > actually read some user memory. Those calls use a task_struct pointer and
    > you don't get to separately tell them the mm_struct describing the thread's
    > address space. For example, the sparc64 "general registers" note for core
    > files includes the register window read from user memory.
    >
    > So, it's not OK to clear the ->mm before everything examining the thread's
    > machine state is really done, i.e. core dump and anything else.


    Oh, thanks Roland.

    Andrew, please drop

    coredump-binfmt_elf_fdpic-dont-use-sub-threads-mm.patch
    coredump-exit_mm-clear-mm-first-then-play-with-core_state.patch





    btw, arch/sparc64/kernel/ptrace.c has a lot of

    if (target == current)
    copy_xxx_user();
    else
    access_process_vm();

    perhaps it make sense to make a helper. Just curious (I don't know what
    regset is), is it possible that ->get() is called when target->mm == NULL?

    Oleg.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH 2/2] coredump: exit_mm: clear ->mm first, then play with ->core_state

    > Yes, oom_kill.c in turn need fixes but still this is not nice, and I
    > personally hate this coredump code in the middle of exit_mm().


    I agree. I think what we should work towards is having the coredump
    synchronization take place earlier (the tracehook_report_exit point should
    be fine, i.e. before PF_EXITING). Then have the dumping and the waiting
    for it be killable. I don't think we can get there in only one or two
    steps, though.

    [The rest if quite off-topic, please do it separately.]

    > btw, arch/sparc64/kernel/ptrace.c has a lot of
    >
    > if (target == current)
    > copy_xxx_user();
    > else
    > access_process_vm();
    >
    > perhaps it make sense to make a helper.


    Dave actually has get_from_target and set_to_target helpers for that.
    The places they aren't used, I assume are either just older code not
    yet streamlined, or places where the separate get/put_user calls perform
    especially better than copy_xxx_user (you'd have to ask Dave). If multiple
    arch ports find such helpers useful, they could move into common code later.

    > Just curious (I don't know what regset is), is it possible that ->get()
    > is called when target->mm == NULL?


    It should not happen. It's only kosher to use user_regset calls on a task
    in a known state (like ptrace stop, or current), and never on kernel threads.


    Thanks,
    Roland
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. killable/interruptible coredumps

    On 07/20, Roland McGrath wrote:
    >
    > Then have the dumping and the waiting
    > for it be killable.


    I think it is easy to make the coredumping killable right now,

    --- fs/exec.c~ 2008-07-21 19:47:22.000000000 +0400
    +++ fs/exec.c 2008-07-21 19:56:44.000000000 +0400
    @@ -1523,6 +1523,7 @@ static inline int zap_threads(struct tas
    spin_lock_irq(&tsk->sighand->siglock);
    if (!signal_group_exit(tsk->signal)) {
    mm->core_state = core_state;
    + clear_thread_flag(TIF_SIGPENDING);
    tsk->signal->group_exit_code = exit_code;
    nr = zap_process(tsk);
    }
    @@ -1735,12 +1736,6 @@ int do_coredump(long signr, int exit_cod
    goto fail;

    /*
    - * Clear any false indication of pending signals that might
    - * be seen by the filesystem code called to write the core file.
    - */
    - clear_thread_flag(TIF_SIGPENDING);
    -
    - /*
    * lock_kernel() because format_corename() is controlled by sysctl, which
    * uses lock_kernel()
    */
    --- fs/binfmt_elf.c~ 2008-07-13 20:52:25.000000000 +0400
    +++ fs/binfmt_elf.c 2008-07-20 19:53:08.000000000 +0400
    @@ -1105,11 +1105,17 @@ out:
    */
    static int dump_write(struct file *file, const void *addr, int nr)
    {
    + if (fatal_signal_pending(current))
    + return 0;
    +
    return file->f_op->write(file, addr, nr, &file->f_pos) == nr;
    }

    static int dump_seek(struct file *file, loff_t off)
    {
    + if (fatal_signal_pending(current))
    + return 0;
    +
    if (file->f_op->llseek && file->f_op->llseek != no_llseek) {
    if (file->f_op->llseek(file, off, SEEK_CUR) < 0)
    return 0;

    This relies on the fact SIGKILL will find the coredumping thread
    because it is not PF_EXITING, even if SIGNAL_GROUP_EXIT is set. This
    is a bit racy though, SIGKILL may come after zap_process(current)
    but before other threads have "died".


    I wonder if it makes sense to go futher. Currently the coredumping
    thread sets SIGNAL_GROUP_EXIT. We can set signal->group_exit_task
    instead, this way it can be killed by any fatal signal, not only by
    SIGKILL. For example the user can just use ^C (unless it is ignored/etc).

    What is your opinion about the patch below? I can't decide wether
    this change is good or bad. It complicates the code, but otoh it
    makes things more consistent, imho. Note also we can change oom_kill
    to check SIGNAL_GROUP_EXIT instead of PF_EXITING to detect the process
    which should release its ->mm and probably free the memory "soon".

    It is not clear what should be ->group_exit_code if ->core_dump() is
    interrupted. This patch sets it = exit_code, but perhaps we should
    keep the original signr in that case.

    Oleg.

    --- 26-rc2/fs/exec.c~1_MAKE_KILLABLE 2008-07-20 17:31:40.000000000 +0400
    +++ 26-rc2/fs/exec.c 2008-07-20 19:46:21.000000000 +0400
    @@ -1498,7 +1498,6 @@ static int zap_process(struct task_struc
    struct task_struct *t;
    int nr = 0;

    - start->signal->flags = SIGNAL_GROUP_EXIT;
    start->signal->group_stop_count = 0;

    t = start;
    @@ -1523,7 +1522,8 @@ static inline int zap_threads(struct tas
    spin_lock_irq(&tsk->sighand->siglock);
    if (!signal_group_exit(tsk->signal)) {
    mm->core_state = core_state;
    - tsk->signal->group_exit_code = exit_code;
    + clear_thread_flag(TIF_SIGPENDING);
    + tsk->signal->group_exit_task = tsk;
    nr = zap_process(tsk);
    }
    spin_unlock_irq(&tsk->sighand->siglock);
    @@ -1574,6 +1574,7 @@ static inline int zap_threads(struct tas
    if (p->mm) {
    if (unlikely(p->mm == mm)) {
    lock_task_sighand(p, &flags);
    + p->signal->flags = SIGNAL_GROUP_EXIT;
    nr += zap_process(p);
    unlock_task_sighand(p, &flags);
    }
    @@ -1619,7 +1620,7 @@ fail:
    return core_waiters;
    }

    -static void coredump_finish(struct mm_struct *mm)
    +static inline void core_state_finish(struct mm_struct *mm)
    {
    struct core_thread *curr, *next;
    struct task_struct *task;
    @@ -1640,6 +1641,17 @@ static void coredump_finish(struct mm_st
    mm->core_state = NULL;
    }

    +static void coredump_finish(struct mm_struct *mm, int exit_code)
    +{
    + spin_lock_irq(&current->sighand->siglock);
    + current->signal->flags = SIGNAL_GROUP_EXIT;
    + current->signal->group_exit_code = exit_code;
    + current->signal->group_exit_task = NULL;
    + spin_unlock_irq(&current->sighand->siglock);
    +
    + core_state_finish(mm);
    +}
    +
    /*
    * set_dumpable converts traditional three-value dumpable to two flags and
    * stores them into mm->flags. It modifies lower two bits of mm->flags, but
    @@ -1735,12 +1747,6 @@ int do_coredump(long signr, int exit_cod
    goto fail;

    /*
    - * Clear any false indication of pending signals that might
    - * be seen by the filesystem code called to write the core file.
    - */
    - clear_thread_flag(TIF_SIGPENDING);
    -
    - /*
    * lock_kernel() because format_corename() is controlled by sysctl, which
    * uses lock_kernel()
    */
    @@ -1814,9 +1820,8 @@ int do_coredump(long signr, int exit_cod
    goto close_fail;

    retval = binfmt->core_dump(signr, regs, file, core_limit);
    -
    if (retval)
    - current->signal->group_exit_code |= 0x80;
    + exit_code |= 0x80;
    close_fail:
    filp_close(file, NULL);
    fail_unlock:
    @@ -1824,7 +1829,7 @@ fail_unlock:
    argv_free(helper_argv);

    current->fsuid = fsuid;
    - coredump_finish(mm);
    + coredump_finish(mm, exit_code);
    fail:
    return retval;
    }

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread