Re: + fix-procfs-task-exe-symlink.patch added to -mm tree - Kernel

This is a discussion on Re: + fix-procfs-task-exe-symlink.patch added to -mm tree - Kernel ; s/mm-commits/lkml/ On 01/28, Matt Helsley wrote: > > On Sun, 2008-01-27 at 19:25 +0300, Oleg Nesterov wrote: > > > > Why? Linux doesn't allow sys_mmap(MAP_EXECUTABLE), so any VM_EXECUTABLE vma > > should refer to the same bprm->file which was ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: Re: + fix-procfs-task-exe-symlink.patch added to -mm tree

  1. Re: + fix-procfs-task-exe-symlink.patch added to -mm tree

    s/mm-commits/lkml/

    On 01/28, Matt Helsley wrote:
    >
    > On Sun, 2008-01-27 at 19:25 +0300, Oleg Nesterov wrote:
    > >
    > > Why? Linux doesn't allow sys_mmap(MAP_EXECUTABLE), so any VM_EXECUTABLE vma
    > > should refer to the same bprm->file which was mapped by elf_map(), no?

    >
    > You're right. So in this case the kernel wouldn't find any VMA marked
    > VM_EXECUTABLE and it would return with -ENOENT. The same would happen
    > with this patch since we drop the extra file reference once the
    > VM_EXECTUABLE VMAs disappear.


    OK, thanks. This leads to another question which I forgot to ask.

    This patch has a lot of complications because it tries to preserve the
    current behaviour: we release the bprm->file when all VM_EXECTUABLE vmas
    are unmapped. Q: is this so important/useful? I don't think this is very
    common case, and I don't quite understand why it is critical to release
    the file. To unmount fs after starting the app? One can always copy
    the file before execing, or do "/lib/ld-linus.so application" and then
    unmap the vmas.

    (I am not arguing, just curious).

    > > I don't understand why do we need ->exe_file_lock. Afaics, all callers of
    > > added_exe_file_vma/removed_exe_file_vma must hold ->mmap_sem, yes? But this
    > > means get_mm_exe_file() can use down_read(mm->map_sem). No?

    >
    > Yes, I could get the task's ->mmap_sem there too and reuse the mmap_sem
    > rather than add a lock. That allows nearly any task to grab another
    > tasks mmap_sem simply by doing a readlink on /proc/pid/exe. So I thought
    > avoiding reuse of the mmap_sem might be best.
    >
    > Do you still think it would be better to reuse mmap_sem?


    Well, we only need down_read(mmap_sem) for the very short time. /proc/pid/maps
    is much "worse" in this sense.

    > > > @@ -409,6 +410,7 @@ void mmput(struct mm_struct *mm)
    > > > if (atomic_dec_and_test(&mm->mm_users)) {
    > > > exit_aio(mm);
    > > > exit_mmap(mm);
    > > > + set_mm_exe_file(mm, NULL);

    > >
    > > This change looks unneeded. exit_mmap() removes all vmas. The last VM_EXECUTABLE
    > > vma should clear ->exe_file via removed_exe_file_vma() ?

    >
    > You're right -- it's redundant. I'll fix that.


    Sorry, I was wrong.

    mmput() has to release ->exe_file if it is called when exec fails before the
    first do_mmmap(MAP_EXECUTABLE). This also means that it is not completely
    trivial to set ->exe_file before exec_mmap(), it can fail. This is solvable,
    but I'm not sure we should do this.

    Still, the accounting looks a little bit fragile to me. flush_old_exec()
    increments ->f_count but sets ->num_exe_file_vmas = 0 because we know that
    the next elf_map() will bump ->num_exe_file_vmas and thus "sync" 2 counters.
    But I don't see how to do better if we really want to release the file when
    VM_EXECUTABLE disappears.

    Oleg.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: + fix-procfs-task-exe-symlink.patch added to -mm tree


    On Tue, 2008-01-29 at 14:36 +0300, Oleg Nesterov wrote:
    > s/mm-commits/lkml/
    >
    > On 01/28, Matt Helsley wrote:
    > >
    > > On Sun, 2008-01-27 at 19:25 +0300, Oleg Nesterov wrote:
    > > >
    > > > Why? Linux doesn't allow sys_mmap(MAP_EXECUTABLE), so any VM_EXECUTABLE vma
    > > > should refer to the same bprm->file which was mapped by elf_map(), no?

    > >
    > > You're right. So in this case the kernel wouldn't find any VMA marked
    > > VM_EXECUTABLE and it would return with -ENOENT. The same would happen
    > > with this patch since we drop the extra file reference once the
    > > VM_EXECTUABLE VMAs disappear.

    >
    > OK, thanks. This leads to another question which I forgot to ask.
    >
    > This patch has a lot of complications because it tries to preserve the
    > current behaviour: we release the bprm->file when all VM_EXECTUABLE vmas
    > are unmapped. Q: is this so important/useful? I don't think this is very
    > common case, and I don't quite understand why it is critical to release
    > the file. To unmount fs after starting the app? One can always copy


    Yes. While most programs don't need this it is still very important for
    some critical programs to be able to unmap the executable and thereby
    allow unmounting the filesystem. Unfortunately, I don't have a confirmed
    specific example for you. A wild guess: some distro install or live CDs
    might use this.

    > the file before execing, or do "/lib/ld-linus.so application" and then
    > unmap the vmas.


    So, rather than what ld-linux.so does now:

    $ /lib/ld-linux.so.2 /bin/sleep 100 &
    [1] 12645
    $ cat /proc/12645/maps
    08048000-0804b000 r-xp 00000000 03:04 606421 /bin/sleep
    0804b000-0804c000 rw-p 00003000 03:04 606421 /bin/sleep
    80000000-8001c000 r-xp 00000000 03:04 688355 /lib/ld-2.7.so
    8001c000-8001e000 rw-p 0001b000 03:04 688355 /lib/ld-2.7.so

    have ld-linux.so copy the mmap'd executable areas and then unmap the
    originals. So it would look roughly like:

    0804c000-0804f000 r-xp 00000000 03:04 606421
    0804f000-08050000 rw-p 00003000 03:04 606421
    80000000-8001c000 r-xp 00000000 03:04 688355 /lib/ld-2.7.so
    8001c000-8001e000 rw-p 0001b000 03:04 688355 /lib/ld-2.7.so

    Then there'd be no need to have the extra reference counting this patch
    adds.

    I think these approaches could subtly break existing userspace
    applications which don't already use these techniques. Furthermore, I
    wonder if some applications may wish to unmount 'everything'. This means
    there may be no mount that's acceptable to pin by either copying to or
    using a modified ld-linux.so.

    Fixing the problem in userspace with these techniques also requires a
    non-trivial audit of userspace. There could easily be two tasks that
    have little or no apparent relation to each other. One does the unmap
    trick and the other expects to be able to unmount. The first would then
    need to be modified to cp the executable to a suitable location or
    utilize a modified ld-linux.so.

    I'd be happy to submit a patch removing the extra reference counting if
    there's a way to avoid breaking userspace or if there's consensus that
    breaking userspace this way is acceptable.

    > (I am not arguing, just curious).


    Sure.

    > > > I don't understand why do we need ->exe_file_lock. Afaics, all callers of
    > > > added_exe_file_vma/removed_exe_file_vma must hold ->mmap_sem, yes? But this
    > > > means get_mm_exe_file() can use down_read(mm->map_sem). No?

    > >
    > > Yes, I could get the task's ->mmap_sem there too and reuse the mmap_sem
    > > rather than add a lock. That allows nearly any task to grab another
    > > tasks mmap_sem simply by doing a readlink on /proc/pid/exe. So I thought
    > > avoiding reuse of the mmap_sem might be best.
    > >
    > > Do you still think it would be better to reuse mmap_sem?

    >
    > Well, we only need down_read(mmap_sem) for the very short time. /proc/pid/maps
    > is much "worse" in this sense.


    OK. I'll post a patch to remove the spinlock and replace it with
    mmap_sem.

    > > > > @@ -409,6 +410,7 @@ void mmput(struct mm_struct *mm)
    > > > > if (atomic_dec_and_test(&mm->mm_users)) {
    > > > > exit_aio(mm);
    > > > > exit_mmap(mm);
    > > > > + set_mm_exe_file(mm, NULL);
    > > >
    > > > This change looks unneeded. exit_mmap() removes all vmas. The last VM_EXECUTABLE
    > > > vma should clear ->exe_file via removed_exe_file_vma() ?

    > >
    > > You're right -- it's redundant. I'll fix that.

    >
    > Sorry, I was wrong.
    >
    > mmput() has to release ->exe_file if it is called when exec fails before the
    > first do_mmmap(MAP_EXECUTABLE). This also means that it is not completely
    > trivial to set ->exe_file before exec_mmap(), it can fail. This is solvable,
    > but I'm not sure we should do this.
    >
    > Still, the accounting looks a little bit fragile to me. flush_old_exec()
    > increments ->f_count but sets ->num_exe_file_vmas = 0 because we know that
    > the next elf_map() will bump ->num_exe_file_vmas and thus "sync" 2 counters.
    > But I don't see how to do better if we really want to release the file when
    > VM_EXECUTABLE disappears.


    OK, I'll leave it unless something better comes to mind.

    Thanks for taking a look at this patch and asking questions.

    Cheers,
    -Matt Helsley

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: + fix-procfs-task-exe-symlink.patch added to -mm tree

    On 01/29, Matt Helsley wrote:
    >
    > On Tue, 2008-01-29 at 14:36 +0300, Oleg Nesterov wrote:
    > >
    > > This patch has a lot of complications because it tries to preserve the
    > > current behaviour: we release the bprm->file when all VM_EXECTUABLE vmas
    > > are unmapped. Q: is this so important/useful? I don't think this is very
    > > common case, and I don't quite understand why it is critical to release
    > > the file. To unmount fs after starting the app? One can always copy

    >
    > Yes. While most programs don't need this it is still very important for
    > some critical programs to be able to unmap the executable and thereby
    > allow unmounting the filesystem. Unfortunately, I don't have a confirmed
    > specific example for you. A wild guess: some distro install or live CDs
    > might use this.


    OK, thanks. I just wanted to be sure I didn't miss some other reason.

    > > Sorry, I was wrong.
    > >
    > > mmput() has to release ->exe_file if it is called when exec fails before the
    > > first do_mmmap(MAP_EXECUTABLE). This also means that it is not completely
    > > trivial to set ->exe_file before exec_mmap(), it can fail. This is solvable,
    > > but I'm not sure we should do this.
    > >
    > > Still, the accounting looks a little bit fragile to me. flush_old_exec()
    > > increments ->f_count but sets ->num_exe_file_vmas = 0 because we know that
    > > the next elf_map() will bump ->num_exe_file_vmas and thus "sync" 2 counters.
    > > But I don't see how to do better if we really want to release the file when
    > > VM_EXECUTABLE disappears.

    >
    > OK, I'll leave it unless something better comes to mind.


    Err, I was double wrong. It _is_ trivial to set ->exe_file before exec_mmap(),

    flush_old_exec:

    + get_file(bprm->file);
    + set_mm_exe_file(bprm->mm, bprm->file);
    retval = exec_mmap(bprm->mm);
    if (retval)
    goto mmap_failed;

    bprm->mm = NULL; /* We're using it now */

    If exec_mmap() fails, the caller (do_execve) has to mmput(bprm->mm)
    anyway, and this imply set_mm_exe_file(NULL). This way set_mm_exe_file()
    doesn't need any locking.

    Not that this is relly important, but still.

    However. I didn't notice this patch plays with #ifdef CONFIG_PROC_FS.
    Without CONFIG_PROC_FS we seem to leak bprm->file, I'd suggest to move
    get_file(bprm->file) into set_mm_exe_file().

    > Thanks for taking a look at this patch and asking questions.


    Thanks for your answers

    Oleg.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: + fix-procfs-task-exe-symlink.patch added to -mm tree


    On Wed, 2008-01-30 at 14:06 +0300, Oleg Nesterov wrote:



    > Err, I was double wrong. It _is_ trivial to set ->exe_file before exec_mmap(),
    >
    > flush_old_exec:
    >
    > + get_file(bprm->file);
    > + set_mm_exe_file(bprm->mm, bprm->file);
    > retval = exec_mmap(bprm->mm);
    > if (retval)
    > goto mmap_failed;
    >
    > bprm->mm = NULL; /* We're using it now */
    >
    > If exec_mmap() fails, the caller (do_execve) has to mmput(bprm->mm)
    > anyway, and this imply set_mm_exe_file(NULL). This way set_mm_exe_file()
    > doesn't need any locking.
    >
    > Not that this is relly important, but still.


    I've got the patch written and will be testing this shortly.

    > However. I didn't notice this patch plays with #ifdef CONFIG_PROC_FS.


    There are portions in there to deal with the presence or lack of proc
    fs. In include/linux/mm_types.h for example. Also, I placed some of the
    functions in fs/proc/base.c and avoided the need for #ifdefs in the .c
    files.

    > Without CONFIG_PROC_FS we seem to leak bprm->file, I'd suggest to move
    > get_file(bprm->file) into set_mm_exe_file().


    Good catch. Also I noticed that I was still setting the exe_file field
    in fork.c regardles of CONFIG_PROC_FS.

    So based on your feedback I'm working on 3 patches:
    1 - fix the two CONFIG_PROC_FS issues
    a) breaks compilation when CONFIG_PROC_FS is not defined
    b) struct file leak when CONFIG_PROC_FS is not defined
    2 - reuse mmap semaphore
    3 - move the mm initialization bits in the exec path to close the window
    where userspace could see a "NULL" exe_file.

    I will post each after it passes testing.

    Thanks again!

    Cheers,
    -Matt Helsley

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread