[PATCH] Fix procfs task exe symlink - Kernel

This is a discussion on [PATCH] Fix procfs task exe symlink - Kernel ; The kernel implements readlink of /proc/pid/exe by getting the file from the first executable VMA. Then the path to the file is reconstructed and reported as the result. While this method is often correct it does not always identify the ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: [PATCH] Fix procfs task exe symlink

  1. [PATCH] Fix procfs task exe symlink

    The kernel implements readlink of /proc/pid/exe by getting the file from the
    first executable VMA. Then the path to the file is reconstructed and reported as
    the result. While this method is often correct it does not always identify the
    correct path.

    Some applications may play games with the executable mappings. Al Viro mentioned
    an example where the executable region is copied and subsequently unmapped.
    In this case the kernel wanders down the VMA list to the next file-backed,
    executable VMA -- which is not necessarily the destination of the copy. Worse,
    any transformation of the executable could conceivably take place and the
    symlink could be deceptive. In this case, since we can't be certain what is
    being executed it might be better to simply "delete" the symlink.

    Another scenario where the symlink might be broken includes some potential
    implementations of checkpoint/restart. If restart is implemented as a loader
    application then it will be reported in the symlink instead of the application
    being restarted. This would break restarted Java applications for example
    (see the java example below).

    For executables on the stackable MVFS filesystem the current procfs methods for
    implementing a task's exe symlink do not point to the correct file and
    applications relying on the symlink fail (see the java example below).

    So there are multiple ways in which a task's exe symlink in /proc can break
    using the current VMA walk method.

    This patch tries to address the case of running Java installed on an MVFS filesystem.
    However the patch solves the problems with the symlink for more than just
    MVFS.

    Java uses the /proc/self/exe symlink to determine JAVAHOME by reading the link
    and trimming the path. This breaks under MVFS because MVFS reorganizes
    files and directories to enable versioning and stores these on a
    filesystem lower in the "stack". This is further complicated by the need for
    efficient IO because reads and mapping of the file must access the file
    contents from the "lower" filesystem. The symlink points to the mapped/read
    file and not the MVFS file. Because MVFS utilizes a different organization of
    the files in the lower filesystem the symlink cannot be used to
    correctly determine JAVAHOME. This could be a problem for any stacking
    filesystem which reorganizes files and directories -- though I don't know of
    another besides MVFS.

    Top FS (e.g. MVFS) Lower FS (e.g. ext3)
    /foo/bar/jvm/bin/java /qux/baz/java
    /foo/bar/jvm/lib /bee/buzz

    When the executable file is opened MVFS returns its own file struct.
    When the MVFS file is subsequently mmap'd or read MVFS transforms the path to
    /qux/baz/java, opens, and uses the contents of the lower filesystem's file
    to satisfy the read or the mmap.

    Since the bytes of the java executable are at /qux/baz/java, /proc/self/exe
    points to /qux/baz/java instead of /foo/bar/jvm/bin/java. Hence JAVAHOME
    points to the wrong directory and Java subsequently fails to find the files
    it needs to run.

    To solve the problem this patch changes the way that the kernel resolves a
    task's exe symlink. Instead of walking the VMAs to find the first
    executable file-backed VMA we store a reference to the exec'd file in the
    mm_struct -- /foo/bar/jvm/bin/java in the example above.

    That reference would prevent the filesystem holding the executable file from
    being unmounted even if the VMA(s) were unmapped. So we track the number of VMAs
    that mapped the executable file during exec. Then we drop the new reference on
    exit or when the last such VMA is unmapped. This avoids pinning the filesystem.

    A minor added benefit of this approach is we no longer need separate code to
    offer this symlink on mmu-less systems.

    Andrew, please consider this patch for inclusion in -mm.

    Signed-off-by: Matt Helsley
    ---

    Changelog:
    Added a counter to keep track of the number of "exec" VMAs
    Drop the exe_file reference when the last of the "exec" VMAs goes away
    (Requested by Al Viro)
    The nommu case is untested -- I don't have an mmu-less system
    to test this on.
    #ifdef proc fs portion of mm (requested by Andrew)
    Merged patch series into one patch -- avoids introducing point in series
    where file reference in mm_struct pins mountpoints
    Added MAP_EXECUTABLE to mapping in binfmt_flat -- needed this flag to help
    track exec'd file VMAs.

    Changelog during RFC postings:
    Moved fput() calls out of areas holding task_lock
    (pointed out by Andrew Morton. see:
    http://lkml.org/lkml/2007/7/12/402)
    Moved the exe_file reference to the mm_struct from the task struct
    (suggested by Dave Hansen)
    Avoid pinning most mounted fs by dropping both file refs when the VMA is
    removed (problem pointed out by Al Viro. see:
    http://lkml.org/lkml/2007/7/12/398)

    Testing:
    2.6.24-rc6-mm1 - compiles, boots, passes simple regression tests on x86,
    x86_64, and ppc64
    2.6.24-rc8-mm1 - compile tested only -- issues with booting test machines
    on unpatched 2.6.24-rc8-mm1 with hotfixes. Will keep trying to get
    these to boot and complete testing.
    nommu-only code paths are untested -- lacking access to nommu system

    fs/binfmt_flat.c | 3 +
    fs/exec.c | 2 +
    fs/proc/base.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++
    fs/proc/internal.h | 1
    fs/proc/task_mmu.c | 34 --------------------
    fs/proc/task_nommu.c | 34 --------------------
    include/linux/init_task.h | 8 ++++
    include/linux/mm.h | 22 +++++++++++++
    include/linux/mm_types.h | 7 ++++
    include/linux/proc_fs.h | 14 +++++++-
    kernel/fork.c | 3 +
    mm/mmap.c | 22 ++++++++++---
    mm/nommu.c | 15 +++++++-
    13 files changed, 164 insertions(+), 78 deletions(-)

    Index: linux-2.6.24-rc8-mm1-hf/fs/exec.c
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/fs/exec.c
    +++ linux-2.6.24-rc8-mm1-hf/fs/exec.c
    @@ -1022,10 +1022,12 @@ int flush_old_exec(struct linux_binprm *
    current->self_exec_id++;

    flush_signal_handlers(current, 0);
    flush_old_files(current->files);

    + get_file(bprm->file);
    + set_mm_exe_file(current->mm, bprm->file);
    return 0;

    mmap_failed:
    reset_files_struct(current, files);
    out:
    Index: linux-2.6.24-rc8-mm1-hf/fs/proc/base.c
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/fs/proc/base.c
    +++ linux-2.6.24-rc8-mm1-hf/fs/proc/base.c
    @@ -1084,10 +1084,87 @@ static const struct file_operations proc
    .release = single_release,
    };

    #endif

    +/* We added or removed a vma mapping the executable. The vmas are only mapped
    + * during exec and are not mapped with the mmap system call.
    + * Callers must _not_ hold the mm's exe_file_lock for these */
    +void added_exe_file_vma(struct mm_struct *mm)
    +{
    + spin_lock(&mm->exe_file_lock);
    + mm->num_exe_file_vmas++;
    + spin_unlock(&mm->exe_file_lock);
    +}
    +
    +void removed_exe_file_vma(struct mm_struct *mm)
    +{
    + struct file *exe_file = NULL;
    +
    + spin_lock(&mm->exe_file_lock);
    + mm->num_exe_file_vmas--;
    + if (mm->num_exe_file_vmas == 0) {
    + exe_file = mm->exe_file;
    + mm->exe_file = NULL;
    + }
    + spin_unlock(&mm->exe_file_lock);
    +
    + if (exe_file)
    + fput(exe_file);
    +}
    +
    +/* Takes a reference to new_exe_file from caller -- does not get a new
    + * reference; only puts old ones */
    +void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
    +{
    + struct file *old_exe_file;
    +
    + spin_lock(&mm->exe_file_lock);
    + old_exe_file = mm->exe_file;
    + mm->exe_file = new_exe_file;
    + mm->num_exe_file_vmas = 0;
    + spin_unlock(&mm->exe_file_lock);
    + if (old_exe_file)
    + fput(old_exe_file);
    +}
    +
    +struct file *get_mm_exe_file(struct mm_struct *mm)
    +{
    + struct file *exe_file;
    +
    + spin_lock(&mm->exe_file_lock);
    + exe_file = mm->exe_file;
    + if (exe_file)
    + get_file(exe_file);
    + spin_unlock(&mm->exe_file_lock);
    + return exe_file;
    +}
    +
    +static int proc_exe_link(struct inode *inode, struct path *exe_path)
    +{
    + struct task_struct *task;
    + struct mm_struct *mm;
    + struct file *exe_file;
    +
    + task = get_proc_task(inode);
    + if (!task)
    + return -ENOENT;
    + mm = get_task_mm(task);
    + put_task_struct(task);
    + if (!mm)
    + return -ENOENT;
    + exe_file = get_mm_exe_file(mm);
    + mmput(mm);
    + if (exe_file) {
    + *exe_path = exe_file->f_path;
    + path_get(&exe_file->f_path);
    + fput(exe_file);
    + return 0;
    + } else
    + return -ENOENT;
    +}
    +
    static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
    {
    struct inode *inode = dentry->d_inode;
    int error = -EACCES;

    Index: linux-2.6.24-rc8-mm1-hf/kernel/fork.c
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/kernel/fork.c
    +++ linux-2.6.24-rc8-mm1-hf/kernel/fork.c
    @@ -356,10 +356,11 @@ static struct mm_struct * mm_init(struct
    rwlock_init(&mm->ioctx_list_lock);
    mm->ioctx_list = NULL;
    mm->free_area_cache = TASK_UNMAPPED_BASE;
    mm->cached_hole_size = ~0UL;
    mm_init_cgroup(mm, p);
    + init_mm_exe_file(mm);

    if (likely(!mm_alloc_pgd(mm))) {
    mm->def_flags = 0;
    return mm;
    }
    @@ -407,10 +408,11 @@ void mmput(struct mm_struct *mm)
    might_sleep();

    if (atomic_dec_and_test(&mm->mm_users)) {
    exit_aio(mm);
    exit_mmap(mm);
    + set_mm_exe_file(mm, NULL);
    if (!list_empty(&mm->mmlist)) {
    spin_lock(&mmlist_lock);
    list_del(&mm->mmlist);
    spin_unlock(&mmlist_lock);
    }
    @@ -523,10 +525,11 @@ static struct mm_struct *dup_mm(struct t

    err = dup_mmap(mm, oldmm);
    if (err)
    goto free_pt;

    + mm->exe_file = get_mm_exe_file(oldmm);
    mm->hiwater_rss = get_mm_rss(mm);
    mm->hiwater_vm = mm->total_vm;

    return mm;

    Index: linux-2.6.24-rc8-mm1-hf/fs/proc/task_mmu.c
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/fs/proc/task_mmu.c
    +++ linux-2.6.24-rc8-mm1-hf/fs/proc/task_mmu.c
    @@ -73,44 +73,10 @@ int task_statm(struct mm_struct *mm, int
    *data = mm->total_vm - mm->shared_vm;
    *resident = *shared + get_mm_counter(mm, anon_rss);
    return mm->total_vm;
    }

    -int proc_exe_link(struct inode *inode, struct path *path)
    -{
    - struct vm_area_struct * vma;
    - int result = -ENOENT;
    - struct task_struct *task = get_proc_task(inode);
    - struct mm_struct * mm = NULL;
    -
    - if (task) {
    - mm = get_task_mm(task);
    - put_task_struct(task);
    - }
    - if (!mm)
    - goto out;
    - down_read(&mm->mmap_sem);
    -
    - vma = mm->mmap;
    - while (vma) {
    - if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file)
    - break;
    - vma = vma->vm_next;
    - }
    -
    - if (vma) {
    - *path = vma->vm_file->f_path;
    - path_get(&vma->vm_file->f_path);
    - result = 0;
    - }
    -
    - up_read(&mm->mmap_sem);
    - mmput(mm);
    -out:
    - return result;
    -}
    -
    static void pad_len_spaces(struct seq_file *m, int len)
    {
    len = 25 + sizeof(void*) * 6 - len;
    if (len < 1)
    len = 1;
    Index: linux-2.6.24-rc8-mm1-hf/include/linux/proc_fs.h
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/include/linux/proc_fs.h
    +++ linux-2.6.24-rc8-mm1-hf/include/linux/proc_fs.h
    @@ -7,11 +7,10 @@
    #include
    #include

    struct net;
    struct completion;
    -
    /*
    * The proc filesystem constants/structures
    */

    /*
    @@ -207,10 +206,15 @@ extern struct proc_dir_entry *proc_net_f
    const char *name, mode_t mode, const struct file_operations *fops);
    extern void proc_net_remove(struct net *net, const char *name);
    extern struct proc_dir_entry *proc_net_mkdir(struct net *net, const char *name,
    struct proc_dir_entry *parent);

    +/* While the {get|set}_mm_exe_file functions are for mm_structs, they are
    + * only needed to implement /proc/|self/exe so we define them here. */
    +extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
    +extern struct file *get_mm_exe_file(struct mm_struct *mm);
    +
    #else

    #define proc_root_driver NULL
    #define proc_bus NULL

    @@ -256,10 +260,18 @@ static inline int pid_ns_prepare_proc(st

    static inline void pid_ns_release_proc(struct pid_namespace *ns)
    {
    }

    +static inline void set_mm_exe_file(struct mm_struct *mm,
    + struct file *new_exe_file)
    +{}
    +
    +static inline struct file *get_mm_exe_file(struct mm_struct *mm)
    +{
    + return NULL;
    +}
    #endif /* CONFIG_PROC_FS */

    #if !defined(CONFIG_PROC_KCORE)
    static inline void kclist_add(struct kcore_list *new, void *addr, size_t size)
    {
    Index: linux-2.6.24-rc8-mm1-hf/mm/mmap.c
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/mm/mmap.c
    +++ linux-2.6.24-rc8-mm1-hf/mm/mmap.c
    @@ -228,12 +228,15 @@ static struct vm_area_struct *remove_vma
    struct vm_area_struct *next = vma->vm_next;

    might_sleep();
    if (vma->vm_ops && vma->vm_ops->close)
    vma->vm_ops->close(vma);
    - if (vma->vm_file)
    + if (vma->vm_file) {
    fput(vma->vm_file);
    + if (vma->vm_flags & VM_EXECUTABLE)
    + removed_exe_file_vma(vma->vm_mm);
    + }
    mpol_free(vma_policy(vma));
    kmem_cache_free(vm_area_cachep, vma);
    return next;
    }

    @@ -621,12 +624,15 @@ again: remove_next = 1 + (end > next->
    spin_unlock(&anon_vma->lock);
    if (mapping)
    spin_unlock(&mapping->i_mmap_lock);

    if (remove_next) {
    - if (file)
    + if (file) {
    fput(file);
    + if (next->vm_flags & VM_EXECUTABLE)
    + removed_exe_file_vma(mm);
    + }
    mm->map_count--;
    mpol_free(vma_policy(next));
    kmem_cache_free(vm_area_cachep, next);
    /*
    * In mprotect's case 6 (see comments on vma_merge),
    @@ -1153,10 +1159,12 @@ munmap_back:
    vma->vm_file = file;
    get_file(file);
    error = file->f_op->mmap(file, vma);
    if (error)
    goto unmap_and_free_vma;
    + if (vm_flags & VM_EXECUTABLE)
    + added_exe_file_vma(mm);
    } else if (vm_flags & VM_SHARED) {
    error = shmem_zero_setup(vma);
    if (error)
    goto free_vma;
    }
    @@ -1818,12 +1826,15 @@ int split_vma(struct mm_struct * mm, str
    kmem_cache_free(vm_area_cachep, new);
    return PTR_ERR(pol);
    }
    vma_set_policy(new, pol);

    - if (new->vm_file)
    + if (new->vm_file) {
    get_file(new->vm_file);
    + if (vma->vm_flags & VM_EXECUTABLE)
    + added_exe_file_vma(mm);
    + }

    if (new->vm_ops && new->vm_ops->open)
    new->vm_ops->open(new);

    if (new_below)
    @@ -2136,12 +2147,15 @@ struct vm_area_struct *copy_vma(struct v
    }
    vma_set_policy(new_vma, pol);
    new_vma->vm_start = addr;
    new_vma->vm_end = addr + len;
    new_vma->vm_pgoff = pgoff;
    - if (new_vma->vm_file)
    + if (new_vma->vm_file) {
    get_file(new_vma->vm_file);
    + if (vma->vm_flags & VM_EXECUTABLE)
    + added_exe_file_vma(mm);
    + }
    if (new_vma->vm_ops && new_vma->vm_ops->open)
    new_vma->vm_ops->open(new_vma);
    vma_link(mm, new_vma, prev, rb_link, rb_parent);
    }
    }
    Index: linux-2.6.24-rc8-mm1-hf/fs/proc/task_nommu.c
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/fs/proc/task_nommu.c
    +++ linux-2.6.24-rc8-mm1-hf/fs/proc/task_nommu.c
    @@ -102,44 +102,10 @@ int task_statm(struct mm_struct *mm, int
    up_read(&mm->mmap_sem);
    *resident = size;
    return size;
    }

    -int proc_exe_link(struct inode *inode, struct path *path)
    -{
    - struct vm_list_struct *vml;
    - struct vm_area_struct *vma;
    - struct task_struct *task = get_proc_task(inode);
    - struct mm_struct *mm = get_task_mm(task);
    - int result = -ENOENT;
    -
    - if (!mm)
    - goto out;
    - down_read(&mm->mmap_sem);
    -
    - vml = mm->context.vmlist;
    - vma = NULL;
    - while (vml) {
    - if ((vml->vma->vm_flags & VM_EXECUTABLE) && vml->vma->vm_file) {
    - vma = vml->vma;
    - break;
    - }
    - vml = vml->next;
    - }
    -
    - if (vma) {
    - *path = vma->vm_file->f_path;
    - path_get(&vma->vm_file->f_path);
    - result = 0;
    - }
    -
    - up_read(&mm->mmap_sem);
    - mmput(mm);
    -out:
    - return result;
    -}
    -
    /*
    * display mapping lines for a particular process's /proc/pid/maps
    */
    static int show_map(struct seq_file *m, void *_vml)
    {
    Index: linux-2.6.24-rc8-mm1-hf/include/linux/mm_types.h
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/include/linux/mm_types.h
    +++ linux-2.6.24-rc8-mm1-hf/include/linux/mm_types.h
    @@ -231,8 +231,15 @@ struct mm_struct {
    rwlock_t ioctx_list_lock;
    struct kioctx *ioctx_list;
    #ifdef CONFIG_CGROUP_MEM_CONT
    struct mem_cgroup *mem_cgroup;
    #endif
    +
    +#ifdef CONFIG_PROC_FS
    + /* store ref to file /proc//exe symlink points to */
    + spinlock_t exe_file_lock;
    + struct file *exe_file;
    + unsigned long num_exe_file_vmas;
    +#endif
    };

    #endif /* _LINUX_MM_TYPES_H */
    Index: linux-2.6.24-rc8-mm1-hf/fs/proc/internal.h
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/fs/proc/internal.h
    +++ linux-2.6.24-rc8-mm1-hf/fs/proc/internal.h
    @@ -46,11 +46,10 @@ extern int nommu_vma_show(struct seq_fil

    extern int maps_protect;

    extern void create_seq_entry(char *name, mode_t mode,
    const struct file_operations *f);
    -extern int proc_exe_link(struct inode *, struct path *);
    extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns,
    struct pid *pid, struct task_struct *task);
    extern int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns,
    struct pid *pid, struct task_struct *task);
    extern int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
    Index: linux-2.6.24-rc8-mm1-hf/fs/binfmt_flat.c
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/fs/binfmt_flat.c
    +++ linux-2.6.24-rc8-mm1-hf/fs/binfmt_flat.c
    @@ -529,11 +529,12 @@ static int load_flat_file(struct linux_b
    * really care
    */
    DBG_FLT("BINFMT_FLAT: ROM mapping of file (we hope)\n");

    down_write(&current->mm->mmap_sem);
    - textpos = do_mmap(bprm->file, 0, text_len, PROT_READ|PROT_EXEC, MAP_PRIVATE, 0);
    + textpos = do_mmap(bprm->file, 0, text_len, PROT_READ|PROT_EXEC,
    + MAP_PRIVATE|MAP_EXECUTABLE, 0);
    up_write(&current->mm->mmap_sem);
    if (!textpos || textpos >= (unsigned long) -4096) {
    if (!textpos)
    textpos = (unsigned long) -ENOMEM;
    printk("Unable to mmap process text, errno %d\n", (int)-textpos);
    Index: linux-2.6.24-rc8-mm1-hf/include/linux/init_task.h
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/include/linux/init_task.h
    +++ linux-2.6.24-rc8-mm1-hf/include/linux/init_task.h
    @@ -44,20 +44,28 @@
    .ctx_lock = __SPIN_LOCK_UNLOCKED(name.ctx_lock), \
    .reqs_active = 0U, \
    .max_reqs = ~0U, \
    }

    +#ifdef CONFIG_PROC_FS
    +#define INIT_MM_EXE_FILE(name) \
    + .exe_file_lock = __SPIN_LOCK_UNLOCKED(name.exe_file_lock),
    +#else
    +#define INIT_MM_EXE_FILE(name)
    +#endif
    +
    #define INIT_MM(name) \
    { \
    .mm_rb = RB_ROOT, \
    .pgd = swapper_pg_dir, \
    .mm_users = ATOMIC_INIT(2), \
    .mm_count = ATOMIC_INIT(1), \
    .mmap_sem = __RWSEM_INITIALIZER(name.mmap_sem), \
    .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \
    .mmlist = LIST_HEAD_INIT(name.mmlist), \
    .cpu_vm_mask = CPU_MASK_ALL, \
    + INIT_MM_EXE_FILE(name) \
    }

    #define INIT_SIGNALS(sig) { \
    .count = ATOMIC_INIT(1), \
    .wait_chldexit = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
    Index: linux-2.6.24-rc8-mm1-hf/mm/nommu.c
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/mm/nommu.c
    +++ linux-2.6.24-rc8-mm1-hf/mm/nommu.c
    @@ -960,12 +960,15 @@ unsigned long do_mmap_pgoff(struct file
    if (!vma)
    goto error_getting_vma;

    INIT_LIST_HEAD(&vma->anon_vma_node);
    atomic_set(&vma->vm_usage, 1);
    - if (file)
    + if (file) {
    get_file(file);
    + if (vm_flags & VM_EXECUTABLE)
    + added_exe_file_vma(mm);
    + }
    vma->vm_file = file;
    vma->vm_flags = vm_flags;
    vma->vm_start = addr;
    vma->vm_end = addr + len;
    vma->vm_pgoff = pgoff;
    @@ -1016,12 +1019,15 @@ unsigned long do_mmap_pgoff(struct file

    error:
    up_write(&nommu_vma_sem);
    kfree(vml);
    if (vma) {
    - if (vma->vm_file)
    + if (vma->vm_file) {
    fput(vma->vm_file);
    + if (vma->vm_flags & VM_EXECUTABLE)
    + removed_exe_file_vma(vma->vm_mm);
    + }
    kfree(vma);
    }
    return ret;

    sharing_violation:
    @@ -1069,12 +1075,15 @@ static void put_vma(struct vm_area_struc
    }

    realalloc -= kobjsize(vma);
    askedalloc -= sizeof(*vma);

    - if (vma->vm_file)
    + if (vma->vm_file) {
    fput(vma->vm_file);
    + if (vma->vm_flags & VM_EXECUTABLE)
    + removed_exe_file_vma(vma->vm_mm);
    + }
    kfree(vma);
    }

    up_write(&nommu_vma_sem);
    }
    Index: linux-2.6.24-rc8-mm1-hf/include/linux/mm.h
    ================================================== =================
    --- linux-2.6.24-rc8-mm1-hf.orig/include/linux/mm.h
    +++ linux-2.6.24-rc8-mm1-hf/include/linux/mm.h
    @@ -1026,10 +1026,32 @@ extern void __vma_link_rb(struct mm_stru
    struct rb_node **, struct rb_node *);
    extern void unlink_file_vma(struct vm_area_struct *);
    extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
    unsigned long addr, unsigned long len, pgoff_t pgoff);
    extern void exit_mmap(struct mm_struct *);
    +
    +#ifdef CONFIG_PROC_FS
    +static inline void init_mm_exe_file(struct mm_struct *mm)
    +{
    + /* The exe_file field itself is already correctly set at this point */
    + spin_lock_init(&mm->exe_file_lock);
    +}
    +
    +/* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */
    +extern void added_exe_file_vma(struct mm_struct *mm);
    +extern void removed_exe_file_vma(struct mm_struct *mm);
    +#else
    +static inline void init_mm_exe_file(struct mm_struct *mm)
    +{}
    +
    +static inline void added_exe_file_vma(struct mm_struct *mm)
    +{}
    +
    +static inline void removed_exe_file_vma(struct mm_struct *mm)
    +{}
    +#endif /* CONFIG_PROC_FS */
    +
    extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
    extern int install_special_mapping(struct mm_struct *mm,
    unsigned long addr, unsigned long len,
    unsigned long flags, struct page **pages);



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH] Fix procfs task exe symlink


    On Jan 23 2008 10:29, Matt Helsley wrote:
    >
    >For executables on the stackable MVFS filesystem the current procfs
    >methods for implementing a task's exe symlink do not point to the
    >correct file and applications relying on the symlink fail (see the
    >java example below).


    This reminds me of unoionfs - it also had this issue, but IIRC,
    it eventually got solved. (Should now display /union/exe rather
    than /lowerlevel/exe.) Unionfs developers should know more.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH] Fix procfs task exe symlink

    On Wed, Jan 23, 2008 at 10:29:37AM -0800, Matt Helsley wrote:
    > For executables on the stackable MVFS filesystem the current procfs methods for
    > implementing a task's exe symlink do not point to the correct file and
    > applications relying on the symlink fail (see the java example below).


    Dou you have a pointer to the source code for that filesystem?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH] Fix procfs task exe symlink

    Matt Helsley wrote:

    > To solve the problem this patch changes the way that the kernel resolves a
    > task's exe symlink. Instead of walking the VMAs to find the first
    > executable file-backed VMA we store a reference to the exec'd file in the
    > mm_struct -- /foo/bar/jvm/bin/java in the example above.


    Sounds interesting.

    > nommu-only code paths are untested -- lacking access to nommu system


    I could test it, but it'll have to wait until I get back from LCA in a couple
    of weeks time.

    David
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH] Fix procfs task exe symlink

    In message , Jan Engelhardt writes:
    >
    > On Jan 23 2008 10:29, Matt Helsley wrote:
    > >
    > >For executables on the stackable MVFS filesystem the current procfs
    > >methods for implementing a task's exe symlink do not point to the
    > >correct file and applications relying on the symlink fail (see the
    > >java example below).

    >
    > This reminds me of unoionfs - it also had this issue, but IIRC,
    > it eventually got solved. (Should now display /union/exe rather
    > than /lowerlevel/exe.) Unionfs developers should know more.


    Unionfs resolved this by fully implementing the address_space ops, as well
    as ->mmap (i.e., one cannot "cheat" and inherit the lower address_space).

    Erez.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH] Fix procfs task exe symlink

    > On Wed, 23 Jan 2008 10:29:37 -0800 Matt Helsley wrote:
    >
    > Andrew, please consider this patch for inclusion in -mm.
    >
    > ...
    >


    Can't say that we're particularly exercised about mvfs's problems, but the
    current way of doing /proc/pid/exe is indeed a nasty hack.

    >
    > fs/binfmt_flat.c | 3 +
    > fs/exec.c | 2 +
    > fs/proc/base.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++
    > fs/proc/internal.h | 1
    > fs/proc/task_mmu.c | 34 --------------------
    > fs/proc/task_nommu.c | 34 --------------------
    > include/linux/init_task.h | 8 ++++
    > include/linux/mm.h | 22 +++++++++++++
    > include/linux/mm_types.h | 7 ++++
    > include/linux/proc_fs.h | 14 +++++++-
    > kernel/fork.c | 3 +
    > mm/mmap.c | 22 ++++++++++---
    > mm/nommu.c | 15 +++++++-
    > 13 files changed, 164 insertions(+), 78 deletions(-)


    It's a fairly expensive fix though. Can't we just do a strcpy() somewhere
    at exec time?


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] Fix procfs task exe symlink


    On Sat, 2008-01-26 at 22:03 -0800, Andrew Morton wrote:
    > > On Wed, 23 Jan 2008 10:29:37 -0800 Matt Helsley wrote:
    > >
    > > Andrew, please consider this patch for inclusion in -mm.
    > >
    > > ...
    > >

    >
    > Can't say that we're particularly exercised about mvfs's problems, but the
    > current way of doing /proc/pid/exe is indeed a nasty hack.
    >
    > >
    > > fs/binfmt_flat.c | 3 +
    > > fs/exec.c | 2 +
    > > fs/proc/base.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++
    > > fs/proc/internal.h | 1
    > > fs/proc/task_mmu.c | 34 --------------------
    > > fs/proc/task_nommu.c | 34 --------------------
    > > include/linux/init_task.h | 8 ++++
    > > include/linux/mm.h | 22 +++++++++++++
    > > include/linux/mm_types.h | 7 ++++
    > > include/linux/proc_fs.h | 14 +++++++-
    > > kernel/fork.c | 3 +
    > > mm/mmap.c | 22 ++++++++++---
    > > mm/nommu.c | 15 +++++++-
    > > 13 files changed, 164 insertions(+), 78 deletions(-)

    >
    > It's a fairly expensive fix though. Can't we just do a strcpy() somewhere
    > at exec time?


    I chose not to do this because I thought it would change the output of
    readlink on /proc/pid/exe under certain circumstances.

    For instance, I think the output of the following would be slightly
    different:

    $ mkdir tmp
    $ cd tmp
    $ cp /bin/sleep ./
    $ ./sleep 10 &
    $ SLEEP_PID=$!
    $ ls -l /proc/4733/exe
    lrwxrwxrwx 1 mhelsley mhelsley 0 Jan 28 15:51 /proc/4733/exe -> /home/mhelsley/tmp/sleep
    $ rm sleep
    $ echo $?

    0
    $ ls -l /proc/4733/exe
    lrwxrwxrwx 1 mhelsley mhelsley 0 Jan 28 15:51 /proc/4733/exe -> /home/mhelsley/tmp/sleep (deleted)

    I think simply storing the string at exec time wouldn't show the latter
    result. Perhaps we could do a lookup during readlink to fix it. That may
    not always work though -- could chroot or mount namespaces break this?

    Al Viro's unmap example might also have different output if we just
    stored the string. When the last VM_EXECUTABLE VMA goes away the symlink
    shouldn't work. So I think we'd still have to track the map/unmap of
    VM_EXECUTABLE VMAs similar to what I do in my patch.

    Using chroot we get a "permission denied" error when doing a readlink
    on /proc/pid/exe symlinks that point outside the chroot. I'm not sure a
    lookup using the stored string will fix all cases here.

    Then there's mount namespaces to consider. I think that the string would
    hold the path in the mount namespace of the executable whose exe link
    we're reading rather than the path in the mount namespace of the task
    reading the link.

    We may be able to work around all of these. I'm not sure that patch
    would be simpler though.

    If you want something a little simpler I could follow Oleg Nesterov's
    suggestions. I think that could trim at least 20 lines at the cost of
    continuing to use the mmap semaphore in the /proc/pid/exe readlink path.

    Cheers,
    -Matt Helsley

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread