[PATCH prototype] [0/8] Predictive bitmaps for ELF executables - Kernel

This is a discussion on [PATCH prototype] [0/8] Predictive bitmaps for ELF executables - Kernel ; This patchkit is an experimental optimization I played around with some time ago. This is more a prototype still, but I wanted to push it out so that other people can play with it. The basic idea is that most ...

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast
Results 1 to 20 of 41

Thread: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

  1. [PATCH prototype] [0/8] Predictive bitmaps for ELF executables


    This patchkit is an experimental optimization I played around with
    some time ago.

    This is more a prototype still, but I wanted to push it out
    so that other people can play with it.

    The basic idea is that most programs have the same working set
    over multiple runs. So instead of demand paging all the text pages
    in the order the program runs save the working set to disk and prefetch
    it at program start and then save it at program exit.

    This allows some optimizations:
    - it can avoid unnecessary disk seeks because the blocks will be fetched in
    sorted offset order instead of program execution order.
    - batch kernel entries (each demand page exception has some
    overhead just for entering the kernel). This keeps the caches hot too.
    - The prefetch could be in theory done in the background while the program
    runs (although that is not implemented currently)

    Some details on the implementation:

    To do all this we need a bitmap space somewhere in the ELF executable. I originally
    hoped to use a standard ELF PHDR for this, which are already parsed by the
    Linux ELF loader. However the problem is that PHDRs are part of the
    mapped program image and inserting any new ones requires relinking
    the program. Since relinking programs just to get this would be
    rather heavy-handed I used a hack by setting another bitflag
    in the gnu_execstack header and when it is set let the kernel
    look for ELF SHDRs at the end of the file. Disadvantage is that
    this costs a seek, but it allows easily to update existing
    executables with a simple too.

    The seek overhead would be gone if the linkers are taught to
    always generate a PBITMAP bitmap header.

    I also considered external bitmap files, but just putting it into the ELF
    files and keeping it all together seemed much nicer policywise.

    Then there is some probability of thrashing the bitmap, e.g. when
    a program runs in different modi with totally different working sets
    (a good example of this would be busybox). I haven't found
    a good heuristic to handle this yet (e.g. one possibility would
    be to or the bitmap instead of rewriting it on exit) this is something
    that could need further experimentation. Also one doesn't want
    too many bitmap updates of course so there is a simple heuristic
    to not update bitmaps more often than a sysctl configurable
    interval.

    User tools:
    ftp://ftp.firstfloor.org/pub/ak/pbitmap/pbitmap.c
    is a simple program to a pbitmap shdrs to an existing ELF executable.

    Base kernel:
    Again 2.6.25-rc6

    Drawbacks:
    - No support for dynamic libraries right now (except very clumpsily
    through the mmap_slurp hack). This is the main reason it is not
    very useful for speed up desktops currently.

    - Executable files have to be writable by the user executing it
    currently to get bitmap updates. It would be possible to let the
    kernel bypass this, but I haven't thought too much about the security
    implications of it.
    However any user can use the bitmap data written by a user with
    write rights.

    That's currently one of the bigger usability issues (together
    with the missing shared library support) and why it is more
    a prototype than a fully usable solution.

    Possible areas of improvements if anybody is interested:
    - Background prefetch
    - Tune all the sysctl defaults
    - Implement shared library support (will require glibc support)
    - Do something about the executable access problem
    - Experiment with more fancy heuristics to update bitmaps (like OR
    or do aging etc.)

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [PATCH prototype] [6/8] Core predictive bitmap engine


    This patchkit is an experimental optimization I played around with
    some time ago.

    This is more a prototype still, but I wanted to push it out
    so that other people can play with it.

    The basic idea is that most programs have the same working set
    over multiple runs. So instead of demand paging all the text pages
    in the order the program runs save the working set to disk and prefetch
    it at program start and then save it at program exit.

    This allows some optimizations:
    - it can avoid unnecessary disk seeks because the blocks will be fetched in
    sorted offset order instead of program execution order.
    - batch kernel entries (each demand page exception has some
    overhead just for entering the kernel). This keeps the caches hot too.
    - The prefetch could be in theory done in the background while the program
    runs (although that is not implemented currently)

    Some details on the implementation:

    To do all this we need a bitmap space somewhere in the ELF executable. I originally
    hoped to use a standard ELF PHDR for this, which are already parsed by the
    Linux ELF loader. However the problem is that PHDRs are part of the
    mapped program image and inserting any new ones requires relinking
    the program. Since relinking programs just to get this would be
    rather heavy-handed I used a hack by setting another bitflag
    in the gnu_execstack header and when it is set let the kernel
    look for ELF SHDRs at the end of the file. Disadvantage is that
    this costs a seek, but it allows easily to update existing
    executables with a simple too.

    The seek overhead would be gone if the linkers are taught to
    always generate a PBITMAP bitmap header.

    I also considered external bitmap files, but just putting it into the ELF
    files and keeping it all together seemed much nicer policywise.

    Then there is some probability of thrashing the bitmap, e.g. when
    a program runs in different modi with totally different working sets
    (a good example of this would be busybox). I haven't found
    a good heuristic to handle this yet (e.g. one possibility would
    be to or the bitmap instead of rewriting it on exit) this is something
    that could need further experimentation. Also one doesn't want
    too many bitmap updates of course so there is a simple heuristic
    to not update bitmaps more often than a sysctl configurable
    interval.

    User tools:
    ftp://ftp.firstfloor.org/pub/ak/pbitmap/pbitmap.c
    is a simple program to a pbitmap shdrs to an existing ELF executable.

    Base kernel:
    Again 2.6.25-rc6

    Drawbacks:
    - No support for dynamic libraries right now (except very clumpsily
    through the mmap_slurp hack). This is the main reason it is not
    very useful for speed up desktops currently.

    - Executable files have to be writable by the user executing it
    currently to get bitmap updates. It would be possible to let the
    kernel bypass this, but I haven't thought too much about the security
    implications of it.
    However any user can use the bitmap data written by a user with
    write rights.

    That's currently one of the bigger usability issues (together
    with the missing shared library support) and why it is more
    a prototype than a fully usable solution.

    Possible areas of improvements if anybody is interested:
    - Background prefetch
    - Tune all the sysctl defaults
    - Implement shared library support (will require glibc support)
    - Do something about the executable access problem
    - Experiment with more fancy heuristics to update bitmaps (like OR
    or do aging etc.)

    Signed-off-by: Andi Kleen

    ---
    fs/binfmt_elf.c | 101 +++++++++++
    include/linux/mm.h | 7
    include/linux/mm_types.h | 3
    kernel/fork.c | 1
    mm/Makefile | 2
    mm/mmap.c | 1
    mm/pbitmap.c | 404 +++++++++++++++++++++++++++++++++++++++++++++++
    7 files changed, 517 insertions(+), 2 deletions(-)

    Index: linux/fs/binfmt_elf.c
    ================================================== =================
    --- linux.orig/fs/binfmt_elf.c
    +++ linux/fs/binfmt_elf.c
    @@ -527,6 +527,89 @@ static unsigned long randomize_stack_top
    #endif
    }

    +static void elf_read_pb_phdr(struct file *f, struct elf_phdr *phdr, int nhdr)
    +{
    + int i, found, err;
    + unsigned long *buf;
    +
    + if (!pbitmap_enabled)
    + return;
    + buf = (unsigned long *)__get_free_page(GFP_KERNEL);
    + if (!buf)
    + return;
    + found = 0;
    + for (i = 0; i < nhdr; i++) {
    + struct elf_phdr *ep = &phdr[i];
    + if (ep->p_type != PT_PRESENT_BITMAP)
    + continue;
    +
    + err = pbitmap_load(f, buf, ep->p_vaddr, ep->p_offset,
    + ep->p_filesz);
    + if (err < 0)
    + printk("%s: pbitmap load failed: %d\n",
    + current->comm, err);
    + else
    + found += err;
    + }
    + if (found > 0)
    + printk("%s: %d pages prefetched\n", current->comm, found);
    + free_page((unsigned long)buf);
    +}
    +
    +/* All errors are ignored because the pbitmap is optional */
    +static void elf_read_pb_shdr(struct file *f, struct elfhdr *hdr)
    +{
    + int err;
    + unsigned n;
    + void *buf;
    +
    + if (!pbitmap_enabled)
    + return;
    +
    + /* Need to rate limit them later */
    + if (hdr->e_shentsize != sizeof(struct elf_shdr)) {
    + printk(KERN_WARNING "%s: unexpected shdr size %d\n",
    + current->comm, hdr->e_shentsize);
    + return;
    + }
    + if (hdr->e_shnum >= PAGE_SIZE / sizeof(struct elf_shdr)) {
    + printk(KERN_WARNING "%s: too many shdrs (%u)\n",
    + current->comm, hdr->e_shnum);
    + return;
    + }
    + buf = (void *)__get_free_page(GFP_KERNEL);
    + if (!buf)
    + return;
    + n = hdr->e_shnum * sizeof(struct elf_shdr);
    + err = kernel_read(f, hdr->e_shoff, buf, n);
    + if (err != n) {
    + printk(KERN_WARNING "%s: shdr read failed: %d\n",
    + current->comm, err);
    + } else {
    + int found = 0;
    + struct elf_shdr *shdrs = (struct elf_shdr *)buf;
    +
    + for (n = 0; n < hdr->e_shnum; n++) {
    + struct elf_shdr *sh = &shdrs[n];
    + if (sh->sh_type != SHT_PRESENT_BITMAP)
    + continue;
    + err = pbitmap_load(f, buf, sh->sh_addr,
    + sh->sh_offset,
    + sh->sh_size);
    + if (err < 0)
    + printk("%s: pbitmap load failed: %d\n",
    + current->comm, err);
    + else
    + found += err;
    + }
    + if (found > 0 && pbitmap_enabled > 1)
    + printk("%s: %d pages prefetched\n", current->comm,
    + found);
    +
    + }
    + free_page((unsigned long)buf);
    +}
    +
    static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
    {
    struct file *interpreter = NULL; /* to shut gcc up */
    @@ -551,6 +634,8 @@ static int load_elf_binary(struct linux_
    struct elfhdr interp_elf_ex;
    struct exec interp_ex;
    } *loc;
    + int pbitmap_seen = 0;
    + int please_load_shdrs = 0;

    loc = kmalloc(sizeof(*loc), GFP_KERNEL);
    if (!loc) {
    @@ -706,6 +791,10 @@ static int load_elf_binary(struct linux_
    executable_stack = EXSTACK_ENABLE_X;
    else
    executable_stack = EXSTACK_DISABLE_X;
    +
    + /* Hack */
    + if (elf_ppnt->p_flags & PF_PLEASE_LOAD_SHDRS)
    + please_load_shdrs = 1;
    break;
    }

    @@ -768,8 +857,13 @@ static int load_elf_binary(struct linux_
    int elf_prot = 0, elf_flags;
    unsigned long k, vaddr;

    - if (elf_ppnt->p_type != PT_LOAD)
    + if (elf_ppnt->p_type != PT_LOAD) {
    + if (elf_ppnt->p_type == PT_PRESENT_BITMAP) {
    + /* convert to writable */
    + pbitmap_seen = 1;
    + }
    continue;
    + }

    if (unlikely (elf_brk > elf_bss)) {
    unsigned long nbyte;
    @@ -969,6 +1063,11 @@ static int load_elf_binary(struct linux_
    arch_randomize_brk(current->mm);
    #endif

    + if (pbitmap_seen)
    + elf_read_pb_phdr(bprm->file, elf_phdata, loc->elf_ex.e_phnum);
    + if (please_load_shdrs)
    + elf_read_pb_shdr(bprm->file, &loc->elf_ex);
    +
    if (current->personality & MMAP_PAGE_ZERO) {
    /* Why this, you ask??? Well SVr4 maps page 0 as read-only,
    and some applications "depend" upon this behavior.
    Index: linux/include/linux/mm.h
    ================================================== =================
    --- linux.orig/include/linux/mm.h
    +++ linux/include/linux/mm.h
    @@ -1215,6 +1215,13 @@ unsigned long shrink_slab(unsigned long
    void drop_pagecache(void);
    void drop_slab(void);

    +void pbitmap_update(struct mm_struct *mm);
    +int pbitmap_load(struct file *f, unsigned long *buf,
    + unsigned long vaddr, unsigned long file_offset, unsigned long filesz);
    +extern int pbitmap_enabled;
    +extern int pbitmap_early_fault;
    +extern unsigned pbitmap_update_interval;
    +
    #ifndef CONFIG_MMU
    #define randomize_va_space 0
    #else
    Index: linux/kernel/fork.c
    ================================================== =================
    --- linux.orig/kernel/fork.c
    +++ linux/kernel/fork.c
    @@ -359,6 +359,7 @@ static struct mm_struct * mm_init(struct
    mm->free_area_cache = TASK_UNMAPPED_BASE;
    mm->cached_hole_size = ~0UL;
    mm_init_cgroup(mm, p);
    + INIT_LIST_HEAD(&mm->pbitmap_list);

    if (likely(!mm_alloc_pgd(mm))) {
    mm->def_flags = 0;
    Index: linux/mm/pbitmap.c
    ================================================== =================
    --- /dev/null
    +++ linux/mm/pbitmap.c
    @@ -0,0 +1,404 @@
    +/*
    + * Manage bitmaps of the working set of programs in executable files
    + * and use them later for quick prefetching.
    + *
    + * Subject to the GNU General Public License, version 2 only.
    + * Copyright 2007 Andi Kleen, SUSE Labs
    + *
    + * Locking: this file generally doesn't bother much with locking
    + * because during the update we're the last user of the mm just being
    + * destroyed and very little can interfere. And during the bitmap load
    + * the pbitmap object is just being constructed and also 100% private.
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +#define Pprintk(x...)
    +
    +#define BITS_PER_PAGE (PAGE_SIZE * 8)
    +#define BITS_TO_BYTES(x) (((x) + 8 - 1) / 8)
    +
    +struct pbitmap {
    + struct list_head node;
    + unsigned long start;
    + unsigned long npages;
    + loff_t file_offset; /* offset of bitmap in ELF file */
    + pgoff_t pgoff; /* page cache offset of VMA */
    + struct path backing;
    + /* Temporarily used in the update phase to pass down arguments: */
    + unsigned long *buffer; /* always a PAGE */
    + struct file *fh;
    +};
    +
    +int pbitmap_enabled __read_mostly;
    +int pbitmap_early_fault __read_mostly = 1;
    +unsigned pbitmap_update_interval __read_mostly = 0; /* seconds */
    +
    +static struct pbitmap *
    +alloc_pb(struct vm_area_struct *vma, unsigned long vaddr,
    + unsigned long file_offset, unsigned long filesz)
    +{
    + struct pbitmap *pb;
    + pb = kzalloc(sizeof(struct pbitmap), GFP_KERNEL);
    + if (!pb)
    + return NULL;
    +
    + Pprintk("alloc_pb vaddr %lx filesz %lu\n", vaddr, filesz);
    + pb->file_offset = file_offset;
    + pb->start = vaddr;
    + pb->pgoff = vma->vm_pgoff + ((vaddr - vma->vm_start) >> PAGE_SHIFT);
    + pb->npages = filesz*8;
    + pb->npages = min_t(unsigned long, pb->npages, vma_pages(vma));
    +
    + /* Save away the file to make sure we access the same file later */
    + pb->backing.mnt = mntget(vma->vm_file->f_vfsmnt);
    + pb->backing.dentry = dget(vma->vm_file->f_dentry);
    + return pb;
    +}
    +
    +void free_pb(struct pbitmap *pb)
    +{
    + if (pb->fh)
    + filp_close(pb->fh, 0);
    + else {
    + dput(pb->backing.dentry);
    + mntput(pb->backing.mnt);
    + }
    + kfree(pb);
    +}
    +
    +static int
    +pbitmap_prefetch(struct file *f, unsigned long *buf, struct pbitmap *pb)
    +{
    + int found = 0;
    + long left = BITS_TO_BYTES(pb->npages);
    + unsigned long offset = pb->file_offset;
    +
    + Pprintk("%s prefetch %lx %lu\n", current->comm, offset, left);
    + while (left > 0) {
    + int n = left, k, bit;
    + if (n > PAGE_SIZE)
    + n = PAGE_SIZE;
    +
    + /* Read the bitmap */
    +
    + k = kernel_read(f, offset, (char *)buf, n);
    + if (n != k) {
    + printk("prefetch read failed: %d\n", n);
    + return -EIO;
    + }
    + Pprintk("bitmap [%lx]\n", *(unsigned long *)buf);
    +
    + left -= n;
    + offset += n;
    +
    + n *= 8;
    +
    + /* First do IO on everything */
    + readahead_bitmap(f, pb->pgoff + pb->npages - left, buf, n);
    +
    + if (!pbitmap_early_fault)
    + continue;
    +
    + /* Now map it all in */
    + bit = 0;
    + while ((bit = find_next_bit(buf, n, bit)) < n) {
    + int err, i;
    + for (i = 1; i < n - bit; i++)
    + if (!test_bit(bit + i, buf))
    + break;
    + err = get_user_pages(current, current->mm,
    + pb->start + bit*PAGE_SIZE,
    + i*PAGE_SIZE,
    + 0,
    + 0,
    + NULL, NULL);
    + if (err < 0) {
    + printk("prefetch gup failed %d %lx\n", err,
    + pb->start+bit*PAGE_SIZE);
    + } else
    + found += err;
    + bit += i;
    + }
    + }
    + return found;
    +}
    +
    +/* Prefetch the program's working set from last exit. */
    +int pbitmap_load(struct file *f, unsigned long *buf,
    + unsigned long vaddr, unsigned long file_offset,
    + unsigned long filesz)
    +{
    + int err;
    + struct pbitmap *pb;
    + struct vm_area_struct *vma;
    +
    + vaddr &= PAGE_MASK;
    +
    + vma = find_vma(current->mm, vaddr);
    + if (!vma)
    + return 0;
    +
    + /* Likely BSS */
    + if (vma->vm_file == NULL)
    + return 0;
    +
    + pb = alloc_pb(vma, vaddr, file_offset, filesz);
    + if (!pb)
    + return -ENOMEM;
    +
    + /* For large programs it might be better to start a thread */
    + err = pbitmap_prefetch(f, buf, pb);
    + if (err < 0) {
    + free_pb(pb);
    + return err;
    + }
    + printk("%d prefetched\n", err);
    + list_add_tail(&pb->node, &current->mm->pbitmap_list);
    + return err;
    +}
    +
    +/*
    + * Standard page table walker.
    + * Could later be converted to the .22 generic walker, but let's keep it like
    + * this for now.
    + */
    +
    +static void
    +pbitmap_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
    + unsigned long end, struct pbitmap *pb)
    +{
    + pte_t *pte;
    + spinlock_t *ptl;
    +
    + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
    +
    + Pprintk("addr %lx end %lx start %lx\n", addr, end, pb->start);
    +
    + /* Turn addresses into bitmap offsets */
    + addr -= pb->start;
    + end -= pb->start;
    + addr >>= PAGE_SHIFT;
    + end >>= PAGE_SHIFT;
    +
    + do {
    + if (pte_present(*pte))
    + __set_bit(addr, pb->buffer);
    + } while (pte++, addr++, addr != end);
    + pte_unmap_unlock(pte - 1, ptl);
    +}
    +
    +static inline void pbitmap_pmd_range(struct vm_area_struct *vma, pud_t *pud,
    + unsigned long addr, unsigned long end,
    + struct pbitmap *pb)
    +{
    + pmd_t *pmd;
    + unsigned long next;
    +
    + pmd = pmd_offset(pud, addr);
    + do {
    + next = pmd_addr_end(addr, end);
    + if (pmd_none(*pmd))
    + continue;
    + pbitmap_pte_range(vma, pmd, addr, next, pb);
    + } while (pmd++, addr = next, addr != end);
    +}
    +
    +static void pbitmap_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
    + unsigned long addr, unsigned long end,
    + struct pbitmap *pb)
    +{
    + pud_t *pud;
    + unsigned long next;
    +
    + pud = pud_offset(pgd, addr);
    + do {
    + next = pud_addr_end(addr, end);
    + if (pud_none(*pud))
    + continue;
    + pbitmap_pmd_range(vma, pud, addr, next, pb);
    + } while (pud++, addr = next, addr != end);
    +}
    +
    +static void pbitmap_page_range(struct vm_area_struct *vma,
    + unsigned long addr, unsigned long end,
    + struct pbitmap *pb)
    +{
    + pgd_t *pgd;
    + unsigned long next;
    +
    + BUG_ON(addr >= end);
    + pgd = pgd_offset(vma->vm_mm, addr);
    + do {
    + next = pgd_addr_end(addr, end);
    + if (pgd_none(*pgd))
    + continue;
    + pbitmap_pud_range(vma, pgd, addr, next, pb);
    + } while (pgd++, addr = next, addr != end);
    +}
    +
    +/*
    + * Do some paranoid checks on the VMA just in case the user unmaped it and
    + * mapped something else there.
    + * This is not strictly needed for security because the bitmap update will only
    + * write to the cached executable.
    + */
    +static int vma_in_file(struct vm_area_struct *vma, struct pbitmap *pb)
    +{
    + if (!vma->vm_file || vma->vm_file->f_dentry != pb->backing.dentry)
    + return 0;
    + if (vma->vm_start >= pb->start + pb->npages*PAGE_SIZE)
    + return 0;
    + return vma->vm_pgoff + vma_pages(vma) > pb->pgoff &&
    + vma->vm_pgoff < pb->pgoff + pb->npages;
    +}
    +
    +static void pbitmap_walk_vmas(struct pbitmap *pb, struct vm_area_struct *vma)
    +{
    + unsigned long addr = pb->start;
    +
    + for (; vma && vma_in_file(vma, pb); vma = vma->vm_next) {
    + unsigned long end = pb->start + pb->npages*PAGE_SIZE;
    + if (addr < vma->vm_start)
    + addr = vma->vm_start;
    + if (end > vma->vm_end) // off by one?
    + end = vma->vm_end;
    + Pprintk("p_p_r %lx %lx vma %lx-%lx pb start %lx pages %lu\n",
    + addr, end,
    + vma->vm_start, vma->vm_end,
    + pb->start, pb->npages);
    + if (addr == end)
    + continue;
    + pbitmap_page_range(vma, addr, end, pb);
    + }
    +}
    +
    +long bit_count(unsigned char *buffer, int n)
    +{
    + int i;
    + long x = 0;
    + for (i = 0; i < n; i++) {
    + unsigned char v = buffer[i];
    + while (v) {
    + if (v & 1)
    + x++;
    + v >>= 1;
    + }
    + }
    + return x;
    +}
    +
    +/* Avoid thrashing from too frequent updates */
    +static int no_update(struct pbitmap *pb)
    +{
    + struct inode *i;
    + if (!pbitmap_update_interval)
    + return 0;
    + i = pb->backing.dentry->d_inode;
    + return i->i_mtime.tv_sec + pbitmap_update_interval < xtime.tv_sec;
    +}
    +
    +/*
    + * Update the present bitmaps on disk on program exit.
    + *
    + * Doesn't do any mmap_sem locking anywhere because there should be no
    + * other users of the mm at this point. The page table locks are
    + * unfortunately still needed because the mm could still being swapped
    + * out in parallel (TBD teach the swapper to not do that)
    + *
    + * Could be merged with unmap_vmas later; but it's easier to do it this
    + * way for now.
    + */
    +void pbitmap_update(struct mm_struct *mm)
    +{
    + struct pbitmap *pb, *prev;
    + struct vm_area_struct *vma;
    + mm_segment_t oldfs;
    + unsigned long *buffer;
    + int order;
    +
    + if (list_empty(&mm->pbitmap_list))
    + return;
    +
    + buffer = NULL;
    +
    + oldfs = get_fs();
    + prev = NULL;
    + list_for_each_entry (pb, &mm->pbitmap_list, node) {
    + long bytes = BITS_TO_BYTES(pb->npages);
    +
    + if (!bytes)
    + continue;
    +
    + /* This is typically order 0; only extremly large VMAs
    + (> 128MB) would have order >0. Failing is also not fatal.
    + We don't want any swapping so use GFP_NOFS */
    + if (!buffer || bytes > (PAGE_SIZE << order)) {
    + if (buffer)
    + free_pages((unsigned long)buffer, order);
    + order = get_order(bytes);
    + buffer = (unsigned long *)
    + __get_free_pages(GFP_NOFS|__GFP_NOWARN|
    + __GFP_ZERO, order);
    + if (!buffer) {
    + printk("allocation %d failed\n", order);
    + break;
    + }
    + }
    +
    + pb->buffer = buffer;
    +
    + vma = find_vma(mm, pb->start);
    + /* Reuse the last file if possible */
    + if (prev && prev->fh &&
    + pb->backing.dentry == prev->backing.dentry &&
    + pb->backing.mnt == prev->backing.mnt) {
    + pb->fh = prev->fh;
    + prev->fh = NULL;
    + } else {
    + if (no_update(pb))
    + continue;
    + /* Reopen to get a writable file */
    + pb->fh = dentry_open(pb->backing.dentry,
    + pb->backing.mnt,
    + O_WRONLY|O_FORCEWRITE);
    + if (IS_ERR(pb->fh)) {
    + printk("dentry open of %s failed: %ld\n",
    + pb->backing.dentry->d_name.name,
    + PTR_ERR(pb->fh));
    + pb->fh = NULL;
    + }
    + }
    + if (pb->fh) {
    + int n;
    + pbitmap_walk_vmas(pb, vma);
    + Pprintk("%s: %ld bytes end %ld bits [%lx] at %Lx\n",
    + current->comm, bytes,
    + bit_count((unsigned char *)buffer, bytes),
    + buffer[0], pb->file_offset);
    + set_fs(KERNEL_DS);
    + n = vfs_write(pb->fh, (char *)buffer, bytes,
    + &pb->file_offset);
    + set_fs(oldfs);
    + if (n != bytes)
    + printk("%s: bitmap write %d of %ld\n",
    + current->comm, n, bytes);
    +
    + /* fh contains the references now */
    + pb->backing.dentry = NULL;
    + pb->backing.mnt = NULL;
    + }
    + prev = pb;
    + }
    + list_for_each_entry_safe (pb, prev, &mm->pbitmap_list, node)
    + free_pb(pb);
    +
    + if (buffer)
    + free_page((unsigned long)buffer);
    +}
    Index: linux/mm/mmap.c
    ================================================== =================
    --- linux.orig/mm/mmap.c
    +++ linux/mm/mmap.c
    @@ -2039,6 +2039,7 @@ void exit_mmap(struct mm_struct *mm)
    /* mm's last user has gone, and its about to be pulled down */
    arch_exit_mmap(mm);

    + pbitmap_update(mm);
    lru_add_drain();
    flush_cache_mm(mm);
    tlb = tlb_gather_mmu(mm, 1);
    Index: linux/mm/Makefile
    ================================================== =================
    --- linux.orig/mm/Makefile
    +++ linux/mm/Makefile
    @@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
    page_alloc.o page-writeback.o pdflush.o \
    readahead.o swap.o truncate.o vmscan.o \
    prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
    - page_isolation.o $(mmu-y)
    + page_isolation.o $(mmu-y) pbitmap.o

    obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
    obj-$(CONFIG_BOUNCE) += bounce.o
    Index: linux/include/linux/mm_types.h
    ================================================== =================
    --- linux.orig/include/linux/mm_types.h
    +++ linux/include/linux/mm_types.h
    @@ -222,6 +222,9 @@ struct mm_struct {
    /* aio bits */
    rwlock_t ioctx_list_lock;
    struct kioctx *ioctx_list;
    +
    + struct list_head pbitmap_list;
    +
    #ifdef CONFIG_CGROUP_MEM_RES_CTLR
    struct mem_cgroup *mem_cgroup;
    #endif
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [PATCH prototype] [4/8] Add readahead function to read-ahead based on a bitmap


    Signed-off-by: Andi Kleen

    ---
    include/linux/mm.h | 3 ++
    mm/filemap.c | 2 -
    mm/readahead.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
    3 files changed, 61 insertions(+), 1 deletion(-)

    Index: linux/mm/readahead.c
    ================================================== =================
    --- linux.orig/mm/readahead.c
    +++ linux/mm/readahead.c
    @@ -117,6 +117,26 @@ out:
    return ret;
    }

    +static int preallocate_page(struct address_space *mapping,
    + pgoff_t page_offset, struct list_head *page_pool)
    +{
    + struct page *page;
    +
    + /* silently cries for a gang lookup */
    + rcu_read_lock();
    + page = radix_tree_lookup(&mapping->page_tree, page_offset);
    + rcu_read_unlock();
    + if (page)
    + return 0;
    +
    + page = page_cache_alloc_cold(mapping);
    + if (!page)
    + return 0;
    + page->index = page_offset;
    + list_add(&page->lru, page_pool);
    + return 1;
    +}
    +
    /*
    * do_page_cache_readahead actually reads a chunk of disk. It allocates all
    * the pages first, then submits them all for I/O. This avoids the very bad
    @@ -140,6 +160,7 @@ __do_page_cache_readahead(struct address
    int page_idx;
    int ret = 0;
    loff_t isize = i_size_read(inode);
    + int n;

    if (isize == 0)
    goto out;
    @@ -216,6 +237,42 @@ int force_page_cache_readahead(struct ad
    }

    /*
    + * Read-ahead a page for each bit set in the bitmap.
    + */
    +void readahead_bitmap(struct file *f, pgoff_t pgoffset, unsigned long *bitmap,
    + unsigned nbits)
    +{
    + long bit, n;
    + LIST_HEAD(page_pool);
    + loff_t isize = i_size_read(f->f_dentry->d_inode);
    + unsigned long end_index;
    + struct address_space *mapping = f->f_mapping;
    +
    + if (isize == 0)
    + return;
    +
    + end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
    +
    + if (!mapping->a_ops || (!mapping->a_ops->readpages &&
    + !mapping->a_ops->readpage))
    + return;
    +
    + bit = -1;
    + n = 0;
    + while ((bit = find_next_bit(bitmap, nbits, bit + 1)) < nbits) {
    + if (pgoffset + bit >= end_index)
    + break;
    + n += preallocate_page(f->f_mapping, pgoffset + bit,&page_pool);
    + if (n >= (MAX_PINNED_CHUNK / PAGE_CACHE_SIZE)) {
    + read_pages(f->f_mapping, f, &page_pool, n);
    + n = 0;
    + }
    + }
    + if (n > 0)
    + read_pages(f->f_mapping, f, &page_pool, n);
    +}
    +
    +/*
    * This version skips the IO if the queue is read-congested, and will tell the
    * block layer to abandon the readahead if request allocation would block.
    *
    Index: linux/mm/filemap.c
    ================================================== =================
    --- linux.orig/mm/filemap.c
    +++ linux/mm/filemap.c
    @@ -1240,7 +1240,7 @@ static ssize_t
    do_readahead(struct address_space *mapping, struct file *filp,
    pgoff_t index, unsigned long nr)
    {
    - if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
    + if (!mapping || !mapping->a_ops)
    return -EINVAL;

    force_page_cache_readahead(mapping, filp, index,
    Index: linux/include/linux/mm.h
    ================================================== =================
    --- linux.orig/include/linux/mm.h
    +++ linux/include/linux/mm.h
    @@ -1104,6 +1104,9 @@ void page_cache_sync_readahead(struct ad
    pgoff_t offset,
    unsigned long size);

    +void readahead_bitmap(struct file *f, pgoff_t pgoffset, unsigned long *bitmap,
    + unsigned nbits);
    +
    void page_cache_async_readahead(struct address_space *mapping,
    struct file_ra_state *ra,
    struct file *filp,
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [PATCH prototype] [8/8] Add mmap_full_slurp support


    This is the non-subtle brother of pbitmaps. Instead of only prefetching
    the pages used last time always prefetch complete mmap areas
    (subject to the max_ra_chunk limits)

    Main advantage is that it works for shared libraries too (but
    the prefetching is currently done for all file backed mmaps,
    not just executable mappings)

    Disadvantage is that uses far more memory and does a lot of
    unnecessary work.

    Still is faster in some circumstances.

    Experimental. Mainly added for comparison. Off by default.
    Probably not a good idea.

    I'm more including it for easier experimentation.

    Signed-off-by: Andi Kleen

    ---
    include/linux/mm.h | 2 ++
    kernel/sysctl.c | 10 ++++++++++
    mm/mmap.c | 10 ++++++++++
    3 files changed, 22 insertions(+)

    Index: linux/mm/mmap.c
    ================================================== =================
    --- linux.orig/mm/mmap.c
    +++ linux/mm/mmap.c
    @@ -40,6 +40,8 @@
    #define arch_rebalance_pgtables(addr, len) (addr)
    #endif

    +int mmap_full_slurp __read_mostly;
    +
    static void unmap_region(struct mm_struct *mm,
    struct vm_area_struct *vma, struct vm_area_struct *prev,
    unsigned long start, unsigned long end);
    @@ -2023,6 +2025,14 @@ out:
    mm->locked_vm += len >> PAGE_SHIFT;
    make_pages_present(addr, addr + len);
    }
    + if (vma->vm_file && mmap_full_slurp) {
    + up_write(&mm->mmap_sem);
    + force_page_cache_readahead(vma->vm_file->f_mapping,
    + vma->vm_file,
    + vma->vm_pgoff,
    + max_sane_readahead(vma_pages(vma)));
    + down_write(&mm->mmap_sem);
    + }
    return addr;
    }

    Index: linux/include/linux/mm.h
    ================================================== =================
    --- linux.orig/include/linux/mm.h
    +++ linux/include/linux/mm.h
    @@ -1077,6 +1077,8 @@ extern int do_munmap(struct mm_struct *,

    extern unsigned long do_brk(unsigned long, unsigned long);

    +extern int mmap_full_slurp;
    +
    /* filemap.c */
    extern unsigned long page_unuse(struct page *);
    extern void truncate_inode_pages(struct address_space *, loff_t);
    Index: linux/kernel/sysctl.c
    ================================================== =================
    --- linux.orig/kernel/sysctl.c
    +++ linux/kernel/sysctl.c
    @@ -909,6 +909,16 @@ static struct ctl_table vm_table[] = {
    .extra2 = &one_hundred,
    },
    {
    + .ctl_name = CTL_UNNUMBERED,
    + .procname = "mmap_full_slurp",
    + .data = &mmap_full_slurp,
    + .maxlen = sizeof(mmap_full_slurp),
    + .mode = 0644,
    + .proc_handler = &proc_dointvec,
    + .strategy = &sysctl_intvec,
    + .extra1 = &zero,
    + },
    + {
    .procname = "dirty_writeback_centisecs",
    .data = &dirty_writeback_interval,
    .maxlen = sizeof(dirty_writeback_interval),
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [PATCH prototype] [7/8] Add the sysctls to control pbitmaps


    - pbitmap_enabled: Master switch for pbitmap
    - pbitmap_early_fault: Control whether pbitmap should do
    early page faults or not. Default on.
    - pbitmap_update_interval: How often the pbitmap should
    be updated on disk.

    Signed-off-by: Andi Kleen

    ---
    kernel/sysctl.c | 30 ++++++++++++++++++++++++++++++
    1 file changed, 30 insertions(+)

    Index: linux/kernel/sysctl.c
    ================================================== =================
    --- linux.orig/kernel/sysctl.c
    +++ linux/kernel/sysctl.c
    @@ -1044,6 +1044,36 @@ static struct ctl_table vm_table[] = {
    .extra1 = &zero,
    },
    {
    + .ctl_name = CTL_UNNUMBERED,
    + .procname = "pbitmap_enabled",
    + .data = &pbitmap_enabled,
    + .maxlen = sizeof(pbitmap_enabled),
    + .mode = 0644,
    + .proc_handler = &proc_dointvec,
    + .strategy = &sysctl_intvec,
    + .extra1 = &zero,
    + },
    + {
    + .ctl_name = CTL_UNNUMBERED,
    + .procname = "pbitmap_early_fault",
    + .data = &pbitmap_early_fault,
    + .maxlen = sizeof(pbitmap_early_fault),
    + .mode = 0644,
    + .proc_handler = &proc_dointvec,
    + .strategy = &sysctl_intvec,
    + .extra1 = &zero,
    + },
    + {
    + .ctl_name = CTL_UNNUMBERED,
    + .procname = "pbitmap_update_interval",
    + .data = &pbitmap_update_interval,
    + .maxlen = sizeof(pbitmap_update_interval),
    + .mode = 0644,
    + .proc_handler = &proc_dointvec,
    + .strategy = &sysctl_intvec,
    + .extra1 = &zero,
    + },
    + {
    .ctl_name = VM_VFS_CACHE_PRESSURE,
    .procname = "vfs_cache_pressure",
    .data = &sysctl_vfs_cache_pressure,
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. [PATCH prototype] [1/8] Give ELF shdr types a name


    The pbitmap code is the first code in Linux to know
    about ELF shdrs. They were already in the include files, but
    not given a suitable type name. Fix that.

    Signed-off-by: Andi Kleen

    ---
    include/linux/elf.h | 4 +++-
    1 file changed, 3 insertions(+), 1 deletion(-)

    Index: linux/include/linux/elf.h
    ================================================== =================
    --- linux.orig/include/linux/elf.h
    +++ linux/include/linux/elf.h
    @@ -286,7 +286,7 @@ typedef struct elf64_phdr {
    #define SHN_COMMON 0xfff2
    #define SHN_HIRESERVE 0xffff

    -typedef struct {
    +typedef struct elf32_shdr {
    Elf32_Word sh_name;
    Elf32_Word sh_type;
    Elf32_Word sh_flags;
    @@ -382,6 +382,7 @@ extern Elf32_Dyn _DYNAMIC [];
    #define elf_phdr elf32_phdr
    #define elf_note elf32_note
    #define elf_addr_t Elf32_Off
    +#define elf_shdr elf32_shdr

    #else

    @@ -390,6 +391,7 @@ extern Elf64_Dyn _DYNAMIC [];
    #define elf_phdr elf64_phdr
    #define elf_note elf64_note
    #define elf_addr_t Elf64_Off
    +#define elf_shdr elf64_shdr

    #endif

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. [PATCH prototype] [3/8] Make readahead max pinned value a sysctl


    Previously it was hard coded to 2MB, which seems dubious as systems
    are getting more and more memory. With a sysctl it is easier to play
    around with it and tune it.

    Signed-off-by: Andi Kleen

    ---
    include/linux/mm.h | 2 ++
    kernel/sysctl.c | 10 ++++++++++
    mm/readahead.c | 14 ++++++++++++--
    3 files changed, 24 insertions(+), 2 deletions(-)

    Index: linux/mm/readahead.c
    ================================================== =================
    --- linux.orig/mm/readahead.c
    +++ linux/mm/readahead.c
    @@ -17,6 +17,14 @@
    #include
    #include

    +/*
    + * Max memory pinned during forced readahead.
    + * Should be probably auto scaled with available memory.
    + */
    +unsigned ra_max_pinned = 2*1024*1024;
    +
    +#define MAX_PINNED_CHUNK ra_max_pinned
    +
    void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
    {
    }
    @@ -176,7 +184,7 @@ out:
    }

    /*
    - * Chunk the readahead into 2 megabyte units, so that we don't pin too much
    + * Chunk the readahead into MAX_PINNED_CHUNK units, so that we don't pin too much
    * memory at once.
    */
    int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
    @@ -190,7 +198,7 @@ int force_page_cache_readahead(struct ad
    while (nr_to_read) {
    int err;

    - unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
    + unsigned long this_chunk = MAX_PINNED_CHUNK / PAGE_CACHE_SIZE;

    if (this_chunk > nr_to_read)
    this_chunk = nr_to_read;
    @@ -229,6 +237,8 @@ int do_page_cache_readahead(struct addre
    */
    unsigned long max_sane_readahead(unsigned long nr)
    {
    + /* AK: RED-PEN for file cached pages there is no reason
    + to limit ourselves to the current node */
    return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
    + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
    }
    Index: linux/include/linux/mm.h
    ================================================== =================
    --- linux.orig/include/linux/mm.h
    +++ linux/include/linux/mm.h
    @@ -1113,6 +1113,8 @@ void page_cache_async_readahead(struct a

    unsigned long max_sane_readahead(unsigned long nr);

    +extern unsigned ra_max_pinned;
    +
    /* Do stack extension */
    extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
    #ifdef CONFIG_IA64
    Index: linux/kernel/sysctl.c
    ================================================== =================
    --- linux.orig/kernel/sysctl.c
    +++ linux/kernel/sysctl.c
    @@ -877,6 +877,16 @@ static struct ctl_table vm_table[] = {
    .proc_handler = &proc_dointvec,
    },
    {
    + .ctl_name = CTL_UNNUMBERED,
    + .procname = "ra_max_pinned",
    + .data = &ra_max_pinned,
    + .maxlen = sizeof(ra_max_pinned),
    + .mode = 0644,
    + .proc_handler = &proc_dointvec,
    + .strategy = &sysctl_intvec,
    + .extra1 = &zero,
    + },
    + {
    .ctl_name = VM_DIRTY_BACKGROUND,
    .procname = "dirty_background_ratio",
    .data = &dirty_background_ratio,
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. [PATCH prototype] [5/8] Add ELF constants for pbitmaps


    I made up some numbers for the present bitmap phdrs and shdrs.

    For serious use this would probably require an allocation somewhere.


    Signed-off-by: Andi Kleen

    ---
    include/linux/elf.h | 4 ++++
    1 file changed, 4 insertions(+)

    Index: linux/include/linux/elf.h
    ================================================== =================
    --- linux.orig/include/linux/elf.h
    +++ linux/include/linux/elf.h
    @@ -49,6 +49,7 @@ typedef __s64 Elf64_Sxword;
    #define PT_GNU_EH_FRAME 0x6474e550

    #define PT_GNU_STACK (PT_LOOS + 0x474e551)
    +#define PT_PRESENT_BITMAP (PT_GNU_STACK + 1)

    /* These constants define the different elf file types */
    #define ET_NONE 0
    @@ -230,6 +231,8 @@ typedef struct elf64_hdr {
    #define PF_W 0x2
    #define PF_X 0x1

    +#define PF_PLEASE_LOAD_SHDRS 0x8 /* hack. checked on PT_GNU_STACK */
    +
    typedef struct elf32_phdr{
    Elf32_Word p_type;
    Elf32_Off p_offset;
    @@ -270,6 +273,7 @@ typedef struct elf64_phdr {
    #define SHT_HIPROC 0x7fffffff
    #define SHT_LOUSER 0x80000000
    #define SHT_HIUSER 0xffffffff
    +#define SHT_PRESENT_BITMAP (SHT_LOPROC - 1000)

    /* sh_flags */
    #define SHF_WRITE 0x1
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. [PATCH prototype] [2/8] Add support to override mmap exec write protection with O_FORCEWRITE


    pbitmaps need to write to files mapped as executable, so add
    a way to bypass the normal write protection using a new O_FORCEWRITE
    flag.

    Right now the flag can be set from user space too. If have not
    made up my mind if that is a good or a bad thing (I don't think
    it has any real security implications). Probably it's more bad
    than good.

    Signed-off-by: Andi Kleen

    ---
    fs/namei.c | 9 +++++++--
    fs/open.c | 2 +-
    include/asm-generic/fcntl.h | 4 ++++
    include/linux/fs.h | 1 +
    4 files changed, 13 insertions(+), 3 deletions(-)

    Index: linux/fs/namei.c
    ================================================== =================
    --- linux.orig/fs/namei.c
    +++ linux/fs/namei.c
    @@ -334,10 +334,10 @@ int file_permission(struct file *file, i
    * the inode->i_lock spinlock.
    */

    -int get_write_access(struct inode * inode)
    +int __get_write_access(struct inode * inode, int flags)
    {
    spin_lock(&inode->i_lock);
    - if (atomic_read(&inode->i_writecount) < 0) {
    + if (atomic_read(&inode->i_writecount) < 0 && !(flags&O_FORCEWRITE)) {
    spin_unlock(&inode->i_lock);
    return -ETXTBSY;
    }
    @@ -347,6 +347,11 @@ int get_write_access(struct inode * inod
    return 0;
    }

    +int get_write_access(struct inode * inode)
    +{
    + return __get_write_access(inode, 0);
    +}
    +
    int deny_write_access(struct file * file)
    {
    struct inode *inode = file->f_path.dentry->d_inode;
    Index: linux/fs/open.c
    ================================================== =================
    --- linux.orig/fs/open.c
    +++ linux/fs/open.c
    @@ -742,7 +742,7 @@ static struct file *__dentry_open(struct
    FMODE_PREAD | FMODE_PWRITE;
    inode = dentry->d_inode;
    if (f->f_mode & FMODE_WRITE) {
    - error = get_write_access(inode);
    + error = __get_write_access(inode, f->f_flags & O_FORCEWRITE);
    if (error)
    goto cleanup_file;
    }
    Index: linux/include/linux/fs.h
    ================================================== =================
    --- linux.orig/include/linux/fs.h
    +++ linux/include/linux/fs.h
    @@ -1720,6 +1720,7 @@ extern int generic_permission(struct ino
    int (*check_acl)(struct inode *, int));

    extern int get_write_access(struct inode *);
    +extern int __get_write_access(struct inode *i, int flags);
    extern int deny_write_access(struct file *);
    static inline void put_write_access(struct inode * inode)
    {
    Index: linux/include/asm-generic/fcntl.h
    ================================================== =================
    --- linux.orig/include/asm-generic/fcntl.h
    +++ linux/include/asm-generic/fcntl.h
    @@ -54,6 +54,10 @@
    #ifndef O_NDELAY
    #define O_NDELAY O_NONBLOCK
    #endif
    +/* ignore ETXTBSY -- should probably hide it from user space */
    +#ifndef O_FORCEWRITE
    +#define O_FORCEWRITE 02000000
    +#endif

    #define F_DUPFD 0 /* dup */
    #define F_GETFD 1 /* get close_on_exec */
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    On Tue, 18 Mar 2008 02:09:34 +0100 (CET) Andi Kleen wrote:

    > This patchkit is an experimental optimization I played around with
    > some time ago.
    >
    > This is more a prototype still, but I wanted to push it out
    > so that other people can play with it.
    >
    > The basic idea is that most programs have the same working set
    > over multiple runs. So instead of demand paging all the text pages
    > in the order the program runs save the working set to disk and prefetch
    > it at program start and then save it at program exit.
    >
    > This allows some optimizations:
    > - it can avoid unnecessary disk seeks because the blocks will be fetched in
    > sorted offset order instead of program execution order.
    > - batch kernel entries (each demand page exception has some
    > overhead just for entering the kernel). This keeps the caches hot too.
    > - The prefetch could be in theory done in the background while the program
    > runs (although that is not implemented currently)


    Should be worthwhile for some things.

    > Some details on the implementation:


    Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD,
    use /proc/self/maps and the new pagemap code to work out which pages of
    which files were faulted in, write that info into the elf file (or a
    separate per-executable shadow file), then use that info the next time the
    app is executed, either with an LD_PRELOAD or just a wrapper.

    > Drawbacks:
    > - No support for dynamic libraries right now (except very clumpsily
    > through the mmap_slurp hack). This is the main reason it is not
    > very useful for speed up desktops currently.
    >
    > - Executable files have to be writable by the user executing it
    > currently to get bitmap updates. It would be possible to let the
    > kernel bypass this, but I haven't thought too much about the security
    > implications of it.
    > However any user can use the bitmap data written by a user with
    > write rights.


    Those all get fixed with the userspace version?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    On Tue, Mar 18, 2008 at 12:36:20AM -0700, Andrew Morton wrote:
    > On Tue, 18 Mar 2008 02:09:34 +0100 (CET) Andi Kleen wrote:
    >
    > > This patchkit is an experimental optimization I played around with
    > > some time ago.
    > >
    > > This is more a prototype still, but I wanted to push it out
    > > so that other people can play with it.
    > >
    > > The basic idea is that most programs have the same working set
    > > over multiple runs. So instead of demand paging all the text pages
    > > in the order the program runs save the working set to disk and prefetch
    > > it at program start and then save it at program exit.
    > >
    > > This allows some optimizations:
    > > - it can avoid unnecessary disk seeks because the blocks will be fetched in
    > > sorted offset order instead of program execution order.
    > > - batch kernel entries (each demand page exception has some
    > > overhead just for entering the kernel). This keeps the caches hot too.
    > > - The prefetch could be in theory done in the background while the program
    > > runs (although that is not implemented currently)

    >
    > Should be worthwhile for some things.
    >
    > > Some details on the implementation:

    >
    > Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD,


    In theory yes, but it would have much more overhead. Also I think prefetching
    algorithms like this really belong in the kernel.

    > > - Executable files have to be writable by the user executing it
    > > currently to get bitmap updates. It would be possible to let the
    > > kernel bypass this, but I haven't thought too much about the security
    > > implications of it.
    > > However any user can use the bitmap data written by a user with
    > > write rights.

    >
    > Those all get fixed with the userspace version?


    No in fact the permission problem is much harder to fix in user space.

    The shared libraries would need some user support. Basically just a way
    how the ld.so can tell the kernel about the offsets/locations of new
    bitmaps to add to the pbitmap list

    -Andi

    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    On Wed, 19 Mar 2008 09:32:28 +0100 Andi Kleen wrote:

    > On Tue, Mar 18, 2008 at 10:44:37AM -0700, Andrew Morton wrote:
    > > On Tue, 18 Mar 2008 18:20:45 +0100 Andi Kleen wrote:
    > >
    > > > > What's the permission problem? executable-but-not-readable files? Could
    > > >
    > > > Not writable.

    > >
    > > Oh.
    > >
    > > I doubt if a userspace implementation would even try to alter the ELF
    > > files, really - there seems to be no point in it. This is just complexity

    >
    > Well the information has to be somewhere and i think the ELF file
    > is the best location for it. It makes it the most user transparent.


    Adopt a standard, stick with it.

    Assuming that all users have the same access pattern might be inefficient,
    a little bit. There might be some advantage to making it per-user, dunno.

    The requirement to write to an executable sounds like a bit of a
    showstopper.

    > > > Yes it could, but i dont even want to thi nk about all the issues of
    > > > doing such an interface. It is basically an microkernelish approach.
    > > > I prefer monolithic simplicity.

    > >
    > > It's not complex at all. Pass a null-terminated pathname to the server and
    > > keep running. The server will asynchronously read your pages for you.

    >
    > But how do you update the bitmap in your scheme?


    umm,

    BITMAP_TRAINING_RUN=1 /usr/lib64/firefox-2.0.0.12/firefox-bin

    will write the bitmap to ~/.bitmaps/usr/lib64/firefox-2.0.0.12/firefox-bin ?

    if it proves useful, build it all into libc..


    I'm assuming that the per-page minor fault cost is relatively low and that
    the major benefit is in disk scheduling[*]. If that's false then we'd need
    kernel support I guess - some sort of gang-fault syscall?


    * solid-state disks are going to put a lot of code out of a job.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    On Tue, 18 Mar 2008 18:20:45 +0100 Andi Kleen wrote:

    > > What's the permission problem? executable-but-not-readable files? Could

    >
    > Not writable.


    Oh.

    I doubt if a userspace implementation would even try to alter the ELF
    files, really - there seems to be no point in it. This is just complexity
    which was added by trying to do it in the kernel.

    > > be handled by passing your request to a suitable-privileged server process,
    > > I guess.

    >
    > Yes it could, but i dont even want to thi nk about all the issues of
    > doing such an interface. It is basically an microkernelish approach.
    > I prefer monolithic simplicity.


    It's not complex at all. Pass a null-terminated pathname to the server and
    keep running. The server will asynchronously read your pages for you.

    That's assuming executable+unreadable libraries and binaries actually need
    to be handled. If not: no server needed.

    > e.g. i am pretty sure your user space implementation would be far
    > more complicated than a nicely streamlined kernel implementation.
    > And I am really not a friend of unnecessary complexity. In the end
    > complexity hurts you, no matter if it is in ring 3 or ring 0.


    There is no complexity here.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    On Tue, 18 Mar 2008 15:18:28 +0100 Andi Kleen wrote:

    > On Tue, Mar 18, 2008 at 12:36:20AM -0700, Andrew Morton wrote:
    > > On Tue, 18 Mar 2008 02:09:34 +0100 (CET) Andi Kleen wrote:
    > >
    > > > This patchkit is an experimental optimization I played around with
    > > > some time ago.
    > > >
    > > > This is more a prototype still, but I wanted to push it out
    > > > so that other people can play with it.
    > > >
    > > > The basic idea is that most programs have the same working set
    > > > over multiple runs. So instead of demand paging all the text pages
    > > > in the order the program runs save the working set to disk and prefetch
    > > > it at program start and then save it at program exit.
    > > >
    > > > This allows some optimizations:
    > > > - it can avoid unnecessary disk seeks because the blocks will be fetched in
    > > > sorted offset order instead of program execution order.
    > > > - batch kernel entries (each demand page exception has some
    > > > overhead just for entering the kernel). This keeps the caches hot too.
    > > > - The prefetch could be in theory done in the background while the program
    > > > runs (although that is not implemented currently)

    > >
    > > Should be worthwhile for some things.
    > >
    > > > Some details on the implementation:

    > >
    > > Can't this all be done in userspace? Hook into exit() with an LD_PRELOAD,

    >
    > In theory yes, but it would have much more overhead.


    s/much/slightly/. The main win will be in optimising disk patterns. We
    can fault in cached pages at, what? A few GB/sec?

    > Also I think prefetching
    > algorithms like this really belong in the kernel.


    hrmph.

    > > > - Executable files have to be writable by the user executing it
    > > > currently to get bitmap updates. It would be possible to let the
    > > > kernel bypass this, but I haven't thought too much about the security
    > > > implications of it.
    > > > However any user can use the bitmap data written by a user with
    > > > write rights.

    > >
    > > Those all get fixed with the userspace version?

    >
    > No in fact the permission problem is much harder to fix in user space.


    What's the permission problem? executable-but-not-readable files? Could
    be handled by passing your request to a suitable-privileged server process,
    I guess.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    > What's the permission problem? executable-but-not-readable files? Could

    Not writable.

    > be handled by passing your request to a suitable-privileged server process,
    > I guess.


    Yes it could, but i dont even want to thi nk about all the issues of
    doing such an interface. It is basically an microkernelish approach.
    I prefer monolithic simplicity.

    e.g. i am pretty sure your user space implementation would be far
    more complicated than a nicely streamlined kernel implementation.
    And I am really not a friend of unnecessary complexity. In the end
    complexity hurts you, no matter if it is in ring 3 or ring 0.

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    On Tue, Mar 18, 2008 at 10:44:37AM -0700, Andrew Morton wrote:
    > On Tue, 18 Mar 2008 18:20:45 +0100 Andi Kleen wrote:
    >
    > > > What's the permission problem? executable-but-not-readable files? Could

    > >
    > > Not writable.

    >
    > Oh.
    >
    > I doubt if a userspace implementation would even try to alter the ELF
    > files, really - there seems to be no point in it. This is just complexity


    Well the information has to be somewhere and i think the ELF file
    is the best location for it. It makes it the most user transparent.

    > > Yes it could, but i dont even want to thi nk about all the issues of
    > > doing such an interface. It is basically an microkernelish approach.
    > > I prefer monolithic simplicity.

    >
    > It's not complex at all. Pass a null-terminated pathname to the server and
    > keep running. The server will asynchronously read your pages for you.


    But how do you update the bitmap in your scheme?

    > > And I am really not a friend of unnecessary complexity. In the end
    > > complexity hurts you, no matter if it is in ring 3 or ring 0.

    >
    > There is no complexity here.


    I have my doubts on that if you consider update too.

    -Andi
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    On Wed, Mar 19, 2008 at 2:04 AM, Andrew Morton
    wrote:
    > The requirement to write to an executable sounds like a bit of a
    > showstopper.


    Agreed. In addition, it makes all kinds of tools misbehave. I can
    see extensions to the ELF format and those would require relinking
    binaries, but without this you create chaos. ELF is so nice because
    it describes the entire file and we can handle everything according to
    rules. If you just somehow magically patch some new data structure in
    this will break things.

    Furthermore, by adding all this data to the end of the file you'll
    normally create unnecessary costs. The parts of the binaries which
    are used at runtime start at the front. At the end there might be a
    lot of data which isn't needed at runtime and is therefore normally
    not read.


    > if it proves useful, build it all into libc..


    I could see that. But to handle it efficiently kernel support is
    needed. Especially since I don't think the currently proposed
    "learning mode" is adequate. Over many uses of a program all kinds of
    pages will be needed. Far more than in most cases. The prefetching
    should really only cover the commonly used code paths in the program.
    If you pull in everything, this will have advantages if you have that
    much page cache to spare. In that case just prefetching the entire
    file is even easier. No, such an improved method has to be more
    selective.

    But if we're selective in the loading of the pages we'll
    (unfortunately) end up with holes. Possibly many of them. This would
    mean with today's interfaces a large number of madvise() calls. What
    would be needed is kernel support which takes a bitmap, each bit
    representing a page. This bitmap could be allocated as part of the
    binary by the linker. With appropriate ELF data structures supporting
    it so that strip et.al won't stumble. To fill in the bitmaps one can
    have separate a separate tool which is explicitly asked to update the
    bitmap data. To collect the page fault data one could use systemtap.
    It's easy enough to write a script which monitors the minor page
    faults for each binary and writes the data into a file. The binary
    update tool and can use the information from that file to generate the
    bitmap.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    On Wed, 19 Mar 2008 15:45:16 -0700
    "Ulrich Drepper" wrote:

    > On Wed, Mar 19, 2008 at 2:04 AM, Andrew Morton
    > wrote:
    > > The requirement to write to an executable sounds like a bit of a
    > > showstopper.

    >
    > Agreed. In addition, it makes all kinds of tools misbehave. I can
    > see extensions to the ELF format and those would require relinking
    > binaries, but without this you create chaos. ELF is so nice because
    > it describes the entire file and we can handle everything according to
    > rules. If you just somehow magically patch some new data structure in
    > this will break things.
    >
    > Furthermore, by adding all this data to the end of the file you'll
    > normally create unnecessary costs. The parts of the binaries which
    > are used at runtime start at the front. At the end there might be a
    > lot of data which isn't needed at runtime and is therefore normally
    > not read.
    >
    >
    > > if it proves useful, build it all into libc..

    >
    > I could see that. But to handle it efficiently kernel support is
    > needed. Especially since I don't think the currently proposed
    > "learning mode" is adequate. Over many uses of a program all kinds of
    > pages will be needed. Far more than in most cases. The prefetching
    > should really only cover the commonly used code paths in the program.
    > If you pull in everything, this will have advantages if you have that
    > much page cache to spare. In that case just prefetching the entire
    > file is even easier. No, such an improved method has to be more
    > selective.


    yes, ultimately we'd end up pulling in most of the executable that way, and
    a 90%-good-enough solution is perhaps to just suck the whole thing into
    pagecache.

    I did some work on that many years ago and I do recall that it helped, but
    I forget how much.

    There's a very very easy way of testing this though. filemap_fault()
    _already_ does readaround when it gets a major fault. And this is tunable
    via /sys/block/sda/queue/read_ahead_kb. So set that to "infinity" to pull
    the whole file into pagecache at the first major fault and run some
    benchmarks.

    If that proves useful we could look at separating read()'s readahead
    tunable from filemap_fault()'s tunable.

    Note! Due to interaction between the linker and common filesystems,
    executables tend to be very poorly laid out on disk: the blocks are
    out-of-order, often grossly. So one shouldn't do performance testing of
    this form against executable files which were written directly by
    /usr/bin/ld. First use `cp' to straighten the file out..

    > But if we're selective in the loading of the pages we'll
    > (unfortunately) end up with holes. Possibly many of them. This would
    > mean with today's interfaces a large number of madvise() calls. What
    > would be needed is kernel support which takes a bitmap, each bit
    > representing a page. This bitmap could be allocated as part of the
    > binary by the linker. With appropriate ELF data structures supporting
    > it so that strip et.al won't stumble. To fill in the bitmaps one can
    > have separate a separate tool which is explicitly asked to update the
    > bitmap data. To collect the page fault data one could use systemtap.
    > It's easy enough to write a script which monitors the minor page
    > faults for each binary and writes the data into a file. The binary
    > update tool and can use the information from that file to generate the
    > bitmap.


    bitmap-based madvise() or fadvise() sounds pretty easy to do.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    El Wed, 19 Mar 2008 02:04:40 -0700, Andrew Morton escribió:

    > Assuming that all users have the same access pattern might be inefficient,
    > a little bit. There might be some advantage to making it per-user, dunno.


    In the Dark Side of operating systems, the prefetching system they use
    can log several access patterns for a single executable, because a single
    executable can have different behaviours even for the same user, depending
    on what parameters the executable is passed and what COM machinery it
    uses. For example, wmplayer.exe can play a dvd, rip a CD, listen to a music
    stream, etc...diferent usages, different access patterns. Linux probably faces
    the same problem (bash, cat...)

    A alternative design for a userspace solution that doesn't needs LD_PRELOAD
    is to use CONFIG_PROC_EVENTS to get notifications of what processes are
    started, which can be used to poll its /proc files or try to preload data
    (asynchronously, and a bit hacky maybe).

    But if a kernel patch is really needed to implement this properly, maybe
    it'd be worth to take a look at the prefetch project that the Ubuntu guys
    are apparently going to merge in the next ubuntu development release (8.10)...
    https://wiki.ubuntu.com/DesktopTeam/Specs/Prefetch

    There are even kernel patches:
    http://code.google.com/p/prefetch/so...etch-core.diff
    http://code.google.com/p/prefetch/so...etch-boot.diff
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH prototype] [0/8] Predictive bitmaps for ELF executables

    From: Andrew Morton
    Date: Wed, 19 Mar 2008 16:12:11 -0700

    > I did some work on that many years ago and I do recall that it
    > helped, but I forget how much.


    I wrote such a patch ages ago as well.

    Frankly, based upon my experiences then and what I know now, I think
    it's a lose to do this.

    Better to 1) have enough ram and 2) make the reclaim smarter about
    important "executable" page cache pages.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast