2.6.24 regression: deadlock on coredump of big process - Kernel

This is a discussion on 2.6.24 regression: deadlock on coredump of big process - Kernel ; Here is a program that can deadlock any kernel from 2.6.24-rc1 to current (2.6.25-git11). The deadlock happens due to oom during a coredump of a large process with multiple threads. git-bisect reveals the following patch as the culprit: commit 557ed1fa2620dc119adb86b34c614e152a629a80 ...

+ Reply to Thread
Results 1 to 14 of 14

Thread: 2.6.24 regression: deadlock on coredump of big process

  1. 2.6.24 regression: deadlock on coredump of big process

    Here is a program that can deadlock any kernel from 2.6.24-rc1 to
    current (2.6.25-git11). The deadlock happens due to oom during a
    coredump of a large process with multiple threads.

    git-bisect reveals the following patch as the culprit:

    commit 557ed1fa2620dc119adb86b34c614e152a629a80
    Author: Nick Piggin
    Date: Tue Oct 16 01:24:40 2007 -0700

    remove ZERO_PAGE

    The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
    (and thus mapcounted and count towards shared rss). These writes to
    the struct page could cause excessive cacheline bouncing on big
    systems. There are a number of ways this could be addressed if it is
    an issue.

    And indeed this cacheline bouncing has shown up on large SGI systems.
    There was a situation where an Altix system was essentially livelocked
    tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
    This situation can be avoided in userspace, but it does highlight the
    potential scalability problem with refcounting ZERO_PAGE, and corner
    cases where it can really hurt (we don't want the system to livelock!).

    There are several broad ways to fix this problem:
    1. add back some special casing to avoid refcounting ZERO_PAGE
    2. per-node or per-cpu ZERO_PAGES
    3. remove the ZERO_PAGE completely

    I will argue for 3. The others should also fix the problem, but they
    result in more complex code than does 3, with little or no real benefit
    that I can see.

    Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
    false optimisation: if an application is performance critical, it would
    not be doing many read faults of new memory, or at least it could be
    expected to write to that memory soon afterwards. If cache or memory use
    is critical, it should not be working with a significant number of
    ZERO_PAGEs anyway (a more compact representation of zeroes should be
    used).

    As a sanity check -- mesuring on my desktop system, there are never many
    mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
    increase much without it.

    When running a make -j4 kernel compile on my dual core system, there are
    about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
    ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
    is torn down without being COWed). So removing ZERO_PAGE will save 1,000
    page faults per second when running kbuild, while keeping it only saves
    less than 1 page clearing operation per second. 1 page clear is cheaper
    than a thousand faults, presumably, so there isn't an obvious loss.

    Neither the logical argument nor these basic tests give a guarantee of no
    regressions. However, this is a reasonable opportunity to try to remove
    the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
    we can reintroduce it and just avoid refcounting it.

    The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
    much use to them except on benchmarks. All other users of ZERO_PAGE are
    converted just to use ZERO_PAGE(0) for simplicity. We can look at
    replacing them all and maybe ripping out ZERO_PAGE completely when we are
    more satisfied with this solution.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus "snif" Torvalds


    I have verified that 2.6.24.5 with the above patch reverted coredumps
    successfully instead of deadlocking. The patch doesn't revert cleanly
    on 2.6.25, so I didn't test that.

    Before finding the above patch with git-bisect, I tested Daniel
    Phillips' bio throttling patch
    (http://zumastor.googlecode.com/svn/t...throttle.patch),
    but it didn't prevent the deadlock.

    I am testing on a simple 32-bit x86 system with Pentium 4 CPU, 256 MB
    DRAM, a IDE hard drive, and an ext3 filesystem. The software is a bare-
    bones embedded environment; I am not running X, device-mapper, RAID, or
    anything fancy. I am not using any network file systems. The system
    has no swap space, and in fact swap support is disabled in the kernel
    configuration.

    When the kernel is deadlocked, I can switch VTs using Alt-.
    When typing characters on the keyboard, the characters are printed to the
    screen if I am on the VT that the core-dumping program was using, but
    keypresses on other VTs do not show up. The system is basically unusable.

    If I let the kernel write the core file to disk directly (the default
    behavior), then pressing Alt-SysRq-I to kill all tasks and free up some
    memory will un-deadlock the coredump for a short while, but then it
    deadlocks again. If I pipe the core file to a program
    (via /proc/sys/kernel/core_pattern) which doesn't write it to disk
    (e.g. cat > /dev/null), then the kernel still deadlocks, but Alt-SysRq-I
    kills the program and breaks the pipe, which un-deadlocks the system and
    allows me to log back in.

    Below is the program that triggers the deadlock; compile with
    -D_REENTRANT -lpthread.

    Tony Battersby
    Cybernetics

    ---------------------------------------------------------------------

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static void allow_coredump(void)
    {
    struct rlimit rlim;

    rlim.rlim_cur = RLIM_INFINITY;
    rlim.rlim_max = RLIM_INFINITY;
    if (setrlimit(RLIMIT_CORE, &rlim))
    {
    perror("setrlimit");
    exit(EXIT_FAILURE);
    }
    }

    static void *thread_func(void *arg)
    {
    for (;
    {
    sleep(100);
    }
    return NULL;
    }

    static void spawn_threads(int n_threads)
    {
    pthread_attr_t thread_attr;
    int i;

    pthread_attr_init(&thread_attr);
    printf("spawn %d threads\n", n_threads);
    for (i = 0; i < n_threads; i++)
    {
    pthread_t thread;

    if (pthread_create(&thread, &thread_attr, &thread_func, NULL))
    {
    perror("pthread_create");
    exit(EXIT_FAILURE);
    }
    }
    sleep(1);
    }

    static size_t get_max_malloc_len(void)
    {
    size_t min = 1;
    size_t max = ~((size_t) 0);

    do
    {
    size_t len = min + (max - min) / 2;
    void *ptr;

    ptr = malloc(len);
    if (ptr == NULL)
    {
    max = len - 1;
    }
    else
    {
    free(ptr);
    min = len + 1;
    }
    } while (min < max);

    return min;
    }

    static void malloc_all_but_x_mb(unsigned free_mb)
    {
    size_t len = get_max_malloc_len();
    void *ptr;

    assert(len > free_mb << 20);
    len -= free_mb << 20;

    printf("allocate %u MB\n", len >> 20);
    ptr = malloc(len);
    assert(ptr != NULL);

    /* if this triggers the oom killer, then use a larger free_mb */
    memset(ptr, 0xab, len);
    }

    static void trigger_segfault(void)
    {
    printf("trigger segfault\n");
    *(int *) 0 = 0;
    }

    int main(int argc, char *argv[])
    {
    allow_coredump();
    malloc_all_but_x_mb(16);
    spawn_threads(10);
    trigger_segfault();
    return 0;
    }


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: 2.6.24 regression: deadlock on coredump of big process

    KAMEZAWA Hiroyuki wrote:
    > On Mon, 28 Apr 2008 11:11:46 -0400
    > Tony Battersby wrote:
    >
    >
    >> Below is the program that triggers the deadlock; compile with
    >> -D_REENTRANT -lpthread.
    >>
    >>

    > What happens if you changes size of stack (of pthreads) smaller ?
    > (maybe ulimit -s will work also for threads.)
    >
    > Thanks,
    > -Kame
    >
    >
    >


    If I leave more memory free by changing the argument to
    malloc_all_but_x_mb(), then I have to increase the number of threads
    required to trigger the deadlock. Changing the thread stack size via
    setrlimit(RLIMIT_STACK) also changes the number of threads that are
    required to trigger the deadlock. For example, with
    malloc_all_but_x_mb(16) and the default stack size of 8 MB, <= 5 threads
    will coredump successfully, and >= 6 threads will deadlock. With
    malloc_all_but_x_mb(16) and a reduced stack size of 4096 bytes, <= 8
    threads will coredump successfully, and >= 9 threads will deadlock.

    Also note that the "free" command reports 10 MB free memory while the
    program is running before the segfault is triggered.

    Tony

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Tue, 29 Apr 2008 10:10:58 -0400
    Tony Battersby wrote:
    >
    > If I leave more memory free by changing the argument to
    > malloc_all_but_x_mb(), then I have to increase the number of threads
    > required to trigger the deadlock. Changing the thread stack size via
    > setrlimit(RLIMIT_STACK) also changes the number of threads that are
    > required to trigger the deadlock. For example, with
    > malloc_all_but_x_mb(16) and the default stack size of 8 MB, <= 5 threads
    > will coredump successfully, and >= 6 threads will deadlock. With
    > malloc_all_but_x_mb(16) and a reduced stack size of 4096 bytes, <= 8
    > threads will coredump successfully, and >= 9 threads will deadlock.
    >
    > Also note that the "free" command reports 10 MB free memory while the
    > program is running before the segfault is triggered.
    >

    Hmm, my idea is below.

    Nick's remove ZERO_PAGE patch includes following change

    ==
    @@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
    spinlock_t *ptl;
    {

    - page_add_new_anon_rmap(page, vma, address);
    - } else {
    - /* Map the ZERO_PAGE - vm_page_prot is readonly */
    - page = ZERO_PAGE(address);
    - page_cache_get(page);
    - entry = mk_pte(page, vma->vm_page_prot);
    + if (unlikely(anon_vma_prepare(vma)))
    + goto oom;
    + page = alloc_zeroed_user_highpage_movable(vma, address);
    ==

    above change is for avoiding to use ZERO_PAGE at read-page-fault to anonymous
    vma. This is reasonable I think. But at coredump, tons of read-but-never-written
    pages can be allocated.
    ==
    coredump
    -> get_user_pages()
    -> follow_page() returns NULL
    -> handle mm fault
    -> do_anonymous page.
    ==
    follow_page() returns ZERO_PAGE only when page table is not avaiable.

    So, making follow_page() return ZERO_PAGE can be a fix of extra memory
    consumpstion at core dump. (Maybe someone can think of other fix.)

    how about this patch ? Could you try ?

    (I'm sorry but I'll not be active for a week because my servers are powered off.)

    -Kame

    ==
    follow_page() returns ZERO_PAGE if page table is not available.
    but returns NULL pte is not presentl.

    Signed-off-by: KAMEZAWA Hiroyuki

    Index: linux-2.6.25/mm/memory.c
    ================================================== =================
    --- linux-2.6.25.orig/mm/memory.c
    +++ linux-2.6.25/mm/memory.c
    @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
    page = NULL;
    pgd = pgd_offset(mm, address);
    if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    - goto no_page_table;
    + goto null_or_zeropage;

    pud = pud_offset(pgd, address);
    if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    - goto no_page_table;
    + goto null_or_zeropage;

    pmd = pmd_offset(pud, address);
    if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    - goto no_page_table;
    + goto null_or_zeropage;

    if (pmd_huge(*pmd)) {
    BUG_ON(flags & FOLL_GET);
    @@ -947,8 +947,10 @@ struct page *follow_page(struct vm_area_
    goto out;

    pte = *ptep;
    - if (!pte_present(pte))
    - goto unlock;
    + if (!(flags & FOLL_WRITE) && !pte_present(pte)) {
    + pte_unmap_unlock(ptep, ptl);
    + goto null_or_zeropage;
    + }
    if ((flags & FOLL_WRITE) && !pte_write(pte))
    goto unlock;
    page = vm_normal_page(vma, address, pte);
    @@ -968,7 +970,7 @@ unlock:
    out:
    return page;

    -no_page_table:
    +null_or_zeropage:
    /*
    * When core dumping an enormous anonymous area that nobody
    * has touched so far, we don't want to allocate page tables.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Wed, Apr 30, 2008 at 01:25:16PM +0900, KAMEZAWA Hiroyuki wrote:
    > On Tue, 29 Apr 2008 10:10:58 -0400
    > Tony Battersby wrote:
    > >
    > > If I leave more memory free by changing the argument to
    > > malloc_all_but_x_mb(), then I have to increase the number of threads
    > > required to trigger the deadlock. Changing the thread stack size via
    > > setrlimit(RLIMIT_STACK) also changes the number of threads that are
    > > required to trigger the deadlock. For example, with
    > > malloc_all_but_x_mb(16) and the default stack size of 8 MB, <= 5 threads
    > > will coredump successfully, and >= 6 threads will deadlock. With
    > > malloc_all_but_x_mb(16) and a reduced stack size of 4096 bytes, <= 8
    > > threads will coredump successfully, and >= 9 threads will deadlock.
    > >
    > > Also note that the "free" command reports 10 MB free memory while the
    > > program is running before the segfault is triggered.
    > >

    > Hmm, my idea is below.
    >
    > Nick's remove ZERO_PAGE patch includes following change
    >
    > ==
    > @@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
    > spinlock_t *ptl;
    > {
    >
    > - page_add_new_anon_rmap(page, vma, address);
    > - } else {
    > - /* Map the ZERO_PAGE - vm_page_prot is readonly */
    > - page = ZERO_PAGE(address);
    > - page_cache_get(page);
    > - entry = mk_pte(page, vma->vm_page_prot);
    > + if (unlikely(anon_vma_prepare(vma)))
    > + goto oom;
    > + page = alloc_zeroed_user_highpage_movable(vma, address);
    > ==
    >
    > above change is for avoiding to use ZERO_PAGE at read-page-fault to anonymous
    > vma. This is reasonable I think. But at coredump, tons of read-but-never-written
    > pages can be allocated.
    > ==
    > coredump
    > -> get_user_pages()
    > -> follow_page() returns NULL
    > -> handle mm fault
    > -> do_anonymous page.
    > ==
    > follow_page() returns ZERO_PAGE only when page table is not avaiable.
    >
    > So, making follow_page() return ZERO_PAGE can be a fix of extra memory
    > consumpstion at core dump. (Maybe someone can think of other fix.)
    >
    > how about this patch ? Could you try ?
    >


    Ah, yes I stupidly missed this detail of follow_page. Definitely your
    patch is a good idea, and I think it would be a good idea even when
    we still had ZERO_PAGE, because it would prevent pagetable clearing
    from having to do extra teardown work here.

    Good catch, and I agree with your patch. Thanks


    > (I'm sorry but I'll not be active for a week because my servers are powered off.)
    >
    > -Kame
    >
    > ==
    > follow_page() returns ZERO_PAGE if page table is not available.
    > but returns NULL pte is not presentl.
    >
    > Signed-off-by: KAMEZAWA Hiroyuki
    >
    > Index: linux-2.6.25/mm/memory.c
    > ================================================== =================
    > --- linux-2.6.25.orig/mm/memory.c
    > +++ linux-2.6.25/mm/memory.c
    > @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
    > page = NULL;
    > pgd = pgd_offset(mm, address);
    > if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pud = pud_offset(pgd, address);
    > if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pmd = pmd_offset(pud, address);
    > if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > if (pmd_huge(*pmd)) {
    > BUG_ON(flags & FOLL_GET);
    > @@ -947,8 +947,10 @@ struct page *follow_page(struct vm_area_
    > goto out;
    >
    > pte = *ptep;
    > - if (!pte_present(pte))
    > - goto unlock;
    > + if (!(flags & FOLL_WRITE) && !pte_present(pte)) {
    > + pte_unmap_unlock(ptep, ptl);
    > + goto null_or_zeropage;
    > + }
    > if ((flags & FOLL_WRITE) && !pte_write(pte))
    > goto unlock;
    > page = vm_normal_page(vma, address, pte);
    > @@ -968,7 +970,7 @@ unlock:
    > out:
    > return page;
    >
    > -no_page_table:
    > +null_or_zeropage:
    > /*
    > * When core dumping an enormous anonymous area that nobody
    > * has touched so far, we don't want to allocate page tables.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Wed, Apr 30, 2008 at 08:03:33AM +0300, Mika Penttilä wrote:
    > KAMEZAWA Hiroyuki wrote:
    > >On Tue, 29 Apr 2008 10:10:58 -0400
    > >Tony Battersby wrote:
    > >
    > >>If I leave more memory free by changing the argument to
    > >>malloc_all_but_x_mb(), then I have to increase the number of threads
    > >>required to trigger the deadlock. Changing the thread stack size via
    > >>setrlimit(RLIMIT_STACK) also changes the number of threads that are
    > >>required to trigger the deadlock. For example, with
    > >>malloc_all_but_x_mb(16) and the default stack size of 8 MB, <= 5 threads
    > >>will coredump successfully, and >= 6 threads will deadlock. With
    > >>malloc_all_but_x_mb(16) and a reduced stack size of 4096 bytes, <= 8
    > >>threads will coredump successfully, and >= 9 threads will deadlock.
    > >>
    > >>Also note that the "free" command reports 10 MB free memory while the
    > >>program is running before the segfault is triggered.
    > >>
    > >>

    > >Hmm, my idea is below.
    > >
    > >Nick's remove ZERO_PAGE patch includes following change
    > >
    > >==
    > >@@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm,
    > >struct vm_area_struct *vma,
    > > spinlock_t *ptl;
    > > {
    > >
    > >- page_add_new_anon_rmap(page, vma, address);
    > >- } else {
    > >- /* Map the ZERO_PAGE - vm_page_prot is readonly */
    > >- page = ZERO_PAGE(address);
    > >- page_cache_get(page);
    > >- entry = mk_pte(page, vma->vm_page_prot);
    > >+ if (unlikely(anon_vma_prepare(vma)))
    > >+ goto oom;
    > >+ page = alloc_zeroed_user_highpage_movable(vma, address);
    > >==
    > >
    > >above change is for avoiding to use ZERO_PAGE at read-page-fault to
    > >anonymous
    > >vma. This is reasonable I think. But at coredump, tons of
    > >read-but-never-written pages can be allocated.
    > >==
    > >coredump
    > > -> get_user_pages()
    > > -> follow_page() returns NULL
    > > -> handle mm fault
    > > -> do_anonymous page.
    > >==
    > >follow_page() returns ZERO_PAGE only when page table is not avaiable.
    > >
    > >So, making follow_page() return ZERO_PAGE can be a fix of extra memory
    > >consumpstion at core dump. (Maybe someone can think of other fix.)
    > >
    > >how about this patch ? Could you try ?
    > >
    > >(I'm sorry but I'll not be active for a week because my servers are
    > >powered off.)
    > >
    > >-Kame
    > >
    > >

    >
    >
    > But sure we still have to handle the fault for instance swapped pages,
    > for other uses of get_user_pages();


    Yeah, it does need to test for pte_none.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Wed, 30 Apr 2008 08:03:33 +0300
    Mika Penttilä wrote:

    > > ==
    > > @@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
    > > spinlock_t *ptl;
    > > {
    > >
    > > - page_add_new_anon_rmap(page, vma, address);
    > > - } else {
    > > - /* Map the ZERO_PAGE - vm_page_prot is readonly */
    > > - page = ZERO_PAGE(address);
    > > - page_cache_get(page);
    > > - entry = mk_pte(page, vma->vm_page_prot);
    > > + if (unlikely(anon_vma_prepare(vma)))
    > > + goto oom;
    > > + page = alloc_zeroed_user_highpage_movable(vma, address);
    > > ==
    > >
    > > above change is for avoiding to use ZERO_PAGE at read-page-fault to anonymous
    > > vma. This is reasonable I think. But at coredump, tons of read-but-never-written
    > > pages can be allocated.
    > > ==
    > > coredump
    > > -> get_user_pages()
    > > -> follow_page() returns NULL
    > > -> handle mm fault
    > > -> do_anonymous page.
    > > ==
    > > follow_page() returns ZERO_PAGE only when page table is not avaiable.
    > >
    > > So, making follow_page() return ZERO_PAGE can be a fix of extra memory
    > > consumpstion at core dump. (Maybe someone can think of other fix.)
    > >
    > > how about this patch ? Could you try ?
    > >
    > > (I'm sorry but I'll not be active for a week because my servers are powered off.)
    > >
    > > -Kame
    > >
    > >

    >
    >
    > But sure we still have to handle the fault for instance swapped pages,
    > for other uses of get_user_pages();
    >

    Ah, my bad.....how about this ? I changed !pte_present() to pte_none().

    -Kame
    ==
    follow_page() returns ZERO_PAGE if a page table is not available.
    but returns NULL if a page table exists. If NULL, handle_mm_fault()
    allocates a new page.

    This behavior increases page consumption at coredump, which tend
    to do read-once-but-never-written page fault. This patch is
    for avoiding this.

    Changelog:
    - fixed to check pte_none() not !pte_present().


    Signed-off-by: KAMEZAWA Hiroyuki

    Index: linux-2.6.25/mm/memory.c
    ================================================== =================
    --- linux-2.6.25.orig/mm/memory.c
    +++ linux-2.6.25/mm/memory.c
    @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
    page = NULL;
    pgd = pgd_offset(mm, address);
    if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    - goto no_page_table;
    + goto null_or_zeropage;

    pud = pud_offset(pgd, address);
    if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    - goto no_page_table;
    + goto null_or_zeropage;

    pmd = pmd_offset(pud, address);
    if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    - goto no_page_table;
    + goto null_or_zeropage;

    if (pmd_huge(*pmd)) {
    BUG_ON(flags & FOLL_GET);
    @@ -947,8 +947,10 @@ struct page *follow_page(struct vm_area_
    goto out;

    pte = *ptep;
    - if (!pte_present(pte))
    - goto unlock;
    + if (!(flags & FOLL_WRITE) && pte_none(pte)) {
    + pte_unmap_unlock(ptep, ptl);
    + goto null_or_zeropage;
    + }
    if ((flags & FOLL_WRITE) && !pte_write(pte))
    goto unlock;
    page = vm_normal_page(vma, address, pte);
    @@ -968,7 +970,7 @@ unlock:
    out:
    return page;

    -no_page_table:
    +null_or_zeropage:
    /*
    * When core dumping an enormous anonymous area that nobody
    * has touched so far, we don't want to allocate page tables.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Wed, Apr 30, 2008 at 02:17:38PM +0900, KAMEZAWA Hiroyuki wrote:
    > On Wed, 30 Apr 2008 08:03:33 +0300
    > Mika Penttilä wrote:
    >
    > > > ==
    > > > @@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
    > > > spinlock_t *ptl;
    > > > {
    > > >
    > > > - page_add_new_anon_rmap(page, vma, address);
    > > > - } else {
    > > > - /* Map the ZERO_PAGE - vm_page_prot is readonly */
    > > > - page = ZERO_PAGE(address);
    > > > - page_cache_get(page);
    > > > - entry = mk_pte(page, vma->vm_page_prot);
    > > > + if (unlikely(anon_vma_prepare(vma)))
    > > > + goto oom;
    > > > + page = alloc_zeroed_user_highpage_movable(vma, address);
    > > > ==
    > > >
    > > > above change is for avoiding to use ZERO_PAGE at read-page-fault to anonymous
    > > > vma. This is reasonable I think. But at coredump, tons of read-but-never-written
    > > > pages can be allocated.
    > > > ==
    > > > coredump
    > > > -> get_user_pages()
    > > > -> follow_page() returns NULL
    > > > -> handle mm fault
    > > > -> do_anonymous page.
    > > > ==
    > > > follow_page() returns ZERO_PAGE only when page table is not avaiable.
    > > >
    > > > So, making follow_page() return ZERO_PAGE can be a fix of extra memory
    > > > consumpstion at core dump. (Maybe someone can think of other fix.)
    > > >
    > > > how about this patch ? Could you try ?
    > > >
    > > > (I'm sorry but I'll not be active for a week because my servers are powered off.)
    > > >
    > > > -Kame
    > > >
    > > >

    > >
    > >
    > > But sure we still have to handle the fault for instance swapped pages,
    > > for other uses of get_user_pages();
    > >

    > Ah, my bad.....how about this ? I changed !pte_present() to pte_none().
    >
    > -Kame
    > ==
    > follow_page() returns ZERO_PAGE if a page table is not available.
    > but returns NULL if a page table exists. If NULL, handle_mm_fault()
    > allocates a new page.
    >
    > This behavior increases page consumption at coredump, which tend
    > to do read-once-but-never-written page fault. This patch is
    > for avoiding this.


    I think you still need the pte_present test too, otherwise !present and
    !none ptes can slip through and be treated as present.

    Something like this should do:
    if (!pte_present(pte)) {
    if (pte_none(pte)) {
    pte_unmap_unlock
    goto null_or_zeropage;
    }
    goto unlock;
    }


    >
    > Changelog:
    > - fixed to check pte_none() not !pte_present().
    >
    >
    > Signed-off-by: KAMEZAWA Hiroyuki
    >
    > Index: linux-2.6.25/mm/memory.c
    > ================================================== =================
    > --- linux-2.6.25.orig/mm/memory.c
    > +++ linux-2.6.25/mm/memory.c
    > @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
    > page = NULL;
    > pgd = pgd_offset(mm, address);
    > if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pud = pud_offset(pgd, address);
    > if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pmd = pmd_offset(pud, address);
    > if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > if (pmd_huge(*pmd)) {
    > BUG_ON(flags & FOLL_GET);
    > @@ -947,8 +947,10 @@ struct page *follow_page(struct vm_area_
    > goto out;
    >
    > pte = *ptep;
    > - if (!pte_present(pte))
    > - goto unlock;
    > + if (!(flags & FOLL_WRITE) && pte_none(pte)) {
    > + pte_unmap_unlock(ptep, ptl);
    > + goto null_or_zeropage;
    > + }
    > if ((flags & FOLL_WRITE) && !pte_write(pte))
    > goto unlock;
    > page = vm_normal_page(vma, address, pte);
    > @@ -968,7 +970,7 @@ unlock:
    > out:
    > return page;
    >
    > -no_page_table:
    > +null_or_zeropage:
    > /*
    > * When core dumping an enormous anonymous area that nobody
    > * has touched so far, we don't want to allocate page tables.
    >
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Wed, 30 Apr 2008 07:19:32 +0200
    Nick Piggin wrote:

    >
    > Something like this should do:
    > if (!pte_present(pte)) {
    > if (pte_none(pte)) {
    > pte_unmap_unlock
    > goto null_or_zeropage;
    > }
    > goto unlock;
    > }
    >

    Sorry for broken work and thank you for advice.
    updated.

    Regards,
    -Kame
    ==
    follow_page() returns ZERO_PAGE if a page table is not available.
    but returns NULL if a page table exists. If NULL, handle_mm_fault()
    allocates a new page.

    This behavior increases page consumption at coredump, which tend
    to do read-once-but-never-written page fault. This patch is
    for avoiding this.

    Changelog:
    - fixed to check pte_present()/pte_none() in proper way.


    Signed-off-by: KAMEZAWA Hiroyuki

    Index: linux-2.6.25/mm/memory.c
    ================================================== =================
    --- linux-2.6.25.orig/mm/memory.c
    +++ linux-2.6.25/mm/memory.c
    @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
    page = NULL;
    pgd = pgd_offset(mm, address);
    if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    - goto no_page_table;
    + goto null_or_zeropage;

    pud = pud_offset(pgd, address);
    if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    - goto no_page_table;
    + goto null_or_zeropage;

    pmd = pmd_offset(pud, address);
    if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    - goto no_page_table;
    + goto null_or_zeropage;

    if (pmd_huge(*pmd)) {
    BUG_ON(flags & FOLL_GET);
    @@ -947,8 +947,13 @@ struct page *follow_page(struct vm_area_
    goto out;

    pte = *ptep;
    - if (!pte_present(pte))
    + if (!pte_present(pte)) {
    + if (!(flags & FOLL_WRITE) && pte_none(pte)) {
    + pte_unmap_unlock(ptep, ptl);
    + goto null_or_zeropage;
    + }
    goto unlock;
    + }
    if ((flags & FOLL_WRITE) && !pte_write(pte))
    goto unlock;
    page = vm_normal_page(vma, address, pte);
    @@ -968,7 +973,7 @@ unlock:
    out:
    return page;

    -no_page_table:
    +null_or_zeropage:
    /*
    * When core dumping an enormous anonymous area that nobody
    * has touched so far, we don't want to allocate page tables.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Wed, Apr 30, 2008 at 02:35:42PM +0900, KAMEZAWA Hiroyuki wrote:
    > On Wed, 30 Apr 2008 07:19:32 +0200
    > Nick Piggin wrote:
    >
    > >
    > > Something like this should do:
    > > if (!pte_present(pte)) {
    > > if (pte_none(pte)) {
    > > pte_unmap_unlock
    > > goto null_or_zeropage;
    > > }
    > > goto unlock;
    > > }
    > >

    > Sorry for broken work and thank you for advice.
    > updated.


    Don't be sorry. The most important thing is that you found this tricky
    problem. That's very good work IMO!

    >
    > Regards,
    > -Kame
    > ==
    > follow_page() returns ZERO_PAGE if a page table is not available.
    > but returns NULL if a page table exists. If NULL, handle_mm_fault()
    > allocates a new page.
    >
    > This behavior increases page consumption at coredump, which tend
    > to do read-once-but-never-written page fault. This patch is
    > for avoiding this.
    >
    > Changelog:
    > - fixed to check pte_present()/pte_none() in proper way.
    >
    >
    > Signed-off-by: KAMEZAWA Hiroyuki
    >
    > Index: linux-2.6.25/mm/memory.c
    > ================================================== =================
    > --- linux-2.6.25.orig/mm/memory.c
    > +++ linux-2.6.25/mm/memory.c
    > @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
    > page = NULL;
    > pgd = pgd_offset(mm, address);
    > if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pud = pud_offset(pgd, address);
    > if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pmd = pmd_offset(pud, address);
    > if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > if (pmd_huge(*pmd)) {
    > BUG_ON(flags & FOLL_GET);
    > @@ -947,8 +947,13 @@ struct page *follow_page(struct vm_area_
    > goto out;
    >
    > pte = *ptep;
    > - if (!pte_present(pte))
    > + if (!pte_present(pte)) {
    > + if (!(flags & FOLL_WRITE) && pte_none(pte)) {
    > + pte_unmap_unlock(ptep, ptl);
    > + goto null_or_zeropage;
    > + }
    > goto unlock;
    > + }


    Just a small nitpick: I guess you don't need this FOLL_WRITE test because
    null_or_zeropage will test FOLL_ANON which implies !FOLL_WRITE. It should give
    slightly smaller code.

    Otherwise, looks good to me:

    Acked-by: Nick Piggin


    > if ((flags & FOLL_WRITE) && !pte_write(pte))
    > goto unlock;
    > page = vm_normal_page(vma, address, pte);
    > @@ -968,7 +973,7 @@ unlock:
    > out:
    > return page;
    >
    > -no_page_table:
    > +null_or_zeropage:
    > /*
    > * When core dumping an enormous anonymous area that nobody
    > * has touched so far, we don't want to allocate page tables.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    KAMEZAWA Hiroyuki wrote:
    > On Tue, 29 Apr 2008 10:10:58 -0400
    > Tony Battersby wrote:
    >
    >> If I leave more memory free by changing the argument to
    >> malloc_all_but_x_mb(), then I have to increase the number of threads
    >> required to trigger the deadlock. Changing the thread stack size via
    >> setrlimit(RLIMIT_STACK) also changes the number of threads that are
    >> required to trigger the deadlock. For example, with
    >> malloc_all_but_x_mb(16) and the default stack size of 8 MB, <= 5 threads
    >> will coredump successfully, and >= 6 threads will deadlock. With
    >> malloc_all_but_x_mb(16) and a reduced stack size of 4096 bytes, <= 8
    >> threads will coredump successfully, and >= 9 threads will deadlock.
    >>
    >> Also note that the "free" command reports 10 MB free memory while the
    >> program is running before the segfault is triggered.
    >>
    >>

    > Hmm, my idea is below.
    >
    > Nick's remove ZERO_PAGE patch includes following change
    >
    > ==
    > @@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
    > spinlock_t *ptl;
    > {
    >
    > - page_add_new_anon_rmap(page, vma, address);
    > - } else {
    > - /* Map the ZERO_PAGE - vm_page_prot is readonly */
    > - page = ZERO_PAGE(address);
    > - page_cache_get(page);
    > - entry = mk_pte(page, vma->vm_page_prot);
    > + if (unlikely(anon_vma_prepare(vma)))
    > + goto oom;
    > + page = alloc_zeroed_user_highpage_movable(vma, address);
    > ==
    >
    > above change is for avoiding to use ZERO_PAGE at read-page-fault to anonymous
    > vma. This is reasonable I think. But at coredump, tons of read-but-never-written
    > pages can be allocated.
    > ==
    > coredump
    > -> get_user_pages()
    > -> follow_page() returns NULL
    > -> handle mm fault
    > -> do_anonymous page.
    > ==
    > follow_page() returns ZERO_PAGE only when page table is not avaiable.
    >
    > So, making follow_page() return ZERO_PAGE can be a fix of extra memory
    > consumpstion at core dump. (Maybe someone can think of other fix.)
    >
    > how about this patch ? Could you try ?
    >
    > (I'm sorry but I'll not be active for a week because my servers are powered off.)
    >
    > -Kame
    >
    >



    But sure we still have to handle the fault for instance swapped pages,
    for other uses of get_user_pages();

    --Mika



    > ==
    > follow_page() returns ZERO_PAGE if page table is not available.
    > but returns NULL pte is not presentl.
    >
    > Signed-off-by: KAMEZAWA Hiroyuki
    >
    > Index: linux-2.6.25/mm/memory.c
    > ================================================== =================
    > --- linux-2.6.25.orig/mm/memory.c
    > +++ linux-2.6.25/mm/memory.c
    > @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
    > page = NULL;
    > pgd = pgd_offset(mm, address);
    > if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pud = pud_offset(pgd, address);
    > if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pmd = pmd_offset(pud, address);
    > if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > if (pmd_huge(*pmd)) {
    > BUG_ON(flags & FOLL_GET);
    > @@ -947,8 +947,10 @@ struct page *follow_page(struct vm_area_
    > goto out;
    >
    > pte = *ptep;
    > - if (!pte_present(pte))
    > - goto unlock;
    > + if (!(flags & FOLL_WRITE) && !pte_present(pte)) {
    > + pte_unmap_unlock(ptep, ptl);
    > + goto null_or_zeropage;
    > + }
    > if ((flags & FOLL_WRITE) && !pte_write(pte))
    > goto unlock;
    > page = vm_normal_page(vma, address, pte);
    > @@ -968,7 +970,7 @@ unlock:
    > out:
    > return page;
    >
    > -no_page_table:
    > +null_or_zeropage:
    > /*
    > * When core dumping an enormous anonymous area that nobody
    > * has touched so far, we don't want to allocate page tables.
    >
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/
    >
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    KAMEZAWA Hiroyuki wrote:
    > follow_page() returns ZERO_PAGE if a page table is not available.
    > but returns NULL if a page table exists. If NULL, handle_mm_fault()
    > allocates a new page.
    >
    > This behavior increases page consumption at coredump, which tend
    > to do read-once-but-never-written page fault. This patch is
    > for avoiding this.
    >
    > Changelog:
    > - fixed to check pte_present()/pte_none() in proper way.
    >
    >
    > Signed-off-by: KAMEZAWA Hiroyuki
    >
    > Index: linux-2.6.25/mm/memory.c
    > ================================================== =================
    > --- linux-2.6.25.orig/mm/memory.c
    > +++ linux-2.6.25/mm/memory.c
    > @@ -926,15 +926,15 @@ struct page *follow_page(struct vm_area_
    > page = NULL;
    > pgd = pgd_offset(mm, address);
    > if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pud = pud_offset(pgd, address);
    > if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > pmd = pmd_offset(pud, address);
    > if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    > - goto no_page_table;
    > + goto null_or_zeropage;
    >
    > if (pmd_huge(*pmd)) {
    > BUG_ON(flags & FOLL_GET);
    > @@ -947,8 +947,13 @@ struct page *follow_page(struct vm_area_
    > goto out;
    >
    > pte = *ptep;
    > - if (!pte_present(pte))
    > + if (!pte_present(pte)) {
    > + if (!(flags & FOLL_WRITE) && pte_none(pte)) {
    > + pte_unmap_unlock(ptep, ptl);
    > + goto null_or_zeropage;
    > + }
    > goto unlock;
    > + }
    > if ((flags & FOLL_WRITE) && !pte_write(pte))
    > goto unlock;
    > page = vm_normal_page(vma, address, pte);
    > @@ -968,7 +973,7 @@ unlock:
    > out:
    > return page;
    >
    > -no_page_table:
    > +null_or_zeropage:
    > /*
    > * When core dumping an enormous anonymous area that nobody
    > * has touched so far, we don't want to allocate page tables.
    >
    >
    >

    This patch fixes the deadlock. Tested on 2.6.24.5. Thanks!

    Tested-by: Tony Battersby

    Tony

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    >This patch fixes the deadlock. Tested on 2.6.24.5. Thanks!
    >
    >Tested-by: Tony Battersby
    >

    thank you for test. I'll post this again when I'm back.

    Regards,
    -Kame
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Wed, 30 Apr 2008 08:11:25 +0200
    Nick Piggin wrote:

    > > pte = *ptep;
    > > - if (!pte_present(pte))
    > > + if (!pte_present(pte)) {
    > > + if (!(flags & FOLL_WRITE) && pte_none(pte)) {
    > > + pte_unmap_unlock(ptep, ptl);
    > > + goto null_or_zeropage;
    > > + }
    > > goto unlock;
    > > + }

    >
    > Just a small nitpick: I guess you don't need this FOLL_WRITE test because
    > null_or_zeropage will test FOLL_ANON which implies !FOLL_WRITE. It should give
    > slightly smaller code.
    >
    > Otherwise, looks good to me:
    >

    Hmm, but

    do_execve()
    -> copy_strings()
    -> get_arg_page()
    -> get_user_pages()

    can do write-page-fault in ANON (and it's a valid ops.)

    So, I think it's safe not to remove FOLL_WRITE check here.

    Thanks,
    -Kame

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

    On Wed, 7 May 2008 11:14:04 +0900
    KAMEZAWA Hiroyuki wrote:

    > > > pte = *ptep;
    > > > - if (!pte_present(pte))
    > > > + if (!pte_present(pte)) {
    > > > + if (!(flags & FOLL_WRITE) && pte_none(pte)) {
    > > > + pte_unmap_unlock(ptep, ptl);
    > > > + goto null_or_zeropage;
    > > > + }
    > > > goto unlock;
    > > > + }

    > >
    > > Just a small nitpick: I guess you don't need this FOLL_WRITE test because
    > > null_or_zeropage will test FOLL_ANON which implies !FOLL_WRITE. It should give
    > > slightly smaller code.
    > >
    > > Otherwise, looks good to me:
    > >

    > Hmm, but
    >
    > do_execve()
    > -> copy_strings()
    > -> get_arg_page()
    > -> get_user_pages()
    >
    > can do write-page-fault in ANON (and it's a valid ops.)
    >
    > So, I think it's safe not to remove FOLL_WRITE check here.
    >

    BTW, in above case, returning ZERO_PAGE() when pgd/pud/pmd is not available is
    safe ? (above path is expanding-stack at exec.)

    Thanks,
    -Kame




    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread