[PATCH] mm: use a radix-tree to make do_move_pages() complexity linear - Kernel

This is a discussion on [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear - Kernel ; Add a radix-tree in do_move_pages() to associate each page with the struct page_to_node that describes its migration. new_page_node() can now easily find out the page_to_node of the given page instead of traversing the whole page_to_node array. So the overall complexity ...

+ Reply to Thread
Results 1 to 9 of 9

Thread: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

  1. [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    Add a radix-tree in do_move_pages() to associate each page with
    the struct page_to_node that describes its migration.
    new_page_node() can now easily find out the page_to_node of the
    given page instead of traversing the whole page_to_node array.
    So the overall complexity is linear instead of quadratic.

    We still need the page_to_node array since it is allocated by the
    caller (sys_move_page()) and used by do_pages_stat() when no target
    nodes are given by the application. And we need room to store all
    these page_to_node entries for do_move_pages() as well anyway.

    If a page is given twice by the application, the old code would
    return -EBUSY (failure from the second isolate_lru_page()). Now,
    radix_tree_insert() will return -EEXIST, and we convert it back
    to -EBUSY to keep the user-space ABI.

    The radix-tree is emptied at the end of do_move_pages() since
    new_page_node() doesn't know when an entry is used for the last
    time (unmap_and_move() could try another pass later).
    Marking pp->page as ZERO_PAGE(0) was actually never used. We now
    set it to NULL when pp is not in the radix-tree. It is faster
    than doing a loop of radix_tree_lookup_gang()+delete().

    Signed-off-by: Brice Goglin
    Signed-off-by: Nathalie Furmento

    --- a/mm/migrate.c
    +++ b/mm/migrate.c
    @@ -31,6 +31,7 @@
    #include
    #include
    #include
    +#include

    #include "internal.h"

    @@ -840,12 +841,10 @@ struct page_to_node {
    static struct page *new_page_node(struct page *p, unsigned long private,
    int **result)
    {
    - struct page_to_node *pm = (struct page_to_node *)private;
    + struct radix_tree_root *root = (struct radix_tree_root *) private;
    + struct page_to_node *pm = (struct page_to_node *) radix_tree_lookup(root, (unsigned long) p);

    - while (pm->node != MAX_NUMNODES && pm->page != p)
    - pm++;
    -
    - if (pm->node == MAX_NUMNODES)
    + if (!pm)
    return NULL;

    *result = &pm->status;
    @@ -865,6 +864,7 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
    int err;
    struct page_to_node *pp;
    LIST_HEAD(pagelist);
    + RADIX_TREE(pmroot, GFP_KERNEL);

    down_read(&mm->mmap_sem);

    @@ -876,11 +876,8 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
    struct vm_area_struct *vma;
    struct page *page;

    - /*
    - * A valid page pointer that will not match any of the
    - * pages that will be moved.
    - */
    - pp->page = ZERO_PAGE(0);
    + /* set to NULL as long as pp is not in the radix-tree */
    + pp->page = NULL;

    err = -EFAULT;
    vma = find_vma(mm, pp->addr);
    @@ -900,9 +897,7 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
    if (PageReserved(page)) /* Check for zero page */
    goto put_and_set;

    - pp->page = page;
    err = page_to_nid(page);
    -
    if (err == pp->node)
    /*
    * Node already in the right place
    @@ -914,6 +909,23 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
    !migrate_all)
    goto put_and_set;

    + /*
    + * Insert pp in the radix-tree so that new_page_node() can find it
    + * while only knowing the page.
    + * There cannot be any duplicate since isolate_lru_page() would fail
    + * below anyway.
    + */
    + err = radix_tree_insert(&pmroot, (unsigned long) page, pp);
    + if (err < 0) {
    + if (err == -EEXIST)
    + /* the page cannot be migrated twice */
    + err = -EBUSY;
    + goto put_and_set;
    + }
    +
    + /* set pp->page for real now that pp is in the radix-tree */
    + pp->page = page;
    +
    err = isolate_lru_page(page, &pagelist);
    put_and_set:
    /*
    @@ -927,12 +939,17 @@ set_status:
    }

    if (!list_empty(&pagelist))
    - err = migrate_pages(&pagelist, new_page_node,
    - (unsigned long)pm);
    + err = migrate_pages(&pagelist, new_page_node, (unsigned long) &pmroot);
    else
    err = -ENOENT;

    up_read(&mm->mmap_sem);
    +
    + /* empty the radix-tree now that new_page_node() will not be called anymore */
    + for(pp = pm; pp->node != MAX_NUMNODES; pp++)
    + if (pp->page)
    + radix_tree_delete(&pmroot, (unsigned long) pp->page);
    +
    return err;
    }



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    On Thu, 09 Oct 2008 14:32:26 +0200
    Brice Goglin wrote:

    > Add a radix-tree in do_move_pages() to associate each page with
    > the struct page_to_node that describes its migration.
    > new_page_node() can now easily find out the page_to_node of the
    > given page instead of traversing the whole page_to_node array.
    > So the overall complexity is linear instead of quadratic.
    >
    > We still need the page_to_node array since it is allocated by the
    > caller (sys_move_page()) and used by do_pages_stat() when no target
    > nodes are given by the application. And we need room to store all
    > these page_to_node entries for do_move_pages() as well anyway.
    >
    > If a page is given twice by the application, the old code would
    > return -EBUSY (failure from the second isolate_lru_page()). Now,
    > radix_tree_insert() will return -EEXIST, and we convert it back
    > to -EBUSY to keep the user-space ABI.
    >
    > The radix-tree is emptied at the end of do_move_pages() since
    > new_page_node() doesn't know when an entry is used for the last
    > time (unmap_and_move() could try another pass later).
    > Marking pp->page as ZERO_PAGE(0) was actually never used. We now
    > set it to NULL when pp is not in the radix-tree. It is faster
    > than doing a loop of radix_tree_lookup_gang()+delete().


    Any O(n*n) code always catches up with us in the end. But I do think
    that to merge this code we'd need some description of the problem which
    we fixed.

    Please send a description of the situation under which the current code
    performs unacceptably. Some before-and-after quantitative measurements
    would be good.

    Because it could be (as far as I know) that the problem is purely
    theoretical, in which case we might not want the patch at all.

    Thanks.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    Andrew Morton wrote:
    > On Thu, 09 Oct 2008 14:32:26 +0200
    > Brice Goglin wrote:
    >
    >
    >> Add a radix-tree in do_move_pages() to associate each page with
    >> the struct page_to_node that describes its migration.
    >> new_page_node() can now easily find out the page_to_node of the
    >> given page instead of traversing the whole page_to_node array.
    >> So the overall complexity is linear instead of quadratic.
    >>
    >> We still need the page_to_node array since it is allocated by the
    >> caller (sys_move_page()) and used by do_pages_stat() when no target
    >> nodes are given by the application. And we need room to store all
    >> these page_to_node entries for do_move_pages() as well anyway.
    >>
    >> If a page is given twice by the application, the old code would
    >> return -EBUSY (failure from the second isolate_lru_page()). Now,
    >> radix_tree_insert() will return -EEXIST, and we convert it back
    >> to -EBUSY to keep the user-space ABI.
    >>
    >> The radix-tree is emptied at the end of do_move_pages() since
    >> new_page_node() doesn't know when an entry is used for the last
    >> time (unmap_and_move() could try another pass later).
    >> Marking pp->page as ZERO_PAGE(0) was actually never used. We now
    >> set it to NULL when pp is not in the radix-tree. It is faster
    >> than doing a loop of radix_tree_lookup_gang()+delete().
    >>

    >
    > Any O(n*n) code always catches up with us in the end. But I do think
    > that to merge this code we'd need some description of the problem which
    > we fixed.
    >
    > Please send a description of the situation under which the current code
    > performs unacceptably. Some before-and-after quantitative measurements
    > would be good.
    >
    > Because it could be (as far as I know) that the problem is purely
    > theoretical, in which case we might not want the patch at all.
    >


    Just try sys_move_pages() on a 10-100MB buffer, you'll get something
    like 50MB/s on a recent Opteron machine. This throughput decreases
    significantly with the number of pages. With this patch, we get about
    350MB/s and the throughput is stable when the migrated buffer gets
    larger. I don't have detailled numbers at hand, I'll send them by monday.

    Brice

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    Brice Goglin wrote:

    > Just try sys_move_pages() on a 10-100MB buffer, you'll get something
    > like 50MB/s on a recent Opteron machine. This throughput decreases
    > significantly with the number of pages. With this patch, we get about
    > 350MB/s and the throughput is stable when the migrated buffer gets
    > larger. I don't have detailled numbers at hand, I'll send them by monday.


    Migration throughput is optimal for sys_move_pages() and the cpuset migration.
    Some comparison would be useful.

    With 100MB you have ~250k pages which will require a vmalloc of 4MB for the
    struct pm struct array to control the migration of each individual page.

    Would it be possible to restructure this in such a way that we work in chunks
    of 100 or so pages each so that we can avoid the vmalloc?

    We also could do a kmalloc for each individual struct pm_struct with the radix
    tree which would also avoid the vmalloc but still keep the need to allocate
    4MB for temporary struct pm_structs.

    Or get rid of the pm_struct altogether by storing the address of the node
    vector somewhere and retrieve the node as needed from the array. This would
    allow storing the struct page * pointers in the radix tree.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    Brice Goglin wrote:
    > Andrew Morton wrote:
    >
    >> On Thu, 09 Oct 2008 14:32:26 +0200
    >> Brice Goglin wrote:
    >>
    >>
    >>
    >>> Add a radix-tree in do_move_pages() to associate each page with
    >>> the struct page_to_node that describes its migration.
    >>> new_page_node() can now easily find out the page_to_node of the
    >>> given page instead of traversing the whole page_to_node array.
    >>> So the overall complexity is linear instead of quadratic.
    >>>
    >>> We still need the page_to_node array since it is allocated by the
    >>> caller (sys_move_page()) and used by do_pages_stat() when no target
    >>> nodes are given by the application. And we need room to store all
    >>> these page_to_node entries for do_move_pages() as well anyway.
    >>>
    >>> If a page is given twice by the application, the old code would
    >>> return -EBUSY (failure from the second isolate_lru_page()). Now,
    >>> radix_tree_insert() will return -EEXIST, and we convert it back
    >>> to -EBUSY to keep the user-space ABI.
    >>>
    >>> The radix-tree is emptied at the end of do_move_pages() since
    >>> new_page_node() doesn't know when an entry is used for the last
    >>> time (unmap_and_move() could try another pass later).
    >>> Marking pp->page as ZERO_PAGE(0) was actually never used. We now
    >>> set it to NULL when pp is not in the radix-tree. It is faster
    >>> than doing a loop of radix_tree_lookup_gang()+delete().
    >>>
    >>>

    >> Any O(n*n) code always catches up with us in the end. But I do think
    >> that to merge this code we'd need some description of the problem which
    >> we fixed.
    >>
    >> Please send a description of the situation under which the current code
    >> performs unacceptably. Some before-and-after quantitative measurements
    >> would be good.
    >>
    >> Because it could be (as far as I know) that the problem is purely
    >> theoretical, in which case we might not want the patch at all.
    >>
    >>

    >
    > Just try sys_move_pages() on a 10-100MB buffer, you'll get something
    > like 50MB/s on a recent Opteron machine. This throughput decreases
    > significantly with the number of pages. With this patch, we get about
    > 350MB/s and the throughput is stable when the migrated buffer gets
    > larger. I don't have detailled numbers at hand, I'll send them by monday.
    >


    Here's some quickyl-gathered numbers for the duration of move_pages().
    It's between nodes #2 and #3 of a quad-quad-core opteron 2347HE with
    2.6.27-rc5 + perfmon2:

    buffer (kB) move_pages (us) move_pages with patch (us)
    4000 12351 12580
    40000 223975 123024

    As you can see, with the patch applied, the migration time for 10x more
    pages is 10x more. Without the patch, it's 18x.

    I'll see if I can implement what Christoph's ideas.

    Brice

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    On Saturday 11 October 2008 19:54, Brice Goglin wrote:
    > Christoph Lameter wrote:
    > > Would it be possible to restructure this in such a way that we work in
    > > chunks of 100 or so pages each so that we can avoid the vmalloc?
    > >
    > > We also could do a kmalloc for each individual struct pm_struct with the
    > > radix tree which would also avoid the vmalloc but still keep the need to
    > > allocate 4MB for temporary struct pm_structs.
    > >
    > > Or get rid of the pm_struct altogether by storing the address of the node
    > > vector somewhere and retrieve the node as needed from the array. This
    > > would allow storing the struct page * pointers in the radix tree.

    >
    > I have been thinking about all this, and here's some ideas:
    > * do_pages_stat() may easily be rewritten without the huge pm array. It
    > just need to user-space pointers to the page and status arrays. It
    > traverses the first array , and for each page does: doing get_user() the
    > address, retrieve the page node, and put_user() the result in the status
    > array. No need to allocate any single page_to_node structure.
    > * If we split the migration in small chunks (such as one page of pm's),
    > the quadratic complexity doesn't matter that much. There will be at most
    > several dozens of pm in the chunk array, so the linear search in
    > new_page_node() won't be that slow, it may even be faster than the
    > overall cost of adding a radix-tree to improve this search. So we can
    > keep the internal code unchanged and just add the chunking around it.
    > * One thing that bothers me is move_pages() returning -ENOENT when no
    > page are given to migrate_pages(). I don't see why having 100/100 pages
    > not migrated would return a different error than having only 99/100
    > pages not migrated. We have the status array to place -ENOENT for all
    > these pages. If the user doesn't know where his pages are allocated, he
    > shouldn't get a different return value depending on how many pages were
    > already on the right node. And actually, this convention makes
    > user-space application harder to write since you need to treat -ENOENT
    > as a success unless you already knew for sure where your pages were
    > allocated. And the big thing is that this convention makes the chunking
    > painfully/uselessly more complex. Breaking user-ABI is bad, but fixing
    > crazy ABI...
    >
    > Here are some other numbers from the above (dirty) implementation,
    > migrating from node #2 to #3 on a quad-quad-core opteron 2347HE with
    > vanilla 2.6.27:
    >
    > length move_pages (us) move_pages with patch (us)
    > 4kB 126 98
    > 40kB 198 168
    > 400kB 963 937
    > 4MB 12503 11930
    > 40MB 246867 11848
    >
    > It seems to be even slightly better than the previous patch (but the
    > kernel are a bit different). And I quickly checked that it scales well
    > up to 4GB buffers.


    If you are worried about vmalloc overhead, I'd suggest testing with -mm.
    I've rewritten the vmap code so it is now slightly scalable and sane to
    use.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    Christoph Lameter wrote:
    > Would it be possible to restructure this in such a way that we work in chunks
    > of 100 or so pages each so that we can avoid the vmalloc?
    >
    > We also could do a kmalloc for each individual struct pm_struct with the radix
    > tree which would also avoid the vmalloc but still keep the need to allocate
    > 4MB for temporary struct pm_structs.
    >
    > Or get rid of the pm_struct altogether by storing the address of the node
    > vector somewhere and retrieve the node as needed from the array. This would
    > allow storing the struct page * pointers in the radix tree.
    >


    I have been thinking about all this, and here's some ideas:
    * do_pages_stat() may easily be rewritten without the huge pm array. It
    just need to user-space pointers to the page and status arrays. It
    traverses the first array , and for each page does: doing get_user() the
    address, retrieve the page node, and put_user() the result in the status
    array. No need to allocate any single page_to_node structure.
    * If we split the migration in small chunks (such as one page of pm's),
    the quadratic complexity doesn't matter that much. There will be at most
    several dozens of pm in the chunk array, so the linear search in
    new_page_node() won't be that slow, it may even be faster than the
    overall cost of adding a radix-tree to improve this search. So we can
    keep the internal code unchanged and just add the chunking around it.
    * One thing that bothers me is move_pages() returning -ENOENT when no
    page are given to migrate_pages(). I don't see why having 100/100 pages
    not migrated would return a different error than having only 99/100
    pages not migrated. We have the status array to place -ENOENT for all
    these pages. If the user doesn't know where his pages are allocated, he
    shouldn't get a different return value depending on how many pages were
    already on the right node. And actually, this convention makes
    user-space application harder to write since you need to treat -ENOENT
    as a success unless you already knew for sure where your pages were
    allocated. And the big thing is that this convention makes the chunking
    painfully/uselessly more complex. Breaking user-ABI is bad, but fixing
    crazy ABI...

    Here are some other numbers from the above (dirty) implementation,
    migrating from node #2 to #3 on a quad-quad-core opteron 2347HE with
    vanilla 2.6.27:

    length move_pages (us) move_pages with patch (us)
    4kB 126 98
    40kB 198 168
    400kB 963 937
    4MB 12503 11930
    40MB 246867 11848

    It seems to be even slightly better than the previous patch (but the
    kernel are a bit different). And I quickly checked that it scales well
    up to 4GB buffers.

    Brice

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    Nick Piggin wrote:
    > If you are worried about vmalloc overhead, I'd suggest testing with -mm.
    > I've rewritten the vmap code so it is now slightly scalable and sane to
    > use.
    >


    I am actually only worried about move_pages() performance for large
    buffers The vmalloc overhead is probably negligible against the
    quadratic duration of move_pages() for dozens of MB.

    Brice

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexity linear

    Brice Goglin wrote:
    > * One thing that bothers me is move_pages() returning -ENOENT when no
    > page are given to migrate_pages(). I don't see why having 100/100 pages
    > not migrated would return a different error than having only 99/100
    > pages not migrated. We have the status array to place -ENOENT for all
    > these pages. If the user doesn't know where his pages are allocated, he
    > shouldn't get a different return value depending on how many pages were
    > already on the right node. And actually, this convention makes
    > user-space application harder to write since you need to treat -ENOENT
    > as a success unless you already knew for sure where your pages were
    > allocated. And the big thing is that this convention makes the chunking
    > painfully/uselessly more complex. Breaking user-ABI is bad, but fixing
    > crazy ABI...
    >

    I do not think that move_pages() is used that frequently. Changing the
    API slightly as you suggest would not be that big of a deal.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread