[RFC PATCH 0/4] Reclaim page capture v1 - Kernel

This is a discussion on [RFC PATCH 0/4] Reclaim page capture v1 - Kernel ; For sometime we have been looking at mechanisms for improving the availability of larger allocations under load. One of the options we have explored is the capturing of pages freed under direct reclaim in order to increase the chances of ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: [RFC PATCH 0/4] Reclaim page capture v1

  1. [RFC PATCH 0/4] Reclaim page capture v1

    For sometime we have been looking at mechanisms for improving the availability
    of larger allocations under load. One of the options we have explored is
    the capturing of pages freed under direct reclaim in order to increase the
    chances of free pages coelescing before they are subject to reallocation
    by racing allocators.

    Following this email is a patch stack implementing page capture during
    direct reclaim. It consits of four patches. The first two simply pull
    out existing code into helpers for reuse. The third makes buddy's use
    of struct page explicit. The fourth contains the meat of the changes,
    and its leader contains a much fuller description of the feature.

    I have done a fair amount of comparitive testing with and without
    this patch set and in broad brush I am seeing improvements in hugepage
    allocations (worst case size) success of the order of 5% which under
    load for systems with larger hugepages represents a doubling of the number
    of pages available. Testing is still ongoing to confirm these results.

    Against: 2.6.26-rc6 (with the explicit page flags patches)

    Comments?

    -apw
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [PATCH 2/4] pull out zone cpuset and watermark checks for reuse

    When allocating we need to confirm that the zone we are about to allocate
    from is acceptable to the CPUSET we are in, and that it does not violate
    the zone watermarks. Pull these checks out so we can reuse them in a
    later patch.

    Signed-off-by: Andy Whitcroft
    ---
    mm/page_alloc.c | 62 ++++++++++++++++++++++++++++++++++++++----------------
    1 files changed, 43 insertions(+), 19 deletions(-)

    diff --git a/mm/page_alloc.c b/mm/page_alloc.c
    index 758ecf1..4d9c4e8 100644
    --- a/mm/page_alloc.c
    +++ b/mm/page_alloc.c
    @@ -1248,6 +1248,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
    return 1;
    }

    +/*
    + * Return 1 if this zone is an acceptable source given the cpuset
    + * constraints.
    + */
    +static inline int zone_cpuset_ok(struct zone *zone,
    + int alloc_flags, gfp_t gfp_mask)
    +{
    + if ((alloc_flags & ALLOC_CPUSET) &&
    + !cpuset_zone_allowed_softwall(zone, gfp_mask))
    + return 0;
    + return 1;
    +}
    +
    +/*
    + * Return 1 if this zone is within the watermarks specified by the
    + * allocation flags.
    + */
    +static inline int zone_alloc_ok(struct zone *zone, int order,
    + int classzone_idx, int alloc_flags, gfp_t gfp_mask)
    +{
    + if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
    + unsigned long mark;
    + if (alloc_flags & ALLOC_WMARK_MIN)
    + mark = zone->pages_min;
    + else if (alloc_flags & ALLOC_WMARK_LOW)
    + mark = zone->pages_low;
    + else
    + mark = zone->pages_high;
    + if (!zone_watermark_ok(zone, order, mark,
    + classzone_idx, alloc_flags)) {
    + if (!zone_reclaim_mode ||
    + !zone_reclaim(zone, gfp_mask, order))
    + return 0;
    + }
    + }
    + return 1;
    +}
    +
    #ifdef CONFIG_NUMA
    /*
    * zlc_setup - Setup for "zonelist cache". Uses cached zone data to
    @@ -1401,25 +1439,11 @@ zonelist_scan:
    if (NUMA_BUILD && zlc_active &&
    !zlc_zone_worth_trying(zonelist, z, allowednodes))
    continue;
    - if ((alloc_flags & ALLOC_CPUSET) &&
    - !cpuset_zone_allowed_softwall(zone, gfp_mask))
    - goto try_next_zone;
    -
    - if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
    - unsigned long mark;
    - if (alloc_flags & ALLOC_WMARK_MIN)
    - mark = zone->pages_min;
    - else if (alloc_flags & ALLOC_WMARK_LOW)
    - mark = zone->pages_low;
    - else
    - mark = zone->pages_high;
    - if (!zone_watermark_ok(zone, order, mark,
    - classzone_idx, alloc_flags)) {
    - if (!zone_reclaim_mode ||
    - !zone_reclaim(zone, gfp_mask, order))
    - goto this_zone_full;
    - }
    - }
    + if (!zone_cpuset_ok(zone, alloc_flags, gfp_mask))
    + goto try_next_zone;
    + if (!zone_alloc_ok(zone, order, classzone_idx,
    + alloc_flags, gfp_mask))
    + goto this_zone_full;

    page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
    if (page)
    --
    1.5.6.1.201.g3e7d3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse

    When we are about to release a page we perform a number of actions
    on that page. We clear down any anonymous mappings, confirm that
    the page is safe to release, check for freeing locks, before mapping
    the page should that be required. Pull this processing out into a
    helper function for reuse in a later patch.

    Note that we do not convert the similar cleardown in free_hot_cold_page()
    as the optimiser is unable to squash the loops during the inline.

    Signed-off-by: Andy Whitcroft
    ---
    mm/page_alloc.c | 43 ++++++++++++++++++++++++++++++-------------
    1 files changed, 30 insertions(+), 13 deletions(-)

    diff --git a/mm/page_alloc.c b/mm/page_alloc.c
    index 8aa93f3..758ecf1 100644
    --- a/mm/page_alloc.c
    +++ b/mm/page_alloc.c
    @@ -468,6 +468,35 @@ static inline int free_pages_check(struct page *page)
    }

    /*
    + * Prepare this page for release to the buddy. Sanity check the page.
    + * Returns 1 if the page is safe to free.
    + */
    +static inline int free_page_prepare(struct page *page, int order)
    +{
    + int i;
    + int reserved = 0;
    +
    + if (PageAnon(page))
    + page->mapping = NULL;
    +
    + for (i = 0 ; i < (1 << order) ; ++i)
    + reserved += free_pages_check(page + i);
    + if (reserved)
    + return 0;
    +
    + if (!PageHighMem(page)) {
    + debug_check_no_locks_freed(page_address(page),
    + PAGE_SIZE << order);
    + debug_check_no_obj_freed(page_address(page),
    + PAGE_SIZE << order);
    + }
    + arch_free_page(page, order);
    + kernel_map_pages(page, 1 << order, 0);
    +
    + return 1;
    +}
    +
    +/*
    * Frees a list of pages.
    * Assumes all pages on list are in same zone, and of same order.
    * count is the number of pages to free.
    @@ -508,22 +537,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
    static void __free_pages_ok(struct page *page, unsigned int order)
    {
    unsigned long flags;
    - int i;
    - int reserved = 0;

    - for (i = 0 ; i < (1 << order) ; ++i)
    - reserved += free_pages_check(page + i);
    - if (reserved)
    + if (!free_page_prepare(page, order))
    return;

    - if (!PageHighMem(page)) {
    - debug_check_no_locks_freed(page_address(page),PAGE _SIZE< - debug_check_no_obj_freed(page_address(page),
    - PAGE_SIZE << order);
    - }
    - arch_free_page(page, order);
    - kernel_map_pages(page, 1 << order, 0);
    -
    local_irq_save(flags);
    __count_vm_events(PGFREE, 1 << order);
    free_one_page(page_zone(page), page, order);
    --
    1.5.6.1.201.g3e7d3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [PATCH 3/4] buddy: explicitly identify buddy field use in struct page

    Explicitly define the struct page fields which buddy uses when it owns
    pages. Defines a new anonymous struct to allow additional fields to
    be defined in a later patch.

    Signed-off-by: Andy Whitcroft
    ---
    include/linux/mm_types.h | 3 +++
    mm/internal.h | 2 +-
    mm/page_alloc.c | 4 ++--
    3 files changed, 6 insertions(+), 3 deletions(-)

    diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
    index 02a27ae..45eb71f 100644
    --- a/include/linux/mm_types.h
    +++ b/include/linux/mm_types.h
    @@ -69,6 +69,9 @@ struct page {
    #endif
    struct kmem_cache *slab; /* SLUB: Pointer to slab */
    struct page *first_page; /* Compound tail pages */
    + struct {
    + unsigned long buddy_order; /* buddy: free page order */
    + };
    };
    union {
    pgoff_t index; /* Our offset within mapping. */
    diff --git a/mm/internal.h b/mm/internal.h
    index 0034e94..ac0f600 100644
    --- a/mm/internal.h
    +++ b/mm/internal.h
    @@ -44,7 +44,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
    static inline unsigned long page_order(struct page *page)
    {
    VM_BUG_ON(!PageBuddy(page));
    - return page_private(page);
    + return page->buddy_order;
    }

    /*
    diff --git a/mm/page_alloc.c b/mm/page_alloc.c
    index 4d9c4e8..d73e1e1 100644
    --- a/mm/page_alloc.c
    +++ b/mm/page_alloc.c
    @@ -316,14 +316,14 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)

    static inline void set_page_order(struct page *page, int order)
    {
    - set_page_private(page, order);
    + page->buddy_order = order;
    __SetPageBuddy(page);
    }

    static inline void rmv_page_order(struct page *page)
    {
    __ClearPageBuddy(page);
    - set_page_private(page, 0);
    + page->buddy_order = 0;
    }

    /*
    --
    1.5.6.1.201.g3e7d3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH 2/4] pull out zone cpuset and watermark checks for reuse

    Hi Andy,

    this is nit.


    > +/*
    > + * Return 1 if this zone is an acceptable source given the cpuset
    > + * constraints.
    > + */
    > +static inline int zone_cpuset_ok(struct zone *zone,
    > + int alloc_flags, gfp_t gfp_mask)
    > +{
    > + if ((alloc_flags & ALLOC_CPUSET) &&
    > + !cpuset_zone_allowed_softwall(zone, gfp_mask))
    > + return 0;
    > + return 1;
    > +}


    zone_cpuset_ok() seems cpuset sanity check.
    but it is "allocatable?" check.

    in addition, "ok" is slightly vague name, IMHO.


    > +/*
    > + * Return 1 if this zone is within the watermarks specified by the
    > + * allocation flags.
    > + */
    > +static inline int zone_alloc_ok(struct zone *zone, int order,
    > + int classzone_idx, int alloc_flags, gfp_t gfp_mask)
    > +{
    > + if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
    > + unsigned long mark;
    > + if (alloc_flags & ALLOC_WMARK_MIN)
    > + mark = zone->pages_min;
    > + else if (alloc_flags & ALLOC_WMARK_LOW)
    > + mark = zone->pages_low;
    > + else
    > + mark = zone->pages_high;
    > + if (!zone_watermark_ok(zone, order, mark,
    > + classzone_idx, alloc_flags)) {
    > + if (!zone_reclaim_mode ||
    > + !zone_reclaim(zone, gfp_mask, order))
    > + return 0;
    > + }
    > + }
    > + return 1;
    > +}


    zone_alloc_ok() seems check "allocatable? or not".
    So, I like zone_reclaim() go away from its function.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

    Hi Andy,

    I feel this is interesting patch.

    but I'm worry about it become increase OOM occur.
    What do you think?

    and, Why don't you make patch against -mm tree?


    > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
    > index d73e1e1..1ac703d 100644
    > --- a/mm/page_alloc.c
    > +++ b/mm/page_alloc.c
    > @@ -410,6 +410,51 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
    > * -- wli
    > */
    >
    > +static inline void __capture_one_page(struct list_head *capture_list,
    > + struct page *page, struct zone *zone, unsigned int order)
    > +{
    > + unsigned long page_idx;
    > + unsigned long order_size = 1UL << order;
    > +
    > + if (unlikely(PageCompound(page)))
    > + destroy_compound_page(page, order);
    > +
    > + page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
    > +
    > + VM_BUG_ON(page_idx & (order_size - 1));
    > + VM_BUG_ON(bad_range(zone, page));
    > +
    > + while (order < MAX_ORDER-1) {
    > + unsigned long combined_idx;
    > + struct page *buddy;
    > +
    > + buddy = __page_find_buddy(page, page_idx, order);
    > + if (!page_is_buddy(page, buddy, order))
    > + break;
    > +
    > + /* Our buddy is free, merge with it and move up one order. */
    > + list_del(&buddy->lru);
    > + if (PageBuddyCapture(buddy)) {
    > + buddy->buddy_free = 0;
    > + __ClearPageBuddyCapture(buddy);
    > + } else {
    > + zone->free_area[order].nr_free--;
    > + __mod_zone_page_state(zone,
    > + NR_FREE_PAGES, -(1UL << order));
    > + }
    > + rmv_page_order(buddy);
    > + combined_idx = __find_combined_index(page_idx, order);
    > + page = page + (combined_idx - page_idx);
    > + page_idx = combined_idx;
    > + order++;
    > + }
    > + set_page_order(page, order);
    > + __SetPageBuddyCapture(page);
    > + page->buddy_free = capture_list;
    > +
    > + list_add(&page->lru, capture_list);
    > +}


    if we already have enough size page,
    shoudn't we release page to buddy list?

    otherwise, increase oom risk.
    or, Am I misunderstanding?


    > static inline void __free_one_page(struct page *page,
    > struct zone *zone, unsigned int order)
    > {
    > @@ -433,6 +478,12 @@ static inline void __free_one_page(struct page *page,
    > buddy = __page_find_buddy(page, page_idx, order);
    > if (!page_is_buddy(page, buddy, order))
    > break;
    > + if (PageBuddyCapture(buddy)) {
    > + __mod_zone_page_state(zone,
    > + NR_FREE_PAGES, -(1UL << order));
    > + return __capture_one_page(buddy->buddy_free,
    > + page, zone, order);
    > + }


    shouldn't you make captured page's zonestat?
    otherwise, administrator can't trouble shooting.


    > /* Can pages be swapped as part of reclaim? */
    > @@ -78,6 +80,12 @@ struct scan_control {
    > unsigned long *scanned, int order, int mode,
    > struct zone *z, struct mem_cgroup *mem_cont,
    > int active);
    > +
    > + /* Captured page. */
    > + struct page **capture;
    > +
    > + /* Nodemask for acceptable allocations. */
    > + nodemask_t *nodemask;
    > };


    please more long comment.
    anybody think about scan_control is reclaim purpose structure.
    So, probably they think "Why is this member needed?".




    > @@ -1314,8 +1360,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
    > unsigned long lru_pages = 0;
    > struct zoneref *z;
    > struct zone *zone;
    > + struct zone *preferred_zone;
    > enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
    >
    > + /* This should never fail as we should be scanning a real zonelist. */
    > + (void)first_zones_zonelist(zonelist, high_zoneidx, sc->nodemask,
    > + &preferred_zone);


    nit.
    (void) is unnecessary.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

    On Wed, Jul 02, 2008 at 09:01:59PM +0900, KOSAKI Motohiro wrote:
    > Hi Andy,
    >
    > I feel this is interesting patch.
    >
    > but I'm worry about it become increase OOM occur.
    > What do you think?


    We do hold onto some nearly free pages for a while longer but only in
    direct reclaim, assuming kswapd is running its pages should not get
    captured. I am pushing our machines in test pretty hard, to the
    unusable stage mostly without OOM'ing but that is still an artifical
    test. The amount of memory under capture is proportional to the size of
    the allocations at the time of capture so one would hope this would only
    be significant at very high orders.

    > and, Why don't you make patch against -mm tree?


    That is historical mostly as there was major churn in the same place when
    I was originally making these patches, plus -mm was not bootable on any
    of my test systems.. I am not sure if that is still true. I will have
    a look at a recent -mm and see if they will rebase and boot.

    Thanks for looking.

    -apw
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread