[RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2) - Kernel

This is a discussion on [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2) - Kernel ; This patch follows several iterations of the memory controller hierarchy patches. The hardwall approach by Kamezawa-San[1]. Version 1 of this patchset at [2]. The current approach is based on [2] and has the following properties 1. Hierarchies are very natural ...

+ Reply to Thread
Results 1 to 14 of 14

Thread: [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2)

  1. [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2)

    This patch follows several iterations of the memory controller hierarchy
    patches. The hardwall approach by Kamezawa-San[1]. Version 1 of this patchset
    at [2].

    The current approach is based on [2] and has the following properties

    1. Hierarchies are very natural in a filesystem like the cgroup filesystem.
    A multi-tree hierarchy has been supported for a long time in filesystems.
    When the feature is turned on, we honor hierarchies such that the root
    accounts for resource usage of all children and limits can be set at
    any point in the hierarchy. Any memory cgroup is limited by limits
    along the hierarchy. The total usage of all children of a node cannot
    exceed the limit of the node.
    2. The hierarchy feature is selectable and off by default
    3. Hierarchies are expensive and the trade off is depth versus performance.
    Hierarchies can also be completely turned off.

    The patches are against 2.6.28-rc2-mm1 and were tested in a KVM instance
    with SMP and swap turned on.

    Signed-off-by: Balbir Singh

    v2..v1
    ======
    1. The hierarchy is now selectable per-subtree
    2. The features file has been renamed to use_hierarchy
    3. Reclaim now holds cgroup lock and the reclaim does recursive walk and reclaim

    Series
    ------

    memcg-hierarchy-documentation.patch
    resource-counters-hierarchy-support.patch
    memcg-hierarchical-reclaim.patch
    memcg-add-hierarchy-selector.patch

    Reviews? Comments?

    References

    1. http://linux.derkeiler.com/Mailing-L.../msg05417.html
    2. http://kerneltrap.org/mailarchive/li...1513644/thread

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2)


    Add support for building hierarchies in resource counters. Cgroups allows us
    to build a deep hierarchy, but we currently don't link the resource counters
    belonging to the memory controller control groups, in the same fashion
    as the corresponding cgroup entries in the cgroup hierarchy. This patch
    provides the infrastructure for resource counters that have the same hiearchy
    as their cgroup counter parts.

    These set of patches are based on the resource counter hiearchy patches posted
    by Pavel Emelianov.

    NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
    the all the way up to the root. It is known that hiearchies are expensive,
    so the user needs to be careful and aware of the trade-offs before creating
    very deep ones.


    Signed-off-by: Balbir Singh
    ---

    include/linux/res_counter.h | 8 ++++++--
    kernel/res_counter.c | 42 ++++++++++++++++++++++++++++++++++--------
    mm/memcontrol.c | 9 ++++++---
    3 files changed, 46 insertions(+), 13 deletions(-)

    diff -puN include/linux/res_counter.h~resource-counters-hierarchy-support include/linux/res_counter.h
    --- linux-2.6.28-rc2/include/linux/res_counter.h~resource-counters-hierarchy-support 2008-11-08 14:09:31.000000000 +0530
    +++ linux-2.6.28-rc2-balbir/include/linux/res_counter.h 2008-11-08 14:09:31.000000000 +0530
    @@ -43,6 +43,10 @@ struct res_counter {
    * the routines below consider this to be IRQ-safe
    */
    spinlock_t lock;
    + /*
    + * Parent counter, used for hierarchial resource accounting
    + */
    + struct res_counter *parent;
    };

    /**
    @@ -87,7 +91,7 @@ enum {
    * helpers for accounting
    */

    -void res_counter_init(struct res_counter *counter);
    +void res_counter_init(struct res_counter *counter, struct res_counter *parent);

    /*
    * charge - try to consume more resource.
    @@ -103,7 +107,7 @@ void res_counter_init(struct res_counter
    int __must_check res_counter_charge_locked(struct res_counter *counter,
    unsigned long val);
    int __must_check res_counter_charge(struct res_counter *counter,
    - unsigned long val);
    + unsigned long val, struct res_counter **limit_fail_at);

    /*
    * uncharge - tell that some portion of the resource is released
    diff -puN kernel/res_counter.c~resource-counters-hierarchy-support kernel/res_counter.c
    --- linux-2.6.28-rc2/kernel/res_counter.c~resource-counters-hierarchy-support 2008-11-08 14:09:31.000000000 +0530
    +++ linux-2.6.28-rc2-balbir/kernel/res_counter.c 2008-11-08 14:09:31.000000000 +0530
    @@ -15,10 +15,11 @@
    #include
    #include

    -void res_counter_init(struct res_counter *counter)
    +void res_counter_init(struct res_counter *counter, struct res_counter *parent)
    {
    spin_lock_init(&counter->lock);
    counter->limit = (unsigned long long)LLONG_MAX;
    + counter->parent = parent;
    }

    int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
    @@ -34,14 +35,34 @@ int res_counter_charge_locked(struct res
    return 0;
    }

    -int res_counter_charge(struct res_counter *counter, unsigned long val)
    +int res_counter_charge(struct res_counter *counter, unsigned long val,
    + struct res_counter **limit_fail_at)
    {
    int ret;
    unsigned long flags;
    + struct res_counter *c, *u;

    - spin_lock_irqsave(&counter->lock, flags);
    - ret = res_counter_charge_locked(counter, val);
    - spin_unlock_irqrestore(&counter->lock, flags);
    + *limit_fail_at = NULL;
    + local_irq_save(flags);
    + for (c = counter; c != NULL; c = c->parent) {
    + spin_lock(&c->lock);
    + ret = res_counter_charge_locked(c, val);
    + spin_unlock(&c->lock);
    + if (ret < 0) {
    + *limit_fail_at = c;
    + goto undo;
    + }
    + }
    + ret = 0;
    + goto done;
    +undo:
    + for (u = counter; u != c; u = u->parent) {
    + spin_lock(&u->lock);
    + res_counter_uncharge_locked(u, val);
    + spin_unlock(&u->lock);
    + }
    +done:
    + local_irq_restore(flags);
    return ret;
    }

    @@ -56,10 +77,15 @@ void res_counter_uncharge_locked(struct
    void res_counter_uncharge(struct res_counter *counter, unsigned long val)
    {
    unsigned long flags;
    + struct res_counter *c;

    - spin_lock_irqsave(&counter->lock, flags);
    - res_counter_uncharge_locked(counter, val);
    - spin_unlock_irqrestore(&counter->lock, flags);
    + local_irq_save(flags);
    + for (c = counter; c != NULL; c = c->parent) {
    + spin_lock(&c->lock);
    + res_counter_uncharge_locked(c, val);
    + spin_unlock(&c->lock);
    + }
    + local_irq_restore(flags);
    }


    diff -puN mm/memcontrol.c~resource-counters-hierarchy-support mm/memcontrol.c
    --- linux-2.6.28-rc2/mm/memcontrol.c~resource-counters-hierarchy-support 2008-11-08 14:09:31.000000000 +0530
    +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c 2008-11-08 14:09:31.000000000 +0530
    @@ -485,6 +485,7 @@ int mem_cgroup_try_charge(struct mm_stru
    {
    struct mem_cgroup *mem;
    int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    + struct res_counter *fail_res;
    /*
    * We always charge the cgroup the mm_struct belongs to.
    * The mm_struct's mem_cgroup changes on task migration if the
    @@ -510,7 +511,7 @@ int mem_cgroup_try_charge(struct mm_stru
    }


    - while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
    + while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
    if (!(gfp_mask & __GFP_WAIT))
    goto nomem;

    @@ -1175,18 +1176,20 @@ static void mem_cgroup_free(struct mem_c
    static struct cgroup_subsys_state *
    mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
    {
    - struct mem_cgroup *mem;
    + struct mem_cgroup *mem, *parent;
    int node;

    if (unlikely((cont->parent) == NULL)) {
    mem = &init_mem_cgroup;
    + parent = NULL;
    } else {
    mem = mem_cgroup_alloc();
    + parent = mem_cgroup_from_cont(cont->parent);
    if (!mem)
    return ERR_PTR(-ENOMEM);
    }

    - res_counter_init(&mem->res);
    + res_counter_init(&mem->res, parent ? &parent->res : NULL);

    for_each_node_state(node, N_POSSIBLE)
    if (alloc_mem_cgroup_per_zone_info(mem, node))
    _

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)


    This patch introduces hierarchical reclaim. When an ancestor goes over its
    limit, the charging routine points to the parent that is above its limit.
    The reclaim process then starts from the last scanned child of the ancestor
    and reclaims until the ancestor goes below its limit.

    Signed-off-by: Balbir Singh
    ---

    mm/memcontrol.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++---------
    1 file changed, 128 insertions(+), 24 deletions(-)

    diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
    --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim 2008-11-08 14:09:32.000000000 +0530
    +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c 2008-11-08 14:09:32.000000000 +0530
    @@ -132,6 +132,11 @@ struct mem_cgroup {
    * statistics.
    */
    struct mem_cgroup_stat stat;
    + /*
    + * While reclaiming in a hiearchy, we cache the last child we
    + * reclaimed from.
    + */
    + struct mem_cgroup *last_scanned_child;
    };
    static struct mem_cgroup init_mem_cgroup;

    @@ -467,6 +472,124 @@ unsigned long mem_cgroup_isolate_pages(u
    return nr_taken;
    }

    +static struct mem_cgroup *
    +mem_cgroup_from_res_counter(struct res_counter *counter)
    +{
    + return container_of(counter, struct mem_cgroup, res);
    +}
    +
    +/*
    + * Dance down the hierarchy if needed to reclaim memory. We remember the
    + * last child we reclaimed from, so that we don't end up penalizing
    + * one child extensively based on its position in the children list.
    + *
    + * root_mem is the original ancestor that we've been reclaim from.
    + */
    +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
    + struct mem_cgroup *root_mem,
    + gfp_t gfp_mask)
    +{
    + struct cgroup *cg_current, *cgroup;
    + struct mem_cgroup *mem_child;
    + int ret = 0;
    +
    + /*
    + * Reclaim unconditionally and don't check for return value.
    + * We need to reclaim in the current group and down the tree.
    + * One might think about checking for children before reclaiming,
    + * but there might be left over accounting, even after children
    + * have left.
    + */
    + try_to_free_mem_cgroup_pages(mem, gfp_mask);
    +
    + if (res_counter_check_under_limit(&root_mem->res))
    + return 0;
    +
    + if (list_empty(&mem->css.cgroup->children))
    + return 0;
    +
    + /*
    + * Scan all children under the mem_cgroup mem
    + */
    + if (!mem->last_scanned_child)
    + cgroup = list_first_entry(&mem->css.cgroup->children,
    + struct cgroup, sibling);
    + else
    + cgroup = mem->last_scanned_child->css.cgroup;
    +
    + cg_current = cgroup;
    + cgroup_lock();
    +
    + do {
    + struct list_head *next;
    +
    + mem_child = mem_cgroup_from_cont(cgroup);
    + cgroup_unlock();
    +
    + ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
    + gfp_mask);
    + mem->last_scanned_child = mem_child;
    +
    + cgroup_lock();
    + if (res_counter_check_under_limit(&root_mem->res)) {
    + ret = 0;
    + goto done;
    + }
    +
    + /*
    + * Since we gave up the lock, it is time to
    + * start from last cgroup
    + */
    + cgroup = mem->last_scanned_child->css.cgroup;
    + next = cgroup->sibling.next;
    +
    + if (next == &cg_current->parent->children)
    + cgroup = list_first_entry(&mem->css.cgroup->children,
    + struct cgroup, sibling);
    + else
    + cgroup = container_of(next, struct cgroup, sibling);
    + } while (cgroup != cg_current);
    +
    +done:
    + cgroup_unlock();
    + return ret;
    +}
    +
    +/*
    + * Charge memory cgroup mem and check if it is over its limit. If so, reclaim
    + * from mem.
    + */
    +static int mem_cgroup_charge_and_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
    +{
    + int ret = 0;
    + unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    + struct res_counter *fail_res;
    + struct mem_cgroup *mem_over_limit;
    +
    + while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
    + if (!(gfp_mask & __GFP_WAIT))
    + goto out;
    +
    + /*
    + * Is one of our ancestors over their limit?
    + */
    + if (fail_res)
    + mem_over_limit = mem_cgroup_from_res_counter(fail_res);
    + else
    + mem_over_limit = mem;
    +
    + ret = mem_cgroup_hierarchical_reclaim(mem_over_limit,
    + mem_over_limit,
    + gfp_mask);
    +
    + if (!nr_retries--) {
    + mem_cgroup_out_of_memory(mem, gfp_mask);
    + goto out;
    + }
    + }
    +out:
    + return ret;
    +}

    /**
    * mem_cgroup_try_charge - get charge of PAGE_SIZE.
    @@ -484,8 +607,7 @@ int mem_cgroup_try_charge(struct mm_stru
    gfp_t gfp_mask, struct mem_cgroup **memcg)
    {
    struct mem_cgroup *mem;
    - int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    - struct res_counter *fail_res;
    +
    /*
    * We always charge the cgroup the mm_struct belongs to.
    * The mm_struct's mem_cgroup changes on task migration if the
    @@ -510,29 +632,9 @@ int mem_cgroup_try_charge(struct mm_stru
    css_get(&mem->css);
    }

    + if (mem_cgroup_charge_and_reclaim(mem, gfp_mask))
    + goto nomem;

    - while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
    - if (!(gfp_mask & __GFP_WAIT))
    - goto nomem;
    -
    - if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
    - continue;
    -
    - /*
    - * try_to_free_mem_cgroup_pages() might not give us a full
    - * picture of reclaim. Some pages are reclaimed and might be
    - * moved to swap cache or just unmapped from the cgroup.
    - * Check the limit again to see if the reclaim reduced the
    - * current usage of the cgroup before giving up
    - */
    - if (res_counter_check_under_limit(&mem->res))
    - continue;
    -
    - if (!nr_retries--) {
    - mem_cgroup_out_of_memory(mem, gfp_mask);
    - goto nomem;
    - }
    - }
    return 0;
    nomem:
    css_put(&mem->css);
    @@ -1195,6 +1297,8 @@ mem_cgroup_create(struct cgroup_subsys *
    if (alloc_mem_cgroup_per_zone_info(mem, node))
    goto free_out;

    + mem->last_scanned_child = NULL;
    +
    return &mem->css;
    free_out:
    for_each_node_state(node, N_POSSIBLE)
    _

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)


    Don't enable multiple hierarchy support by default. This patch introduces
    a features element that can be set to enable the nested depth hierarchy
    feature. This feature can only be enabled when the cgroup for which the
    feature this is enabled, has no children.

    Signed-off-by: Balbir Singh
    ---

    mm/memcontrol.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
    1 file changed, 42 insertions(+), 1 deletion(-)

    diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
    --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-add-hierarchy-selector 2008-11-08 14:09:33.000000000 +0530
    +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c 2008-11-08 14:10:53.000000000 +0530
    @@ -137,6 +137,11 @@ struct mem_cgroup {
    * reclaimed from.
    */
    struct mem_cgroup *last_scanned_child;
    + /*
    + * Should the accounting and control be hierarchical, per subtree?
    + */
    + unsigned long use_hierarchy;
    +
    };
    static struct mem_cgroup init_mem_cgroup;

    @@ -1079,6 +1084,33 @@ out:
    return ret;
    }

    +static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
    +{
    + return mem_cgroup_from_cont(cont)->use_hierarchy;
    +}
    +
    +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
    + u64 val)
    +{
    + int retval = 0;
    + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
    +
    + if (val == 1) {
    + if (list_empty(&cont->children))
    + mem->use_hierarchy = val;
    + else
    + retval = -EBUSY;
    + } else if (val == 0) {
    + if (list_empty(&cont->children))
    + mem->use_hierarchy = val;
    + else
    + retval = -EBUSY;
    + } else
    + retval = -EINVAL;
    +
    + return retval;
    +}
    +
    static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
    {
    return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
    @@ -1213,6 +1245,11 @@ static struct cftype mem_cgroup_files[]
    .name = "stat",
    .read_map = mem_control_stat_show,
    },
    + {
    + .name = "use_hierarchy",
    + .write_u64 = mem_cgroup_hierarchy_write,
    + .read_u64 = mem_cgroup_hierarchy_read,
    + },
    };

    static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
    @@ -1289,9 +1326,13 @@ mem_cgroup_create(struct cgroup_subsys *
    parent = mem_cgroup_from_cont(cont->parent);
    if (!mem)
    return ERR_PTR(-ENOMEM);
    + mem->use_hierarchy = parent->use_hierarchy;
    }

    - res_counter_init(&mem->res, parent ? &parent->res : NULL);
    + if (parent && parent->use_hierarchy)
    + res_counter_init(&mem->res, &parent->res);
    + else
    + res_counter_init(&mem->res, NULL);

    for_each_node_state(node, N_POSSIBLE)
    if (alloc_mem_cgroup_per_zone_info(mem, node))
    _

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)

    > +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
    > + u64 val)
    > +{
    > + int retval = 0;
    > + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
    > +
    > + if (val == 1) {
    > + if (list_empty(&cont->children))


    cgroup_lock should be held before checking cont->children.

    > + mem->use_hierarchy = val;
    > + else
    > + retval = -EBUSY;
    > + } else if (val == 0) {


    And code duplicate.

    > + if (list_empty(&cont->children))
    > + mem->use_hierarchy = val;
    > + else
    > + retval = -EBUSY;
    > + } else
    > + retval = -EINVAL;
    > +
    > + return retval;
    > +}
    > +


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)

    Li Zefan wrote:
    >> +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
    >> + u64 val)
    >> +{
    >> + int retval = 0;
    >> + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
    >> +
    >> + if (val == 1) {
    >> + if (list_empty(&cont->children))

    >
    > cgroup_lock should be held before checking cont->children.
    >


    Good point, I'll look at that aspect

    >> + mem->use_hierarchy = val;
    >> + else
    >> + retval = -EBUSY;
    >> + } else if (val == 0) {

    >
    > And code duplicate.


    Yes, this can be optimized better. I'll fix that in v3.

    >
    >> + if (list_empty(&cont->children))
    >> + mem->use_hierarchy nn= val;
    >> + else
    >> + retval = -EBUSY;
    >> + } else
    >> + retval = -EINVAL;
    >> +
    >> + return retval;
    >> +}
    >> +

    >



    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)

    On Sat, 08 Nov 2008 14:41:00 +0530
    Balbir Singh wrote:

    >
    > This patch introduces hierarchical reclaim. When an ancestor goes over its
    > limit, the charging routine points to the parent that is above its limit.
    > The reclaim process then starts from the last scanned child of the ancestor
    > and reclaims until the ancestor goes below its limit.
    >
    > Signed-off-by: Balbir Singh
    > ---
    >
    > mm/memcontrol.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++---------
    > 1 file changed, 128 insertions(+), 24 deletions(-)
    >
    > diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
    > --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim 2008-11-08 14:09:32.000000000 +0530
    > +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c 2008-11-08 14:09:32.000000000 +0530
    > @@ -132,6 +132,11 @@ struct mem_cgroup {
    > * statistics.
    > */
    > struct mem_cgroup_stat stat;
    > + /*
    > + * While reclaiming in a hiearchy, we cache the last child we
    > + * reclaimed from.
    > + */
    > + struct mem_cgroup *last_scanned_child;
    > };
    > static struct mem_cgroup init_mem_cgroup;
    >
    > @@ -467,6 +472,124 @@ unsigned long mem_cgroup_isolate_pages(u
    > return nr_taken;
    > }
    >
    > +static struct mem_cgroup *
    > +mem_cgroup_from_res_counter(struct res_counter *counter)
    > +{
    > + return container_of(counter, struct mem_cgroup, res);
    > +}
    > +
    > +/*
    > + * Dance down the hierarchy if needed to reclaim memory. We remember the
    > + * last child we reclaimed from, so that we don't end up penalizing
    > + * one child extensively based on its position in the children list.
    > + *
    > + * root_mem is the original ancestor that we've been reclaim from.
    > + */
    > +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
    > + struct mem_cgroup *root_mem,
    > + gfp_t gfp_mask)
    > +{
    > + struct cgroup *cg_current, *cgroup;
    > + struct mem_cgroup *mem_child;
    > + int ret = 0;
    > +
    > + /*
    > + * Reclaim unconditionally and don't check for return value.
    > + * We need to reclaim in the current group and down the tree.
    > + * One might think about checking for children before reclaiming,
    > + * but there might be left over accounting, even after children
    > + * have left.
    > + */
    > + try_to_free_mem_cgroup_pages(mem, gfp_mask);
    > +
    > + if (res_counter_check_under_limit(&root_mem->res))
    > + return 0;
    > +
    > + if (list_empty(&mem->css.cgroup->children))
    > + return 0;
    > +
    > + /*
    > + * Scan all children under the mem_cgroup mem
    > + */
    > + if (!mem->last_scanned_child)
    > + cgroup = list_first_entry(&mem->css.cgroup->children,
    > + struct cgroup, sibling);
    > + else
    > + cgroup = mem->last_scanned_child->css.cgroup;
    > +


    Who guarantee this last_scan_child is accessible at this point ?

    Thanks,
    -Kame
    > + cg_current = cgroup;
    > + cgroup_lock();
    > +
    > + do {
    > + struct list_head *next;
    > +
    > + mem_child = mem_cgroup_from_cont(cgroup);
    > + cgroup_unlock();
    > +
    > + ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
    > + gfp_mask);
    > + mem->last_scanned_child = mem_child;
    > +
    > + cgroup_lock();
    > + if (res_counter_check_under_limit(&root_mem->res)) {
    > + ret = 0;
    > + goto done;
    > + }
    > +
    > + /*
    > + * Since we gave up the lock, it is time to
    > + * start from last cgroup
    > + */
    > + cgroup = mem->last_scanned_child->css.cgroup;
    > + next = cgroup->sibling.next;
    > +
    > + if (next == &cg_current->parent->children)
    > + cgroup = list_first_entry(&mem->css.cgroup->children,
    > + struct cgroup, sibling);
    > + else
    > + cgroup = container_of(next, struct cgroup, sibling);
    > + } while (cgroup != cg_current);
    > +
    > +done:
    > + cgroup_unlock();
    > + return ret;
    > +}
    > +
    > +/*
    > + * Charge memory cgroup mem and check if it is over its limit. If so, reclaim
    > + * from mem.
    > + */
    > +static int mem_cgroup_charge_and_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
    > +{
    > + int ret = 0;
    > + unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    > + struct res_counter *fail_res;
    > + struct mem_cgroup *mem_over_limit;
    > +
    > + while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
    > + if (!(gfp_mask & __GFP_WAIT))
    > + goto out;
    > +
    > + /*
    > + * Is one of our ancestors over their limit?
    > + */
    > + if (fail_res)
    > + mem_over_limit = mem_cgroup_from_res_counter(fail_res);
    > + else
    > + mem_over_limit = mem;
    > +
    > + ret = mem_cgroup_hierarchical_reclaim(mem_over_limit,
    > + mem_over_limit,
    > + gfp_mask);
    > +
    > + if (!nr_retries--) {
    > + mem_cgroup_out_of_memory(mem, gfp_mask);
    > + goto out;
    > + }
    > + }
    > +out:
    > + return ret;
    > +}
    >
    > /**
    > * mem_cgroup_try_charge - get charge of PAGE_SIZE.
    > @@ -484,8 +607,7 @@ int mem_cgroup_try_charge(struct mm_stru
    > gfp_t gfp_mask, struct mem_cgroup **memcg)
    > {
    > struct mem_cgroup *mem;
    > - int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    > - struct res_counter *fail_res;
    > +
    > /*
    > * We always charge the cgroup the mm_struct belongs to.
    > * The mm_struct's mem_cgroup changes on task migration if the
    > @@ -510,29 +632,9 @@ int mem_cgroup_try_charge(struct mm_stru
    > css_get(&mem->css);
    > }
    >
    > + if (mem_cgroup_charge_and_reclaim(mem, gfp_mask))
    > + goto nomem;
    >
    > - while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
    > - if (!(gfp_mask & __GFP_WAIT))
    > - goto nomem;
    > -
    > - if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
    > - continue;
    > -
    > - /*
    > - * try_to_free_mem_cgroup_pages() might not give us a full
    > - * picture of reclaim. Some pages are reclaimed and might be
    > - * moved to swap cache or just unmapped from the cgroup.
    > - * Check the limit again to see if the reclaim reduced the
    > - * current usage of the cgroup before giving up
    > - */
    > - if (res_counter_check_under_limit(&mem->res))
    > - continue;
    > -
    > - if (!nr_retries--) {
    > - mem_cgroup_out_of_memory(mem, gfp_mask);
    > - goto nomem;
    > - }
    > - }
    > return 0;
    > nomem:
    > css_put(&mem->css);
    > @@ -1195,6 +1297,8 @@ mem_cgroup_create(struct cgroup_subsys *
    > if (alloc_mem_cgroup_per_zone_info(mem, node))
    > goto free_out;
    >
    > + mem->last_scanned_child = NULL;
    > +
    > return &mem->css;
    > free_out:
    > for_each_node_state(node, N_POSSIBLE)
    > _
    >
    > --
    > Balbir
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2)

    On Sat, 08 Nov 2008 14:40:47 +0530
    Balbir Singh wrote:

    >
    > Add support for building hierarchies in resource counters. Cgroups allows us
    > to build a deep hierarchy, but we currently don't link the resource counters
    > belonging to the memory controller control groups, in the same fashion
    > as the corresponding cgroup entries in the cgroup hierarchy. This patch
    > provides the infrastructure for resource counters that have the same hiearchy
    > as their cgroup counter parts.
    >
    > These set of patches are based on the resource counter hiearchy patches posted
    > by Pavel Emelianov.
    >
    > NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
    > the all the way up to the root. It is known that hiearchies are expensive,
    > so the user needs to be careful and aware of the trade-offs before creating
    > very deep ones.
    >

    Do you have numbers ?






    >
    > Signed-off-by: Balbir Singh
    > ---
    >
    > include/linux/res_counter.h | 8 ++++++--
    > kernel/res_counter.c | 42 ++++++++++++++++++++++++++++++++++--------
    > mm/memcontrol.c | 9 ++++++---
    > 3 files changed, 46 insertions(+), 13 deletions(-)
    >
    > diff -puN include/linux/res_counter.h~resource-counters-hierarchy-support include/linux/res_counter.h
    > --- linux-2.6.28-rc2/include/linux/res_counter.h~resource-counters-hierarchy-support 2008-11-08 14:09:31.000000000 +0530
    > +++ linux-2.6.28-rc2-balbir/include/linux/res_counter.h 2008-11-08 14:09:31.000000000 +0530
    > @@ -43,6 +43,10 @@ struct res_counter {
    > * the routines below consider this to be IRQ-safe
    > */
    > spinlock_t lock;
    > + /*
    > + * Parent counter, used for hierarchial resource accounting
    > + */
    > + struct res_counter *parent;
    > };
    >
    > /**
    > @@ -87,7 +91,7 @@ enum {
    > * helpers for accounting
    > */
    >
    > -void res_counter_init(struct res_counter *counter);
    > +void res_counter_init(struct res_counter *counter, struct res_counter *parent);
    >
    > /*
    > * charge - try to consume more resource.
    > @@ -103,7 +107,7 @@ void res_counter_init(struct res_counter
    > int __must_check res_counter_charge_locked(struct res_counter *counter,
    > unsigned long val);
    > int __must_check res_counter_charge(struct res_counter *counter,
    > - unsigned long val);
    > + unsigned long val, struct res_counter **limit_fail_at);
    >
    > /*
    > * uncharge - tell that some portion of the resource is released
    > diff -puN kernel/res_counter.c~resource-counters-hierarchy-support kernel/res_counter.c
    > --- linux-2.6.28-rc2/kernel/res_counter.c~resource-counters-hierarchy-support 2008-11-08 14:09:31.000000000 +0530
    > +++ linux-2.6.28-rc2-balbir/kernel/res_counter.c 2008-11-08 14:09:31.000000000 +0530
    > @@ -15,10 +15,11 @@
    > #include
    > #include
    >
    > -void res_counter_init(struct res_counter *counter)
    > +void res_counter_init(struct res_counter *counter, struct res_counter *parent)
    > {
    > spin_lock_init(&counter->lock);
    > counter->limit = (unsigned long long)LLONG_MAX;
    > + counter->parent = parent;
    > }
    >
    > int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
    > @@ -34,14 +35,34 @@ int res_counter_charge_locked(struct res
    > return 0;
    > }
    >
    > -int res_counter_charge(struct res_counter *counter, unsigned long val)
    > +int res_counter_charge(struct res_counter *counter, unsigned long val,
    > + struct res_counter **limit_fail_at)
    > {
    > int ret;
    > unsigned long flags;
    > + struct res_counter *c, *u;
    >
    > - spin_lock_irqsave(&counter->lock, flags);
    > - ret = res_counter_charge_locked(counter, val);
    > - spin_unlock_irqrestore(&counter->lock, flags);
    > + *limit_fail_at = NULL;
    > + local_irq_save(flags);
    > + for (c = counter; c != NULL; c = c->parent) {
    > + spin_lock(&c->lock);
    > + ret = res_counter_charge_locked(c, val);
    > + spin_unlock(&c->lock);
    > + if (ret < 0) {
    > + *limit_fail_at = c;
    > + goto undo;
    > + }
    > + }
    > + ret = 0;
    > + goto done;
    > +undo:
    > + for (u = counter; u != c; u = u->parent) {
    > + spin_lock(&u->lock);
    > + res_counter_uncharge_locked(u, val);
    > + spin_unlock(&u->lock);
    > + }
    > +done:
    > + local_irq_restore(flags);
    > return ret;
    > }
    >

    IMHO, dividing function into

    - res_counter_charge() for res_counter which doesn't need hierarchy.
    - res_counter_charge_hierarchy() fro res_counter with hierarch.

    will reduce footprint of other users than memcg.

    All users of res_counter is forced to use hierarchy version ?

    Thanks,
    -Kame


    > @@ -56,10 +77,15 @@ void res_counter_uncharge_locked(struct
    > void res_counter_uncharge(struct res_counter *counter, unsigned long val)
    > {
    > unsigned long flags;
    > + struct res_counter *c;
    >
    > - spin_lock_irqsave(&counter->lock, flags);
    > - res_counter_uncharge_locked(counter, val);
    > - spin_unlock_irqrestore(&counter->lock, flags);
    > + local_irq_save(flags);
    > + for (c = counter; c != NULL; c = c->parent) {
    > + spin_lock(&c->lock);
    > + res_counter_uncharge_locked(c, val);
    > + spin_unlock(&c->lock);
    > + }
    > + local_irq_restore(flags);
    > }
    >
    >
    > diff -puN mm/memcontrol.c~resource-counters-hierarchy-support mm/memcontrol.c
    > --- linux-2.6.28-rc2/mm/memcontrol.c~resource-counters-hierarchy-support 2008-11-08 14:09:31.000000000 +0530
    > +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c 2008-11-08 14:09:31.000000000 +0530
    > @@ -485,6 +485,7 @@ int mem_cgroup_try_charge(struct mm_stru
    > {
    > struct mem_cgroup *mem;
    > int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    > + struct res_counter *fail_res;
    > /*
    > * We always charge the cgroup the mm_struct belongs to.
    > * The mm_struct's mem_cgroup changes on task migration if the
    > @@ -510,7 +511,7 @@ int mem_cgroup_try_charge(struct mm_stru
    > }
    >
    >
    > - while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
    > + while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
    > if (!(gfp_mask & __GFP_WAIT))
    > goto nomem;
    >
    > @@ -1175,18 +1176,20 @@ static void mem_cgroup_free(struct mem_c
    > static struct cgroup_subsys_state *
    > mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
    > {
    > - struct mem_cgroup *mem;
    > + struct mem_cgroup *mem, *parent;
    > int node;
    >
    > if (unlikely((cont->parent) == NULL)) {
    > mem = &init_mem_cgroup;
    > + parent = NULL;
    > } else {
    > mem = mem_cgroup_alloc();
    > + parent = mem_cgroup_from_cont(cont->parent);
    > if (!mem)
    > return ERR_PTR(-ENOMEM);
    > }
    >
    > - res_counter_init(&mem->res);
    > + res_counter_init(&mem->res, parent ? &parent->res : NULL);
    >
    > for_each_node_state(node, N_POSSIBLE)
    > if (alloc_mem_cgroup_per_zone_info(mem, node))
    > _
    >
    > --
    > Balbir
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)

    On Sat, 08 Nov 2008 14:41:13 +0530
    Balbir Singh wrote:

    >
    > Don't enable multiple hierarchy support by default. This patch introduces
    > a features element that can be set to enable the nested depth hierarchy
    > feature. This feature can only be enabled when the cgroup for which the
    > feature this is enabled, has no children.
    >
    > Signed-off-by: Balbir Singh
    > ---
    >

    IMHO, permission to this is not enough.

    I think following is sane way.
    ==
    When parent->use_hierarchy==1.
    my->use_hierarchy must be "1" and cannot be tunrned to be "0" even if no children.
    When parent->use_hierarchy==0
    my->use_hierarchy can be either of "0" or "1".
    this value can be chagned if we don't have children
    ==

    Thanks,
    -Kame


    > mm/memcontrol.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
    > 1 file changed, 42 insertions(+), 1 deletion(-)
    >
    > diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
    > --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-add-hierarchy-selector 2008-11-08 14:09:33.000000000 +0530
    > +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c 2008-11-08 14:10:53.000000000 +0530
    > @@ -137,6 +137,11 @@ struct mem_cgroup {
    > * reclaimed from.
    > */
    > struct mem_cgroup *last_scanned_child;
    > + /*
    > + * Should the accounting and control be hierarchical, per subtree?
    > + */
    > + unsigned long use_hierarchy;
    > +
    > };
    > static struct mem_cgroup init_mem_cgroup;
    >
    > @@ -1079,6 +1084,33 @@ out:
    > return ret;
    > }
    >
    > +static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
    > +{
    > + return mem_cgroup_from_cont(cont)->use_hierarchy;
    > +}
    > +
    > +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
    > + u64 val)
    > +{
    > + int retval = 0;
    > + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
    > +
    > + if (val == 1) {
    > + if (list_empty(&cont->children))
    > + mem->use_hierarchy = val;
    > + else
    > + retval = -EBUSY;
    > + } else if (val == 0) {
    > + if (list_empty(&cont->children))
    > + mem->use_hierarchy = val;
    > + else
    > + retval = -EBUSY;
    > + } else
    > + retval = -EINVAL;
    > +
    > + return retval;
    > +}
    > +
    > static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
    > {
    > return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
    > @@ -1213,6 +1245,11 @@ static struct cftype mem_cgroup_files[]
    > .name = "stat",
    > .read_map = mem_control_stat_show,
    > },
    > + {
    > + .name = "use_hierarchy",
    > + .write_u64 = mem_cgroup_hierarchy_write,
    > + .read_u64 = mem_cgroup_hierarchy_read,
    > + },
    > };
    >
    > static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
    > @@ -1289,9 +1326,13 @@ mem_cgroup_create(struct cgroup_subsys *
    > parent = mem_cgroup_from_cont(cont->parent);
    > if (!mem)
    > return ERR_PTR(-ENOMEM);
    > + mem->use_hierarchy = parent->use_hierarchy;
    > }
    >
    > - res_counter_init(&mem->res, parent ? &parent->res : NULL);
    > + if (parent && parent->use_hierarchy)
    > + res_counter_init(&mem->res, &parent->res);
    > + else
    > + res_counter_init(&mem->res, NULL);
    >
    > for_each_node_state(node, N_POSSIBLE)
    > if (alloc_mem_cgroup_per_zone_info(mem, node))
    > _
    >
    > --
    > Balbir
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)

    KAMEZAWA Hiroyuki wrote:
    > On Sat, 08 Nov 2008 14:41:00 +0530
    > Balbir Singh wrote:
    >
    >> This patch introduces hierarchical reclaim. When an ancestor goes over its
    >> limit, the charging routine points to the parent that is above its limit.
    >> The reclaim process then starts from the last scanned child of the ancestor
    >> and reclaims until the ancestor goes below its limit.
    >>
    >> Signed-off-by: Balbir Singh
    >> ---
    >>
    >> mm/memcontrol.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++---------
    >> 1 file changed, 128 insertions(+), 24 deletions(-)
    >>
    >> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
    >> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim 2008-11-08 14:09:32.000000000 +0530
    >> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c 2008-11-08 14:09:32.000000000 +0530
    >> @@ -132,6 +132,11 @@ struct mem_cgroup {
    >> * statistics.
    >> */
    >> struct mem_cgroup_stat stat;
    >> + /*
    >> + * While reclaiming in a hiearchy, we cache the last child we
    >> + * reclaimed from.
    >> + */
    >> + struct mem_cgroup *last_scanned_child;
    >> };
    >> static struct mem_cgroup init_mem_cgroup;
    >>
    >> @@ -467,6 +472,124 @@ unsigned long mem_cgroup_isolate_pages(u
    >> return nr_taken;
    >> }
    >>
    >> +static struct mem_cgroup *
    >> +mem_cgroup_from_res_counter(struct res_counter *counter)
    >> +{
    >> + return container_of(counter, struct mem_cgroup, res);
    >> +}
    >> +
    >> +/*
    >> + * Dance down the hierarchy if needed to reclaim memory. We remember the
    >> + * last child we reclaimed from, so that we don't end up penalizing
    >> + * one child extensively based on its position in the children list.
    >> + *
    >> + * root_mem is the original ancestor that we've been reclaim from.
    >> + */
    >> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
    >> + struct mem_cgroup *root_mem,
    >> + gfp_t gfp_mask)
    >> +{
    >> + struct cgroup *cg_current, *cgroup;
    >> + struct mem_cgroup *mem_child;
    >> + int ret = 0;
    >> +
    >> + /*
    >> + * Reclaim unconditionally and don't check for return value.
    >> + * We need to reclaim in the current group and down the tree.
    >> + * One might think about checking for children before reclaiming,
    >> + * but there might be left over accounting, even after children
    >> + * have left.
    >> + */
    >> + try_to_free_mem_cgroup_pages(mem, gfp_mask);
    >> +
    >> + if (res_counter_check_under_limit(&root_mem->res))
    >> + return 0;
    >> +
    >> + if (list_empty(&mem->css.cgroup->children))
    >> + return 0;
    >> +
    >> + /*
    >> + * Scan all children under the mem_cgroup mem
    >> + */
    >> + if (!mem->last_scanned_child)
    >> + cgroup = list_first_entry(&mem->css.cgroup->children,
    >> + struct cgroup, sibling);
    >> + else
    >> + cgroup = mem->last_scanned_child->css.cgroup;
    >> +

    >
    > Who guarantee this last_scan_child is accessible at this point ?
    >


    Good catch! I'll fix this in mem_cgroup_destroy. It'll need some locking around
    it as well.

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2)

    KAMEZAWA Hiroyuki wrote:
    > On Sat, 08 Nov 2008 14:40:47 +0530
    > Balbir Singh wrote:
    >
    >> Add support for building hierarchies in resource counters. Cgroups allows us
    >> to build a deep hierarchy, but we currently don't link the resource counters
    >> belonging to the memory controller control groups, in the same fashion
    >> as the corresponding cgroup entries in the cgroup hierarchy. This patch
    >> provides the infrastructure for resource counters that have the same hiearchy
    >> as their cgroup counter parts.
    >>
    >> These set of patches are based on the resource counter hiearchy patches posted
    >> by Pavel Emelianov.
    >>
    >> NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
    >> the all the way up to the root. It is known that hiearchies are expensive,
    >> so the user needs to be careful and aware of the trade-offs before creating
    >> very deep ones.
    >>

    > Do you have numbers ?
    >


    I am getting those numbers, I'll post them soon.

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)

    KAMEZAWA Hiroyuki wrote:
    > On Sat, 08 Nov 2008 14:41:13 +0530
    > Balbir Singh wrote:
    >
    >> Don't enable multiple hierarchy support by default. This patch introduces
    >> a features element that can be set to enable the nested depth hierarchy
    >> feature. This feature can only be enabled when the cgroup for which the
    >> feature this is enabled, has no children.
    >>
    >> Signed-off-by: Balbir Singh
    >> ---
    >>

    > IMHO, permission to this is not enough.
    >
    > I think following is sane way.
    > ==
    > When parent->use_hierarchy==1.
    > my->use_hierarchy must be "1" and cannot be tunrned to be "0" even if no children.
    > When parent->use_hierarchy==0
    > my->use_hierarchy can be either of "0" or "1".
    > this value can be chagned if we don't have children
    > ==


    Sounds reasonable, will fix in v3.

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)

    On Tue, 11 Nov 2008 10:17:27 +0530
    Balbir Singh wrote:

    > KAMEZAWA Hiroyuki wrote:
    > > On Sat, 08 Nov 2008 14:41:00 +0530
    > > Balbir Singh wrote:
    > >
    > >> This patch introduces hierarchical reclaim. When an ancestor goes over its
    > >> limit, the charging routine points to the parent that is above its limit.
    > >> The reclaim process then starts from the last scanned child of the ancestor
    > >> and reclaims until the ancestor goes below its limit.
    > >>
    > >> Signed-off-by: Balbir Singh
    > >> ---
    > >>
    > >> mm/memcontrol.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++---------
    > >> 1 file changed, 128 insertions(+), 24 deletions(-)
    > >>
    > >> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
    > >> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim 2008-11-08 14:09:32.000000000 +0530
    > >> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c 2008-11-08 14:09:32.000000000 +0530
    > >> @@ -132,6 +132,11 @@ struct mem_cgroup {
    > >> * statistics.
    > >> */
    > >> struct mem_cgroup_stat stat;
    > >> + /*
    > >> + * While reclaiming in a hiearchy, we cache the last child we
    > >> + * reclaimed from.
    > >> + */
    > >> + struct mem_cgroup *last_scanned_child;
    > >> };
    > >> static struct mem_cgroup init_mem_cgroup;
    > >>
    > >> @@ -467,6 +472,124 @@ unsigned long mem_cgroup_isolate_pages(u
    > >> return nr_taken;
    > >> }
    > >>
    > >> +static struct mem_cgroup *
    > >> +mem_cgroup_from_res_counter(struct res_counter *counter)
    > >> +{
    > >> + return container_of(counter, struct mem_cgroup, res);
    > >> +}
    > >> +
    > >> +/*
    > >> + * Dance down the hierarchy if needed to reclaim memory. We remember the
    > >> + * last child we reclaimed from, so that we don't end up penalizing
    > >> + * one child extensively based on its position in the children list.
    > >> + *
    > >> + * root_mem is the original ancestor that we've been reclaim from.
    > >> + */
    > >> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
    > >> + struct mem_cgroup *root_mem,
    > >> + gfp_t gfp_mask)
    > >> +{
    > >> + struct cgroup *cg_current, *cgroup;
    > >> + struct mem_cgroup *mem_child;
    > >> + int ret = 0;
    > >> +
    > >> + /*
    > >> + * Reclaim unconditionally and don't check for return value.
    > >> + * We need to reclaim in the current group and down the tree.
    > >> + * One might think about checking for children before reclaiming,
    > >> + * but there might be left over accounting, even after children
    > >> + * have left.
    > >> + */
    > >> + try_to_free_mem_cgroup_pages(mem, gfp_mask);
    > >> +
    > >> + if (res_counter_check_under_limit(&root_mem->res))
    > >> + return 0;
    > >> +
    > >> + if (list_empty(&mem->css.cgroup->children))
    > >> + return 0;
    > >> +
    > >> + /*
    > >> + * Scan all children under the mem_cgroup mem
    > >> + */
    > >> + if (!mem->last_scanned_child)
    > >> + cgroup = list_first_entry(&mem->css.cgroup->children,
    > >> + struct cgroup, sibling);
    > >> + else
    > >> + cgroup = mem->last_scanned_child->css.cgroup;
    > >> +

    > >
    > > Who guarantee this last_scan_child is accessible at this point ?
    > >

    >
    > Good catch! I'll fix this in mem_cgroup_destroy. It'll need some locking around
    > it as well.
    >

    please see mem+swap controller's refcnt-to-memcg for delaying free of memcg.
    it will be a hint.

    Thanks,
    -Kame


    > --
    > Balbir
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)

    KAMEZAWA Hiroyuki wrote:
    >>

    > please see mem+swap controller's refcnt-to-memcg for delaying free of memcg.
    > it will be a hint.
    >


    I'll integrate with those patches later. I see a memcg->swapref, but we don't
    need to delay deletion for last_scanned_child, we need to make sure that the
    parent does not have any invalid entries.

    --
    Balbir
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread