[patch 0/7] cpuset writeback throttling - Kernel

This is a discussion on [patch 0/7] cpuset writeback throttling - Kernel ; Andrew, This is the revised cpuset writeback throttling patchset posted to LKML on Tuesday, October 27. The comments from Peter Zijlstra have been addressed. His concurrent page cache patchset is not currently in -mm, so we can still serialize updating ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 33

Thread: [patch 0/7] cpuset writeback throttling

  1. [patch 0/7] cpuset writeback throttling

    Andrew,

    This is the revised cpuset writeback throttling patchset posted to LKML
    on Tuesday, October 27.

    The comments from Peter Zijlstra have been addressed. His concurrent
    page cache patchset is not currently in -mm, so we can still serialize
    updating a struct address_space's dirty_nodes on its tree_lock. When his
    patchset is merged, the patch at the end of this message can be used to
    introduce the necessary synchronization.

    This patchset applies nicely to 2.6.28-rc2-mm1 with the exception of the
    first patch due to the alloc_inode() refactoring to inode_init_always() in
    e9110864c440736beb484c2c74dedc307168b14e from linux-next and additions to
    include/linux/cpuset.h from
    oom-print-triggering-tasks-cpuset-and-mems-allowed.patch (oops .

    Please consider this for inclusion in the -mm tree.

    A simple way of testing this change is to create a large file that exceeds
    the amount of memory allocated to a specific cpuset. Then, mmap and
    modify the large file (such as in the following program) while running a
    latency sensitive task in a disjoint cpuset. Notice the writeout
    throttling that doesn't interfere with the latency sensitive task.

    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    void *addr;
    unsigned long length;
    unsigned long i;
    int fd;

    if (argc != 3) {
    fprintf(stderr, "usage: %s \n",
    argv[0]);
    exit(1);
    }

    fd = open(argv[1], O_RDWR, 0644);
    if (fd < 0) {
    fprintf(stderr, "Cannot open file %s\n", argv[1]);
    exit(1);
    }

    length = strtoul(argv[2], NULL, 0);
    if (!length) {
    fprintf(stderr, "Invalid length %s\n", argv[2]);
    exit(1);
    }

    addr = mmap(0, length, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
    0);
    if (addr == MAP_FAILED) {
    fprintf(stderr, "mmap() failed\n");
    exit(1);
    }

    for (; {
    for (i = 0; i < length; i++)
    (*(char *)(addr + i))++;
    msync(addr, length, MS_SYNC);
    }
    return 0;
    }



    The following patch can be applied once the struct address_space's
    tree_lock is removed to protect the attachment of mapping->dirty_nodes.
    ---
    diff --git a/fs/inode.c b/fs/inode.c
    --- a/fs/inode.c
    +++ b/fs/inode.c
    @@ -223,6 +223,9 @@ void inode_init_once(struct inode *inode)
    INIT_LIST_HEAD(&inode->inotify_watches);
    mutex_init(&inode->inotify_mutex);
    #endif
    +#if MAX_NUMNODES > BITS_PER_LONG
    + spin_lock_init(&inode->i_data.dirty_nodes_lock);
    +#endif
    }

    EXPORT_SYMBOL(inode_init_once);
    diff --git a/include/linux/fs.h b/include/linux/fs.h
    --- a/include/linux/fs.h
    +++ b/include/linux/fs.h
    @@ -554,6 +554,7 @@ struct address_space {
    nodemask_t dirty_nodes; /* nodes with dirty pages */
    #else
    nodemask_t *dirty_nodes; /* pointer to mask, if dirty */
    + spinlock_t dirty_nodes_lock; /* protects the above */
    #endif
    #endif
    } __attribute__((aligned(sizeof(long))));
    diff --git a/kernel/cpuset.c b/kernel/cpuset.c
    --- a/kernel/cpuset.c
    +++ b/kernel/cpuset.c
    @@ -2413,25 +2413,27 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
    #if MAX_NUMNODES > BITS_PER_LONG
    /*
    * Special functions for NUMA systems with a large number of nodes. The
    - * nodemask is pointed to from the address_space structure. The attachment of
    - * the dirty_nodes nodemask is protected by the tree_lock. The nodemask is
    - * freed only when the inode is cleared (and therefore unused, thus no locking
    - * is necessary).
    + * nodemask is pointed to from the address_space structure.
    */
    void cpuset_update_dirty_nodes(struct address_space *mapping,
    struct page *page)
    {
    - nodemask_t *nodes = mapping->dirty_nodes;
    + nodemask_t *nodes;
    int node = page_to_nid(page);

    + spin_lock_irq(&mapping->dirty_nodes_lock);
    + nodes = mapping->dirty_nodes;
    if (!nodes) {
    nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
    - if (!nodes)
    + if (!nodes) {
    + spin_unlock_irq(&mapping->dirty_nodes_lock);
    return;
    + }

    *nodes = NODE_MASK_NONE;
    mapping->dirty_nodes = nodes;
    }
    + spin_unlock_irq(&mapping->dirty_nodes_lock);
    node_set(node, *nodes);
    }

    @@ -2446,8 +2448,8 @@ void cpuset_clear_dirty_nodes(struct address_space *mapping)
    }

    /*
    - * Called without tree_lock. The nodemask is only freed when the inode is
    - * cleared and therefore this is safe.
    + * The nodemask is only freed when the inode is cleared and therefore this
    + * requires no locking.
    */
    int cpuset_intersects_dirty_nodes(struct address_space *mapping,
    nodemask_t *mask)
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [patch 0/7] cpuset writeback throttling

    On Thu, Oct 30, 2008 at 12:23:10PM -0700, David Rientjes wrote:
    > Andrew,
    >
    > This is the revised cpuset writeback throttling patchset posted to LKML
    > on Tuesday, October 27.
    >
    > The comments from Peter Zijlstra have been addressed. His concurrent
    > page cache patchset is not currently in -mm, so we can still serialize
    > updating a struct address_space's dirty_nodes on its tree_lock. When his
    > patchset is merged, the patch at the end of this message can be used to
    > introduce the necessary synchronization.
    >
    > This patchset applies nicely to 2.6.28-rc2-mm1 with the exception of the
    > first patch due to the alloc_inode() refactoring to inode_init_always() in
    > e9110864c440736beb484c2c74dedc307168b14e from linux-next and additions to
    > include/linux/cpuset.h from
    > oom-print-triggering-tasks-cpuset-and-mems-allowed.patch (oops .
    >
    > Please consider this for inclusion in the -mm tree.
    >
    > A simple way of testing this change is to create a large file that exceeds
    > the amount of memory allocated to a specific cpuset. Then, mmap and
    > modify the large file (such as in the following program) while running a
    > latency sensitive task in a disjoint cpuset. Notice the writeout
    > throttling that doesn't interfere with the latency sensitive task.


    What sort of validation/regression testing has this been through?

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [patch 0/7] cpuset writeback throttling

    On Fri, 31 Oct 2008, Dave Chinner wrote:

    > What sort of validation/regression testing has this been through?


    There were several tests run more than a year ago by me on large SGI
    machines. Then there was Solomita who did various tests and posts of
    the patchset over the last year.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [patch 0/7] cpuset writeback throttling

    On Thu, Oct 30, 2008 at 04:33:34PM -0500, Christoph Lameter wrote:
    > On Fri, 31 Oct 2008, Dave Chinner wrote:
    >
    >> What sort of validation/regression testing has this been through?

    >
    > There were several tests run more than a year ago by me on large SGI
    > machines. Then there was Solomita who did various tests and posts of the
    > patchset over the last year.


    Yes, I know about those tests at SGI - they were to demonstrate the
    feature worked (i.e. proof-of-concept demonstration), not regression
    test the change.

    This is a fairly major change in behaviour to the writeback path on
    NUMA systems and so has the potential to introduce subtle new
    issues. Hence I'm asking about the level of testing and exposure
    it has had. It doesn't really sound like it has had much coverage
    to me....

    I'm concerned right now because Nick and others (including myself)
    have recently found lots of nasty data integrity problems in the
    writeback path that we are currently trying to test fixes for.
    It's not a good time to introduce new behaviours as that will
    definitely perturb any regression testing we are running....

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [patch 0/7] cpuset writeback throttling

    On Fri, 31 Oct 2008, Dave Chinner wrote:

    > Yes, I know about those tests at SGI - they were to demonstrate the
    > feature worked (i.e. proof-of-concept demonstration), not regression
    > test the change.


    I did some regression tests at that time before publishing the initial
    patchsets. That was mainly using standard tests (AIM7). There have been
    some changes since then though.

    > This is a fairly major change in behaviour to the writeback path on
    > NUMA systems and so has the potential to introduce subtle new
    > issues. Hence I'm asking about the level of testing and exposure
    > it has had. It doesn't really sound like it has had much coverage
    > to me....


    Sure we need more testing. Solomita did some testing as evident from his
    messages. See f.e.
    http://linux.derkeiler.com/Mailing-L.../msg02939.html

    > I'm concerned right now because Nick and others (including myself)
    > have recently found lots of nasty data integrity problems in the
    > writeback path that we are currently trying to test fixes for.
    > It's not a good time to introduce new behaviours as that will
    > definitely perturb any regression testing we are running....


    The writeback throttling that is addressed in the patchset mainly changes
    the calculation of the limits that determine when writeback is started and
    select inodes based on their page allocation on certain nodes. That is
    relatively independent of the locking scheme for writeback.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [patch 0/7] cpuset writeback throttling

    On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
    David Rientjes wrote:

    > This is the revised cpuset writeback throttling patchset


    I'm all confused about why this is a cpuset thing rather than a cgroups
    thing. What are the relationships here?

    I mean, writeback throttling _should_ operate upon a control group
    (more specifically: a memcg), yes? I guess you're assuming a 1:1
    relationship here?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
    > On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
    > David Rientjes wrote:
    >
    > > This is the revised cpuset writeback throttling patchset

    >
    > I'm all confused about why this is a cpuset thing rather than a cgroups
    > thing. What are the relationships here?
    >
    > I mean, writeback throttling _should_ operate upon a control group
    > (more specifically: a memcg), yes? I guess you're assuming a 1:1
    > relationship here?


    I think the main reason is that we have per-node vmstats so the cpuset
    extention is relatively easy. Whereas we do not currently maintain
    vmstats on a cgroup level - although I imagine that could be remedied.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 4 Nov 2008, Peter Zijlstra wrote:

    > I think the main reason is that we have per-node vmstats so the cpuset
    > extention is relatively easy. Whereas we do not currently maintain
    > vmstats on a cgroup level - although I imagine that could be remedied.


    We have a separate LRU for the control groups so we are on the way
    there.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [patch 1/7] cpusets: add dirty map to struct address_space

    On Thu, 30 Oct 2008 12:23:11 -0700 (PDT)
    David Rientjes wrote:

    > From: Christoph Lameter
    >
    > In a NUMA system it is helpful to know where the dirty pages of a mapping
    > are located. That way we will be able to implement writeout for
    > applications that are constrained to a portion of the memory of the
    > system as required by cpusets.
    >
    > This patch implements the management of dirty node maps for an address
    > space through the following functions:
    >
    > cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes
    >
    > cpuset_update_nodes(mapping, page) Record a node in the dirty nodes
    > map
    >
    > cpuset_init_dirty_nodes(mapping) Initialization of the map
    >
    >
    > The dirty map may be stored either directly in the mapping (for NUMA
    > systems with less then BITS_PER_LONG nodes) or separately allocated for
    > systems with a large number of nodes (f.e. ia64 with 1024 nodes).
    >
    > Updating the dirty map may involve allocating it first for large
    > configurations. Therefore, we protect the allocation and setting of a
    > node in the map through the tree_lock. The tree_lock is already taken
    > when a page is dirtied so there is no additional locking overhead if we
    > insert the updating of the nodemask there.
    >
    > The dirty map is only cleared (or freed) when the inode is cleared. At
    > that point no pages are attached to the inode anymore and therefore it
    > can be done without any locking. The dirty map therefore records all nodes
    > that have been used for dirty pages by that inode until the inode is no
    > longer used.
    >
    > If the mapping for an inode's set of dirty nodes does not intersect the
    > set under writeback, the inode is requeued for later. This is the
    > mechanism for enforcing the writeback control's set of nodes policy.
    >
    > ...
    >
    > +/*
    > + * We need macros since struct address_space is not defined yet


    sigh. Thanks for adding the comment.

    > + */
    > +#if MAX_NUMNODES <= BITS_PER_LONG
    > +#define cpuset_init_dirty_nodes(__mapping) \
    > + (__mapping)->dirty_nodes = NODE_MASK_NONE
    > +
    > +#define cpuset_update_dirty_nodes(__mapping, __page) \
    > + node_set(page_to_nid(__page), (__mapping)->dirty_nodes);
    > +
    > +#define cpuset_clear_dirty_nodes(__mapping) \
    > + (__mapping)->dirty_nodes = NODE_MASK_NONE
    > +
    > +#define cpuset_intersects_dirty_nodes(__mapping, __nodemask_ptr) \
    > + (!(__nodemask_ptr) || \
    > + nodes_intersects((__mapping)->dirty_nodes, *(__nodemask_ptr)))


    This macro can evaluate __nodemask_ptr one or two times, possibly
    causing code bloat, performance loss or bugs. Copying it into a local
    would fix that.

    > +#else
    > +struct address_space;


    When forward-declaring structures like this I'd suggest that the
    declaration be put right at the top of the header file, outside ifdefs.

    Because putting it inside an ifdef is just inviting compilation errors
    later on when someone assumes that address_space was declared and
    didn't test all config options.

    And put it at the top of the file because some unobservant person might
    later add more code to the header ahead of the address_space
    declaration which needs address_space. So they go and add a second
    forward declaration of the same thing. We've had that happen before.

    > +#define cpuset_init_dirty_nodes(__mapping) \
    > + (__mapping)->dirty_nodes = NULL


    I think there are ways in which this sort of macro can cause
    compilation errors or even bugs. I forget all the scenarios, but we do
    know that whapping a `do { ... } while (0)' around it fixes them.

    > +extern void cpuset_update_dirty_nodes(struct address_space *mapping,
    > + struct page *page);
    > +extern void cpuset_clear_dirty_nodes(struct address_space *mapping);
    > +extern int cpuset_intersects_dirty_nodes(struct address_space *mapping,
    > + nodemask_t *mask);
    > +#endif
    > +
    > #else /* !CONFIG_CPUSETS */
    >
    > static inline int cpuset_init_early(void) { return 0; }
    > @@ -163,6 +193,27 @@ static inline void rebuild_sched_domains(void)
    > partition_sched_domains(1, NULL, NULL);
    > }
    >
    > +struct address_space;
    > +
    > +static inline void cpuset_init_dirty_nodes(struct address_space *mapping)
    > +{
    > +}
    > +
    > +static inline void cpuset_update_dirty_nodes(struct address_space *mapping,
    > + struct page *page)
    > +{
    > +}


    OK, what's ging on here? I count _three_ definitions of
    cpuset_update_dirty_nodes(). One is a macro, one is a static inline
    stub, one is an out-of-line C function.

    What's up with that? Maybe `#if MAX_NUMNODES > BITS_PER_LONG' added
    the third leg?

    > +static inline void cpuset_clear_dirty_nodes(struct address_space *mapping)
    > +{
    > +}
    > +
    > +static inline int cpuset_intersects_dirty_nodes(struct address_space *mapping,
    > + nodemask_t *mask)
    > +{
    > + return 1;
    > +}
    > +
    > #endif /* !CONFIG_CPUSETS */
    >
    > #endif /* _LINUX_CPUSET_H */
    > diff --git a/include/linux/fs.h b/include/linux/fs.h
    > --- a/include/linux/fs.h
    > +++ b/include/linux/fs.h
    > @@ -544,6 +544,13 @@ struct address_space {
    > spinlock_t private_lock; /* for use by the address_space */
    > struct list_head private_list; /* ditto */
    > struct address_space *assoc_mapping; /* ditto */
    > +#ifdef CONFIG_CPUSETS
    > +#if MAX_NUMNODES <= BITS_PER_LONG
    > + nodemask_t dirty_nodes; /* nodes with dirty pages */
    > +#else
    > + nodemask_t *dirty_nodes; /* pointer to mask, if dirty */
    > +#endif
    > +#endif
    > } __attribute__((aligned(sizeof(long))));


    eek. Increasing the size of the address_space (and hence of the inode)
    is a moderately big deal - there can be millions of these in memory.

    Which gets me thinking...

    Many inodes don't have any pagecache. ext3 directories, anything which
    was read via stat(), for a start. Would it be worthwhile optimising
    for this?

    Options would include

    - remove inode.i_data and dynamically allocate it.

    - move your dirty_nodes thing into radix_tree_node, dynamically
    allocate address_space.page_tree (and tree_lock).

    - etc.

    There's a little project for someone to be going on with.


    > /*
    > * On most architectures that alignment is already the case; but
    > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
    > --- a/include/linux/writeback.h
    > +++ b/include/linux/writeback.h
    > @@ -72,6 +72,8 @@ struct writeback_control {
    > * so we use a single control to update them
    > */
    > unsigned no_nrwrite_index_update:1;
    > +
    > + nodemask_t *nodes; /* Nodemask to writeback */


    This one doesn't get ifdefs?

    > };
    >
    > /*
    > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
    > --- a/kernel/cpuset.c
    > +++ b/kernel/cpuset.c
    > @@ -16,6 +16,7 @@
    > * 2006 Rework by Paul Menage to use generic cgroups
    > * 2008 Rework of the scheduler domains and CPU hotplug handling
    > * by Max Krasnyansky
    > + * 2008 Cpuset writeback by Christoph Lameter
    > *
    > * This file is subject to the terms and conditions of the GNU General Public
    > * License. See the file COPYING in the main directory of the Linux
    > @@ -2323,6 +2324,58 @@ int cpuset_mem_spread_node(void)
    > }
    > EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
    >
    > +#if MAX_NUMNODES > BITS_PER_LONG
    > +/*
    > + * Special functions for NUMA systems with a large number of nodes. The
    > + * nodemask is pointed to from the address_space structure. The attachment of
    > + * the dirty_nodes nodemask is protected by the tree_lock. The nodemask is
    > + * freed only when the inode is cleared (and therefore unused, thus no locking
    > + * is necessary).
    > + */
    > +void cpuset_update_dirty_nodes(struct address_space *mapping,
    > + struct page *page)
    > +{
    > + nodemask_t *nodes = mapping->dirty_nodes;
    > + int node = page_to_nid(page);
    > +
    > + if (!nodes) {
    > + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);


    erk, OK, called from __set_page_dirty, needs to be atomic.

    What are the consequences when this allocation fails?

    I prefer `foo = kmalloc(sizeof(*foo), ...)', personally. It avoids
    duplicating information and means that PoorOldReviewer doesn't have to
    crawl back through the code checking that NastyYoungCoder got the type
    right.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 04 Nov 2008 21:53:08 +0100
    Peter Zijlstra wrote:

    > On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
    > > On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
    > > David Rientjes wrote:
    > >
    > > > This is the revised cpuset writeback throttling patchset

    > >
    > > I'm all confused about why this is a cpuset thing rather than a cgroups
    > > thing. What are the relationships here?
    > >
    > > I mean, writeback throttling _should_ operate upon a control group
    > > (more specifically: a memcg), yes? I guess you're assuming a 1:1
    > > relationship here?

    >
    > I think the main reason is that we have per-node vmstats so the cpuset
    > extention is relatively easy. Whereas we do not currently maintain
    > vmstats on a cgroup level - although I imagine that could be remedied.


    It didn't look easy to me - it added a lot more code in places which are
    already wicked complex.

    I'm trying to understand where this is all coming from and what fits
    into where. Fiddling with a cpuset's mems_allowed for purposes of
    memory partitioning is all nasty 2007 technology, isn't it? Does a raw
    cpuset-based control such as this have a future?


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 2008-11-04 at 13:16 -0800, Andrew Morton wrote:
    > On Tue, 04 Nov 2008 21:53:08 +0100
    > Peter Zijlstra wrote:
    >
    > > On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
    > > > On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
    > > > David Rientjes wrote:
    > > >
    > > > > This is the revised cpuset writeback throttling patchset
    > > >
    > > > I'm all confused about why this is a cpuset thing rather than a cgroups
    > > > thing. What are the relationships here?
    > > >
    > > > I mean, writeback throttling _should_ operate upon a control group
    > > > (more specifically: a memcg), yes? I guess you're assuming a 1:1
    > > > relationship here?

    > >
    > > I think the main reason is that we have per-node vmstats so the cpuset
    > > extention is relatively easy. Whereas we do not currently maintain
    > > vmstats on a cgroup level - although I imagine that could be remedied.

    >
    > It didn't look easy to me - it added a lot more code in places which are
    > already wicked complex.
    >
    > I'm trying to understand where this is all coming from and what fits
    > into where. Fiddling with a cpuset's mems_allowed for purposes of
    > memory partitioning is all nasty 2007 technology, isn't it? Does a raw
    > cpuset-based control such as this have a future?


    Yes, cpusets are making a come-back on the embedded multi-core Real-Time
    side. Folks love to isolate stuff..

    Not saying I really like it...

    Also, there seems to be talk about node aware pdflush from the
    filesystems folks, not sure we need cpusets for that, but this does seem
    to add some node information into it.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [patch 1/7] cpusets: add dirty map to struct address_space

    On Tue, 4 Nov 2008, Andrew Morton wrote:

    >> +#ifdef CONFIG_CPUSETS
    >> +#if MAX_NUMNODES <= BITS_PER_LONG
    >> + nodemask_t dirty_nodes; /* nodes with dirty pages */
    >> +#else
    >> + nodemask_t *dirty_nodes; /* pointer to mask, if dirty */
    >> +#endif
    >> +#endif
    >> } __attribute__((aligned(sizeof(long))));

    >
    > eek. Increasing the size of the address_space (and hence of the inode)
    > is a moderately big deal - there can be millions of these in memory.


    Well this is adding only a single word to the inode structure.

    >> @@ -72,6 +72,8 @@ struct writeback_control {
    >> * so we use a single control to update them
    >> */
    >> unsigned no_nrwrite_index_update:1;
    >> +
    >> + nodemask_t *nodes; /* Nodemask to writeback */

    >
    > This one doesn't get ifdefs?


    The structure is typically allocated temporarily on the stack.

    >> + nodemask_t *nodes = mapping->dirty_nodes;
    >> + int node = page_to_nid(page);
    >> +
    >> + if (!nodes) {
    >> + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);

    >
    > erk, OK, called from __set_page_dirty, needs to be atomic.
    >
    > What are the consequences when this allocation fails?


    Dirty tracking will not occur. All nodes are assumed to be dirty.

    We discussed this earlier
    http://www.ussg.iu.edu/hypermail/lin...09.1/2291.html

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [patch 1/7] cpusets: add dirty map to struct address_space

    On Tue, 4 Nov 2008 15:20:56 -0600 (CST)
    Christoph Lameter wrote:

    > On Tue, 4 Nov 2008, Andrew Morton wrote:
    >
    > >> +#ifdef CONFIG_CPUSETS
    > >> +#if MAX_NUMNODES <= BITS_PER_LONG
    > >> + nodemask_t dirty_nodes; /* nodes with dirty pages */
    > >> +#else
    > >> + nodemask_t *dirty_nodes; /* pointer to mask, if dirty */
    > >> +#endif
    > >> +#endif
    > >> } __attribute__((aligned(sizeof(long))));

    > >
    > > eek. Increasing the size of the address_space (and hence of the inode)
    > > is a moderately big deal - there can be millions of these in memory.

    >
    > Well this is adding only a single word to the inode structure.


    multiplied by millions.

    > >> @@ -72,6 +72,8 @@ struct writeback_control {
    > >> * so we use a single control to update them
    > >> */
    > >> unsigned no_nrwrite_index_update:1;
    > >> +
    > >> + nodemask_t *nodes; /* Nodemask to writeback */

    > >
    > > This one doesn't get ifdefs?

    >
    > The structure is typically allocated temporarily on the stack.


    An ifdef also helps catch the presence of unnecesary code.

    > >> + nodemask_t *nodes = mapping->dirty_nodes;
    > >> + int node = page_to_nid(page);
    > >> +
    > >> + if (!nodes) {
    > >> + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);

    > >
    > > erk, OK, called from __set_page_dirty, needs to be atomic.
    > >
    > > What are the consequences when this allocation fails?

    >
    > Dirty tracking will not occur. All nodes are assumed to be dirty.


    It won't oops?

    > We discussed this earlier
    > http://www.ussg.iu.edu/hypermail/lin...09.1/2291.html


    Well, if it isn't in the changelog and isn't in code comments, we get
    to discuss it again.

    A great amount of mailing list discussion is a Huge Honking Big Fat
    Hint that the original code was insufficently understandable.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 04 Nov 2008 22:21:50 +0100
    Peter Zijlstra wrote:

    > On Tue, 2008-11-04 at 13:16 -0800, Andrew Morton wrote:
    > > On Tue, 04 Nov 2008 21:53:08 +0100
    > > Peter Zijlstra wrote:
    > >
    > > > On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
    > > > > On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
    > > > > David Rientjes wrote:
    > > > >
    > > > > > This is the revised cpuset writeback throttling patchset
    > > > >
    > > > > I'm all confused about why this is a cpuset thing rather than a cgroups
    > > > > thing. What are the relationships here?
    > > > >
    > > > > I mean, writeback throttling _should_ operate upon a control group
    > > > > (more specifically: a memcg), yes? I guess you're assuming a 1:1
    > > > > relationship here?
    > > >
    > > > I think the main reason is that we have per-node vmstats so the cpuset
    > > > extention is relatively easy. Whereas we do not currently maintain
    > > > vmstats on a cgroup level - although I imagine that could be remedied.

    > >
    > > It didn't look easy to me - it added a lot more code in places which are
    > > already wicked complex.
    > >
    > > I'm trying to understand where this is all coming from and what fits
    > > into where. Fiddling with a cpuset's mems_allowed for purposes of
    > > memory partitioning is all nasty 2007 technology, isn't it? Does a raw
    > > cpuset-based control such as this have a future?

    >
    > Yes, cpusets are making a come-back on the embedded multi-core Real-Time
    > side. Folks love to isolate stuff..
    >
    > Not saying I really like it...
    >
    > Also, there seems to be talk about node aware pdflush from the
    > filesystems folks, not sure we need cpusets for that, but this does seem
    > to add some node information into it.


    Sorry, but I'm not seeing enough solid justification here for merging a
    fairly large amount of fairly tricksy code into core kernel. Code
    which, afaict, is heading in the opposite direction from where we've
    all been going for a year or two.

    What are the alternatives here? What do we need to do to make
    throttling a per-memcg thing?


    The patchset is badly misnamed, btw. It doesn't throttle writeback -
    in fact several people are working on IO bandwidth controllers and
    calling this thing "writeback throttling" risks confusion.

    What we're in fact throttling is rate-of-memory-dirtying. The last
    thing we want to throttle is writeback - we want it to go as fast as
    possible!

    Only I can't think of a suitable handy-dandy moniker for this concept.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 4 Nov 2008, Andrew Morton wrote:

    > What are the alternatives here? What do we need to do to make
    > throttling a per-memcg thing?


    Add statistics to the memcg lru and then you need some kind of sets of
    memcgs that are represented by bitmaps or so attached to an inode.

    > The patchset is badly misnamed, btw. It doesn't throttle writeback -
    > in fact several people are working on IO bandwidth controllers and
    > calling this thing "writeback throttling" risks confusion.


    It is limiting dirty pages and throttling the dirty rate of applications
    in a NUMA system (same procedure as we do in non NUMA). The excessive
    dirtying without this patchset can cause OOMs to occur on NUMA systems.

    > What we're in fact throttling is rate-of-memory-dirtying. The last
    > thing we want to throttle is writeback - we want it to go as fast as
    > possible!


    We want to limit the amount of dirty pages such that I/O is progressing
    with optimal speeds for an application that significantly dirties memory.
    Other processes need to be still able to do their work without being
    swapped out because of excessive memory use for dirty pages by the first
    process.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 4 Nov 2008 16:17:52 -0600 (CST)
    Christoph Lameter wrote:

    > On Tue, 4 Nov 2008, Andrew Morton wrote:
    >
    > > What are the alternatives here? What do we need to do to make
    > > throttling a per-memcg thing?

    >
    > Add statistics to the memcg lru and then you need some kind of sets of
    > memcgs that are represented by bitmaps or so attached to an inode.
    >
    > > The patchset is badly misnamed, btw. It doesn't throttle writeback -
    > > in fact several people are working on IO bandwidth controllers and
    > > calling this thing "writeback throttling" risks confusion.

    >
    > It is limiting dirty pages and throttling the dirty rate of applications
    > in a NUMA system (same procedure as we do in non NUMA). The excessive
    > dirtying without this patchset can cause OOMs to occur on NUMA systems.


    yup.

    To fix this with a memcg-based throttling, the operator would need to
    be able to create memcg's which have pages only from particular nodes.
    (That's a bit indirect relative to what they want to do, but is
    presumably workable).

    But do we even have that capability now?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 4 Nov 2008, Andrew Morton wrote:

    > To fix this with a memcg-based throttling, the operator would need to
    > be able to create memcg's which have pages only from particular nodes.
    > (That's a bit indirect relative to what they want to do, but is
    > presumably workable).


    The system would need to have the capability to find the memcg groups that
    have dirty pages for a certain inode. Files are not constrained to nodes
    or memcg groups.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 4 Nov 2008 16:52:48 -0600 (CST)
    Christoph Lameter wrote:

    > On Tue, 4 Nov 2008, Andrew Morton wrote:
    >
    > > To fix this with a memcg-based throttling, the operator would need to
    > > be able to create memcg's which have pages only from particular nodes.
    > > (That's a bit indirect relative to what they want to do, but is
    > > presumably workable).

    >
    > The system would need to have the capability to find the memcg groups that
    > have dirty pages for a certain inode. Files are not constrained to nodes
    > or memcg groups.


    Ah, we're talking about different things.

    In a memcg implementation what we would implement is "throttle
    page-dirtying tasks in this memcg when the memcg's dirty memory reaches
    40% of its total".

    But that doesn't solve the problem which this patchset is trying to
    solve, which is "don't let all the memory in all this group of nodes
    get dirty".


    Yes? Someone help me out here. I don't yet have my head around the
    overlaps and incompatibilities here. Perhaps the containers guys will
    wake up and put their thinking caps on?



    What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
    uses nodes 0,1? Can activity in cpuset A cause ooms in cpuset B?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 4 Nov 2008 15:36:10 -0800
    Andrew Morton wrote:

    > On Tue, 4 Nov 2008 16:52:48 -0600 (CST)
    > Christoph Lameter wrote:
    >
    > > On Tue, 4 Nov 2008, Andrew Morton wrote:
    > >
    > > > To fix this with a memcg-based throttling, the operator would need to
    > > > be able to create memcg's which have pages only from particular nodes.
    > > > (That's a bit indirect relative to what they want to do, but is
    > > > presumably workable).

    > >
    > > The system would need to have the capability to find the memcg groups that
    > > have dirty pages for a certain inode. Files are not constrained to nodes
    > > or memcg groups.

    >
    > Ah, we're talking about different things.
    >
    > In a memcg implementation what we would implement is "throttle
    > page-dirtying tasks in this memcg when the memcg's dirty memory reaches
    > 40% of its total".
    >

    yes. Andrea posted that.


    > But that doesn't solve the problem which this patchset is trying to
    > solve, which is "don't let all the memory in all this group of nodes
    > get dirty".
    >

    yes. but this patch doesn't help the case you mentioned below.

    >
    > Yes? Someone help me out here. I don't yet have my head around the
    > overlaps and incompatibilities here. Perhaps the containers guys will
    > wake up and put their thinking caps on?
    >
    >
    >
    > What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
    > uses nodes 0,1? Can activity in cpuset A cause ooms in cpuset B?
    >

    For help this, per-node-dirty-ratio-throttoling is necessary.

    Shouldn't we just have a new parameter as /proc/sys/vm/dirty_ratio_per_node.

    /proc/sys/vm/dirty_ratio works for throttling the whole system dirty pages.
    /proc/sys/vm/dirty_ratio_per_node works for throttling dirty pages in a node.

    Implementation will not be difficult and works enough against OOM.

    Thanks,
    -Kame








    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [patch 0/7] cpuset writeback throttling

    On Tue, 4 Nov 2008, Andrew Morton wrote:

    > In a memcg implementation what we would implement is "throttle
    > page-dirtying tasks in this memcg when the memcg's dirty memory reaches
    > 40% of its total".


    Right that is similar to what this patch does for cpusets. A memcg
    implementation would need to figure out if we are currently part of a
    memcg and then determine the percentage of memory that is dirty.

    That is one aspect. When performing writeback then we need to figure out
    which inodes have dirty pages in the memcg and we need to start writeout
    on those inodes and not on others that have their dirty pages elsewhere.
    There are two components of this that are in this patch and that would
    also have to be implemented for a memcg.

    > But that doesn't solve the problem which this patchset is trying to
    > solve, which is "don't let all the memory in all this group of nodes
    > get dirty".


    This patch would solve the problem if the calculation of the dirty pages
    would consider the active memcg and be able to determine the amount of
    dirty pages (through some sort of additional memcg counters). That is just
    the first part though. The second part of finding the inodes that have
    dirty pages for writeback would require an association between memcgs and
    inodes.

    > What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
    > uses nodes 0,1? Can activity in cpuset A cause ooms in cpuset B?


    Yes if the activities of cpuset A cause all pages to be dirtied in cpuset
    B and then cpuset B attempts to do writeback. This will fail to acquire
    enough memory for writeback and make reclaim impossible.

    Typically cpusets are not overlapped like that but used to segment the
    system.

    The system would work correctly if the dirty ratio calculation would be
    done on all overlapping cpusets/memcg groups that contain nodes from
    which allocations are permitted.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 2 1 2 LastLast