Tiny cpusets -- cpusets for small systems? - Kernel

This is a discussion on Tiny cpusets -- cpusets for small systems? - Kernel ; A couple of proposals have been made recently by people working Linux on smaller systems, for improving realtime isolation and memory pressure handling: (1) cpu isolation for hard(er) realtime http://lkml.org/lkml/2008/2/21/517 Max Krasnyanskiy [PATCH sched-devel 0/7] CPU isolation extensions (2) notify ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: Tiny cpusets -- cpusets for small systems?

  1. Tiny cpusets -- cpusets for small systems?

    A couple of proposals have been made recently by people working Linux
    on smaller systems, for improving realtime isolation and memory
    pressure handling:

    (1) cpu isolation for hard(er) realtime
    http://lkml.org/lkml/2008/2/21/517
    Max Krasnyanskiy
    [PATCH sched-devel 0/7] CPU isolation extensions

    (2) notify user space of tight memory
    http://lkml.org/lkml/2008/2/9/144
    KOSAKI Motohiro
    [PATCH 0/8][for -mm] mem_notify v6

    In both cases, some of us have responded "why not use cpusets", and the
    original submitters have replied "cpusets are too fat" (well, they
    were more diplomatic than that, but I guess I can say that

    I wonder if there might be room for a "tiny cpusets" configuration
    option:
    * provide the same hooks to the rest of the kernel, and
    * provide the same syntactic interface to user space, but
    * with more limited semantics.

    The primary semantic limit I'd suggest would be supporting exactly
    one layer depth of cpusets, not a full hierarchy. So one could still
    successfully issue from user space 'mkdir /dev/cpuset/foo', but trying
    to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of
    very early FAT file systems, which had just a single, fixed size
    root directory . There might even be a configurable fixed upper
    limit on how many /dev/cpuset/* directories were allowed, further
    simplifying the locking and dynamic memory behavior of this apparatus.

    Some other features that aren't so easy to implement, and which have
    less value on small systems, such as notify_on_release, could also be
    stubbed out and always disabled, simply returning error if requested
    to be enabled from user space. The recent, chunky piece of code
    needed to compute dynamic sched domains from the cpuset hierarchy
    probably admits of a simpler variant in the tiny cpuset configuration.

    I suppose it would still be a vfs-based pseudo file system (even
    embedded Linux still has that infrastructure), except that the vfs
    operator functions could be simpler, as this would really be just
    a flat set of cpumask_t's and nodemask_t's at the core of the
    implementation, not an arbitrarily nested hierarchy of them. See
    further my comments on cgroups, below.

    The rest of the kernel would see no difference ... except that some
    of the cpuset_*() hooks would return more quickly. This tiny cpuset
    option would provide the same kernel hooks as are now provided by
    the defines and inline stubs, in the "#else" to "#endif" half of the
    "#ifdef CONFIG_CPUSETS" code lines in linux/cpuset.h.

    User space would see the same API, except that some valid operations
    on full cpusets, such as a nested mkdir, would fail on tiny cpusets.

    How this extends to cgroups I don't know; for now I suspect that most
    cgroup module development is motivated by the needs of larger systems,
    not smaller systems. However, cpusets is now a module client of
    cgroups, and it is cgroups that now provides cpusets with its interface
    to the vfs infrastructure. It would seem unfortunate if this relation
    was not continued with tiny cpusets. Perhaps someone can imagine a tiny
    cgroups? This might be the most difficult part of this proposal.

    Looking at some IA64 sn2 config builds I have laying about, I see the
    following text sizes for a couple of versions, showing the growth of
    the cpuset/cgroup apparatus over time:

    25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006)
    vs.
    37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008)
    19558 2.6.25-rc2-mm1/kernel/cpuset.o

    So the total has grown from 25933 to 57381 text bytes (note that
    this is IA64 arch; most arch's will have proportionately smaller
    text sizes.)

    Unfortunately, ideas without code are usually met with the sound of
    silence, as well they should be. Furthermore, I can promise that I
    have no time to design or develop this myself; my good employer is
    quite focused on the other end of things - the big honkin NUMA and
    cluster systems.


    --
    I won't rest till it's the best ...
    Programmer, Linux Scalability
    Paul Jackson 1.940.382.4214
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: Tiny cpusets -- cpusets for small systems?

    On Sat, Feb 23, 2008 at 4:09 AM, Paul Jackson wrote:
    > A couple of proposals have been made recently by people working Linux
    > on smaller systems, for improving realtime isolation and memory
    > pressure handling:
    >
    > (1) cpu isolation for hard(er) realtime
    > http://lkml.org/lkml/2008/2/21/517
    > Max Krasnyanskiy
    > [PATCH sched-devel 0/7] CPU isolation extensions
    >
    > (2) notify user space of tight memory
    > http://lkml.org/lkml/2008/2/9/144
    > KOSAKI Motohiro
    > [PATCH 0/8][for -mm] mem_notify v6
    >
    > In both cases, some of us have responded "why not use cpusets", and the
    > original submitters have replied "cpusets are too fat" (well, they
    > were more diplomatic than that, but I guess I can say that


    Having read those threads, it looks to me as though:

    - the parts of Max's problem that would be solved by cpusets can be
    mostly accomplished just via sched_setaffinity()

    - Motohiro wants to add a new system-wide API that you would also like
    to have available on a per-cpuset basis. (Why not just add two access
    points for the same feature?)

    I'm don't think that either of these would be enough to justify big
    changes to cpusets or cgroups, although eliminating bloat is always a
    good thing.

    > The primary semantic limit I'd suggest would be supporting exactly
    > one layer depth of cpusets, not a full hierarchy. So one could still
    > successfully issue from user space 'mkdir /dev/cpuset/foo', but trying
    > to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of
    > very early FAT file systems, which had just a single, fixed size
    > root directory . There might even be a configurable fixed upper
    > limit on how many /dev/cpuset/* directories were allowed, further
    > simplifying the locking and dynamic memory behavior of this apparatus.


    I'm not sure that either of these would make much difference to the
    overall footprint.

    A single layer of cpusets would allow you to simplify
    validate_change() but not much else.

    I don't see how a fixed upper limit on the number of cpusets makes the
    locking sufficiently simpler to save much code.

    >
    > How this extends to cgroups I don't know; for now I suspect that most
    > cgroup module development is motivated by the needs of larger systems,
    > not smaller systems. However, cpusets is now a module client of
    > cgroups, and it is cgroups that now provides cpusets with its interface
    > to the vfs infrastructure. It would seem unfortunate if this relation
    > was not continued with tiny cpusets. Perhaps someone can imagine a tiny
    > cgroups? This might be the most difficult part of this proposal.


    If we wanted to go this way, I can imagine a cgroups config option
    that forces just a single hierarchy, which would allow a bunch of
    simplifications that would save plenty of text.

    >
    > Looking at some IA64 sn2 config builds I have laying about, I see the
    > following text sizes for a couple of versions, showing the growth of
    > the cpuset/cgroup apparatus over time:
    >
    > 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006)
    > vs.
    > 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008)
    > 19558 2.6.25-rc2-mm1/kernel/cpuset.o
    >
    > So the total has grown from 25933 to 57381 text bytes (note that
    > this is IA64 arch; most arch's will have proportionately smaller
    > text sizes.)


    On x86_64 they're:

    cgroup.o: 17348
    cpuset.o: 8533

    Paul
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: Tiny cpusets -- cpusets for small systems?

    Paul M wrote:
    > I'm don't think that either of these would be enough to justify big
    > changes to cpusets or cgroups, although eliminating bloat is always a
    > good thing.


    My "tiny cpuset" idea doesn't so much eliminate bloat, as provide a
    thin alternative, along side of the existing fat alternative. So
    far as kernel source goes, it would get bigger, not smaller, with now
    two CONFIG choices for cpusets, fat or tiny.

    The odds are, however, given that one of us has just promised not to
    code this, and the other of us doesn't figure it's worth it, this
    idea will not live long. Someone would have to step up from the
    embedded side with a coded version that saved a nice chunk of memory
    (from their perspective) to get this off the ground, and no telling
    whether even that would meet with a warm reception.

    --
    I won't rest till it's the best ...
    Programmer, Linux Scalability
    Paul Jackson 1.940.382.4214
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. references:in-reply-to:content-type:

    Hi Paul,

    > A couple of proposals have been made recently by people working Linux
    > on smaller systems, for improving realtime isolation and memory
    > pressure handling:
    >
    > (1) cpu isolation for hard(er) realtime
    > http://lkml.org/lkml/2008/2/21/517
    > Max Krasnyanskiy
    > [PATCH sched-devel 0/7] CPU isolation extensions
    >
    > (2) notify user space of tight memory
    > http://lkml.org/lkml/2008/2/9/144
    > KOSAKI Motohiro
    > [PATCH 0/8][for -mm] mem_notify v6
    >
    > In both cases, some of us have responded "why not use cpusets", and the
    > original submitters have replied "cpusets are too fat" (well, they
    > were more diplomatic than that, but I guess I can say that


    My primary issue with cpusets (from CPU isolation perspective that is) was
    not the fatness. I did make a couple of comments like "On dual-cpu box
    I do not need cpusets to manage the CPUs" but that's not directly related to
    the CPU isolation.
    For the CPU isolation in particular I need code like this

    int select_irq_affinity(unsigned int irq)
    {
    cpumask_t usable_cpus;
    cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map);
    irq_desc[irq].affinity = usable_cpus;
    irq_desc[irq].chip->set_affinity(irq, usable_cpus);
    return 0;
    }

    How would you implement that with cpusets ?
    I haven't seen you patches but I'd imagine that they will still need locks and
    iterators for "Is CPU N isolated" functionality.

    So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower
    level mechanism that actually makes kernel aware of what's isolated what's not.
    Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
    but scheduler does not use cpusets directly.

    > I wonder if there might be room for a "tiny cpusets" configuration option:
    > * provide the same hooks to the rest of the kernel, and
    > * provide the same syntactic interface to user space, but
    > * with more limited semantics.
    >
    > The primary semantic limit I'd suggest would be supporting exactly
    > one layer depth of cpusets, not a full hierarchy. So one could still
    > successfully issue from user space 'mkdir /dev/cpuset/foo', but trying
    > to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of
    > very early FAT file systems, which had just a single, fixed size
    > root directory . There might even be a configurable fixed upper
    > limit on how many /dev/cpuset/* directories were allowed, further
    > simplifying the locking and dynamic memory behavior of this apparatus.

    In a foreseeable future 2-8 cores will be most common configuration.
    Do you think that cpusets are needed/useful for those machines ?
    The reason I'm asking is because given the restrictions you mentioned
    above it seems that you might as well just do
    taskset -c 1,2,3 app1
    taskset -c 3,4,5 app2
    Yes it's not quite the same of course but imo covers most cases. That's what we
    do on 2-4 cores these days, and are quite happy with that. ie We either let the
    specialized apps manage their thread affinities themselves or use "taskset" to
    manage the apps.

    > User space would see the same API, except that some valid operations
    > on full cpusets, such as a nested mkdir, would fail on tiny cpusets.

    Speaking of user-space API. I guess it's not directly related to the tiny-cpusets
    proposal but rather to the cpusets in general.
    Stuff that I'm working on this days (wireless basestations) is designed with the
    following model:
    cpuN - runs soft-RT networking and management code
    cpuN+1 to cpuN+x - are used as dedicated engines
    ie Simplest example would be
    cpu0 - runs IP, L2 and control plane
    cpu1 - runs hard-RT MAC

    So if CPU isolation is implemented on top of the cpusets what kind of API do
    you envision for such an app ? I mean currently cpusets seems to be mostly dealing
    with entire processes, whereas in this case we're really dealing with the threads.
    ie Different threads of the same process require different policies, some must run
    on isolated cpus some must not. I guess one could write a thread's pid into cpusets
    fs but that's not very convenient. pthread_set_affinity() is exactly what's needed.
    Personally I do not see much use for cpusets for those kinds of designs. But maybe
    I missing something. I got really excited when cpusets where first merged into
    mainline but after looking closer I could not really find a use for them, at least
    for not for our apps.

    Max
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: Tiny cpusets -- cpusets for small systems?

    Hi Pual

    > Looking at some IA64 sn2 config builds I have laying about, I see the
    > following text sizes for a couple of versions, showing the growth of
    > the cpuset/cgroup apparatus over time:
    >
    > 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006)
    > vs.
    > 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008)
    > 19558 2.6.25-rc2-mm1/kernel/cpuset.o
    >
    > So the total has grown from 25933 to 57381 text bytes (note that
    > this is IA64 arch; most arch's will have proportionately smaller
    > text sizes.)


    hm, interesting.
    but unfortunately the cpuset have more than depend.(i.e. CONFIG_SMP)

    To more bad thing, some embedded cpu have poor or no atomic instruction
    support.
    at that, turn on CONFIG_SMP become large performace regression


    I am not already embedded engineer.
    thus, I might have made a mistake.
    (BTW: I am large server engineer now)

    but no thinking dependency is wrong, may be.


    Pavel, what do you think it?

    - kosaki


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: Tiny cpusets -- cpusets for small systems?

    > So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower
    > level mechanism that actually makes kernel aware of what's isolated what's not.
    > Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
    > but scheduler does not use cpusets directly.


    One could use cpusets to control the setting of cpu_isolated_map,
    separate from the code such as your select_irq_affinity() that
    uses it.


    > In a foreseeable future 2-8 cores will be most common configuration.
    > Do you think that cpusets are needed/useful for those machines ?
    > The reason I'm asking is because given the restrictions you mentioned
    > above it seems that you might as well just do
    > taskset -c 1,2,3 app1
    > taskset -c 3,4,5 app2


    People tend to manage the CPU and memory placement of the threads
    and processes within a single co-operating job using taskset
    (sched_setaffinity) and numactl (mbind, set_mempolicy.)

    They tend to manage the placement of multiple unrelated jobs onto
    a single system, whether on separate or shared CPUs and nodes,
    using cpusets.

    Something like cpu_isolated_map looks to me like a system-wide
    mechanism, which should, like sched_domains, be managed system-wide.
    Managing it with a mechanism that encourages each thread to update
    it directly, as if that thread owned the system, will break down,
    resulting in conflicting updates, as multiple, insufficiently
    co-operating threads issue conflicting settings.


    > Stuff that I'm working on this days (wireless basestations) is designed
    > with the following model:
    > cpuN - runs soft-RT networking and management code
    > cpuN+1 to cpuN+x - are used as dedicated engines
    > ie Simplest example would be
    > cpu0 - runs IP, L2 and control plane
    > cpu1 - runs hard-RT MAC
    >
    > So if CPU isolation is implemented on top of the cpusets what kind of API do
    > you envision for such an app ?


    That depends on what more API is needed. Do we need to place
    irqs better ... cpusets might not be a natural for that use.
    Aren't irqs directed to specific CPUs, not to hierarchically
    nested subsets of CPUs.

    Separate question:
    Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear
    as general purpose systems running a Linux kernel in your systems?
    These dedicated engines seem more like intelligent devices to me,
    such as disk controllers, which the kernel controls via device
    drivers, not by loading itself on them too.

    --
    I won't rest till it's the best ...
    Programmer, Linux Scalability
    Paul Jackson 1.940.382.4214
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. references:in-reply-to:content-type:

    Paul Jackson wrote:
    >> So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower
    >> level mechanism that actually makes kernel aware of what's isolated what's not.
    >> Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
    >> but scheduler does not use cpusets directly.

    >
    > One could use cpusets to control the setting of cpu_isolated_map,
    > separate from the code such as your select_irq_affinity() that
    > uses it.

    Yes. That's what I proposed too. In one of the CPU isolation threads with
    Peter. The only issue is that you need to simulate CPU_DOWN hotplug even in
    order to cleanup what's already running on those CPUs.

    >> In a foreseeable future 2-8 cores will be most common configuration.
    >> Do you think that cpusets are needed/useful for those machines ?
    >> The reason I'm asking is because given the restrictions you mentioned
    >> above it seems that you might as well just do
    >> taskset -c 1,2,3 app1
    >> taskset -c 3,4,5 app2

    >
    > People tend to manage the CPU and memory placement of the threads
    > and processes within a single co-operating job using taskset
    > (sched_setaffinity) and numactl (mbind, set_mempolicy.)
    >
    > They tend to manage the placement of multiple unrelated jobs onto
    > a single system, whether on separate or shared CPUs and nodes,
    > using cpusets.
    >
    > Something like cpu_isolated_map looks to me like a system-wide
    > mechanism, which should, like sched_domains, be managed system-wide.
    > Managing it with a mechanism that encourages each thread to update
    > it directly, as if that thread owned the system, will break down,
    > resulting in conflicting updates, as multiple, insufficiently
    > co-operating threads issue conflicting settings.

    I'm not sure how to interpret that. I think you might have mixed a couple of
    things I asked about in one reply ;-).
    The question was that given the restrictions you talked about when you
    explained tiny-cpusets functionality I asked how much one gains from using
    them compared to the taskset/numactl. ie On the machines with 2-8 cores it's
    fairly easy to manage cpus with simple affinity masks.

    The second part of your reply seems to imply that I somehow made you think
    that I suggested that cpu_isolated_map is managed per thread. That is of
    course not the case. It's definitely a system-wide mechanism and individual
    threads have nothing to do with it.
    btw I just re-read my prev reply. I definitely did not say anything about
    threads managing cpu_isolated_map .

    >> Stuff that I'm working on this days (wireless basestations) is designed
    >> with the following model:
    >> cpuN - runs soft-RT networking and management code
    >> cpuN+1 to cpuN+x - are used as dedicated engines
    >> ie Simplest example would be
    >> cpu0 - runs IP, L2 and control plane
    >> cpu1 - runs hard-RT MAC
    >>
    >> So if CPU isolation is implemented on top of the cpusets what kind of API do
    >> you envision for such an app ?

    >
    > That depends on what more API is needed. Do we need to place
    > irqs better ... cpusets might not be a natural for that use.
    > Aren't irqs directed to specific CPUs, not to hierarchically
    > nested subsets of CPUs.


    You clipped the part where I elaborated. Which was:
    >> So if CPU isolation is implemented on top of the cpusets what kind of API do
    >> you envision for such an app ? I mean currently cpusets seems to be mostly dealing
    >> with entire processes, whereas in this case we're really dealing with the threads.
    >> ie Different threads of the same process require different policies, some must run
    >> on isolated cpus some must not. I guess one could write a thread's pid into cpusets
    >> fs but that's not very convenient. pthread_set_affinity() is exactly what's needed.

    In other words how would an app place its individual threads into the
    different cpusets.
    IRQ stuff is separate, like we said above cpusets could simply update
    cpu_isolated_map which would take care of IRQs. I was talking specifically
    about the thread management.

    > Separate question:
    > Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear
    > as general purpose systems running a Linux kernel in your systems?
    > These dedicated engines seem more like intelligent devices to me,
    > such as disk controllers, which the kernel controls via device
    > drivers, not by loading itself on them too.

    We still want to be able to run normal threads on them. Which means IPI,
    memory management, etc is still needed. So yes they better show up as normal
    CPUs
    Also with dynamic isolation you can for example un-isolate a cpu when you're
    compiling stuff on the machine and then isolate it when you're running special
    app(s).

    Max
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: Tiny cpusets -- cpusets for small systems?

    On Sat, Feb 23, 2008 at 09:57:52AM -0600, Paul Jackson wrote:
    > Paul M wrote:
    > > I'm don't think that either of these would be enough to justify big
    > > changes to cpusets or cgroups, although eliminating bloat is always a
    > > good thing.

    >
    > My "tiny cpuset" idea doesn't so much eliminate bloat, as provide a
    > thin alternative, along side of the existing fat alternative. So
    > far as kernel source goes, it would get bigger, not smaller, with now
    > two CONFIG choices for cpusets, fat or tiny.
    >
    > The odds are, however, given that one of us has just promised not to
    > code this, and the other of us doesn't figure it's worth it, this
    > idea will not live long. Someone would have to step up from the
    > embedded side with a coded version that saved a nice chunk of memory
    > (from their perspective) to get this off the ground, and no telling
    > whether even that would meet with a warm reception.
    >

    This has actually been on my TODO list for awhile, though not quite in
    the way that you outlined in your initial post. A good reason for why
    cpusets are fat at the moment is largely because there's no isolation or
    configurability of individual features supported and exposed, leaving one
    with an all-or-nothing scenario.

    Both the SMP and NUMA bits are fairly orthogonal, and I think isolating
    those and making each configurable would already help trim things down
    quite a bit (ie, nodemask handling, scheduler domains, etc.). The
    filesystem bits are not really that heavy comparatively, so rather than
    working on a tiny cpuset implementation, simply splitting up the existing
    implementation seems like a much saner approach, and one that can be done
    incrementally.

    While we're on the topic of cpuset reform, co-processors are another
    thing that cpusets is in pretty good shape to handle (particularly in
    terms of carving up large grid and dataflow processors and things of that
    nature, which we do have customer use cases for in embedded space today).

    I'll try to follow up to this thread with an initial patch series in the
    not-too-distant future.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread