--rS8CxjVDS/+yyDmU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Feb 22, 2008 at 01:52:54PM -1000, Jeff Roberson wrote:
> On Fri, 22 Feb 2008, Brooks Davis wrote:
>=20
>> On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>>>=20
>>> On Thu, 21 Feb 2008, Robert Watson wrote:
>>>=20
>>>> On Wed, 20 Feb 2008, Jeff Roberson wrote:


>>>> - It would be nice to be able to use CPU sets in jail as well,=20
>>>> suggesting
>>>> a
>>>> hierarchal model with some sort of tagging so you know what CPU sets=

=20
>>>> were
>>>> created in a jail such that you know whether they can be changed in a
>>>> jail.
>>>> While I recognize this makes things a lot more tricky, I think we=20
>>>> should
>>>> basically be planning more carefully with respect to virtualization=

=20
>>>> when
>>>> we
>>>> add new interfaces, since it's a widely used feature, and the current=

=20
>>>> set
>>>> of
>>>> "stragglers" unsupported in Jail is growing rather than shrinking.
>>>=20
>>> I have implemented a hierarchical model. Each thread has a pointer to=

=20
>>> the
>>> cpuset that it's in. If it makes a local modification via setaffinity(=

)=20
>>> it
>>> gets an anonymous cpuset that is a child of the set assigned to the
>>> process. This anonymous set will also be inherited across fork/thread
>>> creation.
>>>=20
>>> In this model presently there are nodes marked as root. To query the
>>> 'system' cpus available we walk up from the current node until we find a
>>> root. These are the 'system' set. A thread may not break out of its
>>> system set. A process may join the root set but it may not modify a ro=

ot
>>> that is a parent. Jails would create a new root. A process outside of=

=20
>>> the
>>> jail can modify the set of processors in the jail but a process within=

=20
>>> the
>>> jail/root may not.
>>>=20
>>> The next level down from the root is the assigned set. The root may be=

=20
>>> an
>>> assigned set or this may be a subset of the root. Processes may create
>>> sets which are parented back to their root and may include any processo=

rs
>>> within their root. The mask of the assigned set is returned as=20
>>> 'available'
>>> processors.
>>>=20
>>> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an
>>> anonymous set. Any of these but the root may be omitted. There is no
>>> current way for userland to create subsets of assigned sets to permit
>>> further nesting. I'm not sure I see value in it right now and it gives=

=20
>>> the
>>> possibility of unbound tree depth.
>>>=20
>>> Anonymous sets are immutable as they are shared and changes only apply =

to
>>> the thread/pid in the WHICH argument and not others which have inherited
>>> from it. Anonymous sets have no id and may not be specifically=20
>>> manipulated
>>> via a setid. You must refer to the process/thread. From the
>>> administration point of view they don't exist.
>>>=20
>>> When a set is modified we walk down the children recursively and apply=

=20
>>> the
>>> new mask. This is done with a global set lock under which all
>>> modifications and tree operations are performed. The td_cpuset pointer=

=20
>>> is
>>> protected under the thread_lock() and may read the set without a lock.=

=20
>>> This
>>> gives the possibility for certain kinds of races but I believe they are=

=20
>>> all
>>> safe.
>>>=20
>>> Hopefully I explained that well enough for people to follow. I realize
>>> it's a lot of text but it's fairly simple book keeping code. This is a=

ll
>>> implemented and I'm debugging now.

>>=20
>> One place I'd like to implement CPU affinity is in the Sun Grid Engine
>> execution daemon. I think anonymous set would not be sufficent there
>> because the model allows new tasks to be started on a particular node at
>> any time during a parallel job. I'd have to do some more digging in the
>> code to be entierly certain. I think the less limits we place on the
>> hierarchy, the better off we'll be unless there are compeling complexity
>> reasons to avoid them.

>=20
> With the anonymous set you can bind any thread to any cpu that is visible=

=20
> to it. How would this not work?


I'm still trying to wrap my head around the anonymous sets. Is the idea
that once you are in an anonymous set, you can't expand it, or can you
expand out as far as the assigned set? I'd like for parallel jobs to
be allocated a set of cpus that they can't change, but still be able
to make their own decisions about thread affinity if they desire (for
example OpenMPI has some support for this so processes stay put and in
theory benefit from positive cache effects). If that's feasible in
this model, I'm happy ok it. I think we should keep in mind that these
SGE execution daemons might be sitting inside jails. ;-)

>>>> - There's still no way to specify an affinity policy rather than=20
>>>> explicit
>>>> affinity, but if our CPU set model is sufficiently general, that migh=

t=20
>>>> be
>>>> a
>>>> vehicle to do that. I.e., cpuset_setpolicy() rather than setting a=

=20
>>>> mask.
>>>=20
>>> Yes, I think this is orthogonal and can be addressed seperately. I'm n=

ot
>>> sure how many userland programs are smart enough or even capable of=20
>>> making
>>> determinations about their cache behavior however. We should open=20
>>> another
>>> discussion once this one is done.
>>>=20
>>>>=20
>>>> - In the interests of boring API changes, recent APIs tend to prefix t=

he
>>>> method on the object name. Have you thought about cpuset_create(),
>>>> cpuset_foo(), etc? That reduces the chances of interfering with
>>>> application
>>>> namespaces. I think, anyway. :-).
>>>=20
>>> Yes, I prefer that as well, as I mentioned syscalls tended to favor
>>> brevity. I'm fine with changing that trend.
>>>=20
>>>>=20
>>>> I need to ponder the proposal a little more, ideally over a hot bevera=

ge
>>>> this morning, and will follow up if I have further thoughts. Thanks f=

or
>>>> working on this, BTW -- affinity is well-overdue for FreeBSD.
>>>=20
>>> A little more to ponder now! Your feedback is much appreciated.
>>>=20
>>> I believe the present hierarchical model satisfies the jail requirement=

s=20
>>> of
>>> restricting cpus in the jail while still allowing the jail to create=20
>>> sets.
>>>=20
>>> The unanswered questions are:
>>>=20
>>> 1) What to do about sets that strand threads, options described above.
>>> 2) Are people ok with the transient nature of sets?
>>> 3) Does anyone want to help with man pages, administrative tools, etc?=

=20
>>> I
>>> have a prototype tool called 'cpuset' that fully exercises the api but =

is
>>> probably ugly. Will post details soon.

>>=20
>> I could help with some of this as it furthers a funded project at work.

>=20
> I will provide patches soon. It would be great to have a developer with =

a=20
> users perspective to look at some of the details and especially the=20
> administration side of things. I think someone else has offered to help=

=20
> with man pages but I need to double check.


Cool. If you can get some basics out by late Sunday afternoon (CST) I
should be able to look at it and think about it on the plane Monday.

-- Brooks

--rS8CxjVDS/+yyDmU
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFHwHa/XY6L6fI4GtQRAqmjAJ48y/n2UVTEOA723K6tYv1RtK112gCfSvYK
aArGS4pjj474J94hq+iskLA=
=CIob
-----END PGP SIGNATURE-----

--rS8CxjVDS/+yyDmU--