--1sNVjLsmu1MXqwQ/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Feb 23, 2008 at 11:21:33AM -1000, Jeff Roberson wrote:
>=20
> On Sat, 23 Feb 2008, Brooks Davis wrote:
>=20
>> On Fri, Feb 22, 2008 at 01:52:54PM -1000, Jeff Roberson wrote:
>>> On Fri, 22 Feb 2008, Brooks Davis wrote:
>>>=20
>>>> On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>>>>>=20
>>>>> On Thu, 21 Feb 2008, Robert Watson wrote:
>>>>>=20
>>>>>> On Wed, 20 Feb 2008, Jeff Roberson wrote:

>>=20
>>>>>> - It would be nice to be able to use CPU sets in jail as well,
>>>>>> suggesting
>>>>>> a
>>>>>> hierarchal model with some sort of tagging so you know what CPU sets
>>>>>> were
>>>>>> created in a jail such that you know whether they can be changed in=

a
>>>>>> jail.
>>>>>> While I recognize this makes things a lot more tricky, I think we
>>>>>> should
>>>>>> basically be planning more carefully with respect to virtualization
>>>>>> when
>>>>>> we
>>>>>> add new interfaces, since it's a widely used feature, and the curre=

nt
>>>>>> set
>>>>>> of
>>>>>> "stragglers" unsupported in Jail is growing rather than shrinking.
>>>>>=20
>>>>> I have implemented a hierarchical model. Each thread has a pointer to
>>>>> the
>>>>> cpuset that it's in. If it makes a local modification via=20
>>>>> setaffinity()
>>>>> it
>>>>> gets an anonymous cpuset that is a child of the set assigned to the
>>>>> process. This anonymous set will also be inherited across fork/thread
>>>>> creation.
>>>>>=20
>>>>> In this model presently there are nodes marked as root. To query the
>>>>> 'system' cpus available we walk up from the current node until we fin=

d=20
>>>>> a
>>>>> root. These are the 'system' set. A thread may not break out of its
>>>>> system set. A process may join the root set but it may not modify a=

=20
>>>>> root
>>>>> that is a parent. Jails would create a new root. A process outside =

of
>>>>> the
>>>>> jail can modify the set of processors in the jail but a process within
>>>>> the
>>>>> jail/root may not.
>>>>>=20
>>>>> The next level down from the root is the assigned set. The root may =

be
>>>>> an
>>>>> assigned set or this may be a subset of the root. Processes may crea=

te
>>>>> sets which are parented back to their root and may include any=20
>>>>> processors
>>>>> within their root. The mask of the assigned set is returned as
>>>>> 'available'
>>>>> processors.
>>>>>=20
>>>>> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an
>>>>> anonymous set. Any of these but the root may be omitted. There is no
>>>>> current way for userland to create subsets of assigned sets to permit
>>>>> further nesting. I'm not sure I see value in it right now and it giv=

es
>>>>> the
>>>>> possibility of unbound tree depth.
>>>>>=20
>>>>> Anonymous sets are immutable as they are shared and changes only appl=

y=20
>>>>> to
>>>>> the thread/pid in the WHICH argument and not others which have=20
>>>>> inherited
>>>>> from it. Anonymous sets have no id and may not be specifically
>>>>> manipulated
>>>>> via a setid. You must refer to the process/thread. From the
>>>>> administration point of view they don't exist.
>>>>>=20
>>>>> When a set is modified we walk down the children recursively and apply
>>>>> the
>>>>> new mask. This is done with a global set lock under which all
>>>>> modifications and tree operations are performed. The td_cpuset point=

er
>>>>> is
>>>>> protected under the thread_lock() and may read the set without a lock.
>>>>> This
>>>>> gives the possibility for certain kinds of races but I believe they a=

re
>>>>> all
>>>>> safe.
>>>>>=20
>>>>> Hopefully I explained that well enough for people to follow. I reali=

ze
>>>>> it's a lot of text but it's fairly simple book keeping code. This is=

=20
>>>>> all
>>>>> implemented and I'm debugging now.
>>>>=20
>>>> One place I'd like to implement CPU affinity is in the Sun Grid Engine
>>>> execution daemon. I think anonymous set would not be sufficent there
>>>> because the model allows new tasks to be started on a particular node =

at
>>>> any time during a parallel job. I'd have to do some more digging in t=

he
>>>> code to be entierly certain. I think the less limits we place on the
>>>> hierarchy, the better off we'll be unless there are compeling complexi=

ty
>>>> reasons to avoid them.
>>>=20
>>> With the anonymous set you can bind any thread to any cpu that is visib=

le
>>> to it. How would this not work?

>>=20
>> I'm still trying to wrap my head around the anonymous sets. Is the idea
>> that once you are in an anonymous set, you can't expand it, or can you
>> expand out as far as the assigned set? I'd like for parallel jobs to
>> be allocated a set of cpus that they can't change, but still be able
>> to make their own decisions about thread affinity if they desire (for
>> example OpenMPI has some support for this so processes stay put and in
>> theory benefit from positive cache effects). If that's feasible in
>> this model, I'm happy ok it. I think we should keep in mind that these
>> SGE execution daemons might be sitting inside jails. ;-)

>=20
> Ah, when I said the anonymous sets were immutable, that only means that=

=20
> they are copy-on-write. Because you can't know who shares a copy via for=

k=20
> or thread creation you must make a new set each time you write.
>=20
> I made the anonymous sets so that the parent would have a list of all=20
> derivative children sets so that modifications to the parent would be=20
> reflected in the child. This also means that the scheduler only has to=

=20
> look at one bitmap to determine the available cpus for a thread.


I think the anonymous sets seem like a good idea. On solution to my
problem might be to make changing your current set to be something that
is not a subset of your parent (or maybe your current set?) is privileged.

-- Brooks

--1sNVjLsmu1MXqwQ/
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFHwJGKXY6L6fI4GtQRAl3iAKDXYMD6U6rx87OVqGsDfQ gQk/GVfACfXlra
EDNQLEYWfYoI6H5v7YsDBWM=
=YC+R
-----END PGP SIGNATURE-----

--1sNVjLsmu1MXqwQ/--