[PATCH 0/9] Devices accessibility control group (v4) - Kernel

This is a discussion on [PATCH 0/9] Devices accessibility control group (v4) - Kernel ; Andrew Morton wrote: > On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov wrote: > >>> This doesn't include sufficient headers to be compileable. >>> >>> I'm sure there are lots of headers like this. But we regularly need >>> ...

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast
Results 21 to 40 of 47

Thread: [PATCH 0/9] Devices accessibility control group (v4)

  1. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Andrew Morton wrote:
    > On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov wrote:
    >
    >>> This doesn't include sufficient headers to be compileable.
    >>>
    >>> I'm sure there are lots of headers like this. But we regularly need
    >>> to fix them.
    >>>

    >> Not sure, whether this is still relevant after Greg's comments, but that's
    >> the -fix patch for this one. (It will cause a conflict with the 9th patch.)

    >
    > Well. Where do we stand with this? afaict the state of play is:
    >
    > Greg: do it in udev
    > Pavel: but people want to run old distros in containers


    Actually no.

    Greg: Use LSM for this
    Pavel: My approach just makes maps per-group, while LSM will
    bring a new level of filtering/lookup on device open path

    > Realistically, when is the mainline kernel likely to have sufficient
    > container functionality which is sufficiently well-tested for people to
    > actually be able to do that? And how much longer will it take for that
    > kernel.org functionality to propagate out into non-bleeding-edge distros?


    The fact is that we have users of OpenVZ and even Virtuozzo, that still use
    redhat-9 as in containers. So even if this is ready in 5 years, there will
    always be someone who sets the outdated (by that time) fedora-core-8 and find
    out, that his udev refuses to work.

    > Altogether we're looking at one to three years, aren't we? By then,
    > perhaps a lot of "old" distros are already udev-based.
    >
    > otoh, my experience upgrading old kernels to new udev has not been a
    > good one (ie: it didn't work).


    Agree, but we're talking about making old udev working with new kernel.

    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Fri, Mar 07, 2008 at 12:52:40PM +0300, Pavel Emelyanov wrote:
    > Andrew Morton wrote:
    > > On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov wrote:
    > >
    > >>> This doesn't include sufficient headers to be compileable.
    > >>>
    > >>> I'm sure there are lots of headers like this. But we regularly need
    > >>> to fix them.
    > >>>
    > >> Not sure, whether this is still relevant after Greg's comments, but that's
    > >> the -fix patch for this one. (It will cause a conflict with the 9th patch.)

    > >
    > > Well. Where do we stand with this? afaict the state of play is:
    > >
    > > Greg: do it in udev
    > > Pavel: but people want to run old distros in containers

    >
    > Actually no.
    >
    > Greg: Use LSM for this


    Yes, that is my recommendation.

    > Pavel: My approach just makes maps per-group, while LSM will
    > bring a new level of filtering/lookup on device open path


    Huh? You are still doing that same "filtering/lookup" by modifying the
    maps code. The result should be exactly the same.

    Why do you not want to use the LSM interface? That is exactly what it
    is there for, don't go creating new hooks into the kernel for the exact
    same functionality.

    Opening a dev node is not on any "fast path" that you need to be
    concerned about a few extra calls within the kernel.

    And, I think in the end your patch would be much smaller and easier to
    understand and review and maintain overall.

    > > Realistically, when is the mainline kernel likely to have sufficient
    > > container functionality which is sufficiently well-tested for people to
    > > actually be able to do that? And how much longer will it take for that
    > > kernel.org functionality to propagate out into non-bleeding-edge distros?

    >
    > The fact is that we have users of OpenVZ and even Virtuozzo, that still use
    > redhat-9 as in containers. So even if this is ready in 5 years, there will
    > always be someone who sets the outdated (by that time) fedora-core-8 and find
    > out, that his udev refuses to work.


    That's fine, use the LSM interface, no need to change userspace at all.

    Although I think your requirement of using new kernels on very old
    distros is going to cause you more problems than you realize in the
    end...

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Greg KH wrote:
    > On Fri, Mar 07, 2008 at 12:52:40PM +0300, Pavel Emelyanov wrote:
    >> Andrew Morton wrote:
    >>> On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov wrote:
    >>>
    >>>>> This doesn't include sufficient headers to be compileable.
    >>>>>
    >>>>> I'm sure there are lots of headers like this. But we regularly need
    >>>>> to fix them.
    >>>>>
    >>>> Not sure, whether this is still relevant after Greg's comments, but that's
    >>>> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
    >>> Well. Where do we stand with this? afaict the state of play is:
    >>>
    >>> Greg: do it in udev
    >>> Pavel: but people want to run old distros in containers

    >> Actually no.
    >>
    >> Greg: Use LSM for this

    >
    > Yes, that is my recommendation.
    >
    >> Pavel: My approach just makes maps per-group, while LSM will
    >> bring a new level of filtering/lookup on device open path

    >
    > Huh? You are still doing that same "filtering/lookup" by modifying the
    > maps code. The result should be exactly the same.


    No - this lookup was there before to get struct kobject from the dev_t,
    I just make it look up in another map.

    > Why do you not want to use the LSM interface? That is exactly what it
    > is there for, don't go creating new hooks into the kernel for the exact
    > same functionality.


    _No_new_hooks_ - just the map is get from task, not from a static variable.

    I have basically three objections against LSM.

    1. LSM is not stackable, so loading this tiny module with devices
    access rights will block other security modules;

    2. Turning CONFIG_SECURITY on immediately causes all the other hooks
    to get called. This affects performance on critical paths, like
    process creation/destruction, network flow and so on. This impact
    is small, but noticeable;

    3. With LSM turned on we'll have to "virtualize" it, i.e. make its
    work safe in a container. I don't presume to judge how much work
    will have to be done in this area, so the result patch would be
    even larger and maybe will duplicate functionality, which is currently
    in cgroups. OTOH, cgroups already provide the ways to correctly
    delegate proper rights to containers.

    > Opening a dev node is not on any "fast path" that you need to be
    > concerned about a few extra calls within the kernel.
    >
    > And, I think in the end your patch would be much smaller and easier to
    > understand and review and maintain overall.


    Hardly - the largest part of my patch is cgroup manipulations. The part
    that makes the char and block layers switch to new map ac check the
    permissions is 10-20 lines of new code.

    But with LSM I will still need this API.

    >>> Realistically, when is the mainline kernel likely to have sufficient
    >>> container functionality which is sufficiently well-tested for people to
    >>> actually be able to do that? And how much longer will it take for that
    >>> kernel.org functionality to propagate out into non-bleeding-edge distros?

    >> The fact is that we have users of OpenVZ and even Virtuozzo, that still use
    >> redhat-9 as in containers. So even if this is ready in 5 years, there will
    >> always be someone who sets the outdated (by that time) fedora-core-8 and find
    >> out, that his udev refuses to work.

    >
    > That's fine, use the LSM interface, no need to change userspace at all.
    >
    > Although I think your requirement of using new kernels on very old
    > distros is going to cause you more problems than you realize in the
    > end...
    >
    > thanks,
    >
    > greg k-h
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Fri, Mar 07, 2008 at 09:01:04AM -0800, Greg KH wrote:

    > Again, I object to this as you are driving a new security policy
    > infrastructure into the device node logic where it does not belong as we
    > already have this functionality in the LSM interface today. Please use
    > that one instead and don't clutter up the kernel with "one-off" security
    > changes like this one.
    >
    > Please try the LSM interface and see what happens. If, after you have
    > created a patch, you still have objections, please post it for review
    > and I will be glad to revisit my opinion at that time.


    Not that per-container mappings on that level made any kind of sense in
    the first place...
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Fri, Mar 07, 2008 at 07:38:51PM +0300, Pavel Emelyanov wrote:
    > Greg KH wrote:
    > > On Fri, Mar 07, 2008 at 12:52:40PM +0300, Pavel Emelyanov wrote:
    > >> Andrew Morton wrote:
    > >>> On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov wrote:
    > >>>
    > >>>>> This doesn't include sufficient headers to be compileable.
    > >>>>>
    > >>>>> I'm sure there are lots of headers like this. But we regularly need
    > >>>>> to fix them.
    > >>>>>
    > >>>> Not sure, whether this is still relevant after Greg's comments, but that's
    > >>>> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
    > >>> Well. Where do we stand with this? afaict the state of play is:
    > >>>
    > >>> Greg: do it in udev
    > >>> Pavel: but people want to run old distros in containers
    > >> Actually no.
    > >>
    > >> Greg: Use LSM for this

    > >
    > > Yes, that is my recommendation.
    > >
    > >> Pavel: My approach just makes maps per-group, while LSM will
    > >> bring a new level of filtering/lookup on device open path

    > >
    > > Huh? You are still doing that same "filtering/lookup" by modifying the
    > > maps code. The result should be exactly the same.

    >
    > No - this lookup was there before to get struct kobject from the dev_t,
    > I just make it look up in another map.
    >
    > > Why do you not want to use the LSM interface? That is exactly what it
    > > is there for, don't go creating new hooks into the kernel for the exact
    > > same functionality.

    >
    > _No_new_hooks_ - just the map is get from task, not from a static variable.


    The new "hook" is that we now have ways of setting different maps, and
    you added a whole infrastructure to set these permissions from
    userspace, as well as track them within the kernel. That's a security
    policy that you are putting into place within the kernel, just like LSM
    is supposed to be used for.

    > I have basically three objections against LSM.
    >
    > 1. LSM is not stackable, so loading this tiny module with devices
    > access rights will block other security modules;


    Do you really want to run other LSMs within a containerd kernel? Is
    that a requirement? It would seem to run counter to the main goal of
    containers to me.

    > 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
    > to get called. This affects performance on critical paths, like
    > process creation/destruction, network flow and so on. This impact
    > is small, but noticeable;


    The last time this was brought up, it was not noticable, except for some
    network paths. And even then, the number was lost in the noise from
    what I saw. I think with a containered machine, you have bigger things
    to be worried about

    > 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
    > work safe in a container. I don't presume to judge how much work
    > will have to be done in this area, so the result patch would be
    > even larger and maybe will duplicate functionality, which is currently
    > in cgroups. OTOH, cgroups already provide the ways to correctly
    > delegate proper rights to containers.


    No, your lsm would be your "virtualize" policy. I don't think you would
    have to do any additional work here, but could be wrong. Would like to
    see the code to prove it.

    > > Opening a dev node is not on any "fast path" that you need to be
    > > concerned about a few extra calls within the kernel.
    > >
    > > And, I think in the end your patch would be much smaller and easier to
    > > understand and review and maintain overall.

    >
    > Hardly - the largest part of my patch is cgroup manipulations. The part
    > that makes the char and block layers switch to new map ac check the
    > permissions is 10-20 lines of new code.
    >
    > But with LSM I will still need this API.


    Yes, but your LSM hooks will be smaller than the code modifications to
    the map logic

    Again, I object to this as you are driving a new security policy
    infrastructure into the device node logic where it does not belong as we
    already have this functionality in the LSM interface today. Please use
    that one instead and don't clutter up the kernel with "one-off" security
    changes like this one.

    Please try the LSM interface and see what happens. If, after you have
    created a patch, you still have objections, please post it for review
    and I will be glad to revisit my opinion at that time.

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Quoting Greg KH (greg@kroah.com):
    > On Fri, Mar 07, 2008 at 07:38:51PM +0300, Pavel Emelyanov wrote:
    > > Greg KH wrote:
    > > > On Fri, Mar 07, 2008 at 12:52:40PM +0300, Pavel Emelyanov wrote:
    > > >> Andrew Morton wrote:
    > > >>> On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov wrote:
    > > >>>
    > > >>>>> This doesn't include sufficient headers to be compileable.
    > > >>>>>
    > > >>>>> I'm sure there are lots of headers like this. But we regularly need
    > > >>>>> to fix them.
    > > >>>>>
    > > >>>> Not sure, whether this is still relevant after Greg's comments, but that's
    > > >>>> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
    > > >>> Well. Where do we stand with this? afaict the state of play is:
    > > >>>
    > > >>> Greg: do it in udev
    > > >>> Pavel: but people want to run old distros in containers
    > > >> Actually no.
    > > >>
    > > >> Greg: Use LSM for this
    > > >
    > > > Yes, that is my recommendation.
    > > >
    > > >> Pavel: My approach just makes maps per-group, while LSM will
    > > >> bring a new level of filtering/lookup on device open path
    > > >
    > > > Huh? You are still doing that same "filtering/lookup" by modifying the
    > > > maps code. The result should be exactly the same.

    > >
    > > No - this lookup was there before to get struct kobject from the dev_t,
    > > I just make it look up in another map.
    > >
    > > > Why do you not want to use the LSM interface? That is exactly what it
    > > > is there for, don't go creating new hooks into the kernel for the exact
    > > > same functionality.

    > >
    > > _No_new_hooks_ - just the map is get from task, not from a static variable.

    >
    > The new "hook" is that we now have ways of setting different maps, and
    > you added a whole infrastructure to set these permissions from
    > userspace, as well as track them within the kernel. That's a security
    > policy that you are putting into place within the kernel, just like LSM
    > is supposed to be used for.
    >
    > > I have basically three objections against LSM.
    > >
    > > 1. LSM is not stackable, so loading this tiny module with devices
    > > access rights will block other security modules;

    >
    > Do you really want to run other LSMs within a containerd kernel? Is
    > that a requirement? It would seem to run counter to the main goal of
    > containers to me.


    Until user namespaces are complete, selinux seems the only good solution
    to offer isolation.

    > > 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
    > > to get called. This affects performance on critical paths, like
    > > process creation/destruction, network flow and so on. This impact
    > > is small, but noticeable;

    >
    > The last time this was brought up, it was not noticable, except for some
    > network paths. And even then, the number was lost in the noise from
    > what I saw. I think with a containered machine, you have bigger things
    > to be worried about
    >
    > > 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
    > > work safe in a container. I don't presume to judge how much work
    > > will have to be done in this area, so the result patch would be
    > > even larger and maybe will duplicate functionality, which is currently
    > > in cgroups. OTOH, cgroups already provide the ways to correctly
    > > delegate proper rights to containers.

    >
    > No, your lsm would be your "virtualize" policy. I don't think you would
    > have to do any additional work here, but could be wrong. Would like to
    > see the code to prove it.
    >
    > > > Opening a dev node is not on any "fast path" that you need to be
    > > > concerned about a few extra calls within the kernel.
    > > >
    > > > And, I think in the end your patch would be much smaller and easier to
    > > > understand and review and maintain overall.

    > >
    > > Hardly - the largest part of my patch is cgroup manipulations. The part
    > > that makes the char and block layers switch to new map ac check the
    > > permissions is 10-20 lines of new code.
    > >
    > > But with LSM I will still need this API.

    >
    > Yes, but your LSM hooks will be smaller than the code modifications to
    > the map logic
    >
    > Again, I object to this as you are driving a new security policy
    > infrastructure into the device node logic where it does not belong as we
    > already have this functionality in the LSM interface today. Please use
    > that one instead and don't clutter up the kernel with "one-off" security
    > changes like this one.
    >
    > Please try the LSM interface and see what happens. If, after you have


    https://lists.linux-foundation.org/p...er/008589.html

    This was just a proof of concept for discussion.

    Our conclusion (including my own) was that it would be nicer to have the
    controls not be shoed in using lsm but rather at the level Pavel was
    doing.

    > created a patch, you still have objections, please post it for review
    > and I will be glad to revisit my opinion at that time.
    >
    > thanks,
    >
    > greg k-h

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup


    --- "Serge E. Hallyn" wrote:

    > ...
    >
    > Until user namespaces are complete, selinux seems the only good solution
    > to offer isolation.


    Smack does it better and cheaper. (Unless you define good==selinux)
    (insert smiley)



    Casey Schaufler
    casey@schaufler-ca.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
    > > Do you really want to run other LSMs within a containerd kernel? Is
    > > that a requirement? It would seem to run counter to the main goal of
    > > containers to me.

    >
    > Until user namespaces are complete, selinux seems the only good solution
    > to offer isolation.


    Great, use that instead

    > > > 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
    > > > to get called. This affects performance on critical paths, like
    > > > process creation/destruction, network flow and so on. This impact
    > > > is small, but noticeable;

    > >
    > > The last time this was brought up, it was not noticable, except for some
    > > network paths. And even then, the number was lost in the noise from
    > > what I saw. I think with a containered machine, you have bigger things
    > > to be worried about
    > >
    > > > 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
    > > > work safe in a container. I don't presume to judge how much work
    > > > will have to be done in this area, so the result patch would be
    > > > even larger and maybe will duplicate functionality, which is currently
    > > > in cgroups. OTOH, cgroups already provide the ways to correctly
    > > > delegate proper rights to containers.

    > >
    > > No, your lsm would be your "virtualize" policy. I don't think you would
    > > have to do any additional work here, but could be wrong. Would like to
    > > see the code to prove it.
    > >
    > > > > Opening a dev node is not on any "fast path" that you need to be
    > > > > concerned about a few extra calls within the kernel.
    > > > >
    > > > > And, I think in the end your patch would be much smaller and easier to
    > > > > understand and review and maintain overall.
    > > >
    > > > Hardly - the largest part of my patch is cgroup manipulations. The part
    > > > that makes the char and block layers switch to new map ac check the
    > > > permissions is 10-20 lines of new code.
    > > >
    > > > But with LSM I will still need this API.

    > >
    > > Yes, but your LSM hooks will be smaller than the code modifications to
    > > the map logic
    > >
    > > Again, I object to this as you are driving a new security policy
    > > infrastructure into the device node logic where it does not belong as we
    > > already have this functionality in the LSM interface today. Please use
    > > that one instead and don't clutter up the kernel with "one-off" security
    > > changes like this one.
    > >
    > > Please try the LSM interface and see what happens. If, after you have

    >
    > https://lists.linux-foundation.org/p...er/008589.html
    >
    > This was just a proof of concept for discussion.


    I don't see the LSM patch in that posting, only a Makefile change that
    adds it to the build.

    > Our conclusion (including my own) was that it would be nicer to have the
    > controls not be shoed in using lsm but rather at the level Pavel was
    > doing.


    Why? What is the difference? I don't see that discussion anywhere in
    that post.

    Why add add-hock security stuff to the kernel all over the place and not
    use the LSM interface that we already have? If LSM doesn't provide the
    exact right hooks needed, we can always change it (and no, we should not
    be adding kobj_maps hooks to LSM).

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Quoting Casey Schaufler (casey@schaufler-ca.com):
    >
    > --- "Serge E. Hallyn" wrote:
    >
    > > ...
    > >
    > > Until user namespaces are complete, selinux seems the only good solution
    > > to offer isolation.

    >
    > Smack does it better and cheaper. (Unless you define good==selinux)
    > (insert smiley)


    Ah, thanks - I hadn't looked into it, but yes IIUC smack should
    definately work. I'll have to give that a shot.

    (A basic selinux policy module to isolate a container was pretty simple,
    but providing finer-grained intra-container access seems to take some
    changes to the base refpolicy. I've been waiting a few weeks to find
    time to work on that.)

    thanks,
    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Quoting Greg KH (greg@kroah.com):
    > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
    > > > Do you really want to run other LSMs within a containerd kernel? Is
    > > > that a requirement? It would seem to run counter to the main goal of
    > > > containers to me.

    > >
    > > Until user namespaces are complete, selinux seems the only good solution
    > > to offer isolation.

    >
    > Great, use that instead


    That can't work as is since you can't specify major:minor in policy.
    So all we could do again is simply refuse all mknod, which we can
    already do with per-process capability bounding sets.

    > > > > 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
    > > > > to get called. This affects performance on critical paths, like
    > > > > process creation/destruction, network flow and so on. This impact
    > > > > is small, but noticeable;
    > > >
    > > > The last time this was brought up, it was not noticable, except for some
    > > > network paths. And even then, the number was lost in the noise from
    > > > what I saw. I think with a containered machine, you have bigger things
    > > > to be worried about
    > > >
    > > > > 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
    > > > > work safe in a container. I don't presume to judge how much work
    > > > > will have to be done in this area, so the result patch would be
    > > > > even larger and maybe will duplicate functionality, which is currently
    > > > > in cgroups. OTOH, cgroups already provide the ways to correctly
    > > > > delegate proper rights to containers.
    > > >
    > > > No, your lsm would be your "virtualize" policy. I don't think you would
    > > > have to do any additional work here, but could be wrong. Would like to
    > > > see the code to prove it.
    > > >
    > > > > > Opening a dev node is not on any "fast path" that you need to be
    > > > > > concerned about a few extra calls within the kernel.
    > > > > >
    > > > > > And, I think in the end your patch would be much smaller and easier to
    > > > > > understand and review and maintain overall.
    > > > >
    > > > > Hardly - the largest part of my patch is cgroup manipulations. The part
    > > > > that makes the char and block layers switch to new map ac check the
    > > > > permissions is 10-20 lines of new code.
    > > > >
    > > > > But with LSM I will still need this API.
    > > >
    > > > Yes, but your LSM hooks will be smaller than the code modifications to
    > > > the map logic
    > > >
    > > > Again, I object to this as you are driving a new security policy
    > > > infrastructure into the device node logic where it does not belong as we
    > > > already have this functionality in the LSM interface today. Please use
    > > > that one instead and don't clutter up the kernel with "one-off" security
    > > > changes like this one.
    > > >
    > > > Please try the LSM interface and see what happens. If, after you have

    > >
    > > https://lists.linux-foundation.org/p...er/008589.html
    > >
    > > This was just a proof of concept for discussion.

    >
    > I don't see the LSM patch in that posting, only a Makefile change that
    > adds it to the build.


    Aw, crap! That's right, I remember noticing a few days later that it
    wasn't in my git index any more.

    An older version (by about a month) of the patch with the LSM coded
    straight into the cgroup file is appended below.

    > > Our conclusion (including my own) was that it would be nicer to have the
    > > controls not be shoed in using lsm but rather at the level Pavel was
    > > doing.

    >
    > Why? What is the difference? I don't see that discussion anywhere in
    > that post.


    That happened in a later thread in december:
    https://lists.linux-foundation.org/p...er/009337.html

    > Why add add-hock security stuff to the kernel all over the place and not
    > use the LSM interface that we already have? If LSM doesn't provide the
    > exact right hooks needed, we can always change it (and no, we should not
    > be adding kobj_maps hooks to LSM).
    >
    > thanks,
    >
    > greg k-h



    From 4266131c40b629e3b04c0d9d01569a95fa967e3e Mon Sep 17 00:00:00 2001
    From: Serge E. Hallyn
    Date: Thu, 11 Oct 2007 15:27:48 -0400
    Subject: [PATCH 1/1] cgroups: implement device whitelist cgroup+lsm

    Implement a cgroup using the LSM interface to enforce open and mknod
    on device files. Not a line of this code is expected to be used in a
    final version, this is just a proof of concept.

    No stacking is implemented, so to test this you must have

    CGROUPS=y
    SECURITY=y

    but all other LSMs =n (no capabilities, no selinux, no rootplug).

    This implements a simple device access whitelist. A whitelist entry
    has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it
    applies to all types, all major numbers, and all minor numbers. Major and
    minor are obvious. Access is a composition of r (read), w (write), and
    m (mknod).

    The root devcgroup starts with rwm to 'all'. A child devcg gets a copy
    of the parent. Admins can then add and remove devices to the whitelist.
    Once CAP_HOST_ADMIN is introduced it will be needed to add entries as
    well or remove entries from another cgroup, though just CAP_SYS_ADMIN
    will suffice to remove entries for your own group.

    An entry is added by doing "echo " > devcg.allow,
    for instance:

    echo b 7 0 mrw > /cgroups/1/devcg.allow

    An entry is removed by doing likewise into devcg.deny. Since this is a
    pure whitelist, not acls, you can only remove entries which exist in the
    whitelist. You must explicitly

    echo a 0 0 mrw > /cgroups/1/devcg.deny

    to remove the "allow all" entry which is automatically inherited from
    the root cgroup.

    While composing this with the ns_cgroup may seem logical, it may not
    be the right thing to do. Note that each newly created devcg gets
    a copy of the parent whitelist. So if you had done

    mount -t cgroup -o ns,devcg none /cgroups

    then once a process in /cgroup/1 had done an unshare(CLONE_NEWNS)
    it would be under /cgroup/1/node_
    if an admin did

    echo b 7 0 m > /cgroups/1/devcg.deny

    then the entry would still be in the whitelist for /cgroups/1/node_.
    Something to discuss if we get that far before nixing this whole idea.

    Signed-off-by: Serge E. Hallyn
    ---
    include/linux/cgroup_subsys.h | 6 +
    init/Kconfig | 7 +
    kernel/Makefile | 1 +
    kernel/dev_cgroup.c | 554 +++++++++++++++++++++++++++++++++++++++++
    4 files changed, 568 insertions(+), 0 deletions(-)
    create mode 100644 kernel/dev_cgroup.c

    diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
    index d822977..cf55cb2 100644
    --- a/include/linux/cgroup_subsys.h
    +++ b/include/linux/cgroup_subsys.h
    @@ -36,3 +36,9 @@ SUBSYS(mem_cgroup)
    #endif

    /* */
    +
    +#ifdef CONFIG_CGROUP_DEV
    +SUBSYS(devcg)
    +#endif
    +
    +/* */
    diff --git a/init/Kconfig b/init/Kconfig
    index 6bb603a..0b3b684 100644
    --- a/init/Kconfig
    +++ b/init/Kconfig
    @@ -319,6 +319,13 @@ config CPUSETS

    Say N if unsure.

    +config CGROUP_DEV
    + bool "Device controller for cgroups"
    + depends on CGROUPS && SECURITY && EXPERIMENTAL
    + help
    + Provides a cgroup implementing whitelists for devices which
    + a process in the cgroup can mknod or open.
    +
    config FAIR_GROUP_SCHED
    bool "Fair group CPU scheduler"
    default y
    diff --git a/kernel/Makefile b/kernel/Makefile
    index 76f782f..6ded46d 100644
    --- a/kernel/Makefile
    +++ b/kernel/Makefile
    @@ -43,6 +43,7 @@ obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
    obj-$(CONFIG_CPUSETS) += cpuset.o
    obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
    obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
    +obj-$(CONFIG_CGROUP_DEV) += dev_cgroup.o
    obj-$(CONFIG_IKCONFIG) += configs.o
    obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
    obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
    diff --git a/kernel/dev_cgroup.c b/kernel/dev_cgroup.c
    new file mode 100644
    index 0000000..87c8fb4
    --- /dev/null
    +++ b/kernel/dev_cgroup.c
    @@ -0,0 +1,554 @@
    +/*
    + * dev_cgroup.c - device cgroup subsystem
    + *
    + * Copyright 2007 IBM Corp
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +#include
    +
    +#define ACC_MKNOD 1
    +#define ACC_READ 2
    +#define ACC_WRITE 4
    +
    +#define DEV_BLOCK 1
    +#define DEV_CHAR 2
    +#define DEV_ALL 4 /* this represents all devices */
    +
    +/*
    + * whitelist locking rules:
    + * cgroup_lock() cannot be taken under cgroup->lock.
    + * cgroup->lock can be taken with or without cgroup_lock().
    + *
    + * modifications always require cgroup_lock
    + * modifications to a list which is visible require the
    + * cgroup->lock *and* cgroup_lock()
    + * walking the list requires cgroup->lock or cgroup_lock().
    + *
    + * reasoning: dev_whitelist_copy() needs to kmalloc, so needs
    + * a mutex, which the cgroup_lock() is. Since modifying
    + * a visible list requires both locks, either lock can be
    + * taken for walking the list. Since the wh->spinlock is taken
    + * for modifying a public-accessible list, the spinlock is
    + * sufficient for just walking the list.
    + */
    +
    +struct dev_whitelist_item {
    + u32 major, minor;
    + short type;
    + short access;
    + struct list_head list;
    +};
    +
    +struct dev_cgroup {
    + struct cgroup_subsys_state css;
    + struct list_head whitelist;
    + spinlock_t lock;
    +};
    +
    +struct cgroup_subsys devcg_subsys;
    +
    +static inline struct dev_cgroup *cgroup_to_devcg(
    + struct cgroup *cgroup)
    +{
    + return container_of(cgroup_subsys_state(cgroup, devcg_subsys_id),
    + struct dev_cgroup, css);
    +}
    +
    +/*
    + * Once 64-bit caps and CAP_HOST_ADMIN exist, we will be
    + * requiring (CAP_HOST_ADMIN|CAP_MKNOD) to create a device
    + * not in the whitelist, * (CAP_HOST_ADMIN|CAP_SYS_ADMIN)
    + * to edit the whitelist,
    + */
    +static int devcg_can_attach(struct cgroup_subsys *ss,
    + struct cgroup *new_cgroup, struct task_struct *task)
    +{
    + struct cgroup *orig;
    +
    + if (current != task) {
    + if (!cgroup_is_descendant(new_cgroup))
    + return -EPERM;
    + }
    +
    + if (atomic_read(&new_cgroup->count) != 0)
    + return -EPERM;
    +
    + orig = task_cgroup(task, devcg_subsys_id);
    + if (orig && orig != new_cgroup->parent)
    + return -EPERM;
    +
    + return 0;
    +}
    +
    +/*
    + * called under cgroup_lock()
    + */
    +int dev_whitelist_copy(struct list_head *dest, struct list_head *orig)
    +{
    + struct dev_whitelist_item *wh, *tmp, *new;
    +
    + list_for_each_entry(wh, orig, list) {
    + new = kmalloc(sizeof(*wh), GFP_KERNEL);
    + if (!new)
    + goto free_and_exit;
    + new->major = wh->major;
    + new->minor = wh->minor;
    + new->type = wh->type;
    + new->access = wh->access;
    + list_add_tail(&new->list, dest);
    + }
    +
    + return 0;
    +
    +free_and_exit:
    + list_for_each_entry_safe(wh, tmp, dest, list) {
    + list_del(&wh->list);
    + kfree(wh);
    + }
    + return -ENOMEM;
    +}
    +
    +/* Stupid prototype - don't bother combining existing entries */
    +/*
    + * called under cgroup_lock()
    + * since the list is visible to other tasks, we need the spinlock also
    + */
    +void dev_whitelist_add(struct dev_cgroup *dev_cgroup, struct dev_whitelist_item *wh)
    +{
    + spin_lock(&dev_cgroup->lock);
    + list_add_tail(&wh->list, &dev_cgroup->whitelist);
    + spin_unlock(&dev_cgroup->lock);
    +}
    +
    +/*
    + * called under cgroup_lock()
    + * since the list is visible to other tasks, we need the spinlock also
    + */
    +void dev_whitelist_rm(struct dev_cgroup *dev_cgroup, struct dev_whitelist_item *wh)
    +{
    + struct dev_whitelist_item *walk, *tmp;
    +
    + spin_lock(&dev_cgroup->lock);
    + list_for_each_entry_safe(walk, tmp, &dev_cgroup->whitelist, list) {
    + if (walk->type & DEV_ALL) {
    + list_del(&walk->list);
    + kfree(walk);
    + continue;
    + }
    + if (walk->type != wh->type)
    + continue;
    + if (walk->major != wh->major || walk->minor != wh->minor)
    + continue;
    + walk->access &= ~wh->access;
    + if (!walk->access) {
    + list_del(&walk->list);
    + kfree(walk);
    + }
    + }
    + spin_unlock(&dev_cgroup->lock);
    +}
    +
    +/*
    + * Rules: you can only create a cgroup if
    + * 1. you are capable(CAP_SYS_ADMIN)
    + * 2. the target cgroup is a descendant of your own cgroup
    + *
    + * Note: called from kernel/cgroup.c with cgroup_lock() held.
    + */
    +static struct cgroup_subsys_state *devcg_create(struct cgroup_subsys *ss,
    + struct cgroup *cgroup)
    +{
    + struct dev_cgroup *dev_cgroup, *parent_dev_cgroup;
    + struct cgroup *parent_cgroup;
    + int ret;
    +
    + if (!capable(CAP_SYS_ADMIN))
    + return ERR_PTR(-EPERM);
    + if (!cgroup_is_descendant(cgroup))
    + return ERR_PTR(-EPERM);
    +
    + dev_cgroup = kzalloc(sizeof(*dev_cgroup), GFP_KERNEL);
    + if (!dev_cgroup)
    + return ERR_PTR(-ENOMEM);
    + INIT_LIST_HEAD(&dev_cgroup->whitelist);
    + parent_cgroup = cgroup->parent;
    +
    + if (parent_cgroup == NULL) {
    + struct dev_whitelist_item *wh;
    + wh = kmalloc(sizeof(*wh), GFP_KERNEL);
    + wh->minor = wh->major = 0;
    + wh->type = DEV_ALL;
    + wh->access = ACC_MKNOD | ACC_READ | ACC_WRITE;
    + list_add(&wh->list, &dev_cgroup->whitelist);
    + } else {
    + parent_dev_cgroup = cgroup_to_devcg(parent_cgroup);
    + ret = dev_whitelist_copy(&dev_cgroup->whitelist,
    + &parent_dev_cgroup->whitelist);
    + if (ret) {
    + kfree(dev_cgroup);
    + return ERR_PTR(ret);
    + }
    + }
    +
    + spin_lock_init(&dev_cgroup->lock);
    + return &dev_cgroup->css;
    +}
    +
    +static void devcg_destroy(struct cgroup_subsys *ss,
    + struct cgroup *cgroup)
    +{
    + struct dev_cgroup *dev_cgroup;
    + struct dev_whitelist_item *wh, *tmp;
    +
    + dev_cgroup = cgroup_to_devcg(cgroup);
    + list_for_each_entry_safe(wh, tmp, &dev_cgroup->whitelist, list) {
    + list_del(&wh->list);
    + kfree(wh);
    + }
    + kfree(dev_cgroup);
    +}
    +
    +#define DEVCG_ALLOW 1
    +#define DEVCG_DENY 2
    +
    +void set_access(char *acc, short access)
    +{
    + int idx = 0;
    + memset(acc, 0, 4);
    + if (access & ACC_READ)
    + acc[idx++] = 'r';
    + if (access & ACC_WRITE)
    + acc[idx++] = 'w';
    + if (access & ACC_MKNOD)
    + acc[idx++] = 'm';
    +}
    +
    +char type_to_char(short type)
    +{
    + if (type == DEV_ALL)
    + return 'a';
    + if (type == DEV_CHAR)
    + return 'c';
    + if (type == DEV_BLOCK)
    + return 'b';
    + return 'X';
    +}
    +
    +char *print_whitelist(struct dev_cgroup *devcgroup, int *len)
    +{
    + char *buf, *s, acc[4];
    + struct dev_whitelist_item *wh;
    + int ret;
    + int count = 0;
    +
    + buf = kmalloc(4096, GFP_KERNEL);
    + if (!buf)
    + return ERR_PTR(-ENOMEM);
    + s = buf;
    + *s = '\0';
    + *len = 0;
    +
    + spin_lock(&devcgroup->lock);
    + list_for_each_entry(wh, &devcgroup->whitelist, list) {
    + set_access(acc, wh->access);
    + printk(KERN_NOTICE "%s (count%d): whtype %hd maj %u min %u acc %hd\n", __FUNCTION__,
    + count, wh->type, wh->major, wh->minor, wh->access);
    + ret = snprintf(s, 4095-(s-buf), "%c %u %u %s\n",
    + type_to_char(wh->type), wh->major, wh->minor, acc);
    + if (s+ret >= buf+4095) {
    + kfree(buf);
    + buf = ERR_PTR(-ENOMEM);
    + break;
    + }
    + s += ret;
    + *len += ret;
    + count++;
    + }
    + spin_unlock(&devcgroup->lock);
    +
    + return buf;
    +}
    +
    +static ssize_t devcg_access_read(struct cgroup *cgroup,
    + struct cftype *cft, struct file *file,
    + char __user *userbuf, size_t nbytes, loff_t *ppos)
    +{
    + struct dev_cgroup *devcgrp = cgroup_to_devcg(cgroup);
    + int filetype = cft->private;
    + char *buffer;
    + int len, retval;
    +
    + if (filetype != DEVCG_ALLOW)
    + return -EINVAL;
    + buffer = print_whitelist(devcgrp, &len);
    + if (IS_ERR(buffer))
    + return PTR_ERR(buffer);
    +
    + retval = simple_read_from_buffer(userbuf, nbytes, ppos, buffer, len);
    + kfree(buffer);
    + return retval;
    +}
    +
    +static inline short convert_access(char *acc)
    +{
    + short access = 0;
    +
    + while (*acc) {
    + switch (*acc) {
    + case 'r':
    + case 'R': access |= ACC_READ; break;
    + case 'w':
    + case 'W': access |= ACC_WRITE; break;
    + case 'm':
    + case 'M': access |= ACC_MKNOD; break;
    + case '\n': break;
    + default:
    + return -EINVAL;
    + }
    + acc++;
    + }
    +
    + return access;
    +}
    +
    +static inline short convert_type(char intype)
    +{
    + short type = 0;
    + switch(intype) {
    + case 'a': type = DEV_ALL; break;
    + case 'c': type = DEV_CHAR; break;
    + case 'b': type = DEV_BLOCK; break;
    + default: type = -EACCES; break;
    + }
    + return type;
    +}
    +
    +/*
    + * Current rules:
    + * CAP_SYS_ADMIN needed for all writes.
    + * when we have CAP_HOST_ADMIN, the rules will become:
    + * if (!writetoself)
    + * require capable(CAP_HOST_ADMIN | CAP_SYS_ADMIN);
    + * if (is_allow)
    + * require capable(CAP_HOST_ADMIN | CAP_SYS_ADMIN);
    + * require capable(CAP_SYS_ADMIN);
    + */
    +static int have_write_permission(int is_allow, int writetoself)
    +{
    + if (!capable(CAP_SYS_ADMIN))
    + return 0;
    + return 1;
    +}
    +
    +static ssize_t devcg_access_write(struct cgroup *cgroup, struct cftype *cft,
    + struct file *file, const char __user *userbuf,
    + size_t nbytes, loff_t *ppos)
    +{
    + struct cgroup *cur_cgroup;
    + struct dev_cgroup *devcgrp, *cur_devcgroup;
    + int filetype = cft->private;
    + char *buffer, acc[4];
    + int retval = 0;
    + int nitems;
    + char type;
    + struct dev_whitelist_item *wh;
    +
    + devcgrp = cgroup_to_devcg(cgroup);
    + cur_cgroup = task_cgroup(current, devcg_subsys.subsys_id);
    + cur_devcgroup = cgroup_to_devcg(cur_cgroup);
    +
    + if (!have_write_permission(filetype==DEVCG_ALLOW, cur_devcgroup==devcgrp))
    + return -EPERM;
    +
    + buffer = kmalloc(nbytes+1, GFP_KERNEL);
    + if (!buffer)
    + return -ENOMEM;
    +
    + wh = kmalloc(sizeof(*wh), GFP_KERNEL);
    + if (!wh) {
    + kfree(buffer);
    + return -ENOMEM;
    + }
    +
    + if (copy_from_user(buffer, userbuf, nbytes)) {
    + retval = -EFAULT;
    + goto out1;
    + }
    + buffer[nbytes] = 0; /* nul-terminate */
    +
    + cgroup_lock();
    + if (cgroup_is_removed(cgroup)) {
    + retval = -ENODEV;
    + goto out2;
    + }
    +
    + memset(wh, 0, sizeof(*wh));
    + memset(acc, 0, 4);
    + nitems = sscanf(buffer, "%c %u %u %3s", &type, &wh->major, &wh->minor, acc);
    + retval = -EINVAL;
    + if (nitems != 4)
    + goto out2;
    + wh->type = convert_type(type);
    + if (wh->type < 0)
    + goto out2;
    + wh->access = convert_access(acc);
    + if (wh->access < 0)
    + goto out2;
    + retval = 0;
    + switch (filetype) {
    + case DEVCG_ALLOW:
    + printk(KERN_NOTICE "%s: add whtype %hd maj %u min %u acc %hd\n", __FUNCTION__,
    + wh->type, wh->major, wh->minor, wh->access);
    + dev_whitelist_add(devcgrp, wh);
    + break;
    + case DEVCG_DENY:
    + dev_whitelist_rm(devcgrp, wh);
    + break;
    + default:
    + retval = -EINVAL;
    + goto out2;
    + }
    +
    + if (retval == 0)
    + retval = nbytes;
    +
    +out2:
    + cgroup_unlock();
    +out1:
    + kfree(buffer);
    + return retval;
    +}
    +
    +static struct cftype dev_cgroup_files[] = {
    + {
    + .name = "allow",
    + .read = devcg_access_read,
    + .write = devcg_access_write,
    + .private = DEVCG_ALLOW,
    + },
    + {
    + .name = "deny",
    + .write = devcg_access_write,
    + .private = DEVCG_DENY,
    + },
    +};
    +
    +static int devcg_populate(struct cgroup_subsys *ss,
    + struct cgroup *cont)
    +{
    + return cgroup_add_files(cont, ss, dev_cgroup_files,
    + ARRAY_SIZE(dev_cgroup_files));
    +}
    +
    +struct cgroup_subsys devcg_subsys = {
    + .name = "devcg",
    + .can_attach = devcg_can_attach,
    + .create = devcg_create,
    + .destroy = devcg_destroy,
    + .populate = devcg_populate,
    + .subsys_id = devcg_subsys_id,
    +};
    +
    +static int devcgroup_inode_permission(struct inode *inode, int mask,
    + struct nameidata *nd)
    +{
    + struct cgroup *cgroup;
    + struct dev_cgroup *dev_cgroup;
    + struct dev_whitelist_item *wh;
    +
    + dev_t device = inode->i_rdev;
    + if (!device)
    + return 0;
    + if (!S_ISBLK(inode->i_mode) && !S_ISCHR(inode->i_mode))
    + return 0;
    + cgroup = task_cgroup(current, devcg_subsys.subsys_id);
    + dev_cgroup = cgroup_to_devcg(cgroup);
    + if (!dev_cgroup)
    + return 0;
    +
    + spin_lock(&dev_cgroup->lock);
    + /* if capable(CAP_HOST_ADMIN) return 0; */
    + list_for_each_entry(wh, &dev_cgroup->whitelist, list) {
    + if (wh->type & DEV_ALL)
    + goto acc_check;
    + if (S_ISBLK(inode->i_mode) && !(wh->type & DEV_BLOCK))
    + continue;
    + if (S_ISCHR(inode->i_mode) && !(wh->type & DEV_CHAR))
    + continue;
    + if (wh->major != imajor(inode) || wh->minor != iminor(inode))
    + continue;
    +acc_check:
    + if ((mask & MAY_WRITE) && !(wh->access & ACC_WRITE))
    + continue;
    + if ((mask & MAY_READ) && !(wh->access & ACC_READ))
    + continue;
    + spin_unlock(&dev_cgroup->lock);
    + return 0;
    + }
    + spin_unlock(&dev_cgroup->lock);
    +
    + printk(KERN_NOTICE "%s: %d denied %d access to %s (%lu)\n", __FUNCTION__,
    + current->pid, mask, nd ? nd->dentry->d_name.name : "null",
    + inode->i_ino);
    + return -EPERM;
    +}
    +
    +static int devcgroup_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
    +{
    + struct cgroup *cgroup;
    + struct dev_cgroup *dev_cgroup;
    + struct dev_whitelist_item *wh;
    +
    + /* if capable(CAP_HOST_ADMIN) return 0; */
    + cgroup = task_cgroup(current, devcg_subsys.subsys_id);
    + dev_cgroup = cgroup_to_devcg(cgroup);
    + if (!dev_cgroup)
    + return 0;
    +
    + spin_lock(&dev_cgroup->lock);
    + list_for_each_entry(wh, &dev_cgroup->whitelist, list) {
    + if (wh->type & DEV_ALL)
    + goto ok;
    + if (S_ISBLK(mode) && !(wh->type & DEV_BLOCK))
    + continue;
    + if (S_ISCHR(mode) && !(wh->type & DEV_CHAR))
    + continue;
    + if (wh->major != MAJOR(dev) || wh->minor != MINOR(dev))
    + continue;
    + if (wh->access & ACC_MKNOD)
    + goto ok;
    + }
    + spin_unlock(&dev_cgroup->lock);
    +
    + printk(KERN_NOTICE "%s: %d denied %d access to (%d %d)\n", __FUNCTION__,
    + current->pid, mode, MAJOR(dev), MINOR(dev));
    + return -EPERM;
    +
    +ok:
    + spin_unlock(&dev_cgroup->lock);
    + return 0;
    +}
    +
    +static struct security_operations devcgroup_security_ops = {
    + .inode_mknod = devcgroup_inode_mknod,
    + .inode_permission = devcgroup_inode_permission,
    +};
    +
    +static int __init dev_cgroup_security_init (void)
    +{
    + /* register ourselves with the security framework */
    + if (register_security (&devcgroup_security_ops)) {
    + printk (KERN_INFO "Failure registering device cgroup lsm\n");
    + return -EINVAL;
    + }
    + printk (KERN_INFO "Device cgroup LSM initialized\n");
    + return 0;
    +}
    +
    +security_initcall (dev_cgroup_security_init);
    --
    1.5.1.1.GIT

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup


    On Fri, 2008-03-07 at 12:30 -0600, Serge E. Hallyn wrote:
    > Quoting Casey Schaufler (casey@schaufler-ca.com):
    > >
    > > --- "Serge E. Hallyn" wrote:
    > >
    > > > ...
    > > >
    > > > Until user namespaces are complete, selinux seems the only good solution
    > > > to offer isolation.

    > >
    > > Smack does it better and cheaper. (Unless you define good==selinux)
    > > (insert smiley)

    >
    > Ah, thanks - I hadn't looked into it, but yes IIUC smack should
    > definately work. I'll have to give that a shot.


    Not if you want to confine uid 0. smack doesn't control capabilities,
    even the ones used to override it.

    So you'd have to at least configure your per-process bset and file caps
    rather carefully. And even then you have to watch out for things with
    CAP_MAC* or CAP_SETPCAP.

    > (A basic selinux policy module to isolate a container was pretty simple,
    > but providing finer-grained intra-container access seems to take some
    > changes to the base refpolicy. I've been waiting a few weeks to find
    > time to work on that.)


    --
    Stephen Smalley
    National Security Agency

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup


    --- Stephen Smalley wrote:

    >
    > On Fri, 2008-03-07 at 12:30 -0600, Serge E. Hallyn wrote:
    > > Quoting Casey Schaufler (casey@schaufler-ca.com):
    > > >
    > > > --- "Serge E. Hallyn" wrote:
    > > >
    > > > > ...
    > > > >
    > > > > Until user namespaces are complete, selinux seems the only good

    > solution
    > > > > to offer isolation.
    > > >
    > > > Smack does it better and cheaper. (Unless you define good==selinux)
    > > > (insert smiley)

    > >
    > > Ah, thanks - I hadn't looked into it, but yes IIUC smack should
    > > definately work. I'll have to give that a shot.

    >
    > Not if you want to confine uid 0. smack doesn't control capabilities,
    > even the ones used to override it.
    >
    > So you'd have to at least configure your per-process bset and file caps
    > rather carefully. And even then you have to watch out for things with
    > CAP_MAC* or CAP_SETPCAP.


    Shrug. As if getting 800,000 lines of policy definition
    for a thousand applications completely correct is going to be easier.


    Casey Schaufler
    casey@schaufler-ca.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Quoting Casey Schaufler (casey@schaufler-ca.com):
    >
    > --- Stephen Smalley wrote:
    >
    > >
    > > On Fri, 2008-03-07 at 12:30 -0600, Serge E. Hallyn wrote:
    > > > Quoting Casey Schaufler (casey@schaufler-ca.com):
    > > > >
    > > > > --- "Serge E. Hallyn" wrote:
    > > > >
    > > > > > ...
    > > > > >
    > > > > > Until user namespaces are complete, selinux seems the only good

    > > solution
    > > > > > to offer isolation.
    > > > >
    > > > > Smack does it better and cheaper. (Unless you define good==selinux)
    > > > > (insert smiley)
    > > >
    > > > Ah, thanks - I hadn't looked into it, but yes IIUC smack should
    > > > definately work. I'll have to give that a shot.

    > >
    > > Not if you want to confine uid 0. smack doesn't control capabilities,
    > > even the ones used to override it.
    > >
    > > So you'd have to at least configure your per-process bset and file caps
    > > rather carefully. And even then you have to watch out for things with
    > > CAP_MAC* or CAP_SETPCAP.

    >
    > Shrug. As if getting 800,000 lines of policy definition
    > for a thousand applications completely correct is going to be easier.


    Folks, as I get time I will try with both

    I suspect the CAP_MAC_ADMIN will mean containers won't be able to do any
    policy updates without an update to smack to do a CAP_NS_OVERRIDE type
    of thing. For SELinux I've got my hopes on the userspace policy daemon,
    but first (my next step atm) I need to get the namespace thing going
    (vserver1.root_t etc).

    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Fri, Mar 07, 2008 at 12:50:52PM -0600, Serge E. Hallyn wrote:
    > Quoting Greg KH (greg@kroah.com):
    > > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
    > > > > Do you really want to run other LSMs within a containerd kernel? Is
    > > > > that a requirement? It would seem to run counter to the main goal of
    > > > > containers to me.
    > > >
    > > > Until user namespaces are complete, selinux seems the only good solution
    > > > to offer isolation.

    > >
    > > Great, use that instead

    >
    > That can't work as is since you can't specify major:minor in policy.


    Your LSM can not, or the LSM interface does not allow this to happen?

    > So all we could do again is simply refuse all mknod, which we can
    > already do with per-process capability bounding sets.


    I thought we passed that info down to the LSM module, can't you do your
    selection at that point in time?

    And then, just mediate open() like always, right?

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Quoting Greg KH (greg@kroah.com):
    > On Fri, Mar 07, 2008 at 12:50:52PM -0600, Serge E. Hallyn wrote:
    > > Quoting Greg KH (greg@kroah.com):
    > > > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
    > > > > > Do you really want to run other LSMs within a containerd kernel? Is
    > > > > > that a requirement? It would seem to run counter to the main goal of
    > > > > > containers to me.
    > > > >
    > > > > Until user namespaces are complete, selinux seems the only good solution
    > > > > to offer isolation.
    > > >
    > > > Great, use that instead

    > >
    > > That can't work as is since you can't specify major:minor in policy.

    >
    > Your LSM can not, or the LSM interface does not allow this to happen?


    No my lsm in fact does, you just can't do it with selinux policy at the
    moment. I was still responding to your "just use selinux"

    > > So all we could do again is simply refuse all mknod, which we can
    > > already do with per-process capability bounding sets.

    >
    > I thought we passed that info down to the LSM module, can't you do your
    > selection at that point in time?
    >
    > And then, just mediate open() like always, right?


    Yup, the patch I included inline does that.

    An LSM can address the problem. It just felt like more of a
    patch-over-the-real-problem kind of solution.

    thanks,
    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Sat, Mar 08, 2008 at 03:47:57PM -0600, Serge E. Hallyn wrote:
    > Quoting Greg KH (greg@kroah.com):
    > > On Fri, Mar 07, 2008 at 12:50:52PM -0600, Serge E. Hallyn wrote:
    > > > Quoting Greg KH (greg@kroah.com):
    > > > > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
    > > > > > > Do you really want to run other LSMs within a containerd kernel? Is
    > > > > > > that a requirement? It would seem to run counter to the main goal of
    > > > > > > containers to me.
    > > > > >
    > > > > > Until user namespaces are complete, selinux seems the only good solution
    > > > > > to offer isolation.
    > > > >
    > > > > Great, use that instead
    > > >
    > > > That can't work as is since you can't specify major:minor in policy.

    > >
    > > Your LSM can not, or the LSM interface does not allow this to happen?

    >
    > No my lsm in fact does, you just can't do it with selinux policy at the
    > moment. I was still responding to your "just use selinux"


    I never said "use selinux", do you think I am crazy?

    Just use your own lsm, that's all I recommended.

    > > > So all we could do again is simply refuse all mknod, which we can
    > > > already do with per-process capability bounding sets.

    > >
    > > I thought we passed that info down to the LSM module, can't you do your
    > > selection at that point in time?
    > >
    > > And then, just mediate open() like always, right?

    >
    > Yup, the patch I included inline does that.


    Great. But don't put that other file in the core kernel, put it in
    security/ please.

    > An LSM can address the problem. It just felt like more of a
    > patch-over-the-real-problem kind of solution.


    I disagree, it sounds exactly like what LSM is for.

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Quoting Greg KH (greg@kroah.com):
    > On Sat, Mar 08, 2008 at 03:47:57PM -0600, Serge E. Hallyn wrote:
    > > Quoting Greg KH (greg@kroah.com):
    > > > On Fri, Mar 07, 2008 at 12:50:52PM -0600, Serge E. Hallyn wrote:
    > > > > Quoting Greg KH (greg@kroah.com):
    > > > > > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
    > > > > > > > Do you really want to run other LSMs within a containerd kernel? Is
    > > > > > > > that a requirement? It would seem to run counter to the main goal of
    > > > > > > > containers to me.
    > > > > > >
    > > > > > > Until user namespaces are complete, selinux seems the only good solution
    > > > > > > to offer isolation.
    > > > > >
    > > > > > Great, use that instead
    > > > >
    > > > > That can't work as is since you can't specify major:minor in policy.
    > > >
    > > > Your LSM can not, or the LSM interface does not allow this to happen?

    > >
    > > No my lsm in fact does, you just can't do it with selinux policy at the
    > > moment. I was still responding to your "just use selinux"

    >
    > I never said "use selinux", do you think I am crazy?
    >
    > Just use your own lsm, that's all I recommended.
    >
    > > > > So all we could do again is simply refuse all mknod, which we can
    > > > > already do with per-process capability bounding sets.
    > > >
    > > > I thought we passed that info down to the LSM module, can't you do your
    > > > selection at that point in time?
    > > >
    > > > And then, just mediate open() like always, right?

    > >
    > > Yup, the patch I included inline does that.

    >
    > Great. But don't put that other file in the core kernel, put it in
    > security/ please.
    >
    > > An LSM can address the problem. It just felt like more of a
    > > patch-over-the-real-problem kind of solution.

    >
    > I disagree, it sounds exactly like what LSM is for.


    Ok, I went ahead and recreated the two files I had lost by not
    git-adding them. I suspect if we were to use this in place of Pavel's
    patch, we'd want to switch the API over to what he was using? I think
    Pavel and Paul Menage had fine-tuned his somewhat... If Pavel doesn't
    gag, maybe we could just use his cgroup code minus the kobject code
    plus the two LSM hooks?

    -serge

    From ff9f9a190d8664275fb86c7d5b2153fd79b990d7 Mon Sep 17 00:00:00 2001
    From: Serge E. Hallyn
    Date: Mon, 10 Mar 2008 13:01:33 -0700
    Subject: [PATCH 1/1] cgroups: implement device whitelist cgroup+lsm

    Implement a cgroup using the LSM interface to enforce open and mknod
    on device files. Not a line of this code is expected to be used in a
    final version, this is just a proof of concept to explore whether we
    can or should use an LSM for this until device namespaces are really
    needed.

    This implements a simple device access whitelist. A whitelist entry
    has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it
    applies to all types, all major numbers, and all minor numbers. Major and
    minor are obvious. Access is a composition of r (read), w (write), and
    m (mknod).

    The root devcgroup starts with rwm to 'all'. A child devcg gets a copy
    of the parent. Admins can then add and remove devices to the whitelist.
    Once CAP_HOST_ADMIN is introduced it will be needed to add entries as
    well or remove entries from another cgroup, though just CAP_SYS_ADMIN
    will suffice to remove entries for your own group.

    An entry is added by doing "echo " > devcg.allow,
    for instance:

    echo b 7 0 mrw > /cgroups/1/devcg.allow

    An entry is removed by doing likewise into devcg.deny. Since this is a
    pure whitelist, not acls, you can only remove entries which exist in the
    whitelist. You must explicitly

    echo a 0 0 mrw > /cgroups/1/devcg.deny

    to remove the "allow all" entry which is automatically inherited from
    the root cgroup.

    While composing this with the ns_cgroup may seem logical, it may not
    be the right thing to do. Note that each newly created devcg gets
    a copy of the parent whitelist. So if you had done

    mount -t cgroup -o ns,devcg none /cgroups

    then once a process in /cgroup/1 had done an unshare(CLONE_NEWNS)
    it would be under /cgroup/1/node_
    if an admin did

    echo b 7 0 m > /cgroups/1/devcg.deny

    then the entry would still be in the whitelist for /cgroups/1/node_.
    Something to discuss if we get that far before nixing this whole idea.

    CAP_NS_OVERRIDE is defined as a capability needed to tweak device
    access.

    CONFIG_COMMONCAP is defined whenever security/commoncap.c should
    be compiled, so that the decision of whether to show the option
    for FILE_CAPABILITIES can be a bit cleaner.

    Signed-off-by: Serge E. Hallyn
    ---
    include/linux/capability.h | 11 +-
    include/linux/cgroup_subsys.h | 6 +
    include/linux/devcg.h | 53 ++++++
    init/Kconfig | 7 +
    kernel/Makefile | 1 +
    kernel/dev_cgroup.c | 386 +++++++++++++++++++++++++++++++++++++++++
    security/Kconfig | 8 +-
    security/Makefile | 12 +-
    security/dev_cgroup.c | 132 ++++++++++++++
    9 files changed, 606 insertions(+), 10 deletions(-)
    create mode 100644 include/linux/devcg.h
    create mode 100644 kernel/dev_cgroup.c
    create mode 100644 security/dev_cgroup.c

    diff --git a/include/linux/capability.h b/include/linux/capability.h
    index eaab759..f8ecba1 100644
    --- a/include/linux/capability.h
    +++ b/include/linux/capability.h
    @@ -333,7 +333,16 @@ typedef struct kernel_cap_struct {

    #define CAP_MAC_ADMIN 33

    -#define CAP_LAST_CAP CAP_MAC_ADMIN
    +/* Allow acting on resources in another namespace. In particular:
    + * 1. when combined with CAP_MKNOD and dev_cgroup is enabled,
    + * allow creation of devices not in the device whitelist.
    + * 2. whencombined with CAP_SYS_ADMIN and dev_cgroup is enabled,
    + * allow editing device cgroup whitelist
    + */
    +
    +#define CAP_NS_OVERRIDE 34
    +
    +#define CAP_LAST_CAP CAP_NS_OVERRIDE

    #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

    diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
    index 1ddebfc..01e8034 100644
    --- a/include/linux/cgroup_subsys.h
    +++ b/include/linux/cgroup_subsys.h
    @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
    #endif

    /* */
    +
    +#ifdef CONFIG_CGROUP_DEV
    +SUBSYS(devcg)
    +#endif
    +
    +/* */
    diff --git a/include/linux/devcg.h b/include/linux/devcg.h
    new file mode 100644
    index 0000000..764a734
    --- /dev/null
    +++ b/include/linux/devcg.h
    @@ -0,0 +1,53 @@
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +#include
    +
    +#define ACC_MKNOD 1
    +#define ACC_READ 2
    +#define ACC_WRITE 4
    +
    +#define DEV_BLOCK 1
    +#define DEV_CHAR 2
    +#define DEV_ALL 4 /* this represents all devices */
    +
    +/*
    + * whitelist locking rules:
    + * cgroup_lock() cannot be taken under cgroup->lock.
    + * cgroup->lock can be taken with or without cgroup_lock().
    + *
    + * modifications always require cgroup_lock
    + * modifications to a list which is visible require the
    + * cgroup->lock *and* cgroup_lock()
    + * walking the list requires cgroup->lock or cgroup_lock().
    + *
    + * reasoning: dev_whitelist_copy() needs to kmalloc, so needs
    + * a mutex, which the cgroup_lock() is. Since modifying
    + * a visible list requires both locks, either lock can be
    + * taken for walking the list. Since the wh->spinlock is taken
    + * for modifying a public-accessible list, the spinlock is
    + * sufficient for just walking the list.
    + */
    +
    +struct dev_whitelist_item {
    + u32 major, minor;
    + short type;
    + short access;
    + struct list_head list;
    +};
    +
    +struct dev_cgroup {
    + struct cgroup_subsys_state css;
    + struct list_head whitelist;
    + spinlock_t lock;
    +};
    +
    +static inline struct dev_cgroup *cgroup_to_devcg(
    + struct cgroup *cgroup)
    +{
    + return container_of(cgroup_subsys_state(cgroup, devcg_subsys_id),
    + struct dev_cgroup, css);
    +}
    diff --git a/init/Kconfig b/init/Kconfig
    index 77b2ac8..db302fa 100644
    --- a/init/Kconfig
    +++ b/init/Kconfig
    @@ -298,6 +298,13 @@ config CGROUP_NS
    for instance virtual servers and checkpoint/restart
    jobs.

    +config CGROUP_DEV
    + bool "Device controller for cgroups"
    + depends on CGROUPS && SECURITY && EXPERIMENTAL
    + help
    + Provides a cgroup implementing whitelists for devices which
    + a process in the cgroup can mknod or open.
    +
    config CPUSETS
    bool "Cpuset support"
    depends on SMP && CGROUPS
    diff --git a/kernel/Makefile b/kernel/Makefile
    index 4336031..2e93d27 100644
    --- a/kernel/Makefile
    +++ b/kernel/Makefile
    @@ -42,6 +42,7 @@ obj-$(CONFIG_CGROUPS) += cgroup.o
    obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
    obj-$(CONFIG_CPUSETS) += cpuset.o
    obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
    +obj-$(CONFIG_CGROUP_DEV) += dev_cgroup.o
    obj-$(CONFIG_UTS_NS) += utsname.o
    obj-$(CONFIG_USER_NS) += user_namespace.o
    obj-$(CONFIG_PID_NS) += pid_namespace.o
    diff --git a/kernel/dev_cgroup.c b/kernel/dev_cgroup.c
    new file mode 100644
    index 0000000..a0f4371
    --- /dev/null
    +++ b/kernel/dev_cgroup.c
    @@ -0,0 +1,386 @@
    +/*
    + * dev_cgroup.c - device cgroup subsystem
    + *
    + * Copyright 2007 IBM Corp
    + */
    +
    +#include
    +
    +static int devcg_can_attach(struct cgroup_subsys *ss,
    + struct cgroup *new_cgroup, struct task_struct *task)
    +{
    + struct cgroup *orig;
    +
    + if (current != task) {
    + if (!cgroup_is_descendant(new_cgroup))
    + return -EPERM;
    + }
    +
    + if (atomic_read(&new_cgroup->count) != 0)
    + return -EPERM;
    +
    + orig = task_cgroup(task, devcg_subsys_id);
    + if (orig && orig != new_cgroup->parent)
    + return -EPERM;
    +
    + return 0;
    +}
    +
    +/*
    + * called under cgroup_lock()
    + */
    +int dev_whitelist_copy(struct list_head *dest, struct list_head *orig)
    +{
    + struct dev_whitelist_item *wh, *tmp, *new;
    +
    + list_for_each_entry(wh, orig, list) {
    + new = kmalloc(sizeof(*wh), GFP_KERNEL);
    + if (!new)
    + goto free_and_exit;
    + new->major = wh->major;
    + new->minor = wh->minor;
    + new->type = wh->type;
    + new->access = wh->access;
    + list_add_tail(&new->list, dest);
    + }
    +
    + return 0;
    +
    +free_and_exit:
    + list_for_each_entry_safe(wh, tmp, dest, list) {
    + list_del(&wh->list);
    + kfree(wh);
    + }
    + return -ENOMEM;
    +}
    +
    +/* Stupid prototype - don't bother combining existing entries */
    +/*
    + * called under cgroup_lock()
    + * since the list is visible to other tasks, we need the spinlock also
    + */
    +void dev_whitelist_add(struct dev_cgroup *dev_cgroup,
    + struct dev_whitelist_item *wh)
    +{
    + spin_lock(&dev_cgroup->lock);
    + list_add_tail(&wh->list, &dev_cgroup->whitelist);
    + spin_unlock(&dev_cgroup->lock);
    +}
    +
    +/*
    + * called under cgroup_lock()
    + * since the list is visible to other tasks, we need the spinlock also
    + */
    +void dev_whitelist_rm(struct dev_cgroup *dev_cgroup,
    + struct dev_whitelist_item *wh)
    +{
    + struct dev_whitelist_item *walk, *tmp;
    +
    + spin_lock(&dev_cgroup->lock);
    + list_for_each_entry_safe(walk, tmp, &dev_cgroup->whitelist, list) {
    + if (walk->type & DEV_ALL) {
    + list_del(&walk->list);
    + kfree(walk);
    + continue;
    + }
    + if (walk->type != wh->type)
    + continue;
    + if (walk->major != wh->major || walk->minor != wh->minor)
    + continue;
    + walk->access &= ~wh->access;
    + if (!walk->access) {
    + list_del(&walk->list);
    + kfree(walk);
    + }
    + }
    + spin_unlock(&dev_cgroup->lock);
    +}
    +
    +/*
    + * Rules: you can only create a cgroup if
    + * 1. you are capable(CAP_SYS_ADMIN|CAP_NS_OVERRIDE)
    + * 2. the target cgroup is a descendant of your own cgroup
    + *
    + * Note: called from kernel/cgroup.c with cgroup_lock() held.
    + */
    +static struct cgroup_subsys_state *devcg_create(struct cgroup_subsys *ss,
    + struct cgroup *cgroup)
    +{
    + struct dev_cgroup *dev_cgroup, *parent_dev_cgroup;
    + struct cgroup *parent_cgroup;
    + int ret;
    +
    + if (!capable(CAP_SYS_ADMIN) || !capable(CAP_NS_OVERRIDE))
    + return ERR_PTR(-EPERM);
    + if (!cgroup_is_descendant(cgroup))
    + return ERR_PTR(-EPERM);
    +
    + dev_cgroup = kzalloc(sizeof(*dev_cgroup), GFP_KERNEL);
    + if (!dev_cgroup)
    + return ERR_PTR(-ENOMEM);
    + INIT_LIST_HEAD(&dev_cgroup->whitelist);
    + parent_cgroup = cgroup->parent;
    +
    + if (parent_cgroup == NULL) {
    + struct dev_whitelist_item *wh;
    + wh = kmalloc(sizeof(*wh), GFP_KERNEL);
    + wh->minor = wh->major = 0;
    + wh->type = DEV_ALL;
    + wh->access = ACC_MKNOD | ACC_READ | ACC_WRITE;
    + list_add(&wh->list, &dev_cgroup->whitelist);
    + } else {
    + parent_dev_cgroup = cgroup_to_devcg(parent_cgroup);
    + ret = dev_whitelist_copy(&dev_cgroup->whitelist,
    + &parent_dev_cgroup->whitelist);
    + if (ret) {
    + kfree(dev_cgroup);
    + return ERR_PTR(ret);
    + }
    + }
    +
    + spin_lock_init(&dev_cgroup->lock);
    + return &dev_cgroup->css;
    +}
    +
    +static void devcg_destroy(struct cgroup_subsys *ss,
    + struct cgroup *cgroup)
    +{
    + struct dev_cgroup *dev_cgroup;
    + struct dev_whitelist_item *wh, *tmp;
    +
    + dev_cgroup = cgroup_to_devcg(cgroup);
    + list_for_each_entry_safe(wh, tmp, &dev_cgroup->whitelist, list) {
    + list_del(&wh->list);
    + kfree(wh);
    + }
    + kfree(dev_cgroup);
    +}
    +
    +#define DEVCG_ALLOW 1
    +#define DEVCG_DENY 2
    +
    +void set_access(char *acc, short access)
    +{
    + int idx = 0;
    + memset(acc, 0, 4);
    + if (access & ACC_READ)
    + acc[idx++] = 'r';
    + if (access & ACC_WRITE)
    + acc[idx++] = 'w';
    + if (access & ACC_MKNOD)
    + acc[idx++] = 'm';
    +}
    +
    +char type_to_char(short type)
    +{
    + if (type == DEV_ALL)
    + return 'a';
    + if (type == DEV_CHAR)
    + return 'c';
    + if (type == DEV_BLOCK)
    + return 'b';
    + return 'X';
    +}
    +
    +char *print_whitelist(struct dev_cgroup *devcgroup, int *len)
    +{
    + char *buf, *s, acc[4];
    + struct dev_whitelist_item *wh;
    + int ret;
    + int count = 0;
    +
    + buf = kmalloc(4096, GFP_KERNEL);
    + if (!buf)
    + return ERR_PTR(-ENOMEM);
    + s = buf;
    + *s = '\0';
    + *len = 0;
    +
    + spin_lock(&devcgroup->lock);
    + list_for_each_entry(wh, &devcgroup->whitelist, list) {
    + set_access(acc, wh->access);
    + printk(KERN_NOTICE
    + "%s (count%d): whtype %hd maj %u min %u acc %hd\n",
    + __FUNCTION__, count, wh->type, wh->major, wh->minor,
    + wh->access);
    + ret = snprintf(s, 4095-(s-buf), "%c %u %u %s\n",
    + type_to_char(wh->type), wh->major, wh->minor, acc);
    + if (s+ret >= buf+4095) {
    + kfree(buf);
    + buf = ERR_PTR(-ENOMEM);
    + break;
    + }
    + s += ret;
    + *len += ret;
    + count++;
    + }
    + spin_unlock(&devcgroup->lock);
    +
    + return buf;
    +}
    +
    +static ssize_t devcg_access_read(struct cgroup *cgroup,
    + struct cftype *cft, struct file *file,
    + char __user *userbuf, size_t nbytes, loff_t *ppos)
    +{
    + struct dev_cgroup *devcgrp = cgroup_to_devcg(cgroup);
    + int filetype = cft->private;
    + char *buffer;
    + int len, retval;
    +
    + if (filetype != DEVCG_ALLOW)
    + return -EINVAL;
    + buffer = print_whitelist(devcgrp, &len);
    + if (IS_ERR(buffer))
    + return PTR_ERR(buffer);
    +
    + retval = simple_read_from_buffer(userbuf, nbytes, ppos, buffer, len);
    + kfree(buffer);
    + return retval;
    +}
    +
    +static inline short convert_access(char *acc)
    +{
    + short access = 0;
    +
    + while (*acc) {
    + switch (*acc) {
    + case 'r':
    + case 'R': access |= ACC_READ; break;
    + case 'w':
    + case 'W': access |= ACC_WRITE; break;
    + case 'm':
    + case 'M': access |= ACC_MKNOD; break;
    + case '\n': break;
    + default:
    + return -EINVAL;
    + }
    + acc++;
    + }
    +
    + return access;
    +}
    +
    +static inline short convert_type(char intype)
    +{
    + short type = 0;
    + switch (intype) {
    + case 'a': type = DEV_ALL; break;
    + case 'c': type = DEV_CHAR; break;
    + case 'b': type = DEV_BLOCK; break;
    + default: type = -EACCES; break;
    + }
    + return type;
    +}
    +
    +static ssize_t devcg_access_write(struct cgroup *cgroup, struct cftype *cft,
    + struct file *file, const char __user *userbuf,
    + size_t nbytes, loff_t *ppos)
    +{
    + struct cgroup *cur_cgroup;
    + struct dev_cgroup *devcgrp, *cur_devcgroup;
    + int filetype = cft->private;
    + char *buffer, acc[4];
    + int retval = 0;
    + int nitems;
    + char type;
    + struct dev_whitelist_item *wh;
    +
    + if (!capable(CAP_SYS_ADMIN) || !capable(CAP_NS_OVERRIDE))
    + return -EPERM;
    +
    + devcgrp = cgroup_to_devcg(cgroup);
    + cur_cgroup = task_cgroup(current, devcg_subsys.subsys_id);
    + cur_devcgroup = cgroup_to_devcg(cur_cgroup);
    +
    + buffer = kmalloc(nbytes+1, GFP_KERNEL);
    + if (!buffer)
    + return -ENOMEM;
    +
    + wh = kmalloc(sizeof(*wh), GFP_KERNEL);
    + if (!wh) {
    + kfree(buffer);
    + return -ENOMEM;
    + }
    +
    + if (copy_from_user(buffer, userbuf, nbytes)) {
    + retval = -EFAULT;
    + goto out1;
    + }
    + buffer[nbytes] = 0; /* nul-terminate */
    +
    + cgroup_lock();
    + if (cgroup_is_removed(cgroup)) {
    + retval = -ENODEV;
    + goto out2;
    + }
    +
    + memset(wh, 0, sizeof(*wh));
    + memset(acc, 0, 4);
    + nitems = sscanf(buffer, "%c %u %u %3s", &type, &wh->major, &wh->minor,
    + acc);
    + retval = -EINVAL;
    + if (nitems != 4)
    + goto out2;
    + wh->type = convert_type(type);
    + if (wh->type < 0)
    + goto out2;
    + wh->access = convert_access(acc);
    + if (wh->access < 0)
    + goto out2;
    + retval = 0;
    + switch (filetype) {
    + case DEVCG_ALLOW:
    + printk(KERN_NOTICE
    + "%s: add whtype %hd maj %u min %u acc %hd\n",
    + __FUNCTION__, wh->type, wh->major, wh->minor,
    + wh->access);
    + dev_whitelist_add(devcgrp, wh);
    + break;
    + case DEVCG_DENY:
    + dev_whitelist_rm(devcgrp, wh);
    + break;
    + default:
    + retval = -EINVAL;
    + goto out2;
    + }
    +
    + if (retval == 0)
    + retval = nbytes;
    +
    +out2:
    + cgroup_unlock();
    +out1:
    + kfree(buffer);
    + return retval;
    +}
    +
    +static struct cftype dev_cgroup_files[] = {
    + {
    + .name = "allow",
    + .read = devcg_access_read,
    + .write = devcg_access_write,
    + .private = DEVCG_ALLOW,
    + },
    + {
    + .name = "deny",
    + .write = devcg_access_write,
    + .private = DEVCG_DENY,
    + },
    +};
    +
    +static int devcg_populate(struct cgroup_subsys *ss,
    + struct cgroup *cont)
    +{
    + return cgroup_add_files(cont, ss, dev_cgroup_files,
    + ARRAY_SIZE(dev_cgroup_files));
    +}
    +
    +struct cgroup_subsys devcg_subsys = {
    + .name = "devcg",
    + .can_attach = devcg_can_attach,
    + .create = devcg_create,
    + .destroy = devcg_destroy,
    + .populate = devcg_populate,
    + .subsys_id = devcg_subsys_id,
    +};
    diff --git a/security/Kconfig b/security/Kconfig
    index 5dfc206..8082edc 100644
    --- a/security/Kconfig
    +++ b/security/Kconfig
    @@ -75,15 +75,19 @@ config SECURITY_NETWORK_XFRM

    config SECURITY_CAPABILITIES
    bool "Default Linux Capabilities"
    - depends on SECURITY
    + depends on SECURITY && !CGROUP_DEV
    default y
    help
    This enables the "default" Linux capabilities functionality.
    If you are unsure how to answer this question, answer Y.

    +config COMMONCAP
    + bool
    + default !SECURITY || SECURITY_CAPABILITIES || SECURITY_ROOTPLUG || SECURITY_SMACK || CGROUP_DEV
    +
    config SECURITY_FILE_CAPABILITIES
    bool "File POSIX Capabilities (EXPERIMENTAL)"
    - depends on (SECURITY=n || SECURITY_CAPABILITIES!=n) && EXPERIMENTAL
    + depends on COMMONCAP && EXPERIMENTAL
    default n
    help
    This enables filesystem capabilities, allowing you to give
    diff --git a/security/Makefile b/security/Makefile
    index 9e8b025..6093003 100644
    --- a/security/Makefile
    +++ b/security/Makefile
    @@ -6,15 +6,13 @@ obj-$(CONFIG_KEYS) += keys/
    subdir-$(CONFIG_SECURITY_SELINUX) += selinux
    subdir-$(CONFIG_SECURITY_SMACK) += smack

    -# if we don't select a security model, use the default capabilities
    -ifneq ($(CONFIG_SECURITY),y)
    -obj-y += commoncap.o
    -endif
    +obj-$(CONFIG_COMMONCAP) += commoncap.o

    # Object file lists
    obj-$(CONFIG_SECURITY) += security.o dummy.o inode.o
    # Must precede capability.o in order to stack properly.
    obj-$(CONFIG_SECURITY_SELINUX) += selinux/built-in.o
    -obj-$(CONFIG_SECURITY_SMACK) += commoncap.o smack/built-in.o
    -obj-$(CONFIG_SECURITY_CAPABILITIES) += commoncap.o capability.o
    -obj-$(CONFIG_SECURITY_ROOTPLUG) += commoncap.o root_plug.o
    +obj-$(CONFIG_SECURITY_SMACK) += smack/built-in.o
    +obj-$(CONFIG_SECURITY_CAPABILITIES) += capability.o
    +obj-$(CONFIG_SECURITY_ROOTPLUG) += root_plug.o
    +obj-$(CONFIG_CGROUP_DEV) += dev_cgroup.o
    diff --git a/security/dev_cgroup.c b/security/dev_cgroup.c
    new file mode 100644
    index 0000000..2719cdb
    --- /dev/null
    +++ b/security/dev_cgroup.c
    @@ -0,0 +1,132 @@
    +/*
    + * LSM portion of the device cgroup subsystem.
    + *
    + * Copyright 2007 IBM Corp
    + */
    +
    +#include
    +
    +extern struct cgroup_subsys devcg_subsys;
    +
    +static int devcgroup_inode_permission(struct inode *inode, int mask,
    + struct nameidata *nd)
    +{
    + struct cgroup *cgroup;
    + struct dev_cgroup *dev_cgroup;
    + struct dev_whitelist_item *wh;
    +
    + dev_t device = inode->i_rdev;
    + if (!device)
    + return 0;
    + if (!S_ISBLK(inode->i_mode) && !S_ISCHR(inode->i_mode))
    + return 0;
    + cgroup = task_cgroup(current, devcg_subsys.subsys_id);
    + dev_cgroup = cgroup_to_devcg(cgroup);
    + if (!dev_cgroup)
    + return 0;
    +
    + spin_lock(&dev_cgroup->lock);
    + list_for_each_entry(wh, &dev_cgroup->whitelist, list) {
    + if (wh->type & DEV_ALL)
    + goto acc_check;
    + if (S_ISBLK(inode->i_mode) && !(wh->type & DEV_BLOCK))
    + continue;
    + if (S_ISCHR(inode->i_mode) && !(wh->type & DEV_CHAR))
    + continue;
    + if (wh->major != imajor(inode) || wh->minor != iminor(inode))
    + continue;
    +acc_check:
    + if ((mask & MAY_WRITE) && !(wh->access & ACC_WRITE))
    + continue;
    + if ((mask & MAY_READ) && !(wh->access & ACC_READ))
    + continue;
    + spin_unlock(&dev_cgroup->lock);
    + return 0;
    + }
    + spin_unlock(&dev_cgroup->lock);
    +
    + return -EPERM;
    +}
    +
    +static int devcgroup_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
    +{
    + struct cgroup *cgroup;
    + struct dev_cgroup *dev_cgroup;
    + struct dev_whitelist_item *wh;
    +
    + cgroup = task_cgroup(current, devcg_subsys.subsys_id);
    + dev_cgroup = cgroup_to_devcg(cgroup);
    + if (!dev_cgroup)
    + return 0;
    +
    + spin_lock(&dev_cgroup->lock);
    + list_for_each_entry(wh, &dev_cgroup->whitelist, list) {
    + if (wh->type & DEV_ALL)
    + goto ok;
    + if (S_ISBLK(mode) && !(wh->type & DEV_BLOCK))
    + continue;
    + if (S_ISCHR(mode) && !(wh->type & DEV_CHAR))
    + continue;
    + if (wh->major != MAJOR(dev) || wh->minor != MINOR(dev))
    + continue;
    + if (wh->access & ACC_MKNOD)
    + goto ok;
    + }
    + spin_unlock(&dev_cgroup->lock);
    +
    + printk(KERN_NOTICE "%s: %d denied %d access to (%d %d)\n", __FUNCTION__,
    + current->pid, mode, MAJOR(dev), MINOR(dev));
    + return -EPERM;
    +
    +ok:
    + spin_unlock(&dev_cgroup->lock);
    + return 0;
    +}
    +
    +static struct security_operations devcgroup_security_ops = {
    + .inode_mknod = devcgroup_inode_mknod,
    + .inode_permission = devcgroup_inode_permission,
    +
    + .ptrace = cap_ptrace,
    + .capget = cap_capget,
    + .capset_check = cap_capset_check,
    + .capset_set = cap_capset_set,
    + .capable = cap_capable,
    + .settime = cap_settime,
    + .netlink_send = cap_netlink_send,
    + .netlink_recv = cap_netlink_recv,
    +
    + .bprm_apply_creds = cap_bprm_apply_creds,
    + .bprm_set_security = cap_bprm_set_security,
    + .bprm_secureexec = cap_bprm_secureexec,
    +
    + .inode_setxattr = cap_inode_setxattr,
    + .inode_removexattr = cap_inode_removexattr,
    + .inode_need_killpriv = cap_inode_need_killpriv,
    + .inode_killpriv = cap_inode_killpriv,
    +
    + .task_kill = cap_task_kill,
    + .task_setscheduler = cap_task_setscheduler,
    + .task_setioprio = cap_task_setioprio,
    + .task_setnice = cap_task_setnice,
    + .task_post_setuid = cap_task_post_setuid,
    + .task_prctl = cap_task_prctl,
    + .task_reparent_to_init = cap_task_reparent_to_init,
    +
    + .syslog = cap_syslog,
    +
    + .vm_enough_memory = cap_vm_enough_memory,
    +};
    +
    +static int __init dev_cgroup_security_init (void)
    +{
    + /* register ourselves with the security framework */
    + if (register_security (&devcgroup_security_ops)) {
    + printk (KERN_INFO "Failure registering device cgroup lsm\n");
    + return -EINVAL;
    + }
    + printk (KERN_INFO "Device cgroup LSM initialized\n");
    + return 0;
    +}
    +
    +security_initcall (dev_cgroup_security_init);
    --
    1.5.1

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    [snip]

    >> I have basically three objections against LSM.
    >>
    >> 1. LSM is not stackable, so loading this tiny module with devices
    >> access rights will block other security modules;

    >
    > Do you really want to run other LSMs within a containerd kernel? Is
    > that a requirement? It would seem to run counter to the main goal of
    > containers to me.


    _I_ do not want to run LSM with containers, but this was a requirement
    from some people when we decided which approach to chose - LSM of mine.

    Nevertheless, this is not very important, indeed.

    >> 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
    >> to get called. This affects performance on critical paths, like
    >> process creation/destruction, network flow and so on. This impact
    >> is small, but noticeable;

    >
    > The last time this was brought up, it was not noticable, except for some
    > network paths. And even then, the number was lost in the noise from
    > what I saw. I think with a containered machine, you have bigger things
    > to be worried about


    The namespace-based virtualization implies no performance overhead for
    a container, actually. This is not xen or vmware where performance
    loss is OK.

    Besides, I've measured some things - the lat_syscall test for open from
    lmbench test suite and the nptl perf test. Here are the results:

    sec nosec
    open 3.0980s 3.0709s
    nptl 2.7746s 2.7710s

    So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
    much, but it is noticeable. Besides, this is only two tests, digging deeper
    may reveal more.

    Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
    to the vmlinux...

    I think, I finally agree with you and Al Viro, that the kobj mapper is
    not the right place to put the filtering, but taking the above numbers
    into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
    versions of security_inode_permission/security_file_permission/etc?

    >> 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
    >> work safe in a container. I don't presume to judge how much work
    >> will have to be done in this area, so the result patch would be
    >> even larger and maybe will duplicate functionality, which is currently
    >> in cgroups. OTOH, cgroups already provide the ways to correctly
    >> delegate proper rights to containers.

    >
    > No, your lsm would be your "virtualize" policy. I don't think you would
    > have to do any additional work here, but could be wrong. Would like to
    > see the code to prove it.
    >
    >>> Opening a dev node is not on any "fast path" that you need to be
    >>> concerned about a few extra calls within the kernel.
    >>>
    >>> And, I think in the end your patch would be much smaller and easier to
    >>> understand and review and maintain overall.

    >> Hardly - the largest part of my patch is cgroup manipulations. The part
    >> that makes the char and block layers switch to new map ac check the
    >> permissions is 10-20 lines of new code.
    >>
    >> But with LSM I will still need this API.

    >
    > Yes, but your LSM hooks will be smaller than the code modifications to
    > the map logic
    >
    > Again, I object to this as you are driving a new security policy
    > infrastructure into the device node logic where it does not belong as we
    > already have this functionality in the LSM interface today. Please use
    > that one instead and don't clutter up the kernel with "one-off" security
    > changes like this one.
    >
    > Please try the LSM interface and see what happens. If, after you have
    > created a patch, you still have objections, please post it for review
    > and I will be glad to revisit my opinion at that time.
    >
    > thanks,
    >
    > greg k-h
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
    >
    > Besides, I've measured some things - the lat_syscall test for open from
    > lmbench test suite and the nptl perf test. Here are the results:
    >
    > sec nosec
    > open 3.0980s 3.0709s
    > nptl 2.7746s 2.7710s
    >
    > So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
    > much, but it is noticeable. Besides, this is only two tests, digging deeper
    > may reveal more.


    I think that is in the noise of sampling if you run that test many more
    times.

    > Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
    > to the vmlinux...
    >
    > I think, I finally agree with you and Al Viro, that the kobj mapper is
    > not the right place to put the filtering, but taking the above numbers
    > into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
    > versions of security_inode_permission/security_file_permission/etc?


    Ask the security module interface maintainers about this, not me

    good luck,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Greg KH wrote:
    > On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
    >> Besides, I've measured some things - the lat_syscall test for open from
    >> lmbench test suite and the nptl perf test. Here are the results:
    >>
    >> sec nosec
    >> open 3.0980s 3.0709s
    >> nptl 2.7746s 2.7710s
    >>
    >> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
    >> much, but it is noticeable. Besides, this is only two tests, digging deeper
    >> may reveal more.

    >
    > I think that is in the noise of sampling if you run that test many more
    > times.


    These numbers are average values of 20 runs of each test. I didn't
    provide the measurement accuracy, but the abs(open.sec - open.nosec)
    is greater than it.

    >> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
    >> to the vmlinux...
    >>
    >> I think, I finally agree with you and Al Viro, that the kobj mapper is
    >> not the right place to put the filtering, but taking the above numbers
    >> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
    >> versions of security_inode_permission/security_file_permission/etc?

    >
    > Ask the security module interface maintainers about this, not me


    OK Thanks for your time, Greg.

    So, Serge, since you already have a LSM-based version, maybe you can
    change it with the proposed "fix" and send it to LSM maintainers for
    review?

    > good luck,
    >
    > greg k-h
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast