[PATCH 0/9] Devices accessibility control group (v4) - Kernel

This is a discussion on [PATCH 0/9] Devices accessibility control group (v4) - Kernel ; Changes from v3: * Ported on 2.6.25-rc3-mm1; * Re-splitted into smaller pieces; * Added more comments to tricky places. This controller allows to tune the devices accessibility by tasks, i.e. grant full access for /dev/null, /dev/zero etc, grant read-only access ...

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast
Results 1 to 20 of 47

Thread: [PATCH 0/9] Devices accessibility control group (v4)

  1. [PATCH 0/9] Devices accessibility control group (v4)

    Changes from v3:
    * Ported on 2.6.25-rc3-mm1;
    * Re-splitted into smaller pieces;
    * Added more comments to tricky places.

    This controller allows to tune the devices accessibility by tasks,
    i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    access to IDE devices and completely hide SCSI disks.

    Tasks still can call mknod to create device files, regardless of
    whether the particular device is visible or accessible, but they
    may not be able to open it later.

    This one hides under CONFIG_CGROUP_DEVS option.

    To play with it - run a standard procedure:

    # mount -t container none /cont/devs -o devices
    # mkdir /cont/devs/0
    # echo -n $$ > /cont/devs/0/tasks

    and tune device permissions.

    The only configuration file called devices.permissions accepts
    strings like '[cb] |*) [r-][w-]' to provide read,
    write or read-write access to a particular device. Asterisk as the
    minor means "all devices with a given major".

    This will be described in Documentation/controllers/devices.txt
    file in more details.

    Here are some historical notes.

    The third version was here:
    http://openvz.org/pipermail/devel/20...ry/010832.html
    Changes from v2:
    * Fixed problems pointed out by Sukadev with permissions
    revoke. Now we have to perform kobject re-lookup on
    each char device open, just like for block ones, so I
    think this is OK.

    The second version was here:
    http://openvz.org/pipermail/devel/20...ry/010160.html
    Changes from v1:
    * Added the block devices support It turned out to
    be a bit simpler than the char one (or I missed
    something significant);
    * Now we can enable/disable not just individual devices,
    but the whole major with all its minors (see the TODO
    list beyond as well);
    * Added the ability to restrict the read/write permissions
    to devices, not just visible/invisible state.

    The first version was here:
    http://openvz.org/pipermail/devel/20...er/007647.html

    Thanks,
    Pavel
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [PATCH 3/9] Add a mode on the struct probe

    The idea of the set is to make the dev_t to struct kobj map
    be per-cgroup, and provide a permissions for this mapping.
    This mode mask will operate with FMODE_xxx modes.

    To achieve this, I add the mode_t mode on the struct probe,
    which sets this mapping, and move the locked part of the
    kobj_map() into a separate function, which also accepts the
    mode to be set on the probe.

    By default all the modes are provided for a map.

    Signed-off-by: Pavel Emelyanov

    ---
    drivers/base/map.c | 31 ++++++++++++++++++++++++++-----
    1 files changed, 26 insertions(+), 5 deletions(-)

    diff --git a/drivers/base/map.c b/drivers/base/map.c
    index 7de565f..c7e28a1 100644
    --- a/drivers/base/map.c
    +++ b/drivers/base/map.c
    @@ -15,6 +15,7 @@
    #include
    #include
    #include
    +#include

    #define KOBJ_MAP_PROBES 255

    @@ -22,6 +23,7 @@ struct kobj_map {
    struct probe {
    struct probe *next;
    dev_t dev;
    + mode_t mode;
    unsigned long range;
    struct module *owner;
    kobj_probe_t *get;
    @@ -31,9 +33,9 @@ struct kobj_map {
    struct mutex *lock;
    };

    -int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
    - struct module *module, kobj_probe_t *probe,
    - int (*lock)(dev_t, void *), void *data)
    +static int __kobj_map(struct kobj_map *domain, dev_t dev, mode_t mode,
    + unsigned long range, struct module *module,
    + kobj_probe_t *probe, int (*lock)(dev_t, void *), void *data)
    {
    unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
    unsigned index = MAJOR(dev);
    @@ -55,8 +57,15 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
    p->dev = dev;
    p->range = range;
    p->data = data;
    + /*
    + * When opening a device we want to (dis)allow only
    + * read or write. But sys_open() can provide more
    + * modes like lseek or pread. So, these FMODE-s are
    + * always 'on'.
    + */
    + p->mode = mode | FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
    }
    - mutex_lock(domain->lock);
    +
    for (i = 0, p -= n; i < n; i++, p++, index++) {
    struct probe **s = &domain->probes[index % KOBJ_MAP_PROBES];
    while (*s && (*s)->range < range)
    @@ -64,10 +73,22 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
    p->next = *s;
    *s = p;
    }
    - mutex_unlock(domain->lock);
    return 0;
    }

    +int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
    + struct module *module, kobj_probe_t *probe,
    + int (*lock)(dev_t, void *), void *data)
    +{
    + int err;
    +
    + mutex_lock(domain->lock);
    + err = __kobj_map(domain, dev, FMODE_READ | FMODE_WRITE, range,
    + module, probe, lock, data);
    + mutex_unlock(domain->lock);
    + return err;
    +}
    +
    void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)
    {
    unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
    --
    1.5.3.4

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [PATCH 2/9] Cleanup the get_gendisk() a bit

    The current implementation first converts kobj to device and
    then checks for kobj not to be NULL. This is OK, since the
    converting macros is a container_of one, but this does not
    look very nice.

    Besides, I'll have to add some code _before_ the kobj_lookup()
    and thus will have to move the call to kobj_lookup lower.

    So do this job here to make the further patching cleaner.

    Signed-off-by: Pavel Emelyanov

    ---
    block/genhd.c | 11 ++++++++---
    1 files changed, 8 insertions(+), 3 deletions(-)

    diff --git a/block/genhd.c b/block/genhd.c
    index e8cf05a..9a7a903 100644
    --- a/block/genhd.c
    +++ b/block/genhd.c
    @@ -212,10 +212,15 @@ void unlink_gendisk(struct gendisk *disk)
    */
    struct gendisk *get_gendisk(dev_t devt, int *part)
    {
    - struct kobject *kobj = kobj_lookup(bdev_map, devt, part);
    - struct device *dev = kobj_to_dev(kobj);
    + struct kobject *kobj;
    + struct device *dev;
    +
    + kobj = kobj_lookup(bdev_map, devt, part);
    + if (kobj == NULL)
    + return NULL;

    - return kobj ? dev_to_disk(dev) : NULL;
    + dev = kobj_to_dev(kobj);
    + return dev_to_disk(dev);
    }

    /*
    --
    1.5.3.4

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [PATCH 9/9] Devices accessibility control group itself

    Finally, here's the control group, which makes full use of the
    interfaces, declared in the previous patches.

    Signed-off-by: Pavel Emelyanov

    ---
    Documentation/controllers/devices.txt | 61 +++++++
    fs/Makefile | 2 +
    fs/devscontrol.c | 314 +++++++++++++++++++++++++++++++++
    include/linux/cgroup_subsys.h | 6 +
    include/linux/devscontrol.h | 14 ++
    init/Kconfig | 13 ++
    6 files changed, 410 insertions(+), 0 deletions(-)
    create mode 100644 Documentation/controllers/devices.txt
    create mode 100644 fs/devscontrol.c

    diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
    new file mode 100644
    index 0000000..630586c
    --- /dev/null
    +++ b/Documentation/controllers/devices.txt
    @@ -0,0 +1,61 @@
    +
    + Devices visibility controller
    +
    +This controller allows to tune the devices accessibility by tasks,
    +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    +access to IDE devices and completely hide SCSI disks.
    +
    +Tasks still can call mknod to create device files, regardless of
    +whether the particular device is visible or accessible, but they
    +may not be able to open it later.
    +
    +This one hides under CONFIG_CGROUP_DEVS option.
    +
    +
    +Configuring
    +
    +The controller provides a single file to configure itself -- the
    +devices.permissions one. To change the accessibility level for some
    +device write the following string into it:
    +
    +[cb] |*) [r-][w-]
    + ^ ^ ^
    + | | |
    + | | +--- access rights (1)
    + | |
    + | +-- device major and minor numbers (2)
    + |
    + +-- device type (character / block)
    +
    +1) The access rights set to '--' remove the device from the group's
    +access list, so that it will not even be shown in this file later.
    +
    +2) Setting the minor to '*' grants access to all the minors for
    +particular major.
    +
    +When reading from it, one may see something like
    +
    + c 1:5 rw
    + b 8:* r-
    +
    +Security issues, concerning who may grant access to what are governed
    +at the cgroup infrastructure level.
    +
    +
    +Examples:
    +
    +1. Grant full access to /dev/null
    + # echo 'c 1:3 rw' > /cgroups//devices.permissions
    +
    +2. Grant the read-only access to /dev/sda and partitions
    + # echo 'b 8:* r-' > ...
    +
    +3. Change the /dev/null access to write-only
    + # echo 'c 1:3 -w' > ...
    +
    +4. Revoke access to /dev/sda
    + # echo 'b 8:* --' > ...
    +
    +
    + Written by Pavel Emelyanov
    +
    diff --git a/fs/Makefile b/fs/Makefile
    index 7996220..5ad03be 100644
    --- a/fs/Makefile
    +++ b/fs/Makefile
    @@ -64,6 +64,8 @@ obj-y += devpts/

    obj-$(CONFIG_PROFILING) += dcookies.o
    obj-$(CONFIG_DLM) += dlm/
    +
    +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o

    # Do not add any filesystems before this line
    obj-$(CONFIG_REISERFS_FS) += reiserfs/
    diff --git a/fs/devscontrol.c b/fs/devscontrol.c
    new file mode 100644
    index 0000000..4a664cf
    --- /dev/null
    +++ b/fs/devscontrol.c
    @@ -0,0 +1,314 @@
    +/*
    + * devscontrol.c - Device Controller
    + *
    + * Copyright 2008 OpenVZ Parallels Inc
    + * Author: Pavel Emelyanov
    + *
    + * This program is free software; you can redistribute it and/or modify
    + * it under the terms of the GNU General Public License as published by
    + * the Free Software Foundation; either version 2 of the License, or
    + * (at your option) any later version.
    + *
    + * This program is distributed in the hope that it will be useful,
    + * but WITHOUT ANY WARRANTY; without even the implied warranty of
    + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    + * GNU General Public License for more details.
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +struct devs_cgroup {
    + /*
    + * The subsys state to build into cgroups infrastructure
    + */
    + struct cgroup_subsys_state css;
    +
    + /*
    + * The maps of character and block devices. They provide a
    + * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
    + * and block/genhd.c to find out how the ->open() callbacks
    + * work when opening a device.
    + *
    + * Each group will have its own maps, and at the open()
    + * time code will lookup in this map to get the device
    + * and permissions by its dev_t.
    + */
    + struct kobj_map *cdev_map;
    + struct kobj_map *bdev_map;
    +};
    +
    +static inline
    +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
    +{
    + return container_of(css, struct devs_cgroup, css);
    +}
    +
    +static inline
    +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
    +{
    + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
    +}
    +
    +struct kobj_map *task_cdev_map(struct task_struct *tsk)
    +{
    + struct cgroup_subsys_state *css;
    +
    + css = task_subsys_state(tsk, devs_subsys_id);
    + if (css->cgroup->parent == NULL)
    + return NULL;
    + else
    + return css_to_devs(css)->cdev_map;
    +}
    +
    +struct kobj_map *task_bdev_map(struct task_struct *tsk)
    +{
    + struct cgroup_subsys_state *css;
    +
    + css = task_subsys_state(tsk, devs_subsys_id);
    + if (css->cgroup->parent == NULL)
    + return NULL;
    + else
    + return css_to_devs(css)->bdev_map;
    +}
    +
    +static struct cgroup_subsys_state *
    +devs_create(struct cgroup_subsys *ss, struct cgroup *cont)
    +{
    + struct devs_cgroup *devs;
    +
    + devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL);
    + if (devs == NULL)
    + goto out;
    +
    + devs->cdev_map = cdev_map_init();
    + if (devs->cdev_map == NULL)
    + goto out_free;
    +
    + devs->bdev_map = bdev_map_init();
    + if (devs->bdev_map == NULL)
    + goto out_free_cdev;
    +
    + return &devs->css;
    +
    +out_free_cdev:
    + cdev_map_fini(devs->cdev_map);
    +out_free:
    + kfree(devs);
    +out:
    + return ERR_PTR(-ENOMEM);
    +}
    +
    +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
    +{
    + struct devs_cgroup *devs;
    +
    + devs = cgroup_to_devs(cont);
    + bdev_map_fini(devs->bdev_map);
    + cdev_map_fini(devs->cdev_map);
    + kfree(devs);
    +}
    +
    +/*
    + * The devices.permissions file read/write functionality
    + *
    + * The following two routines parse and print the strings like
    + * [cb] |*) [r-][w-]
    + */
    +
    +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev,
    + int *all, mode_t *mode)
    +{
    + unsigned int major, minor;
    + char *end;
    + mode_t tmp;
    +
    + if (buf[0] == 'c')
    + *chrdev = 1;
    + else if (buf[0] == 'b')
    + *chrdev = 0;
    + else
    + return -EINVAL;
    +
    + if (buf[1] != ' ')
    + return -EINVAL;
    +
    + major = simple_strtoul(buf + 2, &end, 10);
    + if (*end != ':')
    + return -EINVAL;
    +
    + if (end[1] == '*') {
    + if (end[2] != ' ')
    + return -EINVAL;
    +
    + *all = 1;
    + minor = 0;
    + end += 2;
    + } else {
    + minor = simple_strtoul(end + 1, &end, 10);
    + if (*end != ' ')
    + return -EINVAL;
    +
    + *all = 0;
    + }
    +
    + tmp = 0;
    +
    + if (end[1] == 'r')
    + tmp |= FMODE_READ;
    + else if (end[1] != '-')
    + return -EINVAL;
    + if (end[2] == 'w')
    + tmp |= FMODE_WRITE;
    + else if (end[2] != '-')
    + return -EINVAL;
    +
    + *dev = MKDEV(major, minor);
    + *mode = tmp;
    + return 0;
    +}
    +
    +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev,
    + int all, mode_t mode)
    +{
    + int ret;
    +
    + ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev));
    + if (all)
    + ret += snprintf(buf + ret, len - ret, "*");
    + else
    + ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev));
    +
    + ret += snprintf(buf + ret, len - ret, " %c%c\n",
    + (mode & FMODE_READ) ? 'r' : '-',
    + (mode & FMODE_WRITE) ? 'w' : '-');
    +
    + return ret;
    +}
    +
    +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
    + struct file *f, const char __user *ubuf,
    + size_t nbytes, loff_t *pos)
    +{
    + int err, all, chrdev;
    + dev_t dev;
    + char buf[64];
    + struct devs_cgroup *devs;
    + mode_t mode;
    +
    + if (copy_from_user(buf, ubuf, sizeof(buf)))
    + return -EFAULT;
    +
    + buf[sizeof(buf) - 1] = 0;
    + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
    + if (err < 0)
    + return err;
    +
    + devs = cgroup_to_devs(cont);
    +
    + /*
    + * No locking here is required - all that we need
    + * is provided inside the kobject mapping code
    + */
    +
    + if (mode == 0) {
    + if (chrdev)
    + err = cdev_del_from_map(devs->cdev_map, dev, all);
    + else
    + err = bdev_del_from_map(devs->bdev_map, dev, all);
    +
    + if (err < 0)
    + return err;
    +
    + css_put(&devs->css);
    + } else {
    + if (chrdev)
    + err = cdev_add_to_map(devs->cdev_map, dev, all, mode);
    + else
    + err = bdev_add_to_map(devs->bdev_map, dev, all, mode);
    +
    + if (err < 0)
    + return err;
    +
    + css_get(&devs->css);
    + }
    +
    + return nbytes;
    +}
    +
    +struct devs_dump_arg {
    + char *buf;
    + int pos;
    + int chrdev;
    +};
    +
    +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x)
    +{
    + struct devs_dump_arg *arg = x;
    + char tmp[64];
    + int len;
    +
    + len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev,
    + range != 1, mode);
    +
    + if (arg->pos >= PAGE_SIZE - len)
    + return 1;
    +
    + memcpy(arg->buf + arg->pos, tmp, len);
    + arg->pos += len;
    + return 0;
    +}
    +
    +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft,
    + struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos)
    +{
    + struct devs_dump_arg arg;
    + struct devs_cgroup *devs;
    + ssize_t ret;
    +
    + arg.buf = (char *)__get_free_page(GFP_KERNEL);
    + if (arg.buf == NULL)
    + return -ENOMEM;
    +
    + devs = cgroup_to_devs(cont);
    + arg.pos = 0;
    +
    + arg.chrdev = 1;
    + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
    +
    + arg.chrdev = 0;
    + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
    +
    + ret = simple_read_from_buffer(ubuf, nbytes, pos,
    + arg.buf, arg.pos);
    +
    + free_page((unsigned long)arg.buf);
    + return ret;
    +}
    +
    +static struct cftype devs_files[] = {
    + {
    + .name = "permissions",
    + .write = devs_write,
    + .read = devs_read,
    + },
    +};
    +
    +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont)
    +{
    + return cgroup_add_files(cont, ss,
    + devs_files, ARRAY_SIZE(devs_files));
    +}
    +
    +struct cgroup_subsys devs_subsys = {
    + .name = "devices",
    + .subsys_id = devs_subsys_id,
    + .create = devs_create,
    + .destroy = devs_destroy,
    + .populate = devs_populate,
    +};
    diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
    index 1ddebfc..212c735 100644
    --- a/include/linux/cgroup_subsys.h
    +++ b/include/linux/cgroup_subsys.h
    @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
    #endif

    /* */
    +
    +#ifdef CONFIG_CGROUP_DEVS
    +SUBSYS(devs)
    +#endif
    +
    +/* */
    diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
    index 04c168b..ae5f801 100644
    --- a/include/linux/devscontrol.h
    +++ b/include/linux/devscontrol.h
    @@ -1,5 +1,18 @@
    #ifndef __DEVS_CONTROL_H__
    #define __DEVS_CONTROL_H__
    +struct kobj_map;
    +struct task_struct;
    +
    +/*
    + * task_[cb]dev_map - get a map from task. Both calls may return
    + * NULL, to indicate, that task doesn't belong to any group and
    + * that the global map is to be used.
    + */
    +
    +#ifdef CONFIG_CGROUP_DEVS
    +struct kobj_map *task_cdev_map(struct task_struct *);
    +struct kobj_map *task_bdev_map(struct task_struct *);
    +#else
    static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
    {
    return NULL;
    @@ -10,3 +23,4 @@ static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
    return NULL;
    }
    #endif
    +#endif
    diff --git a/init/Kconfig b/init/Kconfig
    index 77b2ac8..1b7e671 100644
    --- a/init/Kconfig
    +++ b/init/Kconfig
    @@ -289,6 +289,19 @@ config CGROUP_DEBUG

    Say N if unsure

    +config CGROUP_DEVS
    + bool "Devices cgroup subsystem"
    + depends on CGROUPS
    + help
    + Controlls the access rights to devices, i.e. you may hide
    + some of them from tasks, so that they will not be able
    + to open them, or you may grant a read-only access to some
    + of them.
    +
    + See Documentation/controllers/devices.txt for details.
    +
    + This is harmless to say N here, so do it if unsure.
    +
    config CGROUP_NS
    bool "Namespace cgroup subsystem"
    depends on CGROUPS
    --
    1.5.3.4

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [PATCH 6/9] Extend the drivers/base/map.c functionality

    This includes the following functions:

    1. kobj_remap - this one will change the mapping's permissions
    or add a new one if required.

    2. kobj_map_iterate will walk the map, calling the callback
    on each element.

    3. kobj_map_fini is a rollback for kobj_map_finid - cleans all
    the mappings.

    Signed-off-by: Pavel Emelyanov

    ---
    drivers/base/map.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++
    include/linux/kobj_map.h | 6 +++
    2 files changed, 90 insertions(+), 0 deletions(-)

    diff --git a/drivers/base/map.c b/drivers/base/map.c
    index 285a2d2..8e2e1c2 100644
    --- a/drivers/base/map.c
    +++ b/drivers/base/map.c
    @@ -181,3 +181,87 @@ struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock)
    p->lock = lock;
    return p;
    }
    +
    +#ifdef CONFIG_CGROUP_DEVS
    +int kobj_remap(struct kobj_map *domain, dev_t dev, mode_t mode,
    + unsigned long range, struct module *module,
    + kobj_probe_t *probe, int (*lock)(dev_t, void *), void *data)
    +{
    + unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
    + unsigned index = MAJOR(dev);
    + unsigned i;
    + int err = -ESRCH;
    +
    + if (n > KOBJ_MAP_PROBES)
    + n = KOBJ_MAP_PROBES;
    +
    + mutex_lock(domain->lock);
    + for (i = 0; i < n; i++, index++) {
    + struct probe **s;
    + for (s = &domain->probes[index % KOBJ_MAP_PROBES];
    + *s; s = &(*s)->next) {
    + struct probe *p = *s;
    + if (p->dev == dev) {
    + p->mode = mode | FMODE_LSEEK |
    + FMODE_PREAD | FMODE_PWRITE;
    + err = 0;
    + break;
    + }
    + }
    + }
    +
    + if (err)
    + err = __kobj_map(domain, dev, mode, range, module,
    + probe, lock, data);
    + mutex_unlock(domain->lock);
    + return err;
    +}
    +
    +void kobj_map_iterate(struct kobj_map *domain,
    + int (*fn)(dev_t, int, mode_t, void *), void *arg)
    +{
    + int i;
    + struct probe *p;
    + dev_t skip = MKDEV(0, 0);
    +
    + mutex_lock(domain->lock);
    + for (i = 0; i < KOBJ_MAP_PROBES; i++) {
    + p = domain->probes[i];
    + while (p != NULL) {
    + if (p->dev == skip)
    + goto next;
    + /*
    + * this 'device' is special - see the kobj_map_init why
    + */
    + if (p->dev == 1)
    + goto next;
    +
    + skip = p->dev;
    + if (fn(p->dev, p->range, p->mode, arg))
    + goto done;
    +next:
    + p = p->next;
    + }
    + }
    +done:
    + mutex_unlock(domain->lock);
    +}
    +
    +void kobj_map_fini(struct kobj_map *map)
    +{
    + int i;
    + struct probe *p, *next;
    +
    + for (i = 0; i < KOBJ_MAP_PROBES; i++) {
    + p = map->probes[i];
    + while (p->next != NULL) {
    + next = p->next;
    + kfree(p);
    + p = next;
    + }
    + }
    +
    + kfree(p);
    + kfree(map);
    +}
    +#endif
    diff --git a/include/linux/kobj_map.h b/include/linux/kobj_map.h
    index 867f307..35a670d 100644
    --- a/include/linux/kobj_map.h
    +++ b/include/linux/kobj_map.h
    @@ -11,4 +11,10 @@ void kobj_unmap(struct kobj_map *, dev_t, unsigned long);
    struct kobject *kobj_lookup(struct kobj_map *, dev_t, mode_t *, int *);
    struct kobj_map *kobj_map_init(kobj_probe_t *, struct mutex *);

    +void kobj_map_iterate(struct kobj_map *, int (*fn)(dev_t, int, mode_t, void *),
    + void *);
    +int kobj_remap(struct kobj_map *, dev_t, mode_t, unsigned long, struct module *,
    + kobj_probe_t *, int (*)(dev_t, void *), void *);
    +void kobj_map_fini(struct kobj_map *);
    +
    #endif
    --
    1.5.3.4

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. [PATCH 7/9] Provide functions to manipulate char device mappings

    The character devices currently have a global map for
    devices and its own locks and callbacks for devices
    probing. So here are the functions that provide the way
    to play with character devices maps.

    They are cdev_map_init/cdev_map_fini to create/destroy
    one, cdev_add_to_map/cdev_del_from_map to map/unmap devices,
    and cdev_iterate_map for uniformity.

    Signed-off-by: Pavel Emelyanov

    ---
    fs/char_dev.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
    include/linux/cdev.h | 7 +++++
    2 files changed, 70 insertions(+), 0 deletions(-)

    diff --git a/fs/char_dev.c b/fs/char_dev.c
    index d042446..7a4c1bd 100644
    --- a/fs/char_dev.c
    +++ b/fs/char_dev.c
    @@ -476,6 +476,69 @@ int cdev_add(struct cdev *p, dev_t dev, unsigned count)
    return kobj_map(cdev_map, dev, count, NULL, exact_match, exact_lock, p);
    }

    +#ifdef CONFIG_CGROUP_DEVS
    +int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode)
    +{
    + int tmp;
    + struct kobject *k;
    + struct cdev *c;
    +
    + k = kobj_lookup(cdev_map, dev, NULL, &tmp);
    + if (k == NULL)
    + return -ENODEV;
    +
    + c = container_of(k, struct cdev, kobj);
    + tmp = kobj_remap(map, dev, mode, all ? MINORMASK : 1, NULL,
    + exact_match, exact_lock, c);
    + if (tmp < 0) {
    + cdev_put(c);
    + return tmp;
    + }
    +
    + return 0;
    +}
    +
    +int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all)
    +{
    + int tmp;
    + struct kobject *k;
    + struct cdev *c;
    +
    + k = kobj_lookup(map, dev, NULL, &tmp);
    + if (k == NULL)
    + return -ENODEV;
    +
    + c = container_of(k, struct cdev, kobj);
    + kobj_unmap(map, dev, all ? MINORMASK : 1);
    +
    + /*
    + * one put for the kobj_lookup above and one for
    + * the kobj_lookup in cdev_add_to_map
    + */
    + cdev_put(c);
    + cdev_put(c);
    + return 0;
    +}
    +
    +void cdev_iterate_map(struct kobj_map *map,
    + int (*fn)(dev_t, int, mode_t, void *), void *x)
    +{
    + kobj_map_iterate(map, fn, x);
    +}
    +
    +static struct kobject *base_probe(dev_t dev, int *part, void *data);
    +
    +struct kobj_map *cdev_map_init(void)
    +{
    + return kobj_map_init(base_probe, &chrdevs_lock);
    +}
    +
    +void cdev_map_fini(struct kobj_map *map)
    +{
    + kobj_map_fini(map);
    +}
    +#endif
    +
    static void cdev_unmap(dev_t dev, unsigned count)
    {
    kobj_unmap(cdev_map, dev, count);
    diff --git a/include/linux/cdev.h b/include/linux/cdev.h
    index 1e29b13..14d377b 100644
    --- a/include/linux/cdev.h
    +++ b/include/linux/cdev.h
    @@ -33,5 +33,12 @@ void cd_forget(struct inode *);

    extern struct backing_dev_info directly_mappable_cdev_bdi;

    +struct kobj_map;
    +int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode);
    +int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all);
    +struct kobj_map *cdev_map_init(void);
    +void cdev_map_fini(struct kobj_map *map);
    +void cdev_iterate_map(struct kobj_map *,
    + int (*fn)(dev_t, int, mode_t, void *), void *);
    #endif
    #endif
    --
    1.5.3.4

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. [PATCH 8/9] Provide functions to manipulate block device mappings

    This routines were already done for char devices, these ones
    are for block devices. Everything is the same, but the locks,
    probe and match callbacks and kobj-to-device conversions.

    Signed-off-by: Pavel Emelyanov

    ---
    block/genhd.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
    include/linux/genhd.h | 8 ++++++
    2 files changed, 75 insertions(+), 0 deletions(-)

    diff --git a/block/genhd.c b/block/genhd.c
    index a619158..f407f62 100644
    --- a/block/genhd.c
    +++ b/block/genhd.c
    @@ -204,6 +204,73 @@ void unlink_gendisk(struct gendisk *disk)
    disk->minors);
    }

    +#ifdef CONFIG_CGROUP_DEVS
    +int bdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode)
    +{
    + int tmp;
    + struct kobject *kobj;
    + struct device *d;
    + struct gendisk *disk;
    +
    + kobj = kobj_lookup(bdev_map, dev, NULL, &tmp);
    + if (kobj == NULL)
    + return -ENODEV;
    +
    + d = kobj_to_dev(kobj);
    + disk = dev_to_disk(d);
    + tmp = kobj_remap(map, dev, mode, all ? MINORBITS : 1, NULL,
    + exact_match, exact_lock, disk);
    + if (tmp < 0) {
    + put_disk(disk);
    + return tmp;
    + }
    +
    + return 0;
    +}
    +
    +int bdev_del_from_map(struct kobj_map *map, dev_t dev, int all)
    +{
    + int tmp;
    + struct kobject *kobj;
    + struct device *d;
    + struct gendisk *disk;
    +
    + kobj = kobj_lookup(map, dev, NULL, &tmp);
    + if (kobj == NULL)
    + return -ENODEV;
    +
    + d = kobj_to_dev(kobj);
    + disk = dev_to_disk(d);
    + kobj_unmap(map, dev, all ? MINORBITS : 1);
    +
    + /*
    + * one put for the kobj_lookup above and one for
    + * the kobj_lookup in bdev_add_to_map
    + */
    + put_disk(disk);
    + put_disk(disk);
    + return 0;
    +}
    +
    +void bdev_iterate_map(struct kobj_map *map,
    + int (*fn)(dev_t, int, mode_t, void *), void *x)
    +{
    + kobj_map_iterate(map, fn, x);
    +}
    +
    +static struct kobject *base_probe(dev_t devt, int *part, void *data);
    +
    +struct kobj_map *bdev_map_init(void)
    +{
    + return kobj_map_init(base_probe, &block_class_lock);
    +}
    +
    +void bdev_map_fini(struct kobj_map *map)
    +{
    + kobj_map_fini(map);
    +}
    +#endif
    +
    /**
    * get_gendisk - get partitioning information for a given device
    * @dev: device to get partitioning information for
    diff --git a/include/linux/genhd.h b/include/linux/genhd.h
    index e09df44..c5483fc 100644
    --- a/include/linux/genhd.h
    +++ b/include/linux/genhd.h
    @@ -373,6 +373,14 @@ extern void del_gendisk(struct gendisk *gp);
    extern void unlink_gendisk(struct gendisk *gp);
    extern struct gendisk *get_gendisk(dev_t dev, mode_t *mode, int *part);

    +struct kobj_map;
    +extern int bdev_add_to_map(struct kobj_map *, dev_t dev, int all, mode_t mode);
    +extern int bdev_del_from_map(struct kobj_map *map, dev_t dev, int all);
    +extern void bdev_iterate_map(struct kobj_map *map,
    + int (*fn)(dev_t, int, mode_t, void *), void *x);
    +extern struct kobj_map *bdev_map_init(void);
    +extern void bdev_map_fini(struct kobj_map *map);
    +
    extern void set_device_ro(struct block_device *bdev, int flag);
    extern void set_disk_ro(struct gendisk *disk, int flag);

    --
    1.5.3.4

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Now check the requesting permissions against the granted
    (with the dev_t-to-kobj map) ones.

    The tricky place is chrdev_open - it caches the struct cdev
    on inode and thus, we have to perform lookup each time
    if we are in a restricted mapping.

    The task_cdev_map and task_bdev_map provide the map which
    the current task is in, but now they just return NULL, which
    means, that the task is not in any.

    Signed-off-by: Pavel Emelyanov

    ---
    block/genhd.c | 8 +++++++-
    fs/block_dev.c | 8 ++++++++
    fs/char_dev.c | 18 ++++++++++++++++--
    include/linux/devscontrol.h | 12 ++++++++++++
    4 files changed, 43 insertions(+), 3 deletions(-)
    create mode 100644 include/linux/devscontrol.h

    diff --git a/block/genhd.c b/block/genhd.c
    index 1d1d0f2..a619158 100644
    --- a/block/genhd.c
    +++ b/block/genhd.c
    @@ -8,6 +8,7 @@
    #include
    #include
    #include
    +#include
    #include
    #include
    #include
    @@ -212,10 +213,15 @@ void unlink_gendisk(struct gendisk *disk)
    */
    struct gendisk *get_gendisk(dev_t devt, mode_t *mode, int *part)
    {
    + struct kobj_map *map;
    struct kobject *kobj;
    struct device *dev;

    - kobj = kobj_lookup(bdev_map, devt, mode, part);
    + map = task_bdev_map(current);
    + if (map == NULL)
    + map = bdev_map;
    +
    + kobj = kobj_lookup(map, devt, mode, part);
    if (kobj == NULL)
    return NULL;

    diff --git a/fs/block_dev.c b/fs/block_dev.c
    index 00dda91..34dc607 100644
    --- a/fs/block_dev.c
    +++ b/fs/block_dev.c
    @@ -945,6 +945,14 @@ static int do_open(struct block_device *bdev, struct file *file, int for_part)
    bdput(bdev);
    return ret;
    }
    +
    + if ((file->f_mode & mode) != file->f_mode) {
    + unlock_kernel();
    + bdput(bdev);
    + put_disk(disk);
    + return -EACCES;
    + }
    +
    owner = disk->fops->owner;

    mutex_lock_nested(&bdev->bd_mutex, for_part);
    diff --git a/fs/char_dev.c b/fs/char_dev.c
    index dceb579..d042446 100644
    --- a/fs/char_dev.c
    +++ b/fs/char_dev.c
    @@ -22,6 +22,8 @@
    #include
    #include

    +#include
    +
    #ifdef CONFIG_KMOD
    #include
    #endif
    @@ -361,19 +363,31 @@ static int chrdev_open(struct inode *inode, struct file *filp)
    struct cdev *p;
    struct cdev *new = NULL;
    int ret = 0;
    + struct kobj_map *map;
    +
    + map = task_cdev_map(current);
    + if (map == NULL)
    + map = cdev_map;

    spin_lock(&cdev_lock);
    p = inode->i_cdev;
    - if (!p) {
    + if (!p || map != cdev_map) {
    struct kobject *kobj;
    int idx;
    mode_t mode;

    spin_unlock(&cdev_lock);
    - kobj = kobj_lookup(cdev_map, inode->i_rdev, &mode, &idx);
    + kobj = kobj_lookup(map, inode->i_rdev, &mode, &idx);
    if (!kobj)
    return -ENXIO;
    new = container_of(kobj, struct cdev, kobj);
    + BUG_ON(p != NULL && p != new);
    +
    + if ((filp->f_mode & mode) != filp->f_mode) {
    + cdev_put(new);
    + return -EACCES;
    + }
    +
    spin_lock(&cdev_lock);
    p = inode->i_cdev;
    if (!p) {
    diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
    new file mode 100644
    index 0000000..04c168b
    --- /dev/null
    +++ b/include/linux/devscontrol.h
    @@ -0,0 +1,12 @@
    +#ifndef __DEVS_CONTROL_H__
    +#define __DEVS_CONTROL_H__
    +static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
    +{
    + return NULL;
    +}
    +
    +static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
    +{
    + return NULL;
    +}
    +#endif
    --
    1.5.3.4

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Wed, 05 Mar 2008 20:37:40 +0300
    Pavel Emelyanov wrote:

    > diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
    > new file mode 100644
    > index 0000000..04c168b
    > --- /dev/null
    > +++ b/include/linux/devscontrol.h
    > @@ -0,0 +1,12 @@
    > +#ifndef __DEVS_CONTROL_H__
    > +#define __DEVS_CONTROL_H__
    > +static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
    > +{
    > + return NULL;
    > +}
    > +
    > +static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
    > +{
    > + return NULL;
    > +}
    > +#endif


    This doesn't include sufficient headers to be compileable.

    I'm sure there are lots of headers like this. But we regularly need
    to fix them.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 0/9] Devices accessibility control group (v4)

    On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
    > Changes from v3:
    > * Ported on 2.6.25-rc3-mm1;
    > * Re-splitted into smaller pieces;
    > * Added more comments to tricky places.
    >
    > This controller allows to tune the devices accessibility by tasks,
    > i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    > access to IDE devices and completely hide SCSI disks.


    From within the kernel itself? The kernel should not be keeping track
    of the mode of devices, that's what the filesystem holding /dev is for.
    Those modes change all the time depending on the device plugged in, and
    the user using the "console". Why should the kernel need to worry about
    any of this?

    > Tasks still can call mknod to create device files, regardless of
    > whether the particular device is visible or accessible, but they
    > may not be able to open it later.
    >
    > This one hides under CONFIG_CGROUP_DEVS option.
    >
    > To play with it - run a standard procedure:
    >
    > # mount -t container none /cont/devs -o devices
    > # mkdir /cont/devs/0
    > # echo -n $$ > /cont/devs/0/tasks


    What is /cont/ for?

    > and tune device permissions.


    How is this done?

    Why would the kernel care about this stuff?

    confused,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH 9/9] Devices accessibility control group itself

    On Wed, Mar 05, 2008 at 08:47:26PM +0300, Pavel Emelyanov wrote:
    > Finally, here's the control group, which makes full use of the
    > interfaces, declared in the previous patches.
    >
    > Signed-off-by: Pavel Emelyanov
    >
    > ---
    > Documentation/controllers/devices.txt | 61 +++++++


    The information here should be in Documentation/ABI/ right?

    I still don't see how changing these permissions within the kernel is
    going to affect anything outside of it, nor why you want to be tracking
    this within the kernel itself.

    What's wrong with just having different /dev trees for the different
    containers? That would not require any kernel changes, and you don't
    have to write a tool to be sending these odd "mode settings" into the
    kernel.

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH 0/9] Devices accessibility control group (v4)

    Quoting Greg KH (greg@kroah.com):
    > On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
    > > Changes from v3:
    > > * Ported on 2.6.25-rc3-mm1;
    > > * Re-splitted into smaller pieces;
    > > * Added more comments to tricky places.
    > >
    > > This controller allows to tune the devices accessibility by tasks,
    > > i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    > > access to IDE devices and completely hide SCSI disks.

    >
    > From within the kernel itself? The kernel should not be keeping track
    > of the mode of devices, that's what the filesystem holding /dev is for.
    > Those modes change all the time depending on the device plugged in, and
    > the user using the "console". Why should the kernel need to worry about
    > any of this?


    These are distinct from the permissions on device files. No matter what
    the permissions on the device files, a task in a devcg cgroup which
    isn't allowed write to chardev 4:64 will not be able to write to
    /dev/ttyS0.

    The purpose is to prevent a root task from granting itself access to
    certain devices. Without this, the only option currently is to take
    CAP_MKNOD out of the capability bounding set for a container and make
    sure that /dev is set up right (and enforce nodev for mounts). In
    itself that doesn't sound so bad and it was my preference at first, but
    the argument is that things like udev should be able to run in a
    container, and will object about not being able to create devices.

    > > Tasks still can call mknod to create device files, regardless of
    > > whether the particular device is visible or accessible, but they
    > > may not be able to open it later.
    > >
    > > This one hides under CONFIG_CGROUP_DEVS option.
    > >
    > > To play with it - run a standard procedure:
    > >
    > > # mount -t container none /cont/devs -o devices
    > > # mkdir /cont/devs/0
    > > # echo -n $$ > /cont/devs/0/tasks

    >
    > What is /cont/ for?


    cgroups used to be called containers, so 'cont' is presumably shorthand
    for container.

    > > and tune device permissions.

    >
    > How is this done?
    >
    > Why would the kernel care about this stuff?


    Because there is no way for userspace to restrict a root process in a
    container from accessing whatever devices it wants.

    > confused,
    >
    > greg k-h


    thanks,
    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH 0/9] Devices accessibility control group (v4)

    On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
    > Quoting Greg KH (greg@kroah.com):
    > > On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
    > > > Changes from v3:
    > > > * Ported on 2.6.25-rc3-mm1;
    > > > * Re-splitted into smaller pieces;
    > > > * Added more comments to tricky places.
    > > >
    > > > This controller allows to tune the devices accessibility by tasks,
    > > > i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    > > > access to IDE devices and completely hide SCSI disks.

    > >
    > > From within the kernel itself? The kernel should not be keeping track
    > > of the mode of devices, that's what the filesystem holding /dev is for.
    > > Those modes change all the time depending on the device plugged in, and
    > > the user using the "console". Why should the kernel need to worry about
    > > any of this?

    >
    > These are distinct from the permissions on device files. No matter what
    > the permissions on the device files, a task in a devcg cgroup which
    > isn't allowed write to chardev 4:64 will not be able to write to
    > /dev/ttyS0.


    Then why not do that from userspace with a different /dev, or with a
    LSM?

    > The purpose is to prevent a root task from granting itself access to
    > certain devices. Without this, the only option currently is to take
    > CAP_MKNOD out of the capability bounding set for a container and make
    > sure that /dev is set up right (and enforce nodev for mounts). In
    > itself that doesn't sound so bad and it was my preference at first,


    that would be my preference as well.

    > but the argument is that things like udev should be able to run in a
    > container, and will object about not being able to create devices.


    No reason you can't modify udev to do something like this. At the
    worse, just disable udev warning messages, that is pretty trivial to do.

    This really makes it seem like this kernel change is not needed at all.

    > > > Tasks still can call mknod to create device files, regardless of
    > > > whether the particular device is visible or accessible, but they
    > > > may not be able to open it later.
    > > >
    > > > This one hides under CONFIG_CGROUP_DEVS option.
    > > >
    > > > To play with it - run a standard procedure:
    > > >
    > > > # mount -t container none /cont/devs -o devices
    > > > # mkdir /cont/devs/0
    > > > # echo -n $$ > /cont/devs/0/tasks

    > >
    > > What is /cont/ for?

    >
    > cgroups used to be called containers, so 'cont' is presumably shorthand
    > for container.
    >
    > > > and tune device permissions.

    > >
    > > How is this done?
    > >
    > > Why would the kernel care about this stuff?

    >
    > Because there is no way for userspace to restrict a root process in a
    > container from accessing whatever devices it wants.


    LSM does this quite well today, why reinvent the wheel and put wierd
    hooks in other parts of the kernel that do not need them at all?

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH 0/9] Devices accessibility control group (v4)

    Greg KH wrote:
    > On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
    >> Quoting Greg KH (greg@kroah.com):
    >>> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
    >>>> Changes from v3:
    >>>> * Ported on 2.6.25-rc3-mm1;
    >>>> * Re-splitted into smaller pieces;
    >>>> * Added more comments to tricky places.
    >>>>
    >>>> This controller allows to tune the devices accessibility by tasks,
    >>>> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    >>>> access to IDE devices and completely hide SCSI disks.
    >>> From within the kernel itself? The kernel should not be keeping track
    >>> of the mode of devices, that's what the filesystem holding /dev is for.
    >>> Those modes change all the time depending on the device plugged in, and
    >>> the user using the "console". Why should the kernel need to worry about
    >>> any of this?

    >> These are distinct from the permissions on device files. No matter what
    >> the permissions on the device files, a task in a devcg cgroup which
    >> isn't allowed write to chardev 4:64 will not be able to write to
    >> /dev/ttyS0.

    >
    > Then why not do that from userspace with a different /dev, or with a
    > LSM?


    Different dev is not suitable, since task may still call mknod to
    create device it needs and use it. This is not about comfortable
    use, this is about security.

    LSM approach was proposed, but that required some API to configure
    the permissions. This API done via control groups, so there were no
    difference between this approach and that. Except for this one doesn't
    create one more level of filtering at the top of kobject lookup and
    thus is simpler and faster.

    >> The purpose is to prevent a root task from granting itself access to
    >> certain devices. Without this, the only option currently is to take
    >> CAP_MKNOD out of the capability bounding set for a container and make
    >> sure that /dev is set up right (and enforce nodev for mounts). In
    >> itself that doesn't sound so bad and it was my preference at first,

    >
    > that would be my preference as well.


    That approach relies on a proper user space setup inside a container,
    but this creates security holes, since container user may ignore all
    these "requirements".

    >> but the argument is that things like udev should be able to run in a
    >> container, and will object about not being able to create devices.

    >
    > No reason you can't modify udev to do something like this. At the


    Sure we _can_ modify udev, but the problem is that users of virtualisation
    solutions often (very often) use old software (e.g. set up some out-dated
    distribution inside a container), so trick with modified udev simply won't
    work in many cases.

    > worse, just disable udev warning messages, that is pretty trivial to do.
    >
    > This really makes it seem like this kernel change is not needed at all.
    >
    >>>> Tasks still can call mknod to create device files, regardless of
    >>>> whether the particular device is visible or accessible, but they
    >>>> may not be able to open it later.
    >>>>
    >>>> This one hides under CONFIG_CGROUP_DEVS option.
    >>>>
    >>>> To play with it - run a standard procedure:
    >>>>
    >>>> # mount -t container none /cont/devs -o devices
    >>>> # mkdir /cont/devs/0
    >>>> # echo -n $$ > /cont/devs/0/tasks
    >>> What is /cont/ for?

    >> cgroups used to be called containers, so 'cont' is presumably shorthand
    >> for container.
    >>
    >>>> and tune device permissions.
    >>> How is this done?
    >>>
    >>> Why would the kernel care about this stuff?

    >> Because there is no way for userspace to restrict a root process in a
    >> container from accessing whatever devices it wants.

    >
    > LSM does this quite well today, why reinvent the wheel and put wierd
    > hooks in other parts of the kernel that do not need them at all?


    These are not hooks actually. I just made kobj_map-s per-group.

    > thanks,
    >
    > greg k-h
    >


    Thanks,
    Pavel
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Andrew Morton wrote:
    > On Wed, 05 Mar 2008 20:37:40 +0300
    > Pavel Emelyanov wrote:
    >
    >> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
    >> new file mode 100644
    >> index 0000000..04c168b
    >> --- /dev/null
    >> +++ b/include/linux/devscontrol.h
    >> @@ -0,0 +1,12 @@
    >> +#ifndef __DEVS_CONTROL_H__
    >> +#define __DEVS_CONTROL_H__
    >> +static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
    >> +{
    >> + return NULL;
    >> +}
    >> +
    >> +static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
    >> +{
    >> + return NULL;
    >> +}
    >> +#endif

    >
    > This doesn't include sufficient headers to be compileable.


    Indeed The patch #9 actually declares two needed struct-s, so the final
    compilation is clean, but this set is still bad from the git-bisect point
    of view, sorry.

    Shall I provide a devscontrol-make-use-of-permissions-returned-by-kobj_lookup-fix.patch
    for this one?

    > I'm sure there are lots of headers like this. But we regularly need
    > to fix them.
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH 0/9] Devices accessibility control group (v4)

    On Thu, Mar 06, 2008 at 11:36:22AM +0300, Pavel Emelyanov wrote:
    > Greg KH wrote:
    > > On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
    > >> Quoting Greg KH (greg@kroah.com):
    > >>> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
    > >>>> Changes from v3:
    > >>>> * Ported on 2.6.25-rc3-mm1;
    > >>>> * Re-splitted into smaller pieces;
    > >>>> * Added more comments to tricky places.
    > >>>>
    > >>>> This controller allows to tune the devices accessibility by tasks,
    > >>>> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    > >>>> access to IDE devices and completely hide SCSI disks.
    > >>> From within the kernel itself? The kernel should not be keeping track
    > >>> of the mode of devices, that's what the filesystem holding /dev is for.
    > >>> Those modes change all the time depending on the device plugged in, and
    > >>> the user using the "console". Why should the kernel need to worry about
    > >>> any of this?
    > >> These are distinct from the permissions on device files. No matter what
    > >> the permissions on the device files, a task in a devcg cgroup which
    > >> isn't allowed write to chardev 4:64 will not be able to write to
    > >> /dev/ttyS0.

    > >
    > > Then why not do that from userspace with a different /dev, or with a
    > > LSM?

    >
    > Different dev is not suitable, since task may still call mknod to
    > create device it needs and use it. This is not about comfortable
    > use, this is about security.


    LSM can control mknod quite well. And it is _all_ about security, don't
    try to add "security" to some other portion of the kernel that it not
    expecting it

    > LSM approach was proposed, but that required some API to configure
    > the permissions.


    Much like the API you already created to configure the permissions...

    > This API done via control groups, so there were no
    > difference between this approach and that. Except for this one doesn't
    > create one more level of filtering at the top of kobject lookup and
    > thus is simpler and faster.


    How can it be "faster"? There is no speed problems here.

    Please use LSM for this, that is what it is explicitly there for.

    > >> The purpose is to prevent a root task from granting itself access to
    > >> certain devices. Without this, the only option currently is to take
    > >> CAP_MKNOD out of the capability bounding set for a container and make
    > >> sure that /dev is set up right (and enforce nodev for mounts). In
    > >> itself that doesn't sound so bad and it was my preference at first,

    > >
    > > that would be my preference as well.

    >
    > That approach relies on a proper user space setup inside a container,
    > but this creates security holes, since container user may ignore all
    > these "requirements".


    Ok then, use LSM to enforce those requirements, again, that is what it
    is there for.

    > >> but the argument is that things like udev should be able to run in a
    > >> container, and will object about not being able to create devices.

    > >
    > > No reason you can't modify udev to do something like this. At the

    >
    > Sure we _can_ modify udev, but the problem is that users of virtualisation
    > solutions often (very often) use old software (e.g. set up some out-dated
    > distribution inside a container), so trick with modified udev simply won't
    > work in many cases.


    So the kernel's job is to enforce policy that an old userspace can not
    seem to get straight? That bit of logic doesn't go over so well...

    Andrew, please drop these patches from your tree, this all should be
    done with a simple LSM interface, which would be much smaller and much
    less intrusive than this patchset.

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH 0/9] Devices accessibility control group (v4)

    On Thu 2008-03-06 11:36:22, Pavel Emelyanov wrote:
    > Greg KH wrote:
    > > On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
    > >> Quoting Greg KH (greg@kroah.com):
    > >>> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
    > >>>> Changes from v3:
    > >>>> * Ported on 2.6.25-rc3-mm1;
    > >>>> * Re-splitted into smaller pieces;
    > >>>> * Added more comments to tricky places.
    > >>>>
    > >>>> This controller allows to tune the devices accessibility by tasks,
    > >>>> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    > >>>> access to IDE devices and completely hide SCSI disks.
    > >>> From within the kernel itself? The kernel should not be keeping track
    > >>> of the mode of devices, that's what the filesystem holding /dev is for.
    > >>> Those modes change all the time depending on the device plugged in, and
    > >>> the user using the "console". Why should the kernel need to worry about
    > >>> any of this?
    > >> These are distinct from the permissions on device files. No matter what
    > >> the permissions on the device files, a task in a devcg cgroup which
    > >> isn't allowed write to chardev 4:64 will not be able to write to
    > >> /dev/ttyS0.

    > >
    > > Then why not do that from userspace with a different /dev, or with a
    > > LSM?

    >
    > Different dev is not suitable, since task may still call mknod to
    > create device it needs and use it. This is not about comfortable
    > use, this is about security.


    And you may still take out mknod capability...

    --
    (english) http://www.livejournal.com/~pavelmachek
    (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pav...rses/blog.html
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH 0/9] Devices accessibility control group (v4)

    Pavel Machek wrote:
    > On Thu 2008-03-06 11:36:22, Pavel Emelyanov wrote:
    >> Greg KH wrote:
    >>> On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
    >>>> Quoting Greg KH (greg@kroah.com):
    >>>>> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
    >>>>>> Changes from v3:
    >>>>>> * Ported on 2.6.25-rc3-mm1;
    >>>>>> * Re-splitted into smaller pieces;
    >>>>>> * Added more comments to tricky places.
    >>>>>>
    >>>>>> This controller allows to tune the devices accessibility by tasks,
    >>>>>> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
    >>>>>> access to IDE devices and completely hide SCSI disks.
    >>>>> From within the kernel itself? The kernel should not be keeping track
    >>>>> of the mode of devices, that's what the filesystem holding /dev is for.
    >>>>> Those modes change all the time depending on the device plugged in, and
    >>>>> the user using the "console". Why should the kernel need to worry about
    >>>>> any of this?
    >>>> These are distinct from the permissions on device files. No matter what
    >>>> the permissions on the device files, a task in a devcg cgroup which
    >>>> isn't allowed write to chardev 4:64 will not be able to write to
    >>>> /dev/ttyS0.
    >>> Then why not do that from userspace with a different /dev, or with a
    >>> LSM?

    >> Different dev is not suitable, since task may still call mknod to
    >> create device it needs and use it. This is not about comfortable
    >> use, this is about security.

    >
    > And you may still take out mknod capability...


    Sure we can. In that case we can pre-create only alowed devices for a container
    and be happy, but this is not a flexible way of doing things. The main problem is
    that udev will stop working in such a container. Patching one is not an option
    either, for the reasons I have described before.

    Many virtualization functionality we're submitting can be achieved via proper
    userpace modifications, but we want to be able to run old software (from old
    and even historic distributions) inside containers, so we have to make this
    at the kernel level.

    Thanks,
    Pavel
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    Andrew Morton wrote:
    > On Wed, 05 Mar 2008 20:37:40 +0300
    > Pavel Emelyanov wrote:
    >
    >> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
    >> new file mode 100644
    >> index 0000000..04c168b
    >> --- /dev/null
    >> +++ b/include/linux/devscontrol.h
    >> @@ -0,0 +1,12 @@
    >> +#ifndef __DEVS_CONTROL_H__
    >> +#define __DEVS_CONTROL_H__
    >> +static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
    >> +{
    >> + return NULL;
    >> +}
    >> +
    >> +static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
    >> +{
    >> + return NULL;
    >> +}
    >> +#endif

    >
    > This doesn't include sufficient headers to be compileable.
    >
    > I'm sure there are lots of headers like this. But we regularly need
    > to fix them.
    >


    Not sure, whether this is still relevant after Greg's comments, but that's
    the -fix patch for this one. (It will cause a conflict with the 9th patch.)

    diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
    index 04c168b..52cac64 100644
    --- a/include/linux/devscontrol.h
    +++ b/include/linux/devscontrol.h
    @@ -1,5 +1,8 @@
    #ifndef __DEVS_CONTROL_H__
    #define __DEVS_CONTROL_H__
    +struct kobj_map;
    +struct task_struct;
    +
    static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
    {
    return NULL;
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH 5/9] Make use of permissions, returned by kobj_lookup

    On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov wrote:

    > > This doesn't include sufficient headers to be compileable.
    > >
    > > I'm sure there are lots of headers like this. But we regularly need
    > > to fix them.
    > >

    >
    > Not sure, whether this is still relevant after Greg's comments, but that's
    > the -fix patch for this one. (It will cause a conflict with the 9th patch.)


    Well. Where do we stand with this? afaict the state of play is:

    Greg: do it in udev
    Pavel: but people want to run old distros in containers


    Realistically, when is the mainline kernel likely to have sufficient
    container functionality which is sufficiently well-tested for people to
    actually be able to do that? And how much longer will it take for that
    kernel.org functionality to propagate out into non-bleeding-edge distros?

    Altogether we're looking at one to three years, aren't we? By then,
    perhaps a lot of "old" distros are already udev-based.

    otoh, my experience upgrading old kernels to new udev has not been a
    good one (ie: it didn't work).

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast