[patch 0/4] [RFC] Another proportional weight IO controller - Kernel

This is a discussion on [patch 0/4] [RFC] Another proportional weight IO controller - Kernel ; Hi, If you are not already tired of so many io controller implementations, here is another one. This is a very eary very crude implementation to get early feedback to see if this approach makes any sense or not. This ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 30

Thread: [patch 0/4] [RFC] Another proportional weight IO controller

  1. [patch 0/4] [RFC] Another proportional weight IO controller

    Hi,

    If you are not already tired of so many io controller implementations, here
    is another one.

    This is a very eary very crude implementation to get early feedback to see
    if this approach makes any sense or not.

    This controller is a proportional weight IO controller primarily
    based on/inspired by dm-ioband. One of the things I personally found little
    odd about dm-ioband was need of a dm-ioband device for every device we want
    to control. I thought that probably we can make this control per request
    queue and get rid of device mapper driver. This should make configuration
    aspect easy.

    I have picked up quite some amount of code from dm-ioband especially for
    biocgroup implementation.

    I have done very basic testing and that is running 2-3 dd commands in different
    cgroups on x86_64. Wanted to throw out the code early to get some feedback.

    More details about the design and how to are in documentation patch.

    Your comments are welcome.

    Thanks
    Vivek

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [patch 3/4] io controller: Core IO controller implementation logic


    o Core IO controller implementation

    Signed-off-by: Vivek Goyal

    Index: linux2/mm/biocontrol.c
    ================================================== =================
    --- linux2.orig/mm/biocontrol.c 2008-11-06 05:27:36.000000000 -0500
    +++ linux2/mm/biocontrol.c 2008-11-06 05:33:27.000000000 -0500
    @@ -33,6 +33,7 @@
    #include
    #include

    +void bio_group_inactive_timeout(unsigned long data);

    /* return corresponding bio_cgroup object of a cgroup */
    static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
    @@ -407,3 +408,706 @@ struct cgroup_subsys bio_cgroup_subsys =
    .attach = bio_cgroup_move_task,
    .early_init = 0,
    };
    +
    +struct bio_group* create_bio_group(struct bio_cgroup *biocg,
    + struct request_queue *q)
    +{
    + unsigned long flags;
    + struct bio_group *biog = NULL;
    +
    + biog = kzalloc(sizeof(struct bio_group), GFP_ATOMIC);
    + if (!biog)
    + return biog;
    +
    + spin_lock_init(&biog->bio_group_lock);
    + biog->q = q;
    + biog->biocg = biocg;
    + INIT_LIST_HEAD(&biog->next);
    + biog->biog_inactive_timer.function = bio_group_inactive_timeout;
    + biog->biog_inactive_timer.data = (unsigned long)biog;
    + init_timer(&biog->biog_inactive_timer);
    + atomic_set(&biog->refcnt, 0);
    + spin_lock_irqsave(&biocg->biog_list_lock, flags);
    + list_add(&biog->next, &biocg->bio_group_list);
    + bio_group_get(biog);
    + spin_unlock_irqrestore(&biocg->biog_list_lock, flags);
    + return biog;
    +}
    +
    +void* alloc_biog_io(void)
    +{
    + return kzalloc(sizeof(struct biog_io), GFP_ATOMIC);
    +}
    +
    +void free_biog_io(struct biog_io *biog_io)
    +{
    + kfree(biog_io);
    +}
    +
    +/*
    + * Upon succesful completion of bio, this function starts the inactive timer
    + * so that if a bio group stops contending for disk bandwidth, it is removed
    + * from the token allocation race.
    + */
    +void biog_io_end(struct bio *bio, int error)
    +{
    + struct biog_io *biog_io;
    + struct bio_group *biog;
    + unsigned long flags;
    + struct request_queue *q;
    +
    + biog_io = bio->bi_private;
    + biog = biog_io->biog;
    + BUG_ON(!biog);
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + q = biog->q;
    + BUG_ON(!q);
    +
    + /* Restore the original bio fields */
    + bio->bi_end_io = biog_io->bi_end_io;
    + bio->bi_private = biog_io->bi_private;
    +
    + /* If bio group is still empty, then start the inactive timer */
    + if (bio_group_on_queue(biog) && bio_group_empty(biog)) {
    + mod_timer(&biog->biog_inactive_timer,
    + jiffies + msecs_to_jiffies(q->biogroup_idletime));
    + bio_group_flag_set(BIOG_FLAG_TIMER_ACTIVE, biog);
    + }
    +
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + free_biog_io(biog_io);
    + bio_group_put(biog);
    + bio_endio(bio, error);
    +}
    +
    +/* Calculate how many tokens should be allocated to new group based on
    + * the number of share/weight of this group and the number of tokens and
    + * load which is already present on the queue.
    + */
    +unsigned long calculate_nr_tokens(struct bio_group *biog,
    + struct request_queue *q)
    +{
    + unsigned long nr_tokens, total_slice;
    +
    + total_slice = q->biogroup_deftoken * q->nr_biog;
    + nr_tokens = total_slice * biog->biocg->shares/q->total_weight;
    +
    + BUG_ON(!nr_tokens);
    + return nr_tokens;
    +}
    +
    +unsigned long alloc_bio_group_key(struct request_queue *q)
    +{
    + unsigned long key = 0;
    +
    + if (!q->bio_groups.rb.rb_node)
    + return key;
    +
    + /* Insert element at the end of tree */
    + key = q->max_key + 1;
    + return key;
    +}
    +
    +/*
    + * The below is leftmost cache rbtree addon
    + */
    +struct bio_group *bio_group_rb_first(struct group_rb_root *root)
    +{
    + if (!root->left)
    + root->left = rb_first(&root->rb);
    +
    + if (root->left)
    + return rb_entry(root->left, struct bio_group, rb_node);
    +
    + return NULL;
    +}
    +
    +void remove_bio_group_from_rbtree(struct bio_group *biog,
    + struct request_queue *q)
    +{
    + struct group_rb_root *root;
    + struct rb_node *n;
    +
    + root = &q->bio_groups;
    + n = &biog->rb_node;
    +
    + if (root->left == n)
    + root->left = NULL;
    +
    + rb_erase(n, &root->rb);
    + RB_CLEAR_NODE(n);
    +
    + if (bio_group_blocked(biog))
    + q->nr_biog_blocked--;
    +
    + q->nr_biog--;
    + q->total_weight -= biog->biocg->shares;
    +
    + if (!q->total_weight)
    + q->max_key = 0;
    +}
    +
    +
    +void insert_bio_group_into_rbtree(struct bio_group *biog,
    + struct request_queue *q)
    +{
    + struct rb_node **p;
    + struct rb_node *parent = NULL;
    + struct bio_group *__biog;
    + int leftmost = 1;
    +
    + /* Check if any element being inserted has key less than max key */
    + if (biog->key < q->max_key)
    + BUG();
    +
    + p = &q->bio_groups.rb.rb_node;
    + while (*p) {
    + parent = *p;
    + __biog = rb_entry(parent, struct bio_group, rb_node);
    +
    + /* Should equal key case be a warning? */
    + if (biog->key < __biog->key)
    + p = &(*p)->rb_left;
    + else {
    + p = &(*p)->rb_right;
    + leftmost = 0;
    + }
    + }
    +
    + /* Cache the leftmost element */
    + if (leftmost)
    + q->bio_groups.left = &biog->rb_node;
    +
    + rb_link_node(&biog->rb_node, parent, p);
    + rb_insert_color(&biog->rb_node, &q->bio_groups.rb);
    +
    + /* Update the tokens and weight in request_queue */
    + q->nr_biog++;
    + q->total_weight += biog->biocg->shares;
    + q->max_key = biog->key;
    + if (bio_group_blocked(biog))
    + q->nr_biog_blocked++;
    +}
    +
    +void queue_bio_group(struct bio_group *biog, struct request_queue *q)
    +{
    + biog->key = alloc_bio_group_key(q);
    + /* Take another reference on biog. will be decremented once biog
    + * is off the tree */
    + bio_group_get(biog);
    + insert_bio_group_into_rbtree(biog, q);
    + bio_group_flag_set(BIOG_FLAG_ON_QUEUE, biog);
    + bio_group_flag_clear(BIOG_FLAG_BLOCKED, biog);
    + biog->slice_stamp = q->current_slice;
    +}
    +
    +void start_new_token_slice(struct request_queue *q)
    +{
    + struct rb_node *n;
    + struct bio_group *biog = NULL;
    + struct group_rb_root *root;
    + unsigned long flags;
    +
    + q->current_slice++;
    +
    + /* Traverse the tree and reset the blocked count to zero of all the
    + * biogs */
    +
    + root = &q->bio_groups;
    +
    + if (!root->left)
    + root->left = rb_first(&root->rb);
    +
    + if (root->left)
    + biog = rb_entry(root->left, struct bio_group, rb_node);
    +
    + if (!biog)
    + return;
    +
    + n = &biog->rb_node;
    +
    + /* Reset blocked count */
    + q->nr_biog_blocked = 0;
    + q->newslice_count++;
    +
    + do {
    + biog = rb_entry(n, struct bio_group, rb_node);
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + bio_group_flag_clear(BIOG_FLAG_BLOCKED, biog);
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + n = rb_next(n);
    + } while (n);
    +
    +}
    +
    +int should_start_new_token_slice(struct request_queue *q)
    +{
    + /*
    + * if all the biog on the queue are blocked, then start a new
    + * token slice
    + */
    + if (q->nr_biog_blocked == q->nr_biog)
    + return 1;
    + return 0;
    +}
    +
    +int is_bio_group_blocked(struct bio_group *biog)
    +{
    + unsigned long flags, status = 0;
    +
    + /* Do I really need to lock bio group */
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + if (bio_group_blocked(biog))
    + status = 1;
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + return status;
    +}
    +
    +int can_bio_group_dispatch(struct bio_group *biog, struct bio *bio)
    +{
    + unsigned long temp = 0, flags;
    + struct request_queue *q;
    + long nr_sectors;
    + int can_dispatch = 0;
    +
    + BUG_ON(!biog);
    + BUG_ON(!bio);
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + nr_sectors = bio_sectors(bio);
    + q = biog->q;
    +
    + if (time_after(q->current_slice, biog->slice_stamp)) {
    + temp = calculate_nr_tokens(biog, q);
    + biog->credit_tokens += temp;
    + biog->slice_stamp = q->current_slice;
    + biog->biocg->nr_token_slices++;
    + }
    +
    + if ((biog->credit_tokens > 0) && (biog->credit_tokens > nr_sectors)) {
    + if (bio_group_flag_test_and_clear(BIOG_FLAG_BLOCKED, biog))
    + q->nr_biog_blocked--;
    + can_dispatch = 1;
    + goto out;
    + }
    +
    + if (!bio_group_flag_test_and_set(BIOG_FLAG_BLOCKED, biog))
    + q->nr_biog_blocked++;
    +
    +out:
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + return can_dispatch;
    +}
    +
    +/* Should be called without queue lock held */
    +void bio_group_deactivate_timer(struct bio_group *biog)
    +{
    + unsigned long flags;
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + if (bio_group_flag_test_and_clear(BIOG_FLAG_TIMER_ACT IVE, biog)) {
    + /* Drop the bio group lock so that timer routine could
    + * finish in case it fires */
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + del_timer_sync(&biog->biog_inactive_timer);
    + return;
    + }
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    +}
    +
    +int attach_bio_group_io(struct bio_group *biog, struct bio *bio)
    +{
    + int err = 0;
    + struct biog_io *biog_io;
    +
    + biog_io = alloc_biog_io();
    + if (!biog_io) {
    + err = -ENOMEM;
    + goto out;
    + }
    +
    + /* I already have a valid pointer to biog. So it should be ok
    + * to get a reference to it. */
    + bio_group_get(biog);
    + biog_io->biog = biog;
    + biog_io->bi_end_io = bio->bi_end_io;
    + biog_io->bi_private = bio->bi_private;
    +
    + bio->bi_end_io = biog_io_end;
    + bio->bi_private = biog_io;
    +out:
    + return err;
    +}
    +
    +int account_bio_to_bio_group(struct bio_group *biog, struct bio *bio)
    +{
    + int err = 0;
    + unsigned long flags;
    + struct request_queue *q;
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + err = attach_bio_group_io(biog, bio);
    + if (err)
    + goto out;
    +
    + biog->nr_bio++;
    + q = biog->q;
    + if (!bio_group_on_queue(biog))
    + queue_bio_group(biog, q);
    +
    +out:
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + return err;
    +}
    +
    +int add_bio_to_bio_group_queue(struct bio_group *biog, struct bio *bio)
    +{
    + unsigned long flags;
    + struct request_queue *q;
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + __bio_group_queue_bio_tail(biog, bio);
    + q = biog->q;
    + q->nr_queued_bio++;
    + queue_delayed_work(q->biogroup_workqueue, &q->biogroup_work, 0);
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + return 0;
    +}
    +
    +/*
    + * It determines if the thread submitting the bio can itself continue to
    + * submit the bio or this bio needs to be buffered for later submission
    + */
    +int can_biog_do_direct_dispatch(struct bio_group *biog)
    +{
    + unsigned long flags, dispatch = 1;
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + if (bio_group_blocked(biog)) {
    + dispatch = 0;
    + goto out;
    + }
    +
    + /* Make sure there are not other queued bios on the biog. These
    + * queued bios should get a chance to dispatch first */
    + if (!bio_group_queued_empty(biog))
    + dispatch = 0;
    +out:
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + return dispatch;
    +}
    +
    +void charge_bio_group_for_tokens(struct bio_group *biog, struct bio *bio)
    +{
    + unsigned long flags;
    + long dispatched_tokens;
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + dispatched_tokens = bio_sectors(bio);
    + biog->nr_bio--;
    +
    + biog->credit_tokens -= dispatched_tokens;
    +
    + /* debug aid. also update aggregate tokens and jiffies in biocg */
    + biog->biocg->aggregate_tokens += dispatched_tokens;
    + biog->biocg->jiffies = jiffies;
    +
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    +}
    +
    +unsigned long __bio_group_try_to_dispatch(struct bio_group *biog,
    + struct bio *bio)
    +{
    + struct request_queue *q;
    + int dispatched = 0;
    +
    + BUG_ON(!biog);
    + BUG_ON(!bio);
    +
    + q = biog->q;
    + BUG_ON(!q);
    +retry:
    + if (!can_bio_group_dispatch(biog, bio)) {
    + if (should_start_new_token_slice(q)) {
    + start_new_token_slice(q);
    + goto retry;
    + }
    + goto out;
    + }
    +
    + charge_bio_group_for_tokens(biog, bio);
    + dispatched = 1;
    +out:
    + return dispatched;
    +}
    +
    +unsigned long bio_group_try_to_dispatch(struct bio_group *biog, struct bio *bio)
    +{
    + struct request_queue *q;
    + int dispatched = 0;
    + unsigned long flags;
    +
    + q = biog->q;
    + BUG_ON(!q);
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    + dispatched = __bio_group_try_to_dispatch(biog, bio);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +
    + return dispatched;
    +}
    +
    +/* Should be called with queue lock and bio group lock held */
    +void requeue_bio_group(struct request_queue *q, struct bio_group *biog)
    +{
    + remove_bio_group_from_rbtree(biog, q);
    + biog->key = alloc_bio_group_key(q);
    + insert_bio_group_into_rbtree(biog, q);
    +}
    +
    +/* Make a list of queued bios in this bio group which can be dispatched. */
    +void make_release_bio_list(struct bio_group *biog,
    + struct bio_list *release_list)
    +{
    + unsigned long flags, dispatched = 0;
    + struct bio *bio;
    + struct request_queue *q;
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    +
    + while (1) {
    + if (bio_group_queued_empty(biog))
    + goto out;
    +
    + if (bio_group_blocked(biog))
    + goto out;
    +
    + /* Dequeue one bio from bio group */
    + bio = __bio_group_dequeue_bio(biog);
    + BUG_ON(!bio);
    + q = biog->q;
    + q->nr_queued_bio--;
    +
    + /* Releasing lock as try to dispatch will acquire it again */
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + dispatched = __bio_group_try_to_dispatch(biog, bio);
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    +
    + if (dispatched) {
    + /* Add the bio to release list */
    + bio_list_add(release_list, bio);
    + continue;
    + } else {
    + /* Put the bio back into biog */
    + __bio_group_queue_bio_head(biog, bio);
    + q->nr_queued_bio++;
    + goto out;
    + }
    + }
    +out:
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + return;
    +}
    +
    +/*
    + * If a bio group is inactive for q->inactive_timeout, then this group is
    + * considered to be no more contending for the disk bandwidth and removed
    + * from the tree.
    + */
    +void bio_group_inactive_timeout(unsigned long data)
    +{
    + struct bio_group *biog = (struct bio_group *)data;
    + unsigned long flags, flags1;
    + struct request_queue *q;
    +
    + q = biog->q;
    + BUG_ON(!q);
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    + spin_lock_irqsave(&biog->bio_group_lock, flags1);
    +
    + BUG_ON(!bio_group_on_queue(biog));
    + BUG_ON(biog->nr_bio);
    +
    + BUG_ON((biog->bio_group_flags > 7));
    + /* Remove biog from tree */
    + biog->biocg->nr_off_the_tree++;
    + remove_bio_group_from_rbtree(biog, q);
    + bio_group_flag_clear(BIOG_FLAG_ON_QUEUE, biog);
    + bio_group_flag_clear(BIOG_FLAG_BLOCKED, biog);
    + bio_group_flag_clear(BIOG_FLAG_TIMER_ACTIVE, biog);
    +
    + /* dm_start_new_slice() takes bio_group_lock. Release it now */
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags1);
    +
    + /* Also check if new slice should be started */
    + if ((q->nr_biog) && should_start_new_token_slice(q))
    + start_new_token_slice(q);
    +
    + spin_unlock_irqrestore(q->queue_lock, flags);
    + /* Drop the reference to biog */
    + bio_group_put(biog);
    + return;
    +}
    +
    +/*
    + * It is called through worker thread and it takes care of releasing queued
    + * bios to underlying layer
    + */
    +void bio_group_dispatch_queued_bio(struct request_queue *q)
    +{
    + struct bio_group *biog;
    + unsigned long biog_scanned = 0;
    + unsigned long flags, flags1;
    + struct bio *bio = NULL;
    + int ret;
    + struct bio_list release_list;
    +
    + bio_list_init(&release_list);
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    +
    + while (1) {
    +
    + if (!q->nr_biog)
    + goto out;
    +
    + if (!q->nr_queued_bio)
    + goto out;
    +
    + if (biog_scanned == q->nr_biog) {
    + /* Scanned the whole tree. No eligible biog found */
    + if (q->nr_queued_bio) {
    + queue_delayed_work(q->biogroup_workqueue,
    + &q->biogroup_work, 1);
    + }
    + goto out;
    + }
    +
    + biog = bio_group_rb_first(&q->bio_groups);
    + BUG_ON(!biog);
    +
    + make_release_bio_list(biog, &release_list);
    +
    + /* If there are bios to dispatch, release these */
    + if (!bio_list_empty(&release_list)) {
    + if (q->nr_queued_bio)
    + queue_delayed_work(q->biogroup_workqueue,
    + &q->biogroup_work, 0);
    + goto dispatch_bio;
    + } else {
    + spin_lock_irqsave(&biog->bio_group_lock, flags1);
    + requeue_bio_group(q, biog);
    + biog_scanned++;
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags1);
    + continue;
    + }
    + }
    +
    +dispatch_bio:
    + spin_unlock_irqrestore(q->queue_lock, flags);
    + bio = bio_list_pop(&release_list);
    + BUG_ON(!bio);
    +
    + do {
    + /* Taint the bio with pass through flag */
    + bio->bi_flags |= (1UL << BIO_NOBIOGROUP);
    + do {
    + ret = q->make_request_fn(q, bio);
    + } while (ret);
    + bio = bio_list_pop(&release_list);
    + } while (bio);
    +
    + return;
    +out:
    + spin_unlock_irqrestore(q->queue_lock, flags);
    + return;
    +}
    +
    +void blk_biogroup_work(struct work_struct *work)
    +{
    + struct delayed_work *dw = container_of(work, struct delayed_work, work);
    + struct request_queue *q =
    + container_of(dw, struct request_queue, biogroup_work);
    +
    + bio_group_dispatch_queued_bio(q);
    +}
    +
    +/*
    + * This is core IO controller function which tries to dispatch bios to
    + * underlying layers based on cgroup weights.
    + *
    + * If the cgroup bio belongs to has got sufficient tokens, submitting
    + * task/thread is allowed to continue to submit the bio otherwise, bio
    + * is buffered here and submitting thread returns. This buffered bio will
    + * be dispatched to lower layers when cgroup has sufficient tokens.
    + *
    + * Return code:
    + * 0 --> continue submit the bio
    + * 1---> bio buffered by bio group layer. return
    + */
    +int bio_group_controller(struct request_queue *q, struct bio *bio)
    +{
    +
    + struct bio_group *biog;
    + struct bio_cgroup *biocg;
    + int err = 0;
    + unsigned long flags, dispatched = 0;
    +
    + /* This bio has already been subjected to resource constraints.
    + * Let it pass through unconditionally. */
    + if (bio_flagged(bio, BIO_NOBIOGROUP)) {
    + bio->bi_flags &= ~(1UL << BIO_NOBIOGROUP);
    + return 0;
    + }
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    + biocg = bio_cgroup_from_bio(bio);
    + BUG_ON(!biocg);
    +
    + /* If a biog is found, we also take a reference to it */
    + biog = bio_group_from_cgroup(biocg, q);
    + if (!biog) {
    + /* In case of success, returns with reference to biog */
    + biog = create_bio_group(biocg, q);
    + if (!biog) {
    + err = -ENOMEM;
    + goto end_io;
    + }
    + }
    +
    + spin_unlock_irqrestore(q->queue_lock, flags);
    + bio_group_deactivate_timer(biog);
    + spin_lock_irqsave(q->queue_lock, flags);
    +
    + err = account_bio_to_bio_group(biog, bio);
    + if (err)
    + goto end_io;
    +
    + if (!can_biog_do_direct_dispatch(biog)) {
    + add_bio_to_bio_group_queue(biog, bio);
    + goto buffered;
    + }
    +
    + dispatched = __bio_group_try_to_dispatch(biog, bio);
    +
    + if (!dispatched) {
    + add_bio_to_bio_group_queue(biog, bio);
    + goto buffered;
    + }
    +
    + bio_group_put(biog);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    + return 0;
    +
    +buffered:
    + bio_group_put(biog);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    + return 1;
    +end_io:
    + bio_group_put(biog);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    + bio_endio(bio, err);
    + return 1;
    +}
    Index: linux2/include/linux/bio.h
    ================================================== =================
    --- linux2.orig/include/linux/bio.h 2008-11-06 05:27:05.000000000 -0500
    +++ linux2/include/linux/bio.h 2008-11-06 05:27:37.000000000 -0500
    @@ -131,6 +131,7 @@ struct bio {
    #define BIO_BOUNCED 5 /* bio is a bounce bio */
    #define BIO_USER_MAPPED 6 /* contains user pages */
    #define BIO_EOPNOTSUPP 7 /* not supported */
    +#define BIO_NOBIOGROUP 8 /* Don do bio group control on this bio */
    #define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag)))

    /*
    Index: linux2/block/genhd.c
    ================================================== =================
    --- linux2.orig/block/genhd.c 2008-11-06 05:27:05.000000000 -0500
    +++ linux2/block/genhd.c 2008-11-06 05:27:37.000000000 -0500
    @@ -440,6 +440,120 @@ static ssize_t disk_removable_show(struc
    (disk->flags & GENHD_FL_REMOVABLE ? 1 : 0));
    }

    +static ssize_t disk_biogroup_show(struct device *dev,
    + struct device_attribute *attr, char *buf)
    +{
    + struct gendisk *disk = dev_to_disk(dev);
    + struct request_queue *q = disk->queue;
    +
    + return sprintf(buf, "%d\n", blk_queue_bio_group_enabled(q));
    +}
    +
    +static ssize_t disk_biogroup_store(struct device *dev,
    + struct device_attribute *attr,
    + const char *buf, size_t count)
    +{
    + struct gendisk *disk = dev_to_disk(dev);
    + struct request_queue *q = disk->queue;
    + int i = 0;
    +
    + if (count > 0 && sscanf(buf, "%d", &i) > 0) {
    + spin_lock_irq(q->queue_lock);
    + if (i)
    + queue_flag_set(QUEUE_FLAG_BIOG_ENABLED, q);
    + else
    + queue_flag_clear(QUEUE_FLAG_BIOG_ENABLED, q);
    +
    + spin_unlock_irq(q->queue_lock);
    + }
    + return count;
    +}
    +
    +static ssize_t disk_newslice_count_show(struct device *dev,
    + struct device_attribute *attr, char *buf)
    +{
    + struct gendisk *disk = dev_to_disk(dev);
    + struct request_queue *q = disk->queue;
    +
    + return sprintf(buf, "%lu\n", q->newslice_count);
    +}
    +
    +static ssize_t disk_newslice_count_store(struct device *dev,
    + struct device_attribute *attr,
    + const char *buf, size_t count)
    +{
    + struct gendisk *disk = dev_to_disk(dev);
    + struct request_queue *q = disk->queue;
    + unsigned long flags;
    + int i = 0;
    +
    + if (count > 0 && sscanf(buf, "%d", &i) > 0) {
    + spin_lock_irqsave(q->queue_lock, flags);
    + q->newslice_count = i;
    + spin_unlock_irqrestore(q->queue_lock, flags);
    + }
    + return count;
    +}
    +
    +static ssize_t disk_idletime_show(struct device *dev,
    + struct device_attribute *attr, char *buf)
    +{
    + struct gendisk *disk = dev_to_disk(dev);
    + struct request_queue *q = disk->queue;
    +
    + return sprintf(buf, "%lu\n", q->biogroup_idletime);
    +}
    +
    +static ssize_t disk_idletime_store(struct device *dev,
    + struct device_attribute *attr,
    + const char *buf, size_t count)
    +{
    + struct gendisk *disk = dev_to_disk(dev);
    + struct request_queue *q = disk->queue;
    + int i = 0;
    +
    + if (count > 0 && sscanf(buf, "%d", &i) > 0) {
    + spin_lock_irq(q->queue_lock);
    + if (i)
    + q->biogroup_idletime = i;
    + else
    + q->biogroup_idletime = 0;
    +
    + spin_unlock_irq(q->queue_lock);
    + }
    + return count;
    +}
    +
    +static ssize_t disk_deftoken_show(struct device *dev,
    + struct device_attribute *attr, char *buf)
    +{
    + struct gendisk *disk = dev_to_disk(dev);
    + struct request_queue *q = disk->queue;
    +
    + return sprintf(buf, "%lu\n", q->biogroup_deftoken);
    +}
    +
    +static ssize_t disk_deftoken_store(struct device *dev,
    + struct device_attribute *attr,
    + const char *buf, size_t count)
    +{
    + struct gendisk *disk = dev_to_disk(dev);
    + struct request_queue *q = disk->queue;
    + int i = 0;
    +
    + if (count > 0 && sscanf(buf, "%d", &i) > 0) {
    + spin_lock_irq(q->queue_lock);
    + if (i) {
    + if (i > 0x30)
    + q->biogroup_deftoken = i;
    + } else
    + q->biogroup_deftoken = 0;
    +
    + spin_unlock_irq(q->queue_lock);
    + }
    + return count;
    +}
    +
    static ssize_t disk_ro_show(struct device *dev,
    struct device_attribute *attr, char *buf)
    {
    @@ -524,6 +638,10 @@ static DEVICE_ATTR(ro, S_IRUGO, disk_ro_
    static DEVICE_ATTR(size, S_IRUGO, disk_size_show, NULL);
    static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
    static DEVICE_ATTR(stat, S_IRUGO, disk_stat_show, NULL);
    +static DEVICE_ATTR(biogroup, S_IRUGO | S_IWUSR, disk_biogroup_show, disk_biogroup_store);
    +static DEVICE_ATTR(idletime, S_IRUGO | S_IWUSR, disk_idletime_show, disk_idletime_store);
    +static DEVICE_ATTR(deftoken, S_IRUGO | S_IWUSR, disk_deftoken_show, disk_deftoken_store);
    +static DEVICE_ATTR(newslice_count, S_IRUGO | S_IWUSR, disk_newslice_count_show, disk_newslice_count_store);
    #ifdef CONFIG_FAIL_MAKE_REQUEST
    static struct device_attribute dev_attr_fail =
    __ATTR(make-it-fail, S_IRUGO|S_IWUSR, disk_fail_show, disk_fail_store);
    @@ -539,6 +657,10 @@ static struct attribute *disk_attrs[] =
    #ifdef CONFIG_FAIL_MAKE_REQUEST
    &dev_attr_fail.attr,
    #endif
    + &dev_attr_biogroup.attr,
    + &dev_attr_idletime.attr,
    + &dev_attr_deftoken.attr,
    + &dev_attr_newslice_count.attr,
    NULL
    };

    Index: linux2/include/linux/blkdev.h
    ================================================== =================
    --- linux2.orig/include/linux/blkdev.h 2008-11-06 05:27:05.000000000 -0500
    +++ linux2/include/linux/blkdev.h 2008-11-06 05:29:51.000000000 -0500
    @@ -289,6 +289,11 @@ struct blk_cmd_filter {
    struct kobject kobj;
    };

    +struct group_rb_root {
    + struct rb_root rb;
    + struct rb_node *left;
    +};
    +
    struct request_queue
    {
    /*
    @@ -298,6 +303,33 @@ struct request_queue
    struct request *last_merge;
    elevator_t *elevator;

    + /* rb-tree which contains all the contending bio groups */
    + struct group_rb_root bio_groups;
    +
    + /* Total number of bio_group currently on the request queue */
    + unsigned long nr_biog;
    + unsigned long current_slice;
    +
    + struct workqueue_struct *biogroup_workqueue;
    + struct delayed_work biogroup_work;
    + unsigned long nr_queued_bio;
    +
    + /* What's the idletime after which a bio group is considered idle and
    + * considered no more contending for the bandwidth. */
    + unsigned long biogroup_idletime;
    + unsigned long biogroup_deftoken;
    +
    + /* Number of biog which can't issue IO because they don't have
    + * suffifiet tokens */
    + unsigned long nr_biog_blocked;
    +
    + /* Sum of weight of all the cgroups present on this queue */
    + unsigned long total_weight;
    +
    + /* Debug Aid */
    + unsigned long max_key;
    + unsigned long newslice_count;
    +
    /*
    * the queue request freelist, one for reads and one for writes
    */
    @@ -421,6 +453,7 @@ struct request_queue
    #define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */
    #define QUEUE_FLAG_BIDI 9 /* queue supports bidi requests */
    #define QUEUE_FLAG_NOMERGES 10 /* disable merge attempts */
    +#define QUEUE_FLAG_BIOG_ENABLED 11 /* bio group enabled */

    static inline int queue_is_locked(struct request_queue *q)
    {
    @@ -527,6 +560,7 @@ enum {
    #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
    #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
    #define blk_queue_flushing(q) ((q)->ordseq)
    +#define blk_queue_bio_group_enabled(q) test_bit(QUEUE_FLAG_BIOG_ENABLED, &(q)->queue_flags)

    #define blk_fs_request(rq) ((rq)->cmd_type == REQ_TYPE_FS)
    #define blk_pc_request(rq) ((rq)->cmd_type == REQ_TYPE_BLOCK_PC)
    Index: linux2/block/blk-core.c
    ================================================== =================
    --- linux2.orig/block/blk-core.c 2008-11-06 05:27:05.000000000 -0500
    +++ linux2/block/blk-core.c 2008-11-06 05:27:40.000000000 -0500
    @@ -30,6 +30,7 @@
    #include
    #include
    #include
    +#include

    #include "blk.h"

    @@ -502,6 +503,20 @@ struct request_queue *blk_alloc_queue_no
    mutex_init(&q->sysfs_lock);
    spin_lock_init(&q->__queue_lock);

    +#ifdef CONFIG_CGROUP_BIO
    + /* Initialize default idle time */
    + q->biogroup_idletime = DEFAULT_IDLE_PERIOD;
    + q->biogroup_deftoken = DEFAULT_NR_TOKENS;
    +
    + /* Also create biogroup worker threads. It needs to be conditional */
    + if (!bio_cgroup_disabled()) {
    + q->biogroup_workqueue = create_workqueue("biogroup");
    + if (!q->biogroup_workqueue)
    + panic("Failed to create biogroup\n");
    + }
    + INIT_DELAYED_WORK(&q->biogroup_work, blk_biogroup_work);
    +#endif
    +
    return q;
    }
    EXPORT_SYMBOL(blk_alloc_queue_node);
    Index: linux2/include/linux/biocontrol.h
    ================================================== =================
    --- linux2.orig/include/linux/biocontrol.h 2008-11-06 05:27:36.000000000 -0500
    +++ linux2/include/linux/biocontrol.h 2008-11-06 05:27:37.000000000 -0500
    @@ -12,6 +12,17 @@
    struct io_context;
    struct block_device;

    +/* what's a good value. starting with 8 ms */
    +#define DEFAULT_IDLE_PERIOD 8
    +/* what's a good value. starting with 2000 */
    +#define DEFAULT_NR_TOKENS 2000
    +
    +struct biog_io {
    + struct bio_group *biog;
    + bio_end_io_t *bi_end_io;
    + void *bi_private;
    +};
    +
    struct bio_cgroup {
    struct cgroup_subsys_state css;
    /* Share/weight of the cgroup */
    @@ -32,6 +43,46 @@ struct bio_cgroup {
    unsigned long nr_token_slices;
    };

    +/*
    + * This object keeps track of a group of bios on a particular request queue.
    + * A cgroup will have one bio_group on each block device request queue it
    + * is doing IO to.
    + */
    +struct bio_group {
    + spinlock_t bio_group_lock;
    +
    + unsigned long bio_group_flags;
    +
    + /* reference counting. use bio_group_get() and bio_group_put() */
    + atomic_t refcnt;
    +
    + /* Pointer to the request queue this bio-group is currently associated
    + * with */
    + struct request_queue *q;
    +
    + /* Pointer to parent bio_cgroup */
    + struct bio_cgroup *biocg;
    +
    + /* bio_groups are connected through a linked list in parent cgroup */
    + struct list_head next;
    +
    + long credit_tokens;
    +
    + /* Node which hangs in per request queue rb tree */
    + struct rb_node rb_node;
    +
    + /* Key to index inside rb-tree rooted at devices's request_queue. */
    + unsigned long key;
    +
    + unsigned long slice_stamp;
    +
    + struct timer_list biog_inactive_timer;
    + unsigned long nr_bio;
    +
    + /* List where buffered bios are queued */
    + struct bio_list bio_queue;
    +};
    +
    static inline int bio_cgroup_disabled(void)
    {
    return bio_cgroup_subsys.disabled;
    @@ -110,6 +161,69 @@ static inline void bio_cgroup_remove_pag
    spin_unlock_irqrestore(&biocg->page_list_lock, flags);
    }

    +static inline void bio_group_get(struct bio_group *biog)
    +{
    + atomic_inc(&biog->refcnt);
    +}
    +
    +static inline void bio_group_put(struct bio_group *biog)
    +{
    + atomic_dec(&biog->refcnt);
    +}
    +
    +#define BIOG_FLAG_TIMER_ACTIVE 0 /* Inactive timer armed status */
    +#define BIOG_FLAG_ON_QUEUE 1 /* If biog is on request queue */
    +#define BIOG_FLAG_BLOCKED 2 /* bio group is blocked */
    +
    +#define bio_group_timer_active(biog) test_bit(BIOG_FLAG_TIMER_ACTIVE, &(biog)->bio_group_flags)
    +#define bio_group_on_queue(biog) test_bit(BIOG_FLAG_ON_QUEUE, &(biog)->bio_group_flags)
    +#define bio_group_blocked(biog) test_bit(BIOG_FLAG_BLOCKED, &(biog)->bio_group_flags)
    +
    +static inline void bio_group_flag_set(unsigned int flag, struct bio_group *biog)
    +{
    + __set_bit(flag, &biog->bio_group_flags);
    +}
    +
    +static inline void bio_group_flag_clear(unsigned int flag,
    + struct bio_group *biog)
    +{
    + __clear_bit(flag, &biog->bio_group_flags);
    +}
    +
    +static inline int bio_group_flag_test_and_clear(unsigned int flag,
    + struct bio_group *biog)
    +{
    + if (test_bit(flag, &biog->bio_group_flags)) {
    + __clear_bit(flag, &biog->bio_group_flags);
    + return 1;
    + }
    +
    + return 0;
    +}
    +
    +static inline int bio_group_flag_test_and_set(unsigned int flag,
    + struct bio_group *biog)
    +{
    + if (!test_bit(flag, &biog->bio_group_flags)) {
    + __set_bit(flag, &biog->bio_group_flags);
    + return 0;
    + }
    +
    + return 1;
    +}
    +
    +static inline int bio_group_empty(struct bio_group *biog)
    +{
    + return !biog->nr_bio;
    +}
    +
    +static inline int bio_group_queued_empty(struct bio_group *biog)
    +{
    + if (bio_list_empty(&biog->bio_queue))
    + return 1;
    + return 0;
    +}
    +
    extern void clear_bio_cgroup(struct page_cgroup *pc);

    extern int bio_group_controller(struct request_queue *q, struct bio *bio);

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [patch 2/4] io controller: biocgroup implementation


    o biocgroup functionality.
    o Implemented new controller "bio"
    o Most of it picked from dm-ioband biocgroup implementation patches.

    Signed-off-by: Vivek Goyal

    Index: linux17/include/linux/cgroup_subsys.h
    ================================================== =================
    --- linux17.orig/include/linux/cgroup_subsys.h 2008-10-09 18:13:53.000000000 -0400
    +++ linux17/include/linux/cgroup_subsys.h 2008-11-05 18:12:32.000000000 -0500
    @@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)

    /* */

    +#ifdef CONFIG_CGROUP_BIO
    +SUBSYS(bio_cgroup)
    +#endif
    +
    +/* */
    +
    #ifdef CONFIG_CGROUP_DEVICE
    SUBSYS(devices)
    #endif
    Index: linux17/init/Kconfig
    ================================================== =================
    --- linux17.orig/init/Kconfig 2008-10-09 18:13:53.000000000 -0400
    +++ linux17/init/Kconfig 2008-11-05 18:12:32.000000000 -0500
    @@ -408,6 +408,13 @@ config CGROUP_MEM_RES_CTLR
    This config option also selects MM_OWNER config option, which
    could in turn add some fork/exit overhead.

    +config CGROUP_BIO
    + bool "Block I/O cgroup subsystem"
    + depends on CGROUP_MEM_RES_CTLR
    + select MM_OWNER
    + help
    + A generic proportinal weight IO controller.
    +
    config SYSFS_DEPRECATED
    bool

    Index: linux17/mm/biocontrol.c
    ================================================== =================
    --- /dev/null 1970-01-01 00:00:00.000000000 +0000
    +++ linux17/mm/biocontrol.c 2008-11-05 18:12:44.000000000 -0500
    @@ -0,0 +1,409 @@
    +/* biocontrol.c - Block I/O Controller
    + *
    + * Copyright IBM Corporation, 2007
    + * Author Balbir Singh
    + *
    + * Copyright 2007 OpenVZ SWsoft Inc
    + * Author: Pavel Emelianov
    + *
    + * Copyright VA Linux Systems Japan, 2008
    + * Author Hirokazu Takahashi
    + *
    + * Copyright RedHat Inc, 2008
    + * Author Vivek Goyal
    + *
    + * This program is free software; you can redistribute it and/or modify
    + * it under the terms of the GNU General Public License as published by
    + * the Free Software Foundation; either version 2 of the License, or
    + * (at your option) any later version.
    + *
    + * This program is distributed in the hope that it will be useful,
    + * but WITHOUT ANY WARRANTY; without even the implied warranty of
    + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    + * GNU General Public License for more details.
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +
    +/* return corresponding bio_cgroup object of a cgroup */
    +static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
    +{
    + return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
    + struct bio_cgroup, css);
    +}
    +
    +static inline void bio_list_add_head(struct bio_list *bl, struct bio *bio)
    +{
    + bio->bi_next = NULL;
    +
    + if (bl->head)
    + bio->bi_next = bl->head;
    + else
    + bl->tail = bio;
    +
    + bl->head = bio;
    +}
    +
    +void __bio_group_queue_bio_head(struct bio_group *biog, struct bio *bio)
    +{
    + bio_list_add_head(&biog->bio_queue, bio);
    +}
    +
    +void bio_group_queue_bio_head(struct bio_group *biog, struct bio *bio)
    +{
    + unsigned long flags;
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + __bio_group_queue_bio_head(biog, bio);
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    +}
    +
    +void __bio_group_queue_bio_tail(struct bio_group *biog, struct bio *bio)
    +{
    + bio_list_add(&biog->bio_queue, bio);
    +}
    +
    +void bio_group_queue_bio_tail(struct bio_group *biog, struct bio *bio)
    +{
    + unsigned long flags;
    +
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + __bio_group_queue_bio_tail(biog, bio);
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    +}
    +
    +/* Removes first request from the bio-cgroup request list */
    +struct bio* __bio_group_dequeue_bio(struct bio_group *biog)
    +{
    + struct bio *bio = NULL;
    +
    + if (bio_list_empty(&biog->bio_queue))
    + return NULL;
    + bio = bio_list_pop(&biog->bio_queue);
    + return bio;
    +}
    +
    +struct bio* bio_group_dequeue_bio(struct bio_group *biog)
    +{
    + unsigned long flags;
    + struct bio *bio;
    + spin_lock_irqsave(&biog->bio_group_lock, flags);
    + bio = __bio_group_dequeue_bio(biog);
    + spin_unlock_irqrestore(&biog->bio_group_lock, flags);
    + return bio;
    +}
    +
    +/* Traverse through all the active bio_group list of this cgroup and see
    + * if there is an active bio_group for the request queue. */
    +struct bio_group* bio_group_from_cgroup(struct bio_cgroup *biocg,
    + struct request_queue *q)
    +{
    + unsigned long flags;
    + struct bio_group *biog = NULL;
    +
    + spin_lock_irqsave(&biocg->biog_list_lock, flags);
    + if (list_empty(&biocg->bio_group_list))
    + goto out;
    + list_for_each_entry(biog, &biocg->bio_group_list, next) {
    + if (biog->q == q) {
    + bio_group_get(biog);
    + goto out;
    + }
    + }
    +
    + /* did not find biog */
    + spin_unlock_irqrestore(&biocg->biog_list_lock, flags);
    + return NULL;
    +out:
    + spin_unlock_irqrestore(&biocg->biog_list_lock, flags);
    + return biog;
    +}
    +
    +struct bio_cgroup *bio_cgroup_from_bio(struct bio *bio)
    +{
    + struct page_cgroup *pc;
    + struct bio_cgroup *biocg = NULL;
    + struct page *page = bio_iovec_idx(bio, 0)->bv_page;
    +
    + lock_page_cgroup(page);
    + pc = page_get_page_cgroup(page);
    + if (pc)
    + biocg = pc->bio_cgroup;
    + if (!biocg)
    + biocg = bio_cgroup_from_task(rcu_dereference(init_mm.owner ));
    + unlock_page_cgroup(page);
    + return biocg;
    +}
    +
    +static struct cgroup_subsys_state * bio_cgroup_create(struct cgroup_subsys *ss,
    + struct cgroup *cgrp)
    +{
    + struct bio_cgroup *biocg;
    + int error;
    +
    + if (!cgrp->parent) {
    + static struct bio_cgroup default_bio_cgroup;
    +
    + biocg = &default_bio_cgroup;
    + } else {
    + biocg = kzalloc(sizeof(*biocg), GFP_KERNEL);
    + if (!biocg) {
    + error = -ENOMEM;
    + goto out;
    + }
    + }
    +
    + /* Bind the cgroup to bio_cgroup object we just created */
    + biocg->css.cgroup = cgrp;
    + spin_lock_init(&biocg->biog_list_lock);
    + spin_lock_init(&biocg->page_list_lock);
    + /* Assign default shares */
    + biocg->shares = 1024;
    + INIT_LIST_HEAD(&biocg->bio_group_list);
    + INIT_LIST_HEAD(&biocg->page_list);
    +
    + return &biocg->css;
    +out:
    + kfree(biocg);
    + return ERR_PTR(error);
    +}
    +
    +void free_biog_elements(struct bio_cgroup *biocg)
    +{
    + unsigned long flags, flags1;
    + struct bio_group *biog = NULL;
    +
    + spin_lock_irqsave(&biocg->biog_list_lock, flags);
    + while (1) {
    + if (list_empty(&biocg->bio_group_list))
    + goto out;
    +
    + list_for_each_entry(biog, &biocg->bio_group_list, next) {
    + spin_lock_irqsave(&biog->bio_group_lock, flags1);
    + if (!atomic_read(&biog->refcnt)) {
    + list_del(&biog->next);
    + BUG_ON(bio_group_on_queue(biog));
    + spin_unlock_irqrestore(&biog->bio_group_lock,
    + flags1);
    + kfree(biog);
    + break;
    + } else {
    + /* Drop the locks and schedule out. */
    + spin_unlock_irqrestore(&biog->bio_group_lock,
    + flags1);
    + spin_unlock_irqrestore(&biocg->biog_list_lock,
    + flags);
    + msleep(1);
    +
    + /* Re-acquire the lock */
    + spin_lock_irqsave(&biocg->biog_list_lock,
    + flags);
    + break;
    + }
    + }
    + }
    +
    +out:
    + spin_unlock_irqrestore(&biocg->biog_list_lock, flags);
    + return;
    +}
    +
    +void free_bio_cgroup(struct bio_cgroup *biocg)
    +{
    + free_biog_elements(biocg);
    +}
    +
    +static void __clear_bio_cgroup(struct page_cgroup *pc)
    +{
    + struct bio_cgroup *biocg = pc->bio_cgroup;
    + pc->bio_cgroup = NULL;
    + /* Respective bio group got deleted hence reference to
    + * bio cgroup removed from page during force empty. But page
    + * is being freed now. Igonore it. */
    + if (!biocg)
    + return;
    + put_bio_cgroup(biocg);
    +}
    +
    +void clear_bio_cgroup(struct page_cgroup *pc)
    +{
    + __clear_bio_cgroup(pc);
    +}
    +
    +#define FORCE_UNCHARGE_BATCH (128)
    +void bio_cgroup_force_empty(struct bio_cgroup *biocg)
    +{
    + struct page_cgroup *pc;
    + struct page *page;
    + int count = FORCE_UNCHARGE_BATCH;
    + struct list_head *list = &biocg->page_list;
    + unsigned long flags;
    +
    + spin_lock_irqsave(&biocg->page_list_lock, flags);
    + while (!list_empty(list)) {
    + pc = list_entry(list->prev, struct page_cgroup, blist);
    + page = pc->page;
    + get_page(page);
    + __bio_cgroup_remove_page(pc);
    + __clear_bio_cgroup(pc);
    + spin_unlock_irqrestore(&biocg->page_list_lock, flags);
    + put_page(page);
    + if (--count <= 0) {
    + count = FORCE_UNCHARGE_BATCH;
    + cond_resched();
    + }
    + spin_lock_irqsave(&biocg->page_list_lock, flags);
    + }
    + spin_unlock_irqrestore(&biocg->page_list_lock, flags);
    + /* Now free up all the bio groups releated to cgroup */
    + free_bio_cgroup(biocg);
    + return;
    +}
    +
    +static void bio_cgroup_pre_destroy(struct cgroup_subsys *ss,
    + struct cgroup *cgrp)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    + bio_cgroup_force_empty(biocg);
    +}
    +
    +static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    + kfree(biocg);
    +}
    +
    +static u64 bio_shares_read(struct cgroup *cgrp, struct cftype *cft)
    +{
    + struct bio_cgroup *biog = cgroup_bio(cgrp);
    +
    + return (u64) biog->shares;
    +}
    +
    +static int bio_shares_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
    +{
    + struct bio_cgroup *biog = cgroup_bio(cgrp);
    +
    + biog->shares = val;
    + return 0;
    +}
    +
    +static u64 bio_aggregate_tokens_read(struct cgroup *cgrp, struct cftype *cft)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    +
    + return (u64) biocg->aggregate_tokens;
    +}
    +
    +static int bio_aggregate_tokens_write(struct cgroup *cgrp, struct cftype *cft,
    + u64 val)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    +
    + biocg->aggregate_tokens = val;
    + return 0;
    +}
    +
    +static u64 bio_jiffies_read(struct cgroup *cgrp, struct cftype *cft)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    +
    + return (u64) biocg->jiffies;
    +}
    +
    +static u64 bio_nr_off_the_tree_read(struct cgroup *cgrp, struct cftype *cft)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    +
    + return (u64) biocg->nr_off_the_tree;
    +}
    +
    +static int bio_nr_off_the_tree_write(struct cgroup *cgrp, struct cftype *cft,
    + u64 val)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    +
    + biocg->nr_off_the_tree = val;
    + return 0;
    +}
    +
    +static u64 bio_nr_token_slices_read(struct cgroup *cgrp, struct cftype *cft)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    +
    + return (u64) biocg->nr_token_slices;
    +}
    +
    +static int bio_nr_token_slices_write(struct cgroup *cgrp,
    + struct cftype *cft, u64 val)
    +{
    + struct bio_cgroup *biocg = cgroup_bio(cgrp);
    +
    + biocg->nr_token_slices = val;
    + return 0;
    +}
    +
    +
    +
    +static struct cftype bio_files[] = {
    + {
    + .name = "shares",
    + .read_u64 = bio_shares_read,
    + .write_u64 = bio_shares_write,
    + },
    + {
    + .name = "aggregate_tokens",
    + .read_u64 = bio_aggregate_tokens_read,
    + .write_u64 = bio_aggregate_tokens_write,
    + },
    + {
    + .name = "jiffies",
    + .read_u64 = bio_jiffies_read,
    + },
    + {
    + .name = "nr_off_the_tree",
    + .read_u64 = bio_nr_off_the_tree_read,
    + .write_u64 = bio_nr_off_the_tree_write,
    + },
    + {
    + .name = "nr_token_slices",
    + .read_u64 = bio_nr_token_slices_read,
    + .write_u64 = bio_nr_token_slices_write,
    + },
    +};
    +
    +static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
    +{
    + if (bio_cgroup_disabled())
    + return 0;
    + return cgroup_add_files(cont, ss, bio_files, ARRAY_SIZE(bio_files));
    +}
    +
    +static void bio_cgroup_move_task(struct cgroup_subsys *ss,
    + struct cgroup *cont,
    + struct cgroup *old_cont,
    + struct task_struct *p)
    +{
    + /* do nothing */
    +}
    +
    +
    +struct cgroup_subsys bio_cgroup_subsys = {
    + .name = "bio",
    + .subsys_id = bio_cgroup_subsys_id,
    + .create = bio_cgroup_create,
    + .destroy = bio_cgroup_destroy,
    + .pre_destroy = bio_cgroup_pre_destroy,
    + .populate = bio_cgroup_populate,
    + .attach = bio_cgroup_move_task,
    + .early_init = 0,
    +};
    Index: linux17/include/linux/biocontrol.h
    ================================================== =================
    --- /dev/null 1970-01-01 00:00:00.000000000 +0000
    +++ linux17/include/linux/biocontrol.h 2008-11-05 18:12:44.000000000 -0500
    @@ -0,0 +1,174 @@
    +#include
    +#include
    +#include
    +#include
    +#include "../../drivers/md/dm-bio-list.h"
    +
    +#ifndef _LINUX_BIOCONTROL_H
    +#define _LINUX_BIOCONTROL_H
    +
    +#ifdef CONFIG_CGROUP_BIO
    +
    +struct io_context;
    +struct block_device;
    +
    +struct bio_cgroup {
    + struct cgroup_subsys_state css;
    + /* Share/weight of the cgroup */
    + unsigned long shares;
    +
    + /* list of bio-groups associated with this cgroup. */
    + struct list_head bio_group_list;
    + spinlock_t biog_list_lock;
    +
    + /* list of pages associated with this bio cgroup */
    + spinlock_t page_list_lock;
    + struct list_head page_list;
    +
    + /* Debug Aid */
    + unsigned long aggregate_tokens;
    + unsigned long jiffies;
    + unsigned long nr_off_the_tree;
    + unsigned long nr_token_slices;
    +};
    +
    +static inline int bio_cgroup_disabled(void)
    +{
    + return bio_cgroup_subsys.disabled;
    +}
    +
    +static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
    +{
    + return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
    + struct bio_cgroup, css);
    +}
    +
    +static inline void get_bio_cgroup(struct bio_cgroup *biocg)
    +{
    + css_get(&biocg->css);
    +}
    +
    +static inline void put_bio_cgroup(struct bio_cgroup *biocg)
    +{
    + css_put(&biocg->css);
    +}
    +
    +static inline void set_bio_cgroup(struct page_cgroup *pc,
    + struct bio_cgroup *biog)
    +{
    + pc->bio_cgroup = biog;
    +}
    +
    +static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc)
    +{
    + struct bio_cgroup *biog = pc->bio_cgroup;
    + get_bio_cgroup(biog);
    + return biog;
    +}
    +
    +/* This sould be called in an RCU-protected section. */
    +static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm)
    +{
    + struct bio_cgroup *biog;
    + biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
    + get_bio_cgroup(biog);
    + return biog;
    +}
    +
    +static inline void __bio_cgroup_add_page(struct page_cgroup *pc)
    +{
    + struct bio_cgroup *biocg = pc->bio_cgroup;
    + list_add(&pc->blist, &biocg->page_list);
    +}
    +
    +static inline void bio_cgroup_add_page(struct page_cgroup *pc)
    +{
    + struct bio_cgroup *biocg = pc->bio_cgroup;
    + unsigned long flags;
    + spin_lock_irqsave(&biocg->page_list_lock, flags);
    + __bio_cgroup_add_page(pc);
    + spin_unlock_irqrestore(&biocg->page_list_lock, flags);
    +}
    +
    +static inline void __bio_cgroup_remove_page(struct page_cgroup *pc)
    +{
    + list_del_init(&pc->blist);
    +}
    +
    +static inline void bio_cgroup_remove_page(struct page_cgroup *pc)
    +{
    + struct bio_cgroup *biocg = pc->bio_cgroup;
    + unsigned long flags;
    +
    + /* Respective bio group got deleted hence reference to
    + * bio cgroup removed from page during force empty. But page
    + * is being freed now. Igonore it. */
    + if (!biocg)
    + return;
    + spin_lock_irqsave(&biocg->page_list_lock, flags);
    + __bio_cgroup_remove_page(pc);
    + spin_unlock_irqrestore(&biocg->page_list_lock, flags);
    +}
    +
    +extern void clear_bio_cgroup(struct page_cgroup *pc);
    +
    +extern int bio_group_controller(struct request_queue *q, struct bio *bio);
    +extern void blk_biogroup_work(struct work_struct *work);
    +#else /* CONFIG_CGROUP_BIO */
    +
    +struct bio_cgroup;
    +
    +static inline int bio_cgroup_disabled(void)
    +{
    + return 1;
    +}
    +
    +static inline void get_bio_cgroup(struct bio_cgroup *biocg)
    +{
    +}
    +
    +static inline void put_bio_cgroup(struct bio_cgroup *biocg)
    +{
    +}
    +
    +static inline void set_bio_cgroup(struct page_cgroup *pc,
    + struct bio_cgroup *biog)
    +{
    +}
    +
    +static inline void clear_bio_cgroup(struct page_cgroup *pc)
    +{
    +}
    +
    +static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc)
    +{
    + return NULL;
    +}
    +
    +static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm)
    +{
    + return NULL;
    +}
    +
    +static inline void bio_cgroup_add_page(struct page_cgroup *pc)
    +{
    + return;
    +}
    +
    +static inline void bio_cgroup_remove_page(struct page_cgroup *pc)
    +{
    + return;
    +}
    +
    +static inline int bio_group_controller(struct request_queue *q, struct bio *bio)
    +{
    + return 0;
    +}
    +static inline void blk_biogroup_work(struct work_struct *work)
    +{
    +}
    +
    +
    +#endif /* CONFIG_CGROUP_BIO */
    +
    +#endif /* _LINUX_BIOCONTROL_H */
    Index: linux17/mm/Makefile
    ================================================== =================
    --- linux17.orig/mm/Makefile 2008-10-09 18:13:53.000000000 -0400
    +++ linux17/mm/Makefile 2008-11-05 18:12:32.000000000 -0500
    @@ -34,4 +34,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
    obj-$(CONFIG_SMP) += allocpercpu.o
    obj-$(CONFIG_QUICKLIST) += quicklist.o
    obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
    +obj-$(CONFIG_CGROUP_BIO) += biocontrol.o

    Index: linux17/mm/memcontrol.c
    ================================================== =================
    --- linux17.orig/mm/memcontrol.c 2008-10-09 18:13:53.000000000 -0400
    +++ linux17/mm/memcontrol.c 2008-11-05 18:12:32.000000000 -0500
    @@ -32,6 +32,7 @@
    #include
    #include
    #include
    +#include

    #include

    @@ -144,30 +145,6 @@ struct mem_cgroup {
    };
    static struct mem_cgroup init_mem_cgroup;

    -/*
    - * We use the lower bit of the page->page_cgroup pointer as a bit spin
    - * lock. We need to ensure that page->page_cgroup is at least two
    - * byte aligned (based on comments from Nick Piggin). But since
    - * bit_spin_lock doesn't actually set that lock bit in a non-debug
    - * uniprocessor kernel, we should avoid setting it here too.
    - */
    -#define PAGE_CGROUP_LOCK_BIT 0x0
    -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
    -#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
    -#else
    -#define PAGE_CGROUP_LOCK 0x0
    -#endif
    -
    -/*
    - * A page_cgroup page is associated with every page descriptor. The
    - * page_cgroup helps us identify information about the cgroup
    - */
    -struct page_cgroup {
    - struct list_head lru; /* per cgroup LRU list */
    - struct page *page;
    - struct mem_cgroup *mem_cgroup;
    - int flags;
    -};
    #define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
    #define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */

    @@ -278,21 +255,6 @@ struct page_cgroup *page_get_page_cgroup
    return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
    }

    -static void lock_page_cgroup(struct page *page)
    -{
    - bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
    -}
    -
    -static int try_lock_page_cgroup(struct page *page)
    -{
    - return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
    -}
    -
    -static void unlock_page_cgroup(struct page *page)
    -{
    - bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
    -}
    -
    static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
    struct page_cgroup *pc)
    {
    @@ -535,14 +497,15 @@ unsigned long mem_cgroup_isolate_pages(u
    * < 0 if the cgroup is over its limit
    */
    static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
    - gfp_t gfp_mask, enum charge_type ctype,
    - struct mem_cgroup *memcg)
    + gfp_t gfp_mask, enum charge_type ctype,
    + struct mem_cgroup *memcg, struct bio_cgroup *biocg)
    {
    struct mem_cgroup *mem;
    struct page_cgroup *pc;
    unsigned long flags;
    unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    struct mem_cgroup_per_zone *mz;
    + struct bio_cgroup *biocg_temp;

    pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
    if (unlikely(pc == NULL))
    @@ -572,6 +535,10 @@ static int mem_cgroup_charge_common(stru
    css_get(&memcg->css);
    }

    + rcu_read_lock();
    + biocg_temp = biocg ? biocg : mm_get_bio_cgroup(mm);
    + rcu_read_unlock();
    +
    while (res_counter_charge(&mem->res, PAGE_SIZE)) {
    if (!(gfp_mask & __GFP_WAIT))
    goto out;
    @@ -597,6 +564,7 @@ static int mem_cgroup_charge_common(stru

    pc->mem_cgroup = mem;
    pc->page = page;
    + set_bio_cgroup(pc, biocg_temp);
    /*
    * If a page is accounted as a page cache, insert to inactive list.
    * If anon, insert to active list.
    @@ -611,21 +579,22 @@ static int mem_cgroup_charge_common(stru
    unlock_page_cgroup(page);
    res_counter_uncharge(&mem->res, PAGE_SIZE);
    css_put(&mem->css);
    + clear_bio_cgroup(pc);
    kmem_cache_free(page_cgroup_cache, pc);
    goto done;
    }
    page_assign_page_cgroup(page, pc);
    -
    mz = page_cgroup_zoneinfo(pc);
    spin_lock_irqsave(&mz->lru_lock, flags);
    __mem_cgroup_add_list(mz, pc);
    spin_unlock_irqrestore(&mz->lru_lock, flags);
    -
    + bio_cgroup_add_page(pc);
    unlock_page_cgroup(page);
    done:
    return 0;
    out:
    css_put(&mem->css);
    + put_bio_cgroup(biocg_temp);
    kmem_cache_free(page_cgroup_cache, pc);
    err:
    return -ENOMEM;
    @@ -648,7 +617,7 @@ int mem_cgroup_charge(struct page *page,
    if (unlikely(!mm))
    mm = &init_mm;
    return mem_cgroup_charge_common(page, mm, gfp_mask,
    - MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
    + MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL, NULL);
    }

    int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
    @@ -684,7 +653,7 @@ int mem_cgroup_cache_charge(struct page
    mm = &init_mm;

    return mem_cgroup_charge_common(page, mm, gfp_mask,
    - MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
    + MEM_CGROUP_CHARGE_TYPE_CACHE, NULL, NULL);
    }

    /*
    @@ -720,14 +689,14 @@ __mem_cgroup_uncharge_common(struct page
    spin_lock_irqsave(&mz->lru_lock, flags);
    __mem_cgroup_remove_list(mz, pc);
    spin_unlock_irqrestore(&mz->lru_lock, flags);
    -
    + bio_cgroup_remove_page(pc);
    page_assign_page_cgroup(page, NULL);
    unlock_page_cgroup(page);

    mem = pc->mem_cgroup;
    res_counter_uncharge(&mem->res, PAGE_SIZE);
    css_put(&mem->css);
    -
    + clear_bio_cgroup(pc);
    kmem_cache_free(page_cgroup_cache, pc);
    return;
    unlock:
    @@ -754,6 +723,7 @@ int mem_cgroup_prepare_migration(struct
    struct mem_cgroup *mem = NULL;
    enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
    int ret = 0;
    + struct bio_cgroup *biocg = NULL;

    if (mem_cgroup_subsys.disabled)
    return 0;
    @@ -765,12 +735,15 @@ int mem_cgroup_prepare_migration(struct
    css_get(&mem->css);
    if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
    ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
    + biocg = get_bio_page_cgroup(pc);
    }
    unlock_page_cgroup(page);
    if (mem) {
    ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
    - ctype, mem);
    + ctype, mem, biocg);
    css_put(&mem->css);
    + if (biocg)
    + put_bio_cgroup(biocg);
    }
    return ret;
    }
    Index: linux17/include/linux/memcontrol.h
    ================================================== =================
    --- linux17.orig/include/linux/memcontrol.h 2008-10-09 18:13:53.000000000 -0400
    +++ linux17/include/linux/memcontrol.h 2008-11-05 18:12:32.000000000 -0500
    @@ -17,16 +17,47 @@
    * GNU General Public License for more details.
    */

    +#include
    +#include
    +
    #ifndef _LINUX_MEMCONTROL_H
    #define _LINUX_MEMCONTROL_H

    struct mem_cgroup;
    -struct page_cgroup;
    struct page;
    struct mm_struct;

    #ifdef CONFIG_CGROUP_MEM_RES_CTLR

    +/*
    + * We use the lower bit of the page->page_cgroup pointer as a bit spin
    + * lock. We need to ensure that page->page_cgroup is at least two
    + * byte aligned (based on comments from Nick Piggin). But since
    + * bit_spin_lock doesn't actually set that lock bit in a non-debug
    + * uniprocessor kernel, we should avoid setting it here too.
    + */
    +#define PAGE_CGROUP_LOCK_BIT 0x0
    +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
    +#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
    +#else
    +#define PAGE_CGROUP_LOCK 0x0
    +#endif
    +
    +/*
    + * A page_cgroup page is associated with every page descriptor. The
    + * page_cgroup helps us identify information about the cgroup
    + */
    +struct page_cgroup {
    + struct list_head lru; /* per cgroup LRU list */
    + struct page *page;
    + struct mem_cgroup *mem_cgroup;
    + int flags;
    +#ifdef CONFIG_CGROUP_BIO
    + struct list_head blist; /* for bio_cgroup page list */
    + struct bio_cgroup *bio_cgroup;
    +#endif
    +};
    +
    #define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0)

    extern struct page_cgroup *page_get_page_cgroup(struct page *page);
    @@ -74,6 +105,20 @@ extern long mem_cgroup_calc_reclaim_acti
    extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
    struct zone *zone, int priority);

    +static inline void lock_page_cgroup(struct page *page)
    +{
    + bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
    +}
    +
    +static inline int try_lock_page_cgroup(struct page *page)
    +{
    + return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
    +}
    +
    +static inline void unlock_page_cgroup(struct page *page)
    +{
    + bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
    +}
    #else /* CONFIG_CGROUP_MEM_RES_CTLR */
    static inline void page_reset_bad_cgroup(struct page *page)
    {

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, 2008-11-06 at 10:30 -0500, vgoyal@redhat.com wrote:
    > Hi,
    >
    > If you are not already tired of so many io controller implementations, here
    > is another one.
    >
    > This is a very eary very crude implementation to get early feedback to see
    > if this approach makes any sense or not.
    >
    > This controller is a proportional weight IO controller primarily
    > based on/inspired by dm-ioband. One of the things I personally found little
    > odd about dm-ioband was need of a dm-ioband device for every device we want
    > to control. I thought that probably we can make this control per request
    > queue and get rid of device mapper driver. This should make configuration
    > aspect easy.
    >
    > I have picked up quite some amount of code from dm-ioband especially for
    > biocgroup implementation.
    >
    > I have done very basic testing and that is running 2-3 dd commands in different
    > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
    >
    > More details about the design and how to are in documentation patch.
    >
    > Your comments are welcome.


    please include

    QUILT_REFRESH_ARGS="--diffstat --strip-trailing-whitespace"

    in your environment or .quiltrc

    I would expect all those bio* files to be placed in block/ not mm/

    Does this still require I use dm, or does it also work on regular block
    devices? Patch 4/4 isn't quite clear on this.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, Nov 06, 2008 at 04:49:53PM +0100, Peter Zijlstra wrote:
    > On Thu, 2008-11-06 at 10:30 -0500, vgoyal@redhat.com wrote:
    > > Hi,
    > >
    > > If you are not already tired of so many io controller implementations, here
    > > is another one.
    > >
    > > This is a very eary very crude implementation to get early feedback to see
    > > if this approach makes any sense or not.
    > >
    > > This controller is a proportional weight IO controller primarily
    > > based on/inspired by dm-ioband. One of the things I personally found little
    > > odd about dm-ioband was need of a dm-ioband device for every device we want
    > > to control. I thought that probably we can make this control per request
    > > queue and get rid of device mapper driver. This should make configuration
    > > aspect easy.
    > >
    > > I have picked up quite some amount of code from dm-ioband especially for
    > > biocgroup implementation.
    > >
    > > I have done very basic testing and that is running 2-3 dd commands in different
    > > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
    > >
    > > More details about the design and how to are in documentation patch.
    > >
    > > Your comments are welcome.

    >
    > please include
    >
    > QUILT_REFRESH_ARGS="--diffstat --strip-trailing-whitespace"
    >
    > in your environment or .quiltrc
    >


    Sure, I will do. First time user of quilt. :-)

    > I would expect all those bio* files to be placed in block/ not mm/
    >


    Thinking more about it, probably block/ will be more appropriate place.
    I will do that.

    > Does this still require I use dm, or does it also work on regular block
    > devices? Patch 4/4 isn't quite clear on this.


    No. You don't have to use dm. It will simply work on regular devices. We
    shall have to put few lines of code for it to work on devices which don't
    make use of standard __make_request() function and provide their own
    make_request function.

    Hence for example, I have put that few lines of code so that it can work
    with dm device. I shall have to do something similar for md too.

    Though, I am not very sure why do I need to do IO control on higher level
    devices. Will it be sufficient if we just control only bottom most
    physical block devices?

    Anyway, this approach should work at any level.

    Thanks
    Vivek
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:

    > > Does this still require I use dm, or does it also work on regular block
    > > devices? Patch 4/4 isn't quite clear on this.

    >
    > No. You don't have to use dm. It will simply work on regular devices. We
    > shall have to put few lines of code for it to work on devices which don't
    > make use of standard __make_request() function and provide their own
    > make_request function.
    >
    > Hence for example, I have put that few lines of code so that it can work
    > with dm device. I shall have to do something similar for md too.
    >
    > Though, I am not very sure why do I need to do IO control on higher level
    > devices. Will it be sufficient if we just control only bottom most
    > physical block devices?
    >
    > Anyway, this approach should work at any level.


    Nice, although I would think only doing the higher level devices makes
    more sense than only doing the leafs.

    Is there any reason we cannot merge this with the regular io-scheduler
    interface? afaik the only problem with doing group scheduling in the
    io-schedulers is the stacked devices issue.

    Could we make the io-schedulers aware of this hierarchy?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
    > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
    >
    > > > Does this still require I use dm, or does it also work on regular block
    > > > devices? Patch 4/4 isn't quite clear on this.

    > >
    > > No. You don't have to use dm. It will simply work on regular devices. We
    > > shall have to put few lines of code for it to work on devices which don't
    > > make use of standard __make_request() function and provide their own
    > > make_request function.
    > >
    > > Hence for example, I have put that few lines of code so that it can work
    > > with dm device. I shall have to do something similar for md too.
    > >
    > > Though, I am not very sure why do I need to do IO control on higher level
    > > devices. Will it be sufficient if we just control only bottom most
    > > physical block devices?
    > >
    > > Anyway, this approach should work at any level.

    >
    > Nice, although I would think only doing the higher level devices makes
    > more sense than only doing the leafs.
    >


    I thought that we should be doing any kind of resource management only at
    the level where there is actual contention for the resources.So in this case
    looks like only bottom most devices are slow and don't have infinite bandwidth
    hence the contention.(I am not taking into account the contention at
    bus level or contention at interconnect level for external storage,
    assuming interconnect is not the bottleneck).

    For example, lets say there is one linear device mapper device dm-0 on
    top of physical devices sda and sdb. Assuming two tasks in two different
    cgroups are reading two different files from deivce dm-0. Now if these
    files both fall on same physical device (either sda or sdb), then they
    will be contending for resources. But if files being read are on different
    physical deivces then practically there is no device contention (Even on
    the surface it might look like that dm-0 is being contended for). So if
    files are on different physical devices, IO controller will not know it.
    He will simply dispatch one group at a time and other device might remain
    idle.

    Keeping that in mind I thought we will be able to make use of full
    available bandwidth if we do IO control only at bottom most device. Doing
    it at higher layer has potential of not making use of full available bandwidth.

    > Is there any reason we cannot merge this with the regular io-scheduler
    > interface? afaik the only problem with doing group scheduling in the
    > io-schedulers is the stacked devices issue.


    I think we should be able to merge it with regular io schedulers. Apart
    from stacked device issue, people also mentioned that it is so closely
    tied to IO schedulers that we will end up doing four implementations for
    four schedulers and that is not very good from maintenance perspective.

    But I will spend more time in finding out if there is a common ground
    between schedulers so that a lot of common IO control code can be used
    in all the schedulers.

    >
    > Could we make the io-schedulers aware of this hierarchy?


    You mean IO schedulers knowing that there is somebody above them doing
    proportional weight dispatching of bios? If yes, how would that help?

    Thanks
    Vivek
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    Peter Zijlstra wrote:

    > Nice, although I would think only doing the higher level devices makes
    > more sense than only doing the leafs.


    I'm not convinced.

    Say that you have two resource groups on a bunch of LVM
    volumes across two disks.

    If one of the resource groups only sends requests to one
    of the disks, the other resource group should be able to
    get all of its requests through immediateley at the other
    disk.

    Holding up the second resource group's requests could
    result in a disk being idle. Worse, once that cgroup's
    requests finally make it through, the other cgroup might
    also want to use the disk and they both get slowed down.

    When a resource is uncontended, should a potential user
    be made to wait?

    --
    All rights reversed.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    Peter Zijlstra wrote:

    > The only real issue I can see is with linear volumes, but those are
    > stupid anyway - non of the gains but all the risks.


    Linear volumes may well be the most common ones.

    People start out with the filesystems at a certain size,
    increasing onto a second (new) disk later, when more space
    is required.

    --
    All rights reversed.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
    > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
    > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
    > >
    > > > > Does this still require I use dm, or does it also work on regular block
    > > > > devices? Patch 4/4 isn't quite clear on this.
    > > >
    > > > No. You don't have to use dm. It will simply work on regular devices. We
    > > > shall have to put few lines of code for it to work on devices which don't
    > > > make use of standard __make_request() function and provide their own
    > > > make_request function.
    > > >
    > > > Hence for example, I have put that few lines of code so that it can work
    > > > with dm device. I shall have to do something similar for md too.
    > > >
    > > > Though, I am not very sure why do I need to do IO control on higher level
    > > > devices. Will it be sufficient if we just control only bottom most
    > > > physical block devices?
    > > >
    > > > Anyway, this approach should work at any level.

    > >
    > > Nice, although I would think only doing the higher level devices makes
    > > more sense than only doing the leafs.
    > >

    >
    > I thought that we should be doing any kind of resource management only at
    > the level where there is actual contention for the resources.So in this case
    > looks like only bottom most devices are slow and don't have infinite bandwidth
    > hence the contention.(I am not taking into account the contention at
    > bus level or contention at interconnect level for external storage,
    > assuming interconnect is not the bottleneck).
    >
    > For example, lets say there is one linear device mapper device dm-0 on
    > top of physical devices sda and sdb. Assuming two tasks in two different
    > cgroups are reading two different files from deivce dm-0. Now if these
    > files both fall on same physical device (either sda or sdb), then they
    > will be contending for resources. But if files being read are on different
    > physical deivces then practically there is no device contention (Even on
    > the surface it might look like that dm-0 is being contended for). So if
    > files are on different physical devices, IO controller will not know it.
    > He will simply dispatch one group at a time and other device might remain
    > idle.
    >
    > Keeping that in mind I thought we will be able to make use of full
    > available bandwidth if we do IO control only at bottom most device. Doing
    > it at higher layer has potential of not making use of full available bandwidth.
    >
    > > Is there any reason we cannot merge this with the regular io-scheduler
    > > interface? afaik the only problem with doing group scheduling in the
    > > io-schedulers is the stacked devices issue.

    >
    > I think we should be able to merge it with regular io schedulers. Apart
    > from stacked device issue, people also mentioned that it is so closely
    > tied to IO schedulers that we will end up doing four implementations for
    > four schedulers and that is not very good from maintenance perspective.
    >
    > But I will spend more time in finding out if there is a common ground
    > between schedulers so that a lot of common IO control code can be used
    > in all the schedulers.
    >
    > >
    > > Could we make the io-schedulers aware of this hierarchy?

    >
    > You mean IO schedulers knowing that there is somebody above them doing
    > proportional weight dispatching of bios? If yes, how would that help?


    Well, take the slightly more elaborate example or a raid[56] setup. This
    will need to sometimes issue multiple leaf level ios to satisfy one top
    level io.

    How are you going to attribute this fairly?

    I don't think the issue of bandwidth availability like above will really
    be an issue, if your stripe is set up symmetrically, the contention
    should average out to both (all) disks in equal measures.

    The only real issue I can see is with linear volumes, but those are
    stupid anyway - non of the gains but all the risks.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote:
    > Peter Zijlstra wrote:
    >
    > > The only real issue I can see is with linear volumes, but those are
    > > stupid anyway - non of the gains but all the risks.

    >
    > Linear volumes may well be the most common ones.
    >
    > People start out with the filesystems at a certain size,
    > increasing onto a second (new) disk later, when more space
    > is required.


    Are they aware of how risky linear volumes are? I would discourage
    anyone from using them.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
    > On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
    > > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
    > > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
    > > >
    > > > > > Does this still require I use dm, or does it also work on regular block
    > > > > > devices? Patch 4/4 isn't quite clear on this.
    > > > >
    > > > > No. You don't have to use dm. It will simply work on regular devices. We
    > > > > shall have to put few lines of code for it to work on devices which don't
    > > > > make use of standard __make_request() function and provide their own
    > > > > make_request function.
    > > > >
    > > > > Hence for example, I have put that few lines of code so that it can work
    > > > > with dm device. I shall have to do something similar for md too.
    > > > >
    > > > > Though, I am not very sure why do I need to do IO control on higher level
    > > > > devices. Will it be sufficient if we just control only bottom most
    > > > > physical block devices?
    > > > >
    > > > > Anyway, this approach should work at any level.
    > > >
    > > > Nice, although I would think only doing the higher level devices makes
    > > > more sense than only doing the leafs.
    > > >

    > >
    > > I thought that we should be doing any kind of resource management only at
    > > the level where there is actual contention for the resources.So in this case
    > > looks like only bottom most devices are slow and don't have infinite bandwidth
    > > hence the contention.(I am not taking into account the contention at
    > > bus level or contention at interconnect level for external storage,
    > > assuming interconnect is not the bottleneck).
    > >
    > > For example, lets say there is one linear device mapper device dm-0 on
    > > top of physical devices sda and sdb. Assuming two tasks in two different
    > > cgroups are reading two different files from deivce dm-0. Now if these
    > > files both fall on same physical device (either sda or sdb), then they
    > > will be contending for resources. But if files being read are on different
    > > physical deivces then practically there is no device contention (Even on
    > > the surface it might look like that dm-0 is being contended for). So if
    > > files are on different physical devices, IO controller will not know it.
    > > He will simply dispatch one group at a time and other device might remain
    > > idle.
    > >
    > > Keeping that in mind I thought we will be able to make use of full
    > > available bandwidth if we do IO control only at bottom most device. Doing
    > > it at higher layer has potential of not making use of full available bandwidth.
    > >
    > > > Is there any reason we cannot merge this with the regular io-scheduler
    > > > interface? afaik the only problem with doing group scheduling in the
    > > > io-schedulers is the stacked devices issue.

    > >
    > > I think we should be able to merge it with regular io schedulers. Apart
    > > from stacked device issue, people also mentioned that it is so closely
    > > tied to IO schedulers that we will end up doing four implementations for
    > > four schedulers and that is not very good from maintenance perspective.
    > >
    > > But I will spend more time in finding out if there is a common ground
    > > between schedulers so that a lot of common IO control code can be used
    > > in all the schedulers.
    > >
    > > >
    > > > Could we make the io-schedulers aware of this hierarchy?

    > >
    > > You mean IO schedulers knowing that there is somebody above them doing
    > > proportional weight dispatching of bios? If yes, how would that help?

    >
    > Well, take the slightly more elaborate example or a raid[56] setup. This
    > will need to sometimes issue multiple leaf level ios to satisfy one top
    > level io.
    >
    > How are you going to attribute this fairly?
    >


    I think in this case, definition of fair allocation will be little
    different. We will do fair allocation only at the leaf nodes where
    there is actual contention, irrespective of higher level setup.

    So if higher level block device issues multiple ios to satisfy one top
    level io, we will actually do the bandwidth allocation only on
    those multiple ios because that's the real IO contending for disk
    bandwidth. And if these multiple ios are going to different physical
    devices, then contention management will take place on those devices.

    IOW, we will not worry about providing fairness at bios submitted to
    higher level devices. We will just pitch in for contention management
    only when request from various cgroups are contending for physical
    device at bottom most layers. Isn't if fair?

    Thanks
    Vivek

    > I don't think the issue of bandwidth availability like above will really
    > be an issue, if your stripe is set up symmetrically, the contention
    > should average out to both (all) disks in equal measures.
    >
    > The only real issue I can see is with linear volumes, but those are
    > stupid anyway - non of the gains but all the risks.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    It seems that approaches with two level scheduling (DM-IOBand or this
    patch set on top and another scheduler at elevator) will have the
    possibility of undesirable interactions (see "issues" listed at the
    end of the second patch). For example, a request submitted as RT might
    get delayed at higher layers, even if cfq at elevator level is doing
    the right thing.

    Moreover, if the requests in the higher level scheduler are dispatched
    as soon as they come, there would be no queuing at the higher layers,
    unless the request queue at the lower level fills up and causes a
    backlog. And in the absence of queuing, any work-conserving scheduler
    would behave as a no-op scheduler.

    These issues motivate to take a second look into two level scheduling.
    The main motivations for two level scheduling seem to be:
    (1) Support bandwidth division across multiple devices for RAID and LVMs.
    (2) Divide bandwidth between different cgroups without modifying each
    of the existing schedulers (and without replicating the code).

    One possible approach to handle (1) is to keep track of bandwidth
    utilized by each cgroup in a per cgroup data structure (instead of a
    per cgroup per device data structure) and use that information to make
    scheduling decisions within the elevator level schedulers. Such a
    patch can be made flag-disabled if co-ordination across different
    device schedulers is not required.

    And (2) can probably be handled by having one scheduler support
    different modes. For example, one possible mode is "propotional
    division between crgroups + no-op between threads of a cgroup" or "cfq
    between cgroups + cfq between threads of a cgroup". That would also
    help avoid combinations which might not work e.g RT request issue
    mentioned earlier in this email. And this unified scheduler can re-use
    code from all the existing patches.

    Thanks.
    --
    Nauman

    On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal wrote:
    > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
    >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
    >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
    >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
    >> > >
    >> > > > > Does this still require I use dm, or does it also work on regular block
    >> > > > > devices? Patch 4/4 isn't quite clear on this.
    >> > > >
    >> > > > No. You don't have to use dm. It will simply work on regular devices. We
    >> > > > shall have to put few lines of code for it to work on devices which don't
    >> > > > make use of standard __make_request() function and provide their own
    >> > > > make_request function.
    >> > > >
    >> > > > Hence for example, I have put that few lines of code so that it can work
    >> > > > with dm device. I shall have to do something similar for md too.
    >> > > >
    >> > > > Though, I am not very sure why do I need to do IO control on higher level
    >> > > > devices. Will it be sufficient if we just control only bottom most
    >> > > > physical block devices?
    >> > > >
    >> > > > Anyway, this approach should work at any level.
    >> > >
    >> > > Nice, although I would think only doing the higher level devices makes
    >> > > more sense than only doing the leafs.
    >> > >
    >> >
    >> > I thought that we should be doing any kind of resource management only at
    >> > the level where there is actual contention for the resources.So in this case
    >> > looks like only bottom most devices are slow and don't have infinite bandwidth
    >> > hence the contention.(I am not taking into account the contention at
    >> > bus level or contention at interconnect level for external storage,
    >> > assuming interconnect is not the bottleneck).
    >> >
    >> > For example, lets say there is one linear device mapper device dm-0 on
    >> > top of physical devices sda and sdb. Assuming two tasks in two different
    >> > cgroups are reading two different files from deivce dm-0. Now if these
    >> > files both fall on same physical device (either sda or sdb), then they
    >> > will be contending for resources. But if files being read are on different
    >> > physical deivces then practically there is no device contention (Even on
    >> > the surface it might look like that dm-0 is being contended for). So if
    >> > files are on different physical devices, IO controller will not know it.
    >> > He will simply dispatch one group at a time and other device might remain
    >> > idle.
    >> >
    >> > Keeping that in mind I thought we will be able to make use of full
    >> > available bandwidth if we do IO control only at bottom most device. Doing
    >> > it at higher layer has potential of not making use of full available bandwidth.
    >> >
    >> > > Is there any reason we cannot merge this with the regular io-scheduler
    >> > > interface? afaik the only problem with doing group scheduling in the
    >> > > io-schedulers is the stacked devices issue.
    >> >
    >> > I think we should be able to merge it with regular io schedulers. Apart
    >> > from stacked device issue, people also mentioned that it is so closely
    >> > tied to IO schedulers that we will end up doing four implementations for
    >> > four schedulers and that is not very good from maintenance perspective.
    >> >
    >> > But I will spend more time in finding out if there is a common ground
    >> > between schedulers so that a lot of common IO control code can be used
    >> > in all the schedulers.
    >> >
    >> > >
    >> > > Could we make the io-schedulers aware of this hierarchy?
    >> >
    >> > You mean IO schedulers knowing that there is somebody above them doing
    >> > proportional weight dispatching of bios? If yes, how would that help?

    >>
    >> Well, take the slightly more elaborate example or a raid[56] setup. This
    >> will need to sometimes issue multiple leaf level ios to satisfy one top
    >> level io.
    >>
    >> How are you going to attribute this fairly?
    >>

    >
    > I think in this case, definition of fair allocation will be little
    > different. We will do fair allocation only at the leaf nodes where
    > there is actual contention, irrespective of higher level setup.
    >
    > So if higher level block device issues multiple ios to satisfy one top
    > level io, we will actually do the bandwidth allocation only on
    > those multiple ios because that's the real IO contending for disk
    > bandwidth. And if these multiple ios are going to different physical
    > devices, then contention management will take place on those devices.
    >
    > IOW, we will not worry about providing fairness at bios submitted to
    > higher level devices. We will just pitch in for contention management
    > only when request from various cgroups are contending for physical
    > device at bottom most layers. Isn't if fair?
    >
    > Thanks
    > Vivek
    >
    >> I don't think the issue of bandwidth availability like above will really
    >> be an issue, if your stripe is set up symmetrically, the contention
    >> should average out to both (all) disks in equal measures.
    >>
    >> The only real issue I can see is with linear volumes, but those are
    >> stupid anyway - non of the gains but all the risks.

    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote:
    > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote:
    > > Peter Zijlstra wrote:
    > >
    > > > The only real issue I can see is with linear volumes, but those are
    > > > stupid anyway - non of the gains but all the risks.

    > >
    > > Linear volumes may well be the most common ones.
    > >
    > > People start out with the filesystems at a certain size,
    > > increasing onto a second (new) disk later, when more space
    > > is required.

    >
    > Are they aware of how risky linear volumes are? I would discourage
    > anyone from using them.


    In what way are they risky?

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    vgoyal@redhat.com wrote:
    > Hi,
    >
    > If you are not already tired of so many io controller implementations, here
    > is another one.
    >
    > This is a very eary very crude implementation to get early feedback to see
    > if this approach makes any sense or not.
    >
    > This controller is a proportional weight IO controller primarily
    > based on/inspired by dm-ioband. One of the things I personally found little
    > odd about dm-ioband was need of a dm-ioband device for every device we want
    > to control. I thought that probably we can make this control per request
    > queue and get rid of device mapper driver. This should make configuration
    > aspect easy.
    >
    > I have picked up quite some amount of code from dm-ioband especially for
    > biocgroup implementation.
    >
    > I have done very basic testing and that is running 2-3 dd commands in different
    > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
    >
    > More details about the design and how to are in documentation patch.
    >
    > Your comments are welcome.


    Which kernel version is this patch set based on?

    >
    > Thanks
    > Vivek
    >


    --
    Regards
    Gui Jianfeng

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [patch 2/4] io controller: biocgroup implementation

    On Thu, 06 Nov 2008 10:30:24 -0500
    vgoyal@redhat.com wrote:

    >
    > o biocgroup functionality.
    > o Implemented new controller "bio"
    > o Most of it picked from dm-ioband biocgroup implementation patches.
    >

    page_cgroup implementation is changed and most of this patch needs rework.
    please see the latest one. (I think most of new characteristics are useful
    for you.)

    One comment from me is
    ==
    > +struct page_cgroup {
    > + struct list_head lru; /* per cgroup LRU list */
    > + struct page *page;
    > + struct mem_cgroup *mem_cgroup;
    > + int flags;
    > +#ifdef CONFIG_CGROUP_BIO
    > + struct list_head blist; /* for bio_cgroup page list */
    > + struct bio_cgroup *bio_cgroup;
    > +#endif
    > +};

    ==

    this blist is too bad. please keep this object small...

    Maybe dm-ioband people will post his own new one. just making use of it is an idea.

    Thanks,
    -Kame


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [patch 3/4] io controller: Core IO controller implementation logic

    On Thu, 06 Nov 2008 10:30:25 -0500
    vgoyal@redhat.com wrote:

    >
    > o Core IO controller implementation
    >
    > Signed-off-by: Vivek Goyal
    >


    2 comments after a quick look.

    - I don't recommend generic work queue. More stacked dependency between "work"
    is not good. (I think disk-driver uses "work" for their jobs.)

    - It seems this bio-cgroup can queue the bio to infinite. Then, a process can submit
    io unitl cause OOM.
    (IIUC, Dirty bit of the page is cleared at submitting I/O.
    Then dirty_ratio can't help us.)
    please add "wait for congestion by sleeping" code in bio-cgroup.


    Thanks,
    -Kame

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [patch 2/4] io controller: biocgroup implementation

    Hi,

    I'm going to release a new version of bio_cgroup soon, which doesn't
    have "struct list_head blist" anymore and whose overhead is minimized.

    > > o biocgroup functionality.
    > > o Implemented new controller "bio"
    > > o Most of it picked from dm-ioband biocgroup implementation patches.
    > >

    > page_cgroup implementation is changed and most of this patch needs rework.
    > please see the latest one. (I think most of new characteristics are useful
    > for you.)
    >
    > One comment from me is
    > ==
    > > +struct page_cgroup {
    > > + struct list_head lru; /* per cgroup LRU list */
    > > + struct page *page;
    > > + struct mem_cgroup *mem_cgroup;
    > > + int flags;
    > > +#ifdef CONFIG_CGROUP_BIO
    > > + struct list_head blist; /* for bio_cgroup page list */
    > > + struct bio_cgroup *bio_cgroup;
    > > +#endif
    > > +};

    > ==
    >
    > this blist is too bad. please keep this object small...
    >
    > Maybe dm-ioband people will post his own new one. just making use of it is an idea.
    >
    > Thanks,
    > -Kame
    >
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Fri, 2008-11-07 at 11:41 +1100, Dave Chinner wrote:
    > On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote:
    > > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote:
    > > > Peter Zijlstra wrote:
    > > >
    > > > > The only real issue I can see is with linear volumes, but those are
    > > > > stupid anyway - non of the gains but all the risks.
    > > >
    > > > Linear volumes may well be the most common ones.
    > > >
    > > > People start out with the filesystems at a certain size,
    > > > increasing onto a second (new) disk later, when more space
    > > > is required.

    > >
    > > Are they aware of how risky linear volumes are? I would discourage
    > > anyone from using them.

    >
    > In what way are they risky?


    You loose all your data when one disk dies, so your mtbf decreases with
    the number of disks in your linear span.

    And you get non of the benefits from having multiple disks, like extra
    speed from striping, or redundancy from raid.

    Therefore I say that linear volumes are the absolute worst choice.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Fri, Nov 07, 2008 at 10:36:50AM +0800, Gui Jianfeng wrote:
    > vgoyal@redhat.com wrote:
    > > Hi,
    > >
    > > If you are not already tired of so many io controller implementations, here
    > > is another one.
    > >
    > > This is a very eary very crude implementation to get early feedback to see
    > > if this approach makes any sense or not.
    > >
    > > This controller is a proportional weight IO controller primarily
    > > based on/inspired by dm-ioband. One of the things I personally found little
    > > odd about dm-ioband was need of a dm-ioband device for every device we want
    > > to control. I thought that probably we can make this control per request
    > > queue and get rid of device mapper driver. This should make configuration
    > > aspect easy.
    > >
    > > I have picked up quite some amount of code from dm-ioband especially for
    > > biocgroup implementation.
    > >
    > > I have done very basic testing and that is running 2-3 dd commands in different
    > > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
    > >
    > > More details about the design and how to are in documentation patch.
    > >
    > > Your comments are welcome.

    >
    > Which kernel version is this patch set based on?
    >


    2.6.27

    Thanks
    Vivek
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 2 1 2 LastLast