The objective of the i/o controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.

Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.

The direct bandwidth and/or iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).

Detailed informations about design, its goal and usage are described in the

Tested against 2.6.27-rc5-mm1.

The all-in-one patch (and previous versions) can be found at:

Changelog: (v9 -> v10)
* fix a bug to correctly throttle small direct-IO writes
* fix: do not add a new limiting rule if the limit is 0 (unlimited)
* do not report time values directly in jiffies, always use clock_t
* remove a spinlock in struct iothrottle (we always hold cgroup_lock() when
using it for RCU update, so an additional spinlock is not needed)
* use page_cgroup functionality provided by memory cgroup controller to charge
the right cgroup of asynchronous i/o activity (e.g. pdflush writebacks)
* code simplification in cgroup_io_throttle()
* removed a lot of experimental stuff introduced in the previous version
* update documentation

* Implement a rbtree per request queue; all the requests queued to the I/O
subsystem first will go in this rbtree. Then based on cgroup grouping and
control policy dispatch the requests and pass them to the elevator associated
with the queue. This would allow to provide both bandwidth limiting and
proportional bandwidth functionalities using a quite generic approach
(suggested by Vivek Goyal)

* Improve fair throttling: distribute the time to sleep among all the tasks of
a cgroup that exceeded the I/O limits, depending of the amount of IO activity
previously generated in the past by each task (see task_io_accounting)

* Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
this is not too much expensive, but the call of task_subsys_state() has
surely a cost. A possible solution could be to temporarily account I/O in the
current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
Or on each Y number of I/O requests as well. Better if both X and/or Y can be
tuned at runtime by a userspace tool

* Think an alternative design for general purpose usage; special purpose usage
right now is restricted to improve I/O performance predictability and
evaluate more precise response timings for applications doing I/O. To a large
degree the block I/O bandwidth controller should implement a more complex
logic to better evaluate real I/O operations cost, depending also on the
particular block device profile (i.e. USB stick, optical drive, hard disk,
etc.). This would also allow to appropriately account I/O cost for seeky
workloads, respect to large stream workloads. Instead of looking at the
request stream and try to predict how expensive the I/O cost will be, a
totally different approach could be to collect request timings (start time /
elapsed time) and based on collected informations, try to estimate the I/O
cost and usage

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at