[patch 0/4] [RFC] Another proportional weight IO controller - Kernel

This is a discussion on [patch 0/4] [RFC] Another proportional weight IO controller - Kernel ; On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote: > It seems that approaches with two level scheduling (DM-IOBand or this > patch set on top and another scheduler at elevator) will have the > possibility of undesirable ...

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2
Results 21 to 30 of 30

Thread: [patch 0/4] [RFC] Another proportional weight IO controller

  1. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
    > It seems that approaches with two level scheduling (DM-IOBand or this
    > patch set on top and another scheduler at elevator) will have the
    > possibility of undesirable interactions (see "issues" listed at the
    > end of the second patch). For example, a request submitted as RT might
    > get delayed at higher layers, even if cfq at elevator level is doing
    > the right thing.
    >


    Yep. Buffering of bios at higher layer can break underlying elevator's
    assumptions.

    What if we start keeping track of task priorities and RT tasks in higher
    level schedulers and dispatch the bios accordingly. Will it break the
    underlying noop, deadline or AS?

    > Moreover, if the requests in the higher level scheduler are dispatched
    > as soon as they come, there would be no queuing at the higher layers,
    > unless the request queue at the lower level fills up and causes a
    > backlog. And in the absence of queuing, any work-conserving scheduler
    > would behave as a no-op scheduler.
    >
    > These issues motivate to take a second look into two level scheduling.
    > The main motivations for two level scheduling seem to be:
    > (1) Support bandwidth division across multiple devices for RAID and LVMs.


    Nauman, can you give an example where we really need bandwidth division
    for higher level devices.

    I am beginning to think that real contention is at leaf level physical
    devices and not at higher level logical devices hence we should be doing
    any resource management only at leaf level and not worry about higher
    level logical devices.

    If this requirement goes away, then case of two level scheduler weakens
    and one needs to think about doing changes at leaf level IO schedulers.

    > (2) Divide bandwidth between different cgroups without modifying each
    > of the existing schedulers (and without replicating the code).
    >
    > One possible approach to handle (1) is to keep track of bandwidth
    > utilized by each cgroup in a per cgroup data structure (instead of a
    > per cgroup per device data structure) and use that information to make
    > scheduling decisions within the elevator level schedulers. Such a
    > patch can be made flag-disabled if co-ordination across different
    > device schedulers is not required.
    >


    Can you give more details about it. I am not sure I understand it. Exactly
    what information should be stored in each cgroup.

    I think per cgroup per device data structures are good so that an scheduer
    will not worry about other devices present in the system and will just try
    to arbitrate between various cgroup contending for that device. This goes
    back to same issue of getting rid of requirement (1) from io controller.

    > And (2) can probably be handled by having one scheduler support
    > different modes. For example, one possible mode is "propotional
    > division between crgroups + no-op between threads of a cgroup" or "cfq
    > between cgroups + cfq between threads of a cgroup". That would also
    > help avoid combinations which might not work e.g RT request issue
    > mentioned earlier in this email. And this unified scheduler can re-use
    > code from all the existing patches.
    >


    IIUC, you are suggesting some kind of unification between four IO
    schedulers so that proportional weight code is not replicated and user can
    switch mode on the fly based on tunables?

    Thanks
    Vivek

    > Thanks.
    > --
    > Nauman
    >
    > On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal wrote:
    > > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
    > >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
    > >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
    > >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
    > >> > >
    > >> > > > > Does this still require I use dm, or does it also work on regular block
    > >> > > > > devices? Patch 4/4 isn't quite clear on this.
    > >> > > >
    > >> > > > No. You don't have to use dm. It will simply work on regular devices. We
    > >> > > > shall have to put few lines of code for it to work on devices which don't
    > >> > > > make use of standard __make_request() function and provide their own
    > >> > > > make_request function.
    > >> > > >
    > >> > > > Hence for example, I have put that few lines of code so that it can work
    > >> > > > with dm device. I shall have to do something similar for md too.
    > >> > > >
    > >> > > > Though, I am not very sure why do I need to do IO control on higher level
    > >> > > > devices. Will it be sufficient if we just control only bottom most
    > >> > > > physical block devices?
    > >> > > >
    > >> > > > Anyway, this approach should work at any level.
    > >> > >
    > >> > > Nice, although I would think only doing the higher level devices makes
    > >> > > more sense than only doing the leafs.
    > >> > >
    > >> >
    > >> > I thought that we should be doing any kind of resource management only at
    > >> > the level where there is actual contention for the resources.So in this case
    > >> > looks like only bottom most devices are slow and don't have infinite bandwidth
    > >> > hence the contention.(I am not taking into account the contention at
    > >> > bus level or contention at interconnect level for external storage,
    > >> > assuming interconnect is not the bottleneck).
    > >> >
    > >> > For example, lets say there is one linear device mapper device dm-0 on
    > >> > top of physical devices sda and sdb. Assuming two tasks in two different
    > >> > cgroups are reading two different files from deivce dm-0. Now if these
    > >> > files both fall on same physical device (either sda or sdb), then they
    > >> > will be contending for resources. But if files being read are on different
    > >> > physical deivces then practically there is no device contention (Even on
    > >> > the surface it might look like that dm-0 is being contended for). So if
    > >> > files are on different physical devices, IO controller will not know it.
    > >> > He will simply dispatch one group at a time and other device might remain
    > >> > idle.
    > >> >
    > >> > Keeping that in mind I thought we will be able to make use of full
    > >> > available bandwidth if we do IO control only at bottom most device. Doing
    > >> > it at higher layer has potential of not making use of full available bandwidth.
    > >> >
    > >> > > Is there any reason we cannot merge this with the regular io-scheduler
    > >> > > interface? afaik the only problem with doing group scheduling in the
    > >> > > io-schedulers is the stacked devices issue.
    > >> >
    > >> > I think we should be able to merge it with regular io schedulers. Apart
    > >> > from stacked device issue, people also mentioned that it is so closely
    > >> > tied to IO schedulers that we will end up doing four implementations for
    > >> > four schedulers and that is not very good from maintenance perspective.
    > >> >
    > >> > But I will spend more time in finding out if there is a common ground
    > >> > between schedulers so that a lot of common IO control code can be used
    > >> > in all the schedulers.
    > >> >
    > >> > >
    > >> > > Could we make the io-schedulers aware of this hierarchy?
    > >> >
    > >> > You mean IO schedulers knowing that there is somebody above them doing
    > >> > proportional weight dispatching of bios? If yes, how would that help?
    > >>
    > >> Well, take the slightly more elaborate example or a raid[56] setup. This
    > >> will need to sometimes issue multiple leaf level ios to satisfy one top
    > >> level io.
    > >>
    > >> How are you going to attribute this fairly?
    > >>

    > >
    > > I think in this case, definition of fair allocation will be little
    > > different. We will do fair allocation only at the leaf nodes where
    > > there is actual contention, irrespective of higher level setup.
    > >
    > > So if higher level block device issues multiple ios to satisfy one top
    > > level io, we will actually do the bandwidth allocation only on
    > > those multiple ios because that's the real IO contending for disk
    > > bandwidth. And if these multiple ios are going to different physical
    > > devices, then contention management will take place on those devices.
    > >
    > > IOW, we will not worry about providing fairness at bios submitted to
    > > higher level devices. We will just pitch in for contention management
    > > only when request from various cgroups are contending for physical
    > > device at bottom most layers. Isn't if fair?
    > >
    > > Thanks
    > > Vivek
    > >
    > >> I don't think the issue of bandwidth availability like above will really
    > >> be an issue, if your stripe is set up symmetrically, the contention
    > >> should average out to both (all) disks in equal measures.
    > >>
    > >> The only real issue I can see is with linear volumes, but those are
    > >> stupid anyway - non of the gains but all the risks.

    > > --
    > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > > the body of a message to majordomo@vger.kernel.org
    > > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > > Please read the FAQ at http://www.tux.org/lkml/
    > >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [patch 2/4] io controller: biocgroup implementation

    On Fri, Nov 07, 2008 at 11:50:30AM +0900, KAMEZAWA Hiroyuki wrote:
    > On Thu, 06 Nov 2008 10:30:24 -0500
    > vgoyal@redhat.com wrote:
    >
    > >
    > > o biocgroup functionality.
    > > o Implemented new controller "bio"
    > > o Most of it picked from dm-ioband biocgroup implementation patches.
    > >

    > page_cgroup implementation is changed and most of this patch needs rework.
    > please see the latest one. (I think most of new characteristics are useful
    > for you.)
    >


    Sure I will have a look.

    > One comment from me is
    > ==
    > > +struct page_cgroup {
    > > + struct list_head lru; /* per cgroup LRU list */
    > > + struct page *page;
    > > + struct mem_cgroup *mem_cgroup;
    > > + int flags;
    > > +#ifdef CONFIG_CGROUP_BIO
    > > + struct list_head blist; /* for bio_cgroup page list */
    > > + struct bio_cgroup *bio_cgroup;
    > > +#endif
    > > +};

    > ==
    >
    > this blist is too bad. please keep this object small...
    >


    This is just another connecting element so that page_cgroup can be on another
    list also. It is useful in making sure that IO on all the pages of a bio
    group has completed beofer that bio cgroup is deleted.

    > Maybe dm-ioband people will post his own new one. just making use of it is an idea.


    Sure, I will have a look when dm-ioband people post new version of patch
    and how they have optimized it further.

    Thanks
    Vivek
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [patch 3/4] io controller: Core IO controller implementation logic

    On Fri, Nov 07, 2008 at 12:21:45PM +0900, KAMEZAWA Hiroyuki wrote:
    > On Thu, 06 Nov 2008 10:30:25 -0500
    > vgoyal@redhat.com wrote:
    >
    > >
    > > o Core IO controller implementation
    > >
    > > Signed-off-by: Vivek Goyal
    > >

    >
    > 2 comments after a quick look.
    >
    > - I don't recommend generic work queue. More stacked dependency between "work"
    > is not good. (I think disk-driver uses "work" for their jobs.)


    Sorry, I did not get this. Are you recommending that don't create a new
    work queue, instead use existing work queue (say kblockd) to submit the bios
    here?

    I will look into it. I was little worried about a kblockd being overworked
    in case of too many logical devices enabling IO controller.

    >
    > - It seems this bio-cgroup can queue the bio to infinite. Then, a process can submit
    > io unitl cause OOM.
    > (IIUC, Dirty bit of the page is cleared at submitting I/O.
    > Then dirty_ratio can't help us.)
    > please add "wait for congestion by sleeping" code in bio-cgroup.


    Yes, you are right. I need to put some kind of control on max number of
    bios I can queue on a cgroup and after crossing the limit, I should put
    the submitting task to sleep. (Something like request descriptor kind of
    flow control implememented by elevators).

    Thanks
    Vivek
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal wrote:
    > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
    >> It seems that approaches with two level scheduling (DM-IOBand or this
    >> patch set on top and another scheduler at elevator) will have the
    >> possibility of undesirable interactions (see "issues" listed at the
    >> end of the second patch). For example, a request submitted as RT might
    >> get delayed at higher layers, even if cfq at elevator level is doing
    >> the right thing.
    >>

    >
    > Yep. Buffering of bios at higher layer can break underlying elevator's
    > assumptions.
    >
    > What if we start keeping track of task priorities and RT tasks in higher
    > level schedulers and dispatch the bios accordingly. Will it break the
    > underlying noop, deadline or AS?


    It will probably not. But then we have a cfq-like scheduler at higher
    level and we can agree that the combinations "cfq(higher
    level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
    would probably work. But if we implement one high level cfq-like
    scheduler at a higher level, we would not take care of somebody who
    wants noop-noop or propotional-noop. The point I am trying to make is
    that there is probably no single one-size-fits-all solution for a
    higher level scheduler. And we should limit the arbitrary mixing and
    matching of higher level schedulers and elevator schedulers. That
    being said, the existence of a higher level scheduler is still a point
    of debate I guess, see my comments below.


    >
    >> Moreover, if the requests in the higher level scheduler are dispatched
    >> as soon as they come, there would be no queuing at the higher layers,
    >> unless the request queue at the lower level fills up and causes a
    >> backlog. And in the absence of queuing, any work-conserving scheduler
    >> would behave as a no-op scheduler.
    >>
    >> These issues motivate to take a second look into two level scheduling.
    >> The main motivations for two level scheduling seem to be:
    >> (1) Support bandwidth division across multiple devices for RAID and LVMs.

    >
    > Nauman, can you give an example where we really need bandwidth division
    > for higher level devices.
    >
    > I am beginning to think that real contention is at leaf level physical
    > devices and not at higher level logical devices hence we should be doing
    > any resource management only at leaf level and not worry about higher
    > level logical devices.
    >
    > If this requirement goes away, then case of two level scheduler weakens
    > and one needs to think about doing changes at leaf level IO schedulers.


    I cannot agree with you more on this that there is only contention at
    the leaf level physical devices and bandwidth should be managed only
    there. But having seen earlier posts on this list, i feel some folks
    might not agree with us. For example, if we have RAID-0 striping, we
    might want to schedule requests based on accumulative bandwidth used
    over all devices. Again, I myself don't agree with moving scheduling
    at a higher level just to support that.

    >
    >> (2) Divide bandwidth between different cgroups without modifying each
    >> of the existing schedulers (and without replicating the code).
    >>
    >> One possible approach to handle (1) is to keep track of bandwidth
    >> utilized by each cgroup in a per cgroup data structure (instead of a
    >> per cgroup per device data structure) and use that information to make
    >> scheduling decisions within the elevator level schedulers. Such a
    >> patch can be made flag-disabled if co-ordination across different
    >> device schedulers is not required.
    >>

    >
    > Can you give more details about it. I am not sure I understand it. Exactly
    > what information should be stored in each cgroup.
    >
    > I think per cgroup per device data structures are good so that an scheduer
    > will not worry about other devices present in the system and will just try
    > to arbitrate between various cgroup contending for that device. This goes
    > back to same issue of getting rid of requirement (1) from io controller.


    I was thinking that we can keep track of disk time used at each
    device, and keep the cumulative number in a per cgroup data structure.
    But that is only if we want to support bandwidth division across
    devices. You and me both agree that we probably do not need to do
    that.

    >
    >> And (2) can probably be handled by having one scheduler support
    >> different modes. For example, one possible mode is "propotional
    >> division between crgroups + no-op between threads of a cgroup" or "cfq
    >> between cgroups + cfq between threads of a cgroup". That would also
    >> help avoid combinations which might not work e.g RT request issue
    >> mentioned earlier in this email. And this unified scheduler can re-use
    >> code from all the existing patches.
    >>

    >
    > IIUC, you are suggesting some kind of unification between four IO
    > schedulers so that proportional weight code is not replicated and user can
    > switch mode on the fly based on tunables?


    Yes, that seems to be a solution to avoid replication of code. But we
    should also look at any other solutions that avoid replication of
    code, and also avoid scheduling in two different layers.
    In my opinion, scheduling at two different layers is problematic because
    (a) Any buffering done at a higher level will be artificial, unless
    the queues at lower levels are completely full. And if there is no
    buffering at a higher level, any scheduling scheme would be
    ineffective.
    (b) We cannot have an arbitrary mixing and matching of higher and
    lower level schedulers.

    (a) would exist in any solution in which requests are queued at
    multiple levels. Can you please comment on this with respect to the
    patch that you have posted?

    Thanks.
    --
    Nauman

    >
    > Thanks
    > Vivek
    >
    >> Thanks.
    >> --
    >> Nauman
    >>
    >> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal wrote:
    >> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
    >> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
    >> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
    >> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
    >> >> > >
    >> >> > > > > Does this still require I use dm, or does it also work on regular block
    >> >> > > > > devices? Patch 4/4 isn't quite clear on this.
    >> >> > > >
    >> >> > > > No. You don't have to use dm. It will simply work on regular devices. We
    >> >> > > > shall have to put few lines of code for it to work on devices which don't
    >> >> > > > make use of standard __make_request() function and provide their own
    >> >> > > > make_request function.
    >> >> > > >
    >> >> > > > Hence for example, I have put that few lines of code so that it can work
    >> >> > > > with dm device. I shall have to do something similar for md too.
    >> >> > > >
    >> >> > > > Though, I am not very sure why do I need to do IO control on higher level
    >> >> > > > devices. Will it be sufficient if we just control only bottom most
    >> >> > > > physical block devices?
    >> >> > > >
    >> >> > > > Anyway, this approach should work at any level.
    >> >> > >
    >> >> > > Nice, although I would think only doing the higher level devices makes
    >> >> > > more sense than only doing the leafs.
    >> >> > >
    >> >> >
    >> >> > I thought that we should be doing any kind of resource management only at
    >> >> > the level where there is actual contention for the resources.So in this case
    >> >> > looks like only bottom most devices are slow and don't have infinite bandwidth
    >> >> > hence the contention.(I am not taking into account the contention at
    >> >> > bus level or contention at interconnect level for external storage,
    >> >> > assuming interconnect is not the bottleneck).
    >> >> >
    >> >> > For example, lets say there is one linear device mapper device dm-0 on
    >> >> > top of physical devices sda and sdb. Assuming two tasks in two different
    >> >> > cgroups are reading two different files from deivce dm-0. Now if these
    >> >> > files both fall on same physical device (either sda or sdb), then they
    >> >> > will be contending for resources. But if files being read are on different
    >> >> > physical deivces then practically there is no device contention (Even on
    >> >> > the surface it might look like that dm-0 is being contended for). So if
    >> >> > files are on different physical devices, IO controller will not know it.
    >> >> > He will simply dispatch one group at a time and other device might remain
    >> >> > idle.
    >> >> >
    >> >> > Keeping that in mind I thought we will be able to make use of full
    >> >> > available bandwidth if we do IO control only at bottom most device. Doing
    >> >> > it at higher layer has potential of not making use of full available bandwidth.
    >> >> >
    >> >> > > Is there any reason we cannot merge this with the regular io-scheduler
    >> >> > > interface? afaik the only problem with doing group scheduling in the
    >> >> > > io-schedulers is the stacked devices issue.
    >> >> >
    >> >> > I think we should be able to merge it with regular io schedulers. Apart
    >> >> > from stacked device issue, people also mentioned that it is so closely
    >> >> > tied to IO schedulers that we will end up doing four implementations for
    >> >> > four schedulers and that is not very good from maintenance perspective.
    >> >> >
    >> >> > But I will spend more time in finding out if there is a common ground
    >> >> > between schedulers so that a lot of common IO control code can be used
    >> >> > in all the schedulers.
    >> >> >
    >> >> > >
    >> >> > > Could we make the io-schedulers aware of this hierarchy?
    >> >> >
    >> >> > You mean IO schedulers knowing that there is somebody above them doing
    >> >> > proportional weight dispatching of bios? If yes, how would that help?
    >> >>
    >> >> Well, take the slightly more elaborate example or a raid[56] setup. This
    >> >> will need to sometimes issue multiple leaf level ios to satisfy one top
    >> >> level io.
    >> >>
    >> >> How are you going to attribute this fairly?
    >> >>
    >> >
    >> > I think in this case, definition of fair allocation will be little
    >> > different. We will do fair allocation only at the leaf nodes where
    >> > there is actual contention, irrespective of higher level setup.
    >> >
    >> > So if higher level block device issues multiple ios to satisfy one top
    >> > level io, we will actually do the bandwidth allocation only on
    >> > those multiple ios because that's the real IO contending for disk
    >> > bandwidth. And if these multiple ios are going to different physical
    >> > devices, then contention management will take place on those devices.
    >> >
    >> > IOW, we will not worry about providing fairness at bios submitted to
    >> > higher level devices. We will just pitch in for contention management
    >> > only when request from various cgroups are contending for physical
    >> > device at bottom most layers. Isn't if fair?
    >> >
    >> > Thanks
    >> > Vivek
    >> >
    >> >> I don't think the issue of bandwidth availability like above will really
    >> >> be an issue, if your stripe is set up symmetrically, the contention
    >> >> should average out to both (all) disks in equal measures.
    >> >>
    >> >> The only real issue I can see is with linear volumes, but those are
    >> >> stupid anyway - non of the gains but all the risks.
    >> > --
    >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    >> > the body of a message to majordomo@vger.kernel.org
    >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
    >> > Please read the FAQ at http://www.tux.org/lkml/
    >> >

    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [patch 3/4] io controller: Core IO controller implementationlogic

    Vivek Goyal said:
    > On Fri, Nov 07, 2008 at 12:21:45PM +0900, KAMEZAWA Hiroyuki wrote:
    >> On Thu, 06 Nov 2008 10:30:25 -0500
    >> vgoyal@redhat.com wrote:
    >>
    >> >
    >> > o Core IO controller implementation
    >> >
    >> > Signed-off-by: Vivek Goyal
    >> >

    >>
    >> 2 comments after a quick look.
    >>
    >> - I don't recommend generic work queue. More stacked dependency between
    >> "work"
    >> is not good. (I think disk-driver uses "work" for their jobs.)

    >
    > Sorry, I did not get this. Are you recommending that don't create a new
    > work queue, instead use existing work queue (say kblockd) to submit the
    > bios
    > here?
    >

    Ah, no, recomending new-original its own workqueue. I'm sorry that it seems
    I missed something at reading your patch.
    (other person may have other opinion, here

    > I will look into it. I was little worried about a kblockd being overworked
    > in case of too many logical devices enabling IO controller.
    >


    Thanks,
    -Kame

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Fri, Nov 07, 2008 at 11:31:44AM +0100, Peter Zijlstra wrote:
    > On Fri, 2008-11-07 at 11:41 +1100, Dave Chinner wrote:
    > > On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote:
    > > > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote:
    > > > > Peter Zijlstra wrote:
    > > > >
    > > > > > The only real issue I can see is with linear volumes, but
    > > > > > those are stupid anyway - non of the gains but all the
    > > > > > risks.
    > > > >
    > > > > Linear volumes may well be the most common ones.
    > > > >
    > > > > People start out with the filesystems at a certain size,
    > > > > increasing onto a second (new) disk later, when more space
    > > > > is required.
    > > >
    > > > Are they aware of how risky linear volumes are? I would
    > > > discourage anyone from using them.

    > >
    > > In what way are they risky?

    >
    > You loose all your data when one disk dies, so your mtbf decreases
    > with the number of disks in your linear span. And you get non of
    > the benefits from having multiple disks, like extra speed from
    > striping, or redundancy from raid.


    Fmeh. Step back and think for a moment. How does every major
    distro build redundant root drives?

    Yeah, they build a mirror and then put LVM on top of the mirror
    to partition it. Each partition is a *linear volume*, but
    no single disk failure is going to lose data because it's
    been put on top of a mirror.

    IOWs, reliability of linear volumes is only an issue if you don't
    build redundancy into your storage stack. Just like RAID0, a single
    disk failure will lose data. So, most people use linear volumes on
    top of RAID1 or RAID5 to avoid such a single disk failure problem.
    People do the same thing with RAID0 - it's what RAID10 and RAID50
    do....

    Also, linear volume performance scalability is on a different axis
    to striping. Striping improves bandwidth, but each disk in a stripe
    tends to make the same head movements. Hence striping improves
    sequential throughput but only provides limited iops scalability.
    Effectively, striping only improves throughput while the disks are
    not seeking a lot. Add a few parallel I/O streams, and a stripe will
    start to slow down as each disk seeks between streams. i.e. disks
    in stripes cannot be considered to be able to operate independently.

    Linear voulmes create independent regions within the address space -
    the regions can seek independently when under concurrent I/O and
    hence iops scalability is much greater. Aggregate bandwidth is the
    same a striping, it's just that a single stream is limited in
    throughput. If you want to improve single stream throughput,
    you stripe before you concatenate.

    That's why people create layered storage systems like this:

    linear volume
    |->stripe
    |-> md RAID5
    |-> disk
    |-> disk
    |-> disk
    |-> disk
    |-> disk
    |-> md RAID5
    |-> disk
    |-> disk
    |-> disk
    |-> disk
    |-> disk
    |->stripe
    |-> md RAID5
    ......
    |->stripe
    ......

    What you then need is a filesystem that can spread the load over
    such a layout. Lets use, for argument's sake, XFS and tell it the
    geometry of the RAID5 luns that make up the volume so that it's
    allocation is all nicely aligned. Then we match the allocation
    group size to the size of each independent part of the linear
    volume. Now when XFS spreads it's inodes and data over multiple
    AGs, it's spreading the load across disks that can operate
    concurrently....

    Effectively, linear volumes are about as dangerous as striping.
    If you don't build in redundancy at a level below the linear
    volume or stripe, then you lose when something fails.

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote:
    > On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal wrote:
    > > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
    > >> It seems that approaches with two level scheduling (DM-IOBand or this
    > >> patch set on top and another scheduler at elevator) will have the
    > >> possibility of undesirable interactions (see "issues" listed at the
    > >> end of the second patch). For example, a request submitted as RT might
    > >> get delayed at higher layers, even if cfq at elevator level is doing
    > >> the right thing.
    > >>

    > >
    > > Yep. Buffering of bios at higher layer can break underlying elevator's
    > > assumptions.
    > >
    > > What if we start keeping track of task priorities and RT tasks in higher
    > > level schedulers and dispatch the bios accordingly. Will it break the
    > > underlying noop, deadline or AS?

    >
    > It will probably not. But then we have a cfq-like scheduler at higher
    > level and we can agree that the combinations "cfq(higher
    > level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
    > would probably work. But if we implement one high level cfq-like
    > scheduler at a higher level, we would not take care of somebody who
    > wants noop-noop or propotional-noop. The point I am trying to make is
    > that there is probably no single one-size-fits-all solution for a
    > higher level scheduler. And we should limit the arbitrary mixing and
    > matching of higher level schedulers and elevator schedulers. That
    > being said, the existence of a higher level scheduler is still a point
    > of debate I guess, see my comments below.
    >


    Ya, implemeting CFQ like thing in higher level scheduler will make things
    complex.

    >
    > >
    > >> Moreover, if the requests in the higher level scheduler are dispatched
    > >> as soon as they come, there would be no queuing at the higher layers,
    > >> unless the request queue at the lower level fills up and causes a
    > >> backlog. And in the absence of queuing, any work-conserving scheduler
    > >> would behave as a no-op scheduler.
    > >>
    > >> These issues motivate to take a second look into two level scheduling.
    > >> The main motivations for two level scheduling seem to be:
    > >> (1) Support bandwidth division across multiple devices for RAID and LVMs.

    > >
    > > Nauman, can you give an example where we really need bandwidth division
    > > for higher level devices.
    > >
    > > I am beginning to think that real contention is at leaf level physical
    > > devices and not at higher level logical devices hence we should be doing
    > > any resource management only at leaf level and not worry about higher
    > > level logical devices.
    > >
    > > If this requirement goes away, then case of two level scheduler weakens
    > > and one needs to think about doing changes at leaf level IO schedulers.

    >
    > I cannot agree with you more on this that there is only contention at
    > the leaf level physical devices and bandwidth should be managed only
    > there. But having seen earlier posts on this list, i feel some folks
    > might not agree with us. For example, if we have RAID-0 striping, we
    > might want to schedule requests based on accumulative bandwidth used
    > over all devices. Again, I myself don't agree with moving scheduling
    > at a higher level just to support that.
    >


    Hmm.., I am not very convinced that we need to do resource management
    at RAID0 device. The common case of resource management is that a higher
    priority task group is not deprived of resources because of lower priority
    task group. So if there is no contention between two task groups (At leaf
    node), then I might as well let them give them full access to RAID 0
    logical device without any control.

    Hope people who have requirement of control at higher level devices can
    pitch in now and share their perspective.

    > >
    > >> (2) Divide bandwidth between different cgroups without modifying each
    > >> of the existing schedulers (and without replicating the code).
    > >>
    > >> One possible approach to handle (1) is to keep track of bandwidth
    > >> utilized by each cgroup in a per cgroup data structure (instead of a
    > >> per cgroup per device data structure) and use that information to make
    > >> scheduling decisions within the elevator level schedulers. Such a
    > >> patch can be made flag-disabled if co-ordination across different
    > >> device schedulers is not required.
    > >>

    > >
    > > Can you give more details about it. I am not sure I understand it. Exactly
    > > what information should be stored in each cgroup.
    > >
    > > I think per cgroup per device data structures are good so that an scheduer
    > > will not worry about other devices present in the system and will just try
    > > to arbitrate between various cgroup contending for that device. This goes
    > > back to same issue of getting rid of requirement (1) from io controller.

    >
    > I was thinking that we can keep track of disk time used at each
    > device, and keep the cumulative number in a per cgroup data structure.
    > But that is only if we want to support bandwidth division across
    > devices. You and me both agree that we probably do not need to do
    > that.
    >
    > >
    > >> And (2) can probably be handled by having one scheduler support
    > >> different modes. For example, one possible mode is "propotional
    > >> division between crgroups + no-op between threads of a cgroup" or "cfq
    > >> between cgroups + cfq between threads of a cgroup". That would also
    > >> help avoid combinations which might not work e.g RT request issue
    > >> mentioned earlier in this email. And this unified scheduler can re-use
    > >> code from all the existing patches.
    > >>

    > >
    > > IIUC, you are suggesting some kind of unification between four IO
    > > schedulers so that proportional weight code is not replicated and user can
    > > switch mode on the fly based on tunables?

    >
    > Yes, that seems to be a solution to avoid replication of code. But we
    > should also look at any other solutions that avoid replication of
    > code, and also avoid scheduling in two different layers.
    > In my opinion, scheduling at two different layers is problematic because
    > (a) Any buffering done at a higher level will be artificial, unless
    > the queues at lower levels are completely full. And if there is no
    > buffering at a higher level, any scheduling scheme would be
    > ineffective.
    > (b) We cannot have an arbitrary mixing and matching of higher and
    > lower level schedulers.
    >
    > (a) would exist in any solution in which requests are queued at
    > multiple levels. Can you please comment on this with respect to the
    > patch that you have posted?
    >


    I am not very sure about the queustion, but in my patch, buffering at
    at higher layer is irrespective of the status of underlying queue. We
    try our best to fill underlying queue with request, only subject to the
    criteria of proportional bandwidth.

    So, if there are two cgroups A and B and we allocate two cgroups 2000
    tokens each to begin with. If A has consumed all the tokens soon and B
    has not, then we will stop A from dispatching more requests and wait for
    B to either issue more IO and consume tokens or get out of contention.
    This can leave disk idle for sometime. We can probably do some
    optimizations here.

    Thanks
    Vivek

    > >>
    > >> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal wrote:
    > >> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
    > >> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
    > >> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
    > >> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
    > >> >> > >
    > >> >> > > > > Does this still require I use dm, or does it also work on regular block
    > >> >> > > > > devices? Patch 4/4 isn't quite clear on this.
    > >> >> > > >
    > >> >> > > > No. You don't have to use dm. It will simply work on regular devices. We
    > >> >> > > > shall have to put few lines of code for it to work on devices which don't
    > >> >> > > > make use of standard __make_request() function and provide their own
    > >> >> > > > make_request function.
    > >> >> > > >
    > >> >> > > > Hence for example, I have put that few lines of code so that it can work
    > >> >> > > > with dm device. I shall have to do something similar for md too.
    > >> >> > > >
    > >> >> > > > Though, I am not very sure why do I need to do IO control on higher level
    > >> >> > > > devices. Will it be sufficient if we just control only bottom most
    > >> >> > > > physical block devices?
    > >> >> > > >
    > >> >> > > > Anyway, this approach should work at any level.
    > >> >> > >
    > >> >> > > Nice, although I would think only doing the higher level devices makes
    > >> >> > > more sense than only doing the leafs.
    > >> >> > >
    > >> >> >
    > >> >> > I thought that we should be doing any kind of resource management only at
    > >> >> > the level where there is actual contention for the resources.So in this case
    > >> >> > looks like only bottom most devices are slow and don't have infinite bandwidth
    > >> >> > hence the contention.(I am not taking into account the contention at
    > >> >> > bus level or contention at interconnect level for external storage,
    > >> >> > assuming interconnect is not the bottleneck).
    > >> >> >
    > >> >> > For example, lets say there is one linear device mapper device dm-0 on
    > >> >> > top of physical devices sda and sdb. Assuming two tasks in two different
    > >> >> > cgroups are reading two different files from deivce dm-0. Now if these
    > >> >> > files both fall on same physical device (either sda or sdb), then they
    > >> >> > will be contending for resources. But if files being read are on different
    > >> >> > physical deivces then practically there is no device contention (Even on
    > >> >> > the surface it might look like that dm-0 is being contended for). So if
    > >> >> > files are on different physical devices, IO controller will not know it.
    > >> >> > He will simply dispatch one group at a time and other device might remain
    > >> >> > idle.
    > >> >> >
    > >> >> > Keeping that in mind I thought we will be able to make use of full
    > >> >> > available bandwidth if we do IO control only at bottom most device. Doing
    > >> >> > it at higher layer has potential of not making use of full available bandwidth.
    > >> >> >
    > >> >> > > Is there any reason we cannot merge this with the regular io-scheduler
    > >> >> > > interface? afaik the only problem with doing group scheduling in the
    > >> >> > > io-schedulers is the stacked devices issue.
    > >> >> >
    > >> >> > I think we should be able to merge it with regular io schedulers. Apart
    > >> >> > from stacked device issue, people also mentioned that it is so closely
    > >> >> > tied to IO schedulers that we will end up doing four implementations for
    > >> >> > four schedulers and that is not very good from maintenance perspective.
    > >> >> >
    > >> >> > But I will spend more time in finding out if there is a common ground
    > >> >> > between schedulers so that a lot of common IO control code can be used
    > >> >> > in all the schedulers.
    > >> >> >
    > >> >> > >
    > >> >> > > Could we make the io-schedulers aware of this hierarchy?
    > >> >> >
    > >> >> > You mean IO schedulers knowing that there is somebody above them doing
    > >> >> > proportional weight dispatching of bios? If yes, how would that help?
    > >> >>
    > >> >> Well, take the slightly more elaborate example or a raid[56] setup. This
    > >> >> will need to sometimes issue multiple leaf level ios to satisfy one top
    > >> >> level io.
    > >> >>
    > >> >> How are you going to attribute this fairly?
    > >> >>
    > >> >
    > >> > I think in this case, definition of fair allocation will be little
    > >> > different. We will do fair allocation only at the leaf nodes where
    > >> > there is actual contention, irrespective of higher level setup.
    > >> >
    > >> > So if higher level block device issues multiple ios to satisfy one top
    > >> > level io, we will actually do the bandwidth allocation only on
    > >> > those multiple ios because that's the real IO contending for disk
    > >> > bandwidth. And if these multiple ios are going to different physical
    > >> > devices, then contention management will take place on those devices.
    > >> >
    > >> > IOW, we will not worry about providing fairness at bios submitted to
    > >> > higher level devices. We will just pitch in for contention management
    > >> > only when request from various cgroups are contending for physical
    > >> > device at bottom most layers. Isn't if fair?
    > >> >
    > >> > Thanks
    > >> > Vivek
    > >> >
    > >> >> I don't think the issue of bandwidth availability like above will really
    > >> >> be an issue, if your stripe is set up symmetrically, the contention
    > >> >> should average out to both (all) disks in equal measures.
    > >> >>
    > >> >> The only real issue I can see is with linear volumes, but those are
    > >> >> stupid anyway - non of the gains but all the risks.
    > >> > --
    > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > >> > the body of a message to majordomo@vger.kernel.org
    > >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > >> > Please read the FAQ at http://www.tux.org/lkml/
    > >> >

    > >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [patch 3/4] io controller: Core IO controller implementation logic

    vgoyal@redhat.com wrote:

    Hi vivek,

    I think bio_group_controller() need to be exported by EXPORT_SYMBOL()

    --
    Regards
    Gui Jianfeng

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Mon, Nov 10, 2008 at 6:11 AM, Vivek Goyal wrote:
    > On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote:
    >> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal wrote:
    >> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
    >> >> It seems that approaches with two level scheduling (DM-IOBand or this
    >> >> patch set on top and another scheduler at elevator) will have the
    >> >> possibility of undesirable interactions (see "issues" listed at the
    >> >> end of the second patch). For example, a request submitted as RT might
    >> >> get delayed at higher layers, even if cfq at elevator level is doing
    >> >> the right thing.
    >> >>
    >> >
    >> > Yep. Buffering of bios at higher layer can break underlying elevator's
    >> > assumptions.
    >> >
    >> > What if we start keeping track of task priorities and RT tasks in higher
    >> > level schedulers and dispatch the bios accordingly. Will it break the
    >> > underlying noop, deadline or AS?

    >>
    >> It will probably not. But then we have a cfq-like scheduler at higher
    >> level and we can agree that the combinations "cfq(higher
    >> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
    >> would probably work. But if we implement one high level cfq-like
    >> scheduler at a higher level, we would not take care of somebody who
    >> wants noop-noop or propotional-noop. The point I am trying to make is
    >> that there is probably no single one-size-fits-all solution for a
    >> higher level scheduler. And we should limit the arbitrary mixing and
    >> matching of higher level schedulers and elevator schedulers. That
    >> being said, the existence of a higher level scheduler is still a point
    >> of debate I guess, see my comments below.
    >>

    >
    > Ya, implemeting CFQ like thing in higher level scheduler will make things
    > complex.
    >
    >>
    >> >
    >> >> Moreover, if the requests in the higher level scheduler are dispatched
    >> >> as soon as they come, there would be no queuing at the higher layers,
    >> >> unless the request queue at the lower level fills up and causes a
    >> >> backlog. And in the absence of queuing, any work-conserving scheduler
    >> >> would behave as a no-op scheduler.
    >> >>
    >> >> These issues motivate to take a second look into two level scheduling.
    >> >> The main motivations for two level scheduling seem to be:
    >> >> (1) Support bandwidth division across multiple devices for RAID and LVMs.
    >> >
    >> > Nauman, can you give an example where we really need bandwidth division
    >> > for higher level devices.
    >> >
    >> > I am beginning to think that real contention is at leaf level physical
    >> > devices and not at higher level logical devices hence we should be doing
    >> > any resource management only at leaf level and not worry about higher
    >> > level logical devices.
    >> >
    >> > If this requirement goes away, then case of two level scheduler weakens
    >> > and one needs to think about doing changes at leaf level IO schedulers.

    >>
    >> I cannot agree with you more on this that there is only contention at
    >> the leaf level physical devices and bandwidth should be managed only
    >> there. But having seen earlier posts on this list, i feel some folks
    >> might not agree with us. For example, if we have RAID-0 striping, we
    >> might want to schedule requests based on accumulative bandwidth used
    >> over all devices. Again, I myself don't agree with moving scheduling
    >> at a higher level just to support that.
    >>

    >
    > Hmm.., I am not very convinced that we need to do resource management
    > at RAID0 device. The common case of resource management is that a higher
    > priority task group is not deprived of resources because of lower priority
    > task group. So if there is no contention between two task groups (At leaf
    > node), then I might as well let them give them full access to RAID 0
    > logical device without any control.
    >
    > Hope people who have requirement of control at higher level devices can
    > pitch in now and share their perspective.
    >
    >> >
    >> >> (2) Divide bandwidth between different cgroups without modifying each
    >> >> of the existing schedulers (and without replicating the code).
    >> >>
    >> >> One possible approach to handle (1) is to keep track of bandwidth
    >> >> utilized by each cgroup in a per cgroup data structure (instead of a
    >> >> per cgroup per device data structure) and use that information to make
    >> >> scheduling decisions within the elevator level schedulers. Such a
    >> >> patch can be made flag-disabled if co-ordination across different
    >> >> device schedulers is not required.
    >> >>
    >> >
    >> > Can you give more details about it. I am not sure I understand it. Exactly
    >> > what information should be stored in each cgroup.
    >> >
    >> > I think per cgroup per device data structures are good so that an scheduer
    >> > will not worry about other devices present in the system and will just try
    >> > to arbitrate between various cgroup contending for that device. This goes
    >> > back to same issue of getting rid of requirement (1) from io controller.

    >>
    >> I was thinking that we can keep track of disk time used at each
    >> device, and keep the cumulative number in a per cgroup data structure.
    >> But that is only if we want to support bandwidth division across
    >> devices. You and me both agree that we probably do not need to do
    >> that.
    >>
    >> >
    >> >> And (2) can probably be handled by having one scheduler support
    >> >> different modes. For example, one possible mode is "propotional
    >> >> division between crgroups + no-op between threads of a cgroup" or "cfq
    >> >> between cgroups + cfq between threads of a cgroup". That would also
    >> >> help avoid combinations which might not work e.g RT request issue
    >> >> mentioned earlier in this email. And this unified scheduler can re-use
    >> >> code from all the existing patches.
    >> >>
    >> >
    >> > IIUC, you are suggesting some kind of unification between four IO
    >> > schedulers so that proportional weight code is not replicated and user can
    >> > switch mode on the fly based on tunables?

    >>
    >> Yes, that seems to be a solution to avoid replication of code. But we
    >> should also look at any other solutions that avoid replication of
    >> code, and also avoid scheduling in two different layers.
    >> In my opinion, scheduling at two different layers is problematic because
    >> (a) Any buffering done at a higher level will be artificial, unless
    >> the queues at lower levels are completely full. And if there is no
    >> buffering at a higher level, any scheduling scheme would be
    >> ineffective.
    >> (b) We cannot have an arbitrary mixing and matching of higher and
    >> lower level schedulers.
    >>
    >> (a) would exist in any solution in which requests are queued at
    >> multiple levels. Can you please comment on this with respect to the
    >> patch that you have posted?
    >>

    >
    > I am not very sure about the queustion, but in my patch, buffering at
    > at higher layer is irrespective of the status of underlying queue. We
    > try our best to fill underlying queue with request, only subject to the
    > criteria of proportional bandwidth.
    >
    > So, if there are two cgroups A and B and we allocate two cgroups 2000
    > tokens each to begin with. If A has consumed all the tokens soon and B
    > has not, then we will stop A from dispatching more requests and wait for
    > B to either issue more IO and consume tokens or get out of contention.
    > This can leave disk idle for sometime. We can probably do some
    > optimizations here.


    What do you think about elevator based solutions like 2 level cfq
    patches submitted by Satoshi and Vasily earlier? CFQ can be trivially
    modified to do proportional division (i.e give time slices in
    proportion to weight instead of priority). And such a solution would
    avoid idleness problem like the one you mentioned above and can also
    avoid burstiness issues (see smoothing patches -- v1.2.0 and v1.3.0 --
    of dm-ioband) in token based schemes.

    Also doing time based token allocation (as you mentioned in TODO list)
    sounds very interesting. Can we look at the disk time taken by each
    bio and use that to account for tokens? The problem is that the time
    taken is not available when the requests are sent to disk, but we can
    do delayed token charging (i.e deduct tokens after the request is
    completed?). It seems that such an approach should work. What do you
    think?

    >
    > Thanks
    > Vivek
    >
    >> >>
    >> >> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal wrote:
    >> >> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
    >> >> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
    >> >> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
    >> >> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
    >> >> >> > >
    >> >> >> > > > > Does this still require I use dm, or does it also work on regular block
    >> >> >> > > > > devices? Patch 4/4 isn't quite clear on this.
    >> >> >> > > >
    >> >> >> > > > No. You don't have to use dm. It will simply work on regular devices. We
    >> >> >> > > > shall have to put few lines of code for it to work on devices which don't
    >> >> >> > > > make use of standard __make_request() function and provide their own
    >> >> >> > > > make_request function.
    >> >> >> > > >
    >> >> >> > > > Hence for example, I have put that few lines of code so that it can work
    >> >> >> > > > with dm device. I shall have to do something similar for md too.
    >> >> >> > > >
    >> >> >> > > > Though, I am not very sure why do I need to do IO control on higher level
    >> >> >> > > > devices. Will it be sufficient if we just control only bottom most
    >> >> >> > > > physical block devices?
    >> >> >> > > >
    >> >> >> > > > Anyway, this approach should work at any level.
    >> >> >> > >
    >> >> >> > > Nice, although I would think only doing the higher level devices makes
    >> >> >> > > more sense than only doing the leafs.
    >> >> >> > >
    >> >> >> >
    >> >> >> > I thought that we should be doing any kind of resource management only at
    >> >> >> > the level where there is actual contention for the resources.So in this case
    >> >> >> > looks like only bottom most devices are slow and don't have infinite bandwidth
    >> >> >> > hence the contention.(I am not taking into account the contention at
    >> >> >> > bus level or contention at interconnect level for external storage,
    >> >> >> > assuming interconnect is not the bottleneck).
    >> >> >> >
    >> >> >> > For example, lets say there is one linear device mapper device dm-0 on
    >> >> >> > top of physical devices sda and sdb. Assuming two tasks in two different
    >> >> >> > cgroups are reading two different files from deivce dm-0. Now if these
    >> >> >> > files both fall on same physical device (either sda or sdb), then they
    >> >> >> > will be contending for resources. But if files being read are on different
    >> >> >> > physical deivces then practically there is no device contention (Even on
    >> >> >> > the surface it might look like that dm-0 is being contended for). So if
    >> >> >> > files are on different physical devices, IO controller will not know it.
    >> >> >> > He will simply dispatch one group at a time and other device might remain
    >> >> >> > idle.
    >> >> >> >
    >> >> >> > Keeping that in mind I thought we will be able to make use of full
    >> >> >> > available bandwidth if we do IO control only at bottom most device. Doing
    >> >> >> > it at higher layer has potential of not making use of full available bandwidth.
    >> >> >> >
    >> >> >> > > Is there any reason we cannot merge this with the regular io-scheduler
    >> >> >> > > interface? afaik the only problem with doing group scheduling in the
    >> >> >> > > io-schedulers is the stacked devices issue.
    >> >> >> >
    >> >> >> > I think we should be able to merge it with regular io schedulers. Apart
    >> >> >> > from stacked device issue, people also mentioned that it is so closely
    >> >> >> > tied to IO schedulers that we will end up doing four implementations for
    >> >> >> > four schedulers and that is not very good from maintenance perspective.
    >> >> >> >
    >> >> >> > But I will spend more time in finding out if there is a common ground
    >> >> >> > between schedulers so that a lot of common IO control code can be used
    >> >> >> > in all the schedulers.
    >> >> >> >
    >> >> >> > >
    >> >> >> > > Could we make the io-schedulers aware of this hierarchy?
    >> >> >> >
    >> >> >> > You mean IO schedulers knowing that there is somebody above them doing
    >> >> >> > proportional weight dispatching of bios? If yes, how would that help?
    >> >> >>
    >> >> >> Well, take the slightly more elaborate example or a raid[56] setup. This
    >> >> >> will need to sometimes issue multiple leaf level ios to satisfy one top
    >> >> >> level io.
    >> >> >>
    >> >> >> How are you going to attribute this fairly?
    >> >> >>
    >> >> >
    >> >> > I think in this case, definition of fair allocation will be little
    >> >> > different. We will do fair allocation only at the leaf nodes where
    >> >> > there is actual contention, irrespective of higher level setup.
    >> >> >
    >> >> > So if higher level block device issues multiple ios to satisfy one top
    >> >> > level io, we will actually do the bandwidth allocation only on
    >> >> > those multiple ios because that's the real IO contending for disk
    >> >> > bandwidth. And if these multiple ios are going to different physical
    >> >> > devices, then contention management will take place on those devices.
    >> >> >
    >> >> > IOW, we will not worry about providing fairness at bios submitted to
    >> >> > higher level devices. We will just pitch in for contention management
    >> >> > only when request from various cgroups are contending for physical
    >> >> > device at bottom most layers. Isn't if fair?
    >> >> >
    >> >> > Thanks
    >> >> > Vivek
    >> >> >
    >> >> >> I don't think the issue of bandwidth availability like above will really
    >> >> >> be an issue, if your stripe is set up symmetrically, the contention
    >> >> >> should average out to both (all) disks in equal measures.
    >> >> >>
    >> >> >> The only real issue I can see is with linear volumes, but those are
    >> >> >> stupid anyway - non of the gains but all the risks.
    >> >> > --
    >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    >> >> > the body of a message to majordomo@vger.kernel.org
    >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
    >> >> > Please read the FAQ at http://www.tux.org/lkml/
    >> >> >
    >> >

    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [patch 0/4] [RFC] Another proportional weight IO controller

    On Tue, Nov 11, 2008 at 11:55:53AM -0800, Nauman Rafique wrote:
    > On Mon, Nov 10, 2008 at 6:11 AM, Vivek Goyal wrote:
    > > On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote:
    > >> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal wrote:
    > >> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
    > >> >> It seems that approaches with two level scheduling (DM-IOBand or this
    > >> >> patch set on top and another scheduler at elevator) will have the
    > >> >> possibility of undesirable interactions (see "issues" listed at the
    > >> >> end of the second patch). For example, a request submitted as RT might
    > >> >> get delayed at higher layers, even if cfq at elevator level is doing
    > >> >> the right thing.
    > >> >>
    > >> >
    > >> > Yep. Buffering of bios at higher layer can break underlying elevator's
    > >> > assumptions.
    > >> >
    > >> > What if we start keeping track of task priorities and RT tasks in higher
    > >> > level schedulers and dispatch the bios accordingly. Will it break the
    > >> > underlying noop, deadline or AS?
    > >>
    > >> It will probably not. But then we have a cfq-like scheduler at higher
    > >> level and we can agree that the combinations "cfq(higher
    > >> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
    > >> would probably work. But if we implement one high level cfq-like
    > >> scheduler at a higher level, we would not take care of somebody who
    > >> wants noop-noop or propotional-noop. The point I am trying to make is
    > >> that there is probably no single one-size-fits-all solution for a
    > >> higher level scheduler. And we should limit the arbitrary mixing and
    > >> matching of higher level schedulers and elevator schedulers. That
    > >> being said, the existence of a higher level scheduler is still a point
    > >> of debate I guess, see my comments below.
    > >>

    > >
    > > Ya, implemeting CFQ like thing in higher level scheduler will make things
    > > complex.
    > >
    > >>
    > >> >
    > >> >> Moreover, if the requests in the higher level scheduler are dispatched
    > >> >> as soon as they come, there would be no queuing at the higher layers,
    > >> >> unless the request queue at the lower level fills up and causes a
    > >> >> backlog. And in the absence of queuing, any work-conserving scheduler
    > >> >> would behave as a no-op scheduler.
    > >> >>
    > >> >> These issues motivate to take a second look into two level scheduling.
    > >> >> The main motivations for two level scheduling seem to be:
    > >> >> (1) Support bandwidth division across multiple devices for RAID and LVMs.
    > >> >
    > >> > Nauman, can you give an example where we really need bandwidth division
    > >> > for higher level devices.
    > >> >
    > >> > I am beginning to think that real contention is at leaf level physical
    > >> > devices and not at higher level logical devices hence we should be doing
    > >> > any resource management only at leaf level and not worry about higher
    > >> > level logical devices.
    > >> >
    > >> > If this requirement goes away, then case of two level scheduler weakens
    > >> > and one needs to think about doing changes at leaf level IO schedulers.
    > >>
    > >> I cannot agree with you more on this that there is only contention at
    > >> the leaf level physical devices and bandwidth should be managed only
    > >> there. But having seen earlier posts on this list, i feel some folks
    > >> might not agree with us. For example, if we have RAID-0 striping, we
    > >> might want to schedule requests based on accumulative bandwidth used
    > >> over all devices. Again, I myself don't agree with moving scheduling
    > >> at a higher level just to support that.
    > >>

    > >
    > > Hmm.., I am not very convinced that we need to do resource management
    > > at RAID0 device. The common case of resource management is that a higher
    > > priority task group is not deprived of resources because of lower priority
    > > task group. So if there is no contention between two task groups (At leaf
    > > node), then I might as well let them give them full access to RAID 0
    > > logical device without any control.
    > >
    > > Hope people who have requirement of control at higher level devices can
    > > pitch in now and share their perspective.
    > >
    > >> >
    > >> >> (2) Divide bandwidth between different cgroups without modifying each
    > >> >> of the existing schedulers (and without replicating the code).
    > >> >>
    > >> >> One possible approach to handle (1) is to keep track of bandwidth
    > >> >> utilized by each cgroup in a per cgroup data structure (instead of a
    > >> >> per cgroup per device data structure) and use that information to make
    > >> >> scheduling decisions within the elevator level schedulers. Such a
    > >> >> patch can be made flag-disabled if co-ordination across different
    > >> >> device schedulers is not required.
    > >> >>
    > >> >
    > >> > Can you give more details about it. I am not sure I understand it. Exactly
    > >> > what information should be stored in each cgroup.
    > >> >
    > >> > I think per cgroup per device data structures are good so that an scheduer
    > >> > will not worry about other devices present in the system and will just try
    > >> > to arbitrate between various cgroup contending for that device. This goes
    > >> > back to same issue of getting rid of requirement (1) from io controller.
    > >>
    > >> I was thinking that we can keep track of disk time used at each
    > >> device, and keep the cumulative number in a per cgroup data structure.
    > >> But that is only if we want to support bandwidth division across
    > >> devices. You and me both agree that we probably do not need to do
    > >> that.
    > >>
    > >> >
    > >> >> And (2) can probably be handled by having one scheduler support
    > >> >> different modes. For example, one possible mode is "propotional
    > >> >> division between crgroups + no-op between threads of a cgroup" or "cfq
    > >> >> between cgroups + cfq between threads of a cgroup". That would also
    > >> >> help avoid combinations which might not work e.g RT request issue
    > >> >> mentioned earlier in this email. And this unified scheduler can re-use
    > >> >> code from all the existing patches.
    > >> >>
    > >> >
    > >> > IIUC, you are suggesting some kind of unification between four IO
    > >> > schedulers so that proportional weight code is not replicated and user can
    > >> > switch mode on the fly based on tunables?
    > >>
    > >> Yes, that seems to be a solution to avoid replication of code. But we
    > >> should also look at any other solutions that avoid replication of
    > >> code, and also avoid scheduling in two different layers.
    > >> In my opinion, scheduling at two different layers is problematic because
    > >> (a) Any buffering done at a higher level will be artificial, unless
    > >> the queues at lower levels are completely full. And if there is no
    > >> buffering at a higher level, any scheduling scheme would be
    > >> ineffective.
    > >> (b) We cannot have an arbitrary mixing and matching of higher and
    > >> lower level schedulers.
    > >>
    > >> (a) would exist in any solution in which requests are queued at
    > >> multiple levels. Can you please comment on this with respect to the
    > >> patch that you have posted?
    > >>

    > >
    > > I am not very sure about the queustion, but in my patch, buffering at
    > > at higher layer is irrespective of the status of underlying queue. We
    > > try our best to fill underlying queue with request, only subject to the
    > > criteria of proportional bandwidth.
    > >
    > > So, if there are two cgroups A and B and we allocate two cgroups 2000
    > > tokens each to begin with. If A has consumed all the tokens soon and B
    > > has not, then we will stop A from dispatching more requests and wait for
    > > B to either issue more IO and consume tokens or get out of contention.
    > > This can leave disk idle for sometime. We can probably do some
    > > optimizations here.

    >
    > What do you think about elevator based solutions like 2 level cfq
    > patches submitted by Satoshi and Vasily earlier?


    I have had a very high level look at Satoshi's patch. I will go into
    details soon. I was thinking that this patch solves the problem only
    for CFQ. Can we create a common layer which can be shared by all
    the four IO schedulers.

    So this one common layer can take care of all the management w.r.t
    per device per cgroup data structures and track all the groups, their
    limits (either token based or time based scheme), and control the
    dispatch of requests.

    This way we can enable IO controller not only for CFQ but for all the
    IO schedulers without duplicating too much of code.

    This is what I am playing around with currently. At this point I am
    not sure, how much of common ground I can have between all the IO
    schedulers.

    > CFQ can be trivially
    > modified to do proportional division (i.e give time slices in
    > proportion to weight instead of priority).
    > And such a solution would
    > avoid idleness problem like the one you mentioned above.


    Can you just elaborate a little on how do you get around idleness problem?
    If you don't create idleness than if two tasks in two cgroups are doing
    sequential IO, they might simply get into lockstep and we will not achieve
    any differentiated service proportionate to their weight.

    > and can also
    > avoid burstiness issues (see smoothing patches -- v1.2.0 and v1.3.0 --
    > of dm-ioband) in token based schemes.
    >
    > Also doing time based token allocation (as you mentioned in TODO list)
    > sounds very interesting. Can we look at the disk time taken by each
    > bio and use that to account for tokens? The problem is that the time
    > taken is not available when the requests are sent to disk, but we can
    > do delayed token charging (i.e deduct tokens after the request is
    > completed?). It seems that such an approach should work. What do you
    > think?


    This is a good idea. Charging the cgroup based on time actually consumed
    should be doable. I will look into it. I think in the past somebody
    mentioned that how do you account for the seek time taken because of
    switchover between cgroups? May be average time per cgroup can help here
    a bit.

    This is more about refining the dispatch algorightm once we have agreed
    upon other semantics like 2 level scheduler and can we come up with a common
    layer which can be shared by all four IO schedulers. Once common layer is
    possible, we can always change the common layer algorithm from token based
    to time based to achive better accuracy.

    Thanks
    Vivek
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2