Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path) - Kernel

This is a discussion on Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path) - Kernel ; Hi The bug is correctly analyzed. Regarding the patch --- you need to wake up md->wait if dec_pending drops to zero, so that dm_suspend is not waiting forever. Otherwise the patch is correct, but contains some unneeded parts: if md->pending ...

+ Reply to Thread
Results 1 to 14 of 14

Thread: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

  1. Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    Hi

    The bug is correctly analyzed.

    Regarding the patch --- you need to wake up md->wait if dec_pending drops
    to zero, so that dm_suspend is not waiting forever.

    Otherwise the patch is correct, but contains some unneeded parts: if
    md->pending is used to block removal, we no longer need to hold the
    reference on the table at all. If the code is already called under a
    spinlock (from upstream kernel), we could execute everything under
    md->map_lock spinlock and drop this reference counting at all. But the
    code can take indefinite amount of time, so it would be best to not call
    it under a spinlock in the first place.

    For upstream Linux developers: you are holding a spinlock and calling
    bdi*_congested functions that can take indefinite amount of time (there
    are even users reporting having 50 disks in one logical volume or so). I
    think it would be good to move these calls out of spinlocks.

    What are the general locking rules for these block queue upcalls
    (congested_fn, unplug_fn, merge_bvec_fn & others)? It looks like there may
    be more bugs like this in another upcalls. Can you please somehow specify
    what can the upcall do and what can't?

    Mikulas

    On Wed, 5 Nov 2008, Chandra Seetharaman wrote:

    > dm_any_congested() just checks for the DMF_BLOCK_IO and has no
    > code to make sure that suspend waits for dm_any_congested() to
    > complete.
    >
    > This leads to a case where the table_destroy() and free_multipath()
    > and some other sleeping functions are called (thru dm_table_put())
    > from dm_any_congested.
    >
    > This leads to 2 problems:
    > 1. Sleeping functions called from congested code, whose caller
    > holds a spin lock.
    > 2. An ABBA deadlock between pdflush and multipathd. The two locks
    > in contention are inode lock and kernel lock.
    > Here is the crash analysis:
    > PID: 16879 TASK: ffff81013a06a140 CPU: 3 COMMAND: "pdflush"
    > Owns inode_lock and waiting for kernel_sem
    >
    > PID: 8299 TASK: ffff81024f03e100 CPU: 2 COMMAND: "multipathd"
    > Owns kernel_sem and waiting for inode_lock.
    >
    >
    > PID: 8299 TASK: ffff81024f03e100 CPU: 2 COMMAND: "multipathd"^M
    > #0 [ffff81024ad338c8] schedule at ffffffff8128534c^M
    > #1 [ffff81024ad33980] rt_spin_lock_slowlock at ffffffff81286d15^M << Waiting
    > for inode_lock
    > #2 [ffff81024ad33a40] __rt_spin_lock at ffffffff812873b0^M
    > #3 [ffff81024ad33a50] rt_spin_lock at ffffffff812873bb^M
    > #4 [ffff81024ad33a60] ifind_fast at ffffffff810c32a1^M
    > #5 [ffff81024ad33a90] iget_locked at ffffffff810c3be8^M
    > #6 [ffff81024ad33ad0] proc_get_inode at ffffffff810ebaff^M
    > #7 [ffff81024ad33b10] proc_lookup at ffffffff810f082a^M
    > #8 [ffff81024ad33b40] proc_root_lookup at ffffffff810ec312^M
    > #9 [ffff81024ad33b70] do_lookup at ffffffff810b78f7^M
    > #10 [ffff81024ad33bc0] __link_path_walk at ffffffff810b9a93^M
    > #11 [ffff81024ad33c60] link_path_walk at ffffffff810b9fc1^M
    > #12 [ffff81024ad33d30] path_walk at ffffffff810ba073^M
    > #13 [ffff81024ad33d40] do_path_lookup at ffffffff810ba37a^M
    > #14 [ffff81024ad33d90] __path_lookup_intent_open at ffffffff810baeb0^M
    > #15 [ffff81024ad33de0] path_lookup_open at ffffffff810baf60^M
    > #16 [ffff81024ad33df0] open_namei at ffffffff810bb071^M
    > #17 [ffff81024ad33e80] do_filp_open at ffffffff810ae610^M
    > #18 [ffff81024ad33f30] do_sys_open at ffffffff810ae67f^M
    > #19 [ffff81024ad33f70] sys_open at ffffffff810ae729^M
    >
    >
    > PID: 16879 TASK: ffff81013a06a140 CPU: 3 COMMAND: "pdflush"^M
    > #0 [ffff810023063aa0] schedule at ffffffff8128534c^M
    > #1 [ffff810023063b58] rt_mutex_slowlock at ffffffff81286ac5^M << Waiting for
    > Kernel Lock
    > #2 [ffff810023063c28] rt_mutex_lock at ffffffff81285fb4^M
    > #3 [ffff810023063c38] rt_down at ffffffff8105fec7^M
    > #4 [ffff810023063c58] lock_kernel at ffffffff81287b8c^M
    > #5 [ffff810023063c78] __blkdev_put at ffffffff810d5d31^M
    > #6 [ffff810023063cb8] blkdev_put at ffffffff810d5e68^M
    > #7 [ffff810023063cc8] close_dev at ffffffff8819e547^M
    > #8 [ffff810023063ce8] dm_put_device at ffffffff8819e579^M
    > #9 [ffff810023063d08] free_priority_group at ffffffff881c0e86^M
    > #10 [ffff810023063d58] free_multipath at ffffffff881c0f11^M
    > #11 [ffff810023063d78] multipath_dtr at ffffffff881c0f73^M
    > #12 [ffff810023063d98] dm_table_put at ffffffff8819e347^M
    > #13 [ffff810023063dc8] dm_any_congested at ffffffff8819d074^M
    > #14 [ffff810023063df8] sync_sb_inodes at ffffffff810cd451^M
    > #15 [ffff810023063e38] writeback_inodes at ffffffff810cd7b5^M
    > #16 [ffff810023063e68] background_writeout at ffffffff8108be38^M
    > #17 [ffff810023063ed8] pdflush at ffffffff8108c79a^M
    > #18 [ffff810023063f28] kthread at ffffffff81051477^M
    > #19 [ffff810023063f48] kernel_thread at ffffffff8100d048^M
    >
    > crash> kernel_sem
    > kernel_sem = $5 = {
    > count = {
    > counter = 0
    > },
    > lock = {
    > wait_lock = {
    > raw_lock = {
    > slock = 34952
    > },
    > break_lock = 0
    > },
    > wait_list = {
    > prio_list = {
    > next = 0xffff810023063b88,
    > prev = 0xffff81014941fdc0
    > },
    > node_list = {
    > next = 0xffff810023063b98,
    > prev = 0xffff81014d04ddd0
    > }
    > },
    > owner = 0xffff81024f03e102 << multipathd owns it. (task 0xffff81024f03e100)
    > << last two bits are flags..
    > so replace it with 0.
    > }
    > }
    > crash> inode_lock
    > inode_lock = $6 = {
    > lock = {
    > wait_lock = {
    > raw_lock = {
    > slock = 3341
    > },
    > break_lock = 0
    > },
    > wait_list = {
    > prio_list = {
    > next = 0xffff81024ad339a0,
    > prev = 0xffff8100784fdd78
    > },
    > node_list = {
    > next = 0xffff81024ad339b0,
    > prev = 0xffff81024afb3a70
    > }
    > },
    > owner = 0xffff81013a06a142 << pdflush owns it. (task 0xffff81013a06a140)
    > },
    > break_lock = 0
    > }
    > crash>
    > ----------------
    >
    > Signed-off-by: Chandra Seetharaman
    > ----
    >
    > Index: linux-2.6.28-rc3/drivers/md/dm.c
    > ================================================== =================
    > --- linux-2.6.28-rc3.orig/drivers/md/dm.c
    > +++ linux-2.6.28-rc3/drivers/md/dm.c
    > @@ -937,16 +937,21 @@ static void dm_unplug_all(struct request
    >
    > static int dm_any_congested(void *congested_data, int bdi_bits)
    > {
    > - int r;
    > + int r = bdi_bits;
    > struct mapped_device *md = (struct mapped_device *) congested_data;
    > - struct dm_table *map = dm_get_table(md);
    > + struct dm_table *map;
    >
    > - if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
    > - r = bdi_bits;
    > - else
    > - r = dm_table_any_congested(map, bdi_bits);
    > + atomic_inc(&md->pending);
    > + if (test_bit(DMF_BLOCK_IO, &md->flags))
    > + goto done;
    >
    > - dm_table_put(map);
    > + map = dm_get_table(md);
    > + if (map) {
    > + r = dm_table_any_congested(map, bdi_bits);
    > + dm_table_put(map);
    > + }
    > +done:
    > + atomic_dec(&md->pending);
    > return r;
    > }
    >
    >
    >
    > --
    > dm-devel mailing list
    > dm-devel@redhat.com
    > https://www.redhat.com/mailman/listinfo/dm-devel
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
    > For upstream Linux developers: you are holding a spinlock and calling
    > bdi*_congested functions that can take indefinite amount of time (there
    > are even users reporting having 50 disks in one logical volume or so). I
    > think it would be good to move these calls out of spinlocks.


    Umm, they shouldn't block that long, as that completely defeats their
    purpose. These functions are mostly used to avoid throwing more I/O at
    a congested device if pdflush could do more useful things instead. But
    if it blocks in those functions anyway we wouldn't have to bother using
    them. Do you have more details about the uses cases when this happens
    and where the routines spend so much time?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, 10 Nov 2008, Christoph Hellwig wrote:

    > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
    > > For upstream Linux developers: you are holding a spinlock and calling
    > > bdi*_congested functions that can take indefinite amount of time (there
    > > are even users reporting having 50 disks in one logical volume or so). I
    > > think it would be good to move these calls out of spinlocks.

    >
    > Umm, they shouldn't block that long, as that completely defeats their
    > purpose. These functions are mostly used to avoid throwing more I/O at
    > a congested device if pdflush could do more useful things instead. But
    > if it blocks in those functions anyway we wouldn't have to bother using
    > them. Do you have more details about the uses cases when this happens
    > and where the routines spend so much time?


    For device mapper, congested_fn asks every device in the tree and make OR
    of their bits --- so if the user has 50 devices, it asks them all.

    For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
    same --- asking every device.

    If you have a better idea how to implement congested_fn, say it.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, Nov 10, 2008 at 08:54:01AM -0500, Christoph Hellwig wrote:
    > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
    > > For upstream Linux developers: you are holding a spinlock and calling
    > > bdi*_congested functions that can take indefinite amount of time (there
    > > are even users reporting having 50 disks in one logical volume or so). I
    > > think it would be good to move these calls out of spinlocks.

    > Umm, they shouldn't block that long, as that completely defeats their
    > purpose.


    Indeed - the blocking was a bug for which there's a patch, but that doesn't
    deal with how the function should be operating in the first place.

    - If one device is found to be congested, why bother checking the remaining
    devices?

    - If the device is suspended, the response should be that it is congested, I'd
    have thought.

    Alasdair
    --
    agk@redhat.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
    > On Mon, 10 Nov 2008, Christoph Hellwig wrote:
    >
    > > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
    > > > For upstream Linux developers: you are holding a spinlock and calling
    > > > bdi*_congested functions that can take indefinite amount of time (there
    > > > are even users reporting having 50 disks in one logical volume or so). I
    > > > think it would be good to move these calls out of spinlocks.

    > >
    > > Umm, they shouldn't block that long, as that completely defeats their
    > > purpose. These functions are mostly used to avoid throwing more I/O at
    > > a congested device if pdflush could do more useful things instead. But
    > > if it blocks in those functions anyway we wouldn't have to bother using
    > > them. Do you have more details about the uses cases when this happens
    > > and where the routines spend so much time?

    >
    > For device mapper, congested_fn asks every device in the tree and make OR
    > of their bits --- so if the user has 50 devices, it asks them all.
    >
    > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
    > same --- asking every device.
    >
    > If you have a better idea how to implement congested_fn, say it.


    Fix the infrastructure by adding a function call so that you can have
    the individual devices report their congestion state to the aggregate.

    Then congestion_fn can return a valid state in O(1) because the state is
    keps up-to-date by the individual state changes.

    IOW, add a set_congested_fn() and clear_congested_fn().
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)



    On Mon, 10 Nov 2008, Peter Zijlstra wrote:

    > On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
    > > On Mon, 10 Nov 2008, Christoph Hellwig wrote:
    > >
    > > > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
    > > > > For upstream Linux developers: you are holding a spinlock and calling
    > > > > bdi*_congested functions that can take indefinite amount of time (there
    > > > > are even users reporting having 50 disks in one logical volume or so). I
    > > > > think it would be good to move these calls out of spinlocks.
    > > >
    > > > Umm, they shouldn't block that long, as that completely defeats their
    > > > purpose. These functions are mostly used to avoid throwing more I/O at
    > > > a congested device if pdflush could do more useful things instead. But
    > > > if it blocks in those functions anyway we wouldn't have to bother using
    > > > them. Do you have more details about the uses cases when this happens
    > > > and where the routines spend so much time?

    > >
    > > For device mapper, congested_fn asks every device in the tree and make OR
    > > of their bits --- so if the user has 50 devices, it asks them all.
    > >
    > > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
    > > same --- asking every device.
    > >
    > > If you have a better idea how to implement congested_fn, say it.

    >
    > Fix the infrastructure by adding a function call so that you can have
    > the individual devices report their congestion state to the aggregate.
    >
    > Then congestion_fn can return a valid state in O(1) because the state is
    > keps up-to-date by the individual state changes.
    >
    > IOW, add a set_congested_fn() and clear_congested_fn().


    If you have a physical disk that has many LVM volumes on it, you end up in
    a situation when disk congestion state change is reported to all the
    volumes. So it will create O(n) problem at the other side.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, Nov 10, 2008 at 03:26:50PM +0100, Peter Zijlstra wrote:
    > Fix the infrastructure by adding a function call so that you can have
    > the individual devices report their congestion state to the aggregate.


    Which requires knowing/accessing some device hierarchy, which would be nice
    for other things too (like propagation of notification of changes to device
    properties such as size, hardsect size etc.).

    Alasdair
    --
    agk@redhat.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, 2008-11-10 at 09:32 -0500, Mikulas Patocka wrote:
    >
    > On Mon, 10 Nov 2008, Peter Zijlstra wrote:
    >
    > > On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
    > > > On Mon, 10 Nov 2008, Christoph Hellwig wrote:
    > > >
    > > > > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
    > > > > > For upstream Linux developers: you are holding a spinlock and calling
    > > > > > bdi*_congested functions that can take indefinite amount of time (there
    > > > > > are even users reporting having 50 disks in one logical volume or so). I
    > > > > > think it would be good to move these calls out of spinlocks.
    > > > >
    > > > > Umm, they shouldn't block that long, as that completely defeats their
    > > > > purpose. These functions are mostly used to avoid throwing more I/O at
    > > > > a congested device if pdflush could do more useful things instead. But
    > > > > if it blocks in those functions anyway we wouldn't have to bother using
    > > > > them. Do you have more details about the uses cases when this happens
    > > > > and where the routines spend so much time?
    > > >
    > > > For device mapper, congested_fn asks every device in the tree and make OR
    > > > of their bits --- so if the user has 50 devices, it asks them all.
    > > >
    > > > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
    > > > same --- asking every device.
    > > >
    > > > If you have a better idea how to implement congested_fn, say it.

    > >
    > > Fix the infrastructure by adding a function call so that you can have
    > > the individual devices report their congestion state to the aggregate.
    > >
    > > Then congestion_fn can return a valid state in O(1) because the state is
    > > keps up-to-date by the individual state changes.
    > >
    > > IOW, add a set_congested_fn() and clear_congested_fn().

    >
    > If you have a physical disk that has many LVM volumes on it, you end up in
    > a situation when disk congestion state change is reported to all the
    > volumes. So it will create O(n) problem at the other side.


    *sigh* I can almost understand why people want to use lvm to combine
    multiple disks, but why make the partition thing even worse...
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)



    On Mon, 10 Nov 2008, Alasdair G Kergon wrote:

    > On Mon, Nov 10, 2008 at 08:54:01AM -0500, Christoph Hellwig wrote:
    > > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
    > > > For upstream Linux developers: you are holding a spinlock and calling
    > > > bdi*_congested functions that can take indefinite amount of time (there
    > > > are even users reporting having 50 disks in one logical volume or so). I
    > > > think it would be good to move these calls out of spinlocks.

    > > Umm, they shouldn't block that long, as that completely defeats their
    > > purpose.

    >
    > Indeed - the blocking was a bug for which there's a patch, but that doesn't
    > deal with how the function should be operating in the first place.


    To me it looks like the bug is more severe --- holding a reference to
    table can happen in other upcalls too and in many other dm places.

    I'm considering if we could call the table destructor just at one place
    (that would wait until all references are gone), instead of calling the
    destructor at each dm_table_put point (there are many of these points and
    it's hard to check all of them for correctness).

    > - If one device is found to be congested, why bother checking the remaining
    > devices?


    Yes, that could be improved. But it doesn't solve the O(n) problem.

    > - If the device is suspended, the response should be that it is
    > congested, I'd have thought.


    Yes, but these congestion upcalls are used only for optimization and the
    device is suspended for so small time, that it doesn't matter if we
    optimize io acces in small moment or not.

    Mikulas

    > Alasdair
    > --
    > agk@redhat.com
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, 10 Nov 2008, Peter Zijlstra wrote:

    > > If you have a physical disk that has many LVM volumes on it, you end up in
    > > a situation when disk congestion state change is reported to all the
    > > volumes. So it will create O(n) problem at the other side.

    >
    > *sigh* I can almost understand why people want to use lvm to combine
    > multiple disks, but why make the partition thing even worse...


    To protect the services from each other --- so that mail storm won't blow
    user's directories, users blowing their space can't corrupt the system
    updates, logs are kept safely even if other partition overflows etc.

    I just know some admins who prefer to have separate filesystems instead of
    quotas for this purpose.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, Nov 10, 2008 at 09:19:21AM -0500, Mikulas Patocka wrote:
    > For device mapper, congested_fn asks every device in the tree and make OR
    > of their bits --- so if the user has 50 devices, it asks them all.
    >
    > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
    > same --- asking every device.
    >
    > If you have a better idea how to implement congested_fn, say it.


    Well, even asking 50 devices shouldn't take that long normally. Do you
    take some sleeping lock that might block forever? In that case I would
    just return congested for that device as it would delay pdflush
    otherwise.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, Nov 10, 2008 at 09:40:18AM -0500, Mikulas Patocka wrote:
    > > - If the device is suspended, the response should be that it is
    > > congested, I'd have thought.

    >
    > Yes, but these congestion upcalls are used only for optimization and the
    > device is suspended for so small time, that it doesn't matter if we
    > optimize io acces in small moment or not.


    We use device congested to not block pdflush on a device that can't
    currently accept I/O. For a suspended device congested is always
    correct, pdflush will just come back a little later and in the mean time
    write out stuff that doesn't delay it.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Mon, Nov 10, 2008 at 02:27:15PM +0000, Alasdair G Kergon wrote:
    > Indeed - the blocking was a bug for which there's a patch, but that doesn't
    > deal with how the function should be operating in the first place.
    >
    > - If one device is found to be congested, why bother checking the remaining
    > devices?
    >
    > - If the device is suspended, the response should be that it is congested, I'd
    > have thought.


    Yes. device congested is a quick an dirty check - if you encounter
    anything long just return congested if I/O would hit this long delay,
    too or not congested if you can't really know.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

    On Tue, 11 Nov 2008, Christoph Hellwig wrote:

    > On Mon, Nov 10, 2008 at 09:19:21AM -0500, Mikulas Patocka wrote:
    > > For device mapper, congested_fn asks every device in the tree and make OR
    > > of their bits --- so if the user has 50 devices, it asks them all.
    > >
    > > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
    > > same --- asking every device.
    > >
    > > If you have a better idea how to implement congested_fn, say it.

    >
    > Well, even asking 50 devices shouldn't take that long normally.


    There are some atomic operations inside congested_fn.

    I remember that there was some database benchmark on a setup with 48
    devices. Just dropping one spinlock from unplug function resulted in
    overall improvement (not microbenchmarks, the whole run) by 18%. The
    slowdown with unplug is similar as with congested_fn --- unplugs walks all
    the physical devices for a given logical device and unplugs them.

    So simplifying congested_fn might help too. Do you have an idea, how often
    is it called? I thought about caching the return value for one jiffy.

    > Do you take some sleeping lock that might block forever? In that case I
    > would just return congested for that device as it would delay pdflush
    > otherwise.


    Sometimes we do, and we shouldn't, because congested_fn is called with
    spinlock. That's what I'm working on now.

    BTW. how "realtime" do you expect the Linux kernel to be? Is there some
    effort to make all spinlocks locked for finite time? (so that someone
    could one day calculate the times of all spinlocks, take the maximum time
    and give Linux a "realtime" label). If so, then bdi_congested call should
    be moved out of spinlocks. If there are already many cases when spinlock
    could be taken and list of indefinite length walked inside, it isn't worth
    optimizing for it.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread