[PATCH 00/13] request-based dm-multipath - Kernel

This is a discussion on [PATCH 00/13] request-based dm-multipath - Kernel ; Hi Jens, James and Alasdair, This is a new version of request-based dm-multipath patches. The patches are created on top of 2.6.27-rc6 + Alasdair's dm patches for linux-next below: dm-mpath-use-more-error-codes.patch dm-mpath-remove-is_active-from-struct-dm_path.patch Major changes from the previous version (*) are: - ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 26

Thread: [PATCH 00/13] request-based dm-multipath

  1. [PATCH 00/13] request-based dm-multipath

    Hi Jens, James and Alasdair,

    This is a new version of request-based dm-multipath patches.
    The patches are created on top of 2.6.27-rc6 + Alasdair's dm patches
    for linux-next below:
    dm-mpath-use-more-error-codes.patch
    dm-mpath-remove-is_active-from-struct-dm_path.patch

    Major changes from the previous version (*) are:
    - Moved busy state information for device/host to
    q->backing_dev_info from q->queue_flags, since backing_dev_info
    seems to be more appropriate location. (PATCH 03)
    And corresponding changes to the scsi driver. (PATCH 04)

    - Added a queue flag to indicate whether the block device is
    request stackable or not, so that request stacking drivers
    can avoid to stack request-based device on bio-based device.
    (PATCH 05)

    - Fixed the problem that requests are not flushed on flush suspend.
    (PATCH 10)

    - Changed queue initialization method for bio-based dm devices
    from blk_alloc_queue() to blk_init_queue(). (PATCH 11)

    - Changed congestion check method in dm-multipath not to invoke
    __choose_pgpath(). (PATCH 13)

    (*) http://lkml.org/lkml/2008/3/19/478

    Some basic function/performance testings are done with NEC iStorage
    (active-active multipath), and no problem was found.
    Please review and apply if no problem.


    Summary of the patch-set:
    01/13: block: add request data completion interface
    02/13: block: add request submission interface
    03/13: mm: export driver's busy state via backing_dev_info
    04/13: scsi: export busy status
    05/13: block: add a queue flag for request stacking support
    06/13: dm: remove unused code (preparation for request-based dm)
    07/13: dm: tidy local_init (preparation for request-based dm)
    08/13: dm: prepare mempools on module init for request-based dm
    09/13: dm: add target interface for request-based dm
    10/13: dm: add core functions for request-based dm
    11/13: dm: add a switch to enable request-based dm if target is ready
    12/13: dm: reject requests violating limits for request-based dm
    13/13: dm-mpath: convert to request-based from bio-based


    Summary of the design and request-based dm-multipath are below.

    BACKGROUND
    ==========
    Currently, device-mapper (dm) is implemented as a stacking block device
    at bio level. This bio-based implementation has an issue below
    on dm-multipath.

    Because hook for I/O mapping is above block layer __make_request(),
    contiguous bios can be mapped to different underlying devices
    and these bios aren't merged into a request.
    Dynamic load balancing could happen this situation, though
    it has not been implemented yet.
    Therefore, I/O mapping after bio merging is needed for better
    dynamic load balancing.

    The basic idea to resolve the issue is to move multipathing layer
    down below the I/O scheduler, and it was proposed from Mike Christie
    as the block layer (request-based) multipath:
    http://marc.info/?l=linux-scsi&m=115520444515914&w=2

    Mike's patch added new block layer device for multipath and didn't
    have dm interface. So I modified his patch to be used from dm.
    It is request-based dm-multipath.


    DESIGN OVERVIEW
    ===============
    While currently dm and md stacks block devices at bio level,
    request-based dm stacks at request level and submits/completes
    struct request instead of struct bio.


    Overview of the request-based dm patch:
    - Mapping is done in a unit of struct request, instead of struct bio
    - Hook for I/O mapping is at q->request_fn() after merging and
    sorting by I/O scheduler, instead of q->make_request_fn().
    - Hook for I/O completion is at bio->bi_end_io() and rq->end_io(),
    instead of only bio->bi_end_io()
    bio-based (current) request-based (this patch)
    ------------------------------------------------------------------
    submission q->make_request_fn() q->request_fn()
    completion bio->bi_end_io() bio->bi_end_io(), rq->end_io()
    - Whether the dm device is bio-based or request-based is determined
    at table loading time
    - Keep user interface same (table/message/status/ioctl)
    - Any bio-based devices (like dm/md) can be stacked on request-based
    dm device.
    Request-based dm device *cannot* be stacked on any bio-based device.


    Expected benefit:
    - better load balancing


    Additional explanations:

    Why does request-based dm use bio->bi_end_io(), too?
    Because:
    - dm needs to keep not only the request but also bios of the request,
    if dm target drivers want to retry or do something on the request.
    For example, dm-multipath has to check errors and retry with other
    paths if necessary before returning the I/O result to the upper layer.

    - But rq->end_io() is called at the very late stage of completion
    handling where all bios in the request have been completed and
    the I/O results are already visible to the upper layer.
    So request-based dm hooks bio->bi_end_io() and doesn't complete the bio
    in error cases, and gives over the error handling to rq->end_io() hook.


    Thanks,
    Kiyoshi Ueda
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [PATCH 12/13] dm: reject I/O violating new queue limits

    This patch detects requests violating the queue limitations
    and rejects them.

    The same limitation checks are done when requests are submitted
    to the queue by blk_submit_request().
    However, such violation can happen if a table is swapped and
    the queue limitations are shrunk while some requests are
    in the queue.

    Since struct request is a reliable one in the block layer and
    device drivers, dispatching such requests is pretty dangerous.
    (e.g. it may cause kernel panic easily.)
    So avoid to dispatch such problematic requests in request-based dm.


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    ---
    drivers/md/dm.c | 24 ++++++++++++++++++++++++
    1 files changed, 24 insertions(+)

    Index: 2.6.27-rc6/drivers/md/dm.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm.c
    +++ 2.6.27-rc6/drivers/md/dm.c
    @@ -1469,6 +1469,30 @@ static void map_request(struct dm_target

    tio->ti = ti;
    atomic_inc(&md->pending);
    +
    + /*
    + * Although submitted requests to the md->queue are checked against
    + * the table/queue limitations at the submission time, the limitations
    + * may be changed by a table swapping while those already checked
    + * requests are in the md->queue.
    + * If the limitations have been shrunk in such situations, we may be
    + * dispatching requests violating the current limitations here.
    + * Since struct request is a reliable one in the block-layer
    + * and device drivers, dispatching such requests is dangerous.
    + * (e.g. it may cause kernel panic easily.)
    + * Avoid to dispatch such problematic requests in request-based dm.
    + *
    + * Since dm_kill_request() decrements the md->pending, this have to
    + * be done after incrementing the md->pending.
    + */
    + r = blk_rq_check_limits(rq->q, rq);
    + if (unlikely(r)) {
    + DMWARN("violating the queue limitation. the limitation may be"
    + " shrunk while there are some requests in the queue.");
    + dm_kill_request(clone, r);
    + return;
    + }
    +
    r = ti->type->map_rq(ti, clone, &tio->info);
    switch (r) {
    case DM_MAPIO_SUBMITTED:
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [PATCH 09/13] dm: add target interfaces for request-based dm

    This patch adds the following target interfaces for request-based dm.

    map_rq : for mapping a request

    rq_end_io : for finishing a request

    busy : for avoiding performance regression from bio-based dm.
    Target can tell dm core not to map requests now, and
    that may help requests in the block layer queue to be
    bigger by I/O merging.
    In bio-based dm, this behavior is done by device
    drivers which managing the block layer queue.
    But in request-based dm, dm core has to do that
    since dm core manages the block layer queue, and
    target drivers help is needed for it.


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    ---
    include/linux/device-mapper.h | 15 +++++++++++++++
    1 files changed, 15 insertions(+)

    Index: 2.6.27-rc6/include/linux/device-mapper.h
    ================================================== =================
    --- 2.6.27-rc6.orig/include/linux/device-mapper.h
    +++ 2.6.27-rc6/include/linux/device-mapper.h
    @@ -46,6 +46,8 @@ typedef void (*dm_dtr_fn) (struct dm_tar
    */
    typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio,
    union map_info *map_context);
    +typedef int (*dm_map_request_fn) (struct dm_target *ti, struct request *clone,
    + union map_info *map_context);

    /*
    * Returns:
    @@ -58,6 +60,9 @@ typedef int (*dm_map_fn) (struct dm_targ
    typedef int (*dm_endio_fn) (struct dm_target *ti,
    struct bio *bio, int error,
    union map_info *map_context);
    +typedef int (*dm_request_endio_fn) (struct dm_target *ti,
    + struct request *clone, int error,
    + union map_info *map_context);

    typedef void (*dm_flush_fn) (struct dm_target *ti);
    typedef void (*dm_presuspend_fn) (struct dm_target *ti);
    @@ -77,6 +82,13 @@ typedef int (*dm_ioctl_fn) (struct dm_ta
    typedef int (*dm_merge_fn) (struct dm_target *ti, struct bvec_merge_data *bvm,
    struct bio_vec *biovec, int max_size);

    +/*
    + * Returns:
    + * 0: The target can handle the next I/O immediately.
    + * 1: The target can't handle the next I/O immediately.
    + */
    +typedef int (*dm_busy_fn) (struct dm_target *ti);
    +
    void dm_error(const char *message);

    /*
    @@ -103,7 +115,9 @@ struct target_type {
    dm_ctr_fn ctr;
    dm_dtr_fn dtr;
    dm_map_fn map;
    + dm_map_request_fn map_rq;
    dm_endio_fn end_io;
    + dm_request_endio_fn rq_end_io;
    dm_flush_fn flush;
    dm_presuspend_fn presuspend;
    dm_postsuspend_fn postsuspend;
    @@ -113,6 +127,7 @@ struct target_type {
    dm_message_fn message;
    dm_ioctl_fn ioctl;
    dm_merge_fn merge;
    + dm_busy_fn busy;
    };

    struct io_restrictions {
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [PATCH 06/13] dm: remove unused DM_WQ_FLUSH_ALL

    This patch removes dead codes for the noflush suspend.
    No functional change.

    This patch is just a clean up of the codes and not functionally
    related to request-based dm. But included here due to literal
    dependency.

    The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
    is never invoked because:
    - The 'goto flush_and_out' is same as 'goto out', because
    the 'goto flush_and_out' is called only when '!noflush'
    - If the 'r && noflush' is true, the interrupt check code above
    is invoked and 'goto out'

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Milan Broz
    Cc: Alasdair G Kergon
    ---
    drivers/md/dm.c | 14 +-------------
    1 files changed, 1 insertion(+), 13 deletions(-)

    Index: 2.6.27-rc6/drivers/md/dm.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm.c
    +++ 2.6.27-rc6/drivers/md/dm.c
    @@ -76,7 +76,6 @@ union map_info *dm_get_mapinfo(struct bi
    */
    struct dm_wq_req {
    enum {
    - DM_WQ_FLUSH_ALL,
    DM_WQ_FLUSH_DEFERRED,
    } type;
    struct work_struct work;
    @@ -1384,9 +1383,6 @@ static void dm_wq_work(struct work_struc

    down_write(&md->io_lock);
    switch (req->type) {
    - case DM_WQ_FLUSH_ALL:
    - __merge_pushback_list(md);
    - /* pass through */
    case DM_WQ_FLUSH_DEFERRED:
    __flush_deferred_io(md);
    break;
    @@ -1516,7 +1512,7 @@ int dm_suspend(struct mapped_device *md,
    if (!md->suspended_bdev) {
    DMWARN("bdget failed in dm_suspend");
    r = -ENOMEM;
    - goto flush_and_out;
    + goto out;
    }

    /*
    @@ -1567,14 +1563,6 @@ int dm_suspend(struct mapped_device *md,

    set_bit(DMF_SUSPENDED, &md->flags);

    -flush_and_out:
    - if (r && noflush)
    - /*
    - * Because there may be already I/Os in the pushback list,
    - * flush them before return.
    - */
    - dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL);
    -
    out:
    if (r && md->suspended_bdev) {
    bdput(md->suspended_bdev);
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [PATCH 08/13] dm: add kmem_cache for request-based dm

    This patch prepares some kmem_cache for request-based dm.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    ---
    drivers/md/dm.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
    1 files changed, 44 insertions(+), 1 deletion(-)

    Index: 2.6.27-rc6/drivers/md/dm.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm.c
    +++ 2.6.27-rc6/drivers/md/dm.c
    @@ -32,6 +32,7 @@ static unsigned int _major = 0;

    static DEFINE_SPINLOCK(_minor_lock);
    /*
    + * For bio based dm.
    * One of these is allocated per bio.
    */
    struct dm_io {
    @@ -43,6 +44,7 @@ struct dm_io {
    };

    /*
    + * For bio based dm.
    * One of these is allocated per target within a bio. Hopefully
    * this will be simplified out one day.
    */
    @@ -52,6 +54,31 @@ struct dm_target_io {
    union map_info info;
    };

    +/*
    + * For request based dm.
    + * One of these is allocated per request.
    + *
    + * Since assuming "original request : cloned request = 1 : 1" and
    + * a counter for number of clones like struct dm_io.io_count isn't needed,
    + * struct dm_io and struct target_io can be merged.
    + */
    +struct dm_rq_target_io {
    + struct mapped_device *md;
    + struct dm_target *ti;
    + struct request *orig, clone;
    + int error;
    + union map_info info;
    +};
    +
    +/*
    + * For request based dm.
    + * One of these is allocated per bio.
    + */
    +struct dm_clone_bio_info {
    + struct bio *orig;
    + struct request *rq;
    +};
    +
    union map_info *dm_get_mapinfo(struct bio *bio)
    {
    if (bio && bio->bi_private)
    @@ -147,6 +174,8 @@ struct mapped_device {
    #define MIN_IOS 256
    static struct kmem_cache *_io_cache;
    static struct kmem_cache *_tio_cache;
    +static struct kmem_cache *_rq_tio_cache;
    +static struct kmem_cache *_bio_info_cache;

    static int __init local_init(void)
    {
    @@ -162,9 +191,17 @@ static int __init local_init(void)
    if (!_tio_cache)
    goto out_free_io_cache;

    + _rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0);
    + if (!_rq_tio_cache)
    + goto out_free_tio_cache;
    +
    + _bio_info_cache = KMEM_CACHE(dm_clone_bio_info, 0);
    + if (!_bio_info_cache)
    + goto out_free_rq_tio_cache;
    +
    r = dm_uevent_init();
    if (r)
    - goto out_free_tio_cache;
    + goto out_free_bio_info_cache;

    _major = major;
    r = register_blkdev(_major, _name);
    @@ -178,6 +215,10 @@ static int __init local_init(void)

    out_uevent_exit:
    dm_uevent_exit();
    +out_free_bio_info_cache:
    + kmem_cache_destroy(_bio_info_cache);
    +out_free_rq_tio_cache:
    + kmem_cache_destroy(_rq_tio_cache);
    out_free_tio_cache:
    kmem_cache_destroy(_tio_cache);
    out_free_io_cache:
    @@ -188,6 +229,8 @@ out_free_io_cache:

    static void local_exit(void)
    {
    + kmem_cache_destroy(_bio_info_cache);
    + kmem_cache_destroy(_rq_tio_cache);
    kmem_cache_destroy(_tio_cache);
    kmem_cache_destroy(_io_cache);
    unregister_blkdev(_major, _name);
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. [PATCH 01/13] block: add request update interface

    This patch adds blk_update_request(), which updates struct request
    with completing its data part, but doesn't complete the struct
    request itself.
    Though it looks like end_that_request_first() of older kernels,
    blk_update_request() should be used only by request stacking drivers.

    Request-based dm will use it in bio->bi_end_io callback to update
    the original request when a data part of a cloned request completes.
    Followings are additional background information of why request-based
    dm needs this interface.

    - Request stacking drivers can't use blk_end_request() directly from
    the lower driver's completion context (bio->bi_end_io or rq->end_io),
    because some device drivers (e.g. ide) may try to complete
    their request with queue lock held, and it may cause deadlock.
    See below for detailed description of possible deadlock:


    - To solve that, request-based dm offloads the completion of
    cloned struct request to softirq context (i.e. using
    blk_complete_request() from rq->end_io). See PATCH 10.

    - Though it is possible to use the same solution from bio->bi_end_io,
    it will delay the notification of bio completion to the original
    submitter. Also, it will cause inefficient partial completion,
    because the lower driver can't perform the cloned request anymore
    and request-based dm needs to requeue and redispatch it to
    the lower driver again later. That's not good.

    - So request-based dm needs blk_update_request() to perform the bio
    completion in the lower driver's completion context, which is more
    efficient.


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Jens Axboe
    ---
    block/blk-core.c | 59 +++++++++++++++++++++++++++++++++++++++++--------
    include/linux/blkdev.h | 2 +
    2 files changed, 52 insertions(+), 9 deletions(-)

    Index: 2.6.27-rc6/block/blk-core.c
    ================================================== =================
    --- 2.6.27-rc6.orig/block/blk-core.c
    +++ 2.6.27-rc6/block/blk-core.c
    @@ -1862,6 +1862,22 @@ void end_request(struct request *req, in
    }
    EXPORT_SYMBOL(end_request);

    +static int end_that_request_data(struct request *rq, int error,
    + unsigned int nr_bytes, unsigned int bidi_bytes)
    +{
    + if (blk_fs_request(rq) || blk_pc_request(rq)) {
    + if (__end_that_request_first(rq, error, nr_bytes))
    + return 1;
    +
    + /* Bidi request must be completed as a whole */
    + if (blk_bidi_rq(rq) &&
    + __end_that_request_first(rq->next_rq, error, bidi_bytes))
    + return 1;
    + }
    +
    + return 0;
    +}
    +
    /**
    * blk_end_io - Generic end_io function to complete a request.
    * @rq: the request being processed
    @@ -1888,15 +1904,8 @@ static int blk_end_io(struct request *rq
    struct request_queue *q = rq->q;
    unsigned long flags = 0UL;

    - if (blk_fs_request(rq) || blk_pc_request(rq)) {
    - if (__end_that_request_first(rq, error, nr_bytes))
    - return 1;
    -
    - /* Bidi request must be completed as a whole */
    - if (blk_bidi_rq(rq) &&
    - __end_that_request_first(rq->next_rq, error, bidi_bytes))
    - return 1;
    - }
    + if (end_that_request_data(rq, error, nr_bytes, bidi_bytes))
    + return 1;

    /* Special feature for tricky drivers */
    if (drv_callback && drv_callback(rq))
    @@ -1981,6 +1990,38 @@ int blk_end_bidi_request(struct request
    EXPORT_SYMBOL_GPL(blk_end_bidi_request);

    /**
    + * blk_update_request - Special helper function for request stacking drivers
    + * @rq: the request being processed
    + * @error: 0 for success, < 0 for error
    + * @nr_bytes: number of bytes to complete @rq
    + *
    + * Description:
    + * Ends I/O on a number of bytes attached to @rq, but doesn't complete
    + * the request structure even if @rq doesn't have leftover.
    + * If @rq has leftover, sets it up for the next range of segments.
    + *
    + * This special helper function is only for request stacking drivers
    + * (e.g. request-based dm) so that they can handle partial completion.
    + * Actual device drivers should use blk_end_request instead.
    + */
    +void blk_update_request(struct request *rq, int error, unsigned int nr_bytes)
    +{
    + if (!end_that_request_data(rq, error, nr_bytes, 0)) {
    + /*
    + * All bios in the request have been completed.
    + * Then, members of the request are not updated.
    + * Update those members to avoid double charge of diskstat
    + * when the stacking driver calls blk_end_request()
    + * to complete the request actually.
    + */
    + rq->nr_sectors = rq->hard_nr_sectors = 0;
    + rq->current_nr_sectors = rq->hard_cur_sectors = 0;
    + rq->nr_phys_segments = rq->nr_hw_segments = 0;
    + }
    +}
    +EXPORT_SYMBOL_GPL(blk_update_request);
    +
    +/**
    * blk_end_request_callback - Special helper function for tricky drivers
    * @rq: the request being processed
    * @error: 0 for success, < 0 for error
    Index: 2.6.27-rc6/include/linux/blkdev.h
    ================================================== =================
    --- 2.6.27-rc6.orig/include/linux/blkdev.h
    +++ 2.6.27-rc6/include/linux/blkdev.h
    @@ -756,6 +756,8 @@ extern int blk_end_request_callback(stru
    unsigned int nr_bytes,
    int (drv_callback)(struct request *));
    extern void blk_complete_request(struct request *);
    +extern void blk_update_request(struct request *rq, int error,
    + unsigned int nr_bytes);

    /*
    * blk_end_request() takes bytes instead of sectors as a complete size.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. [PATCH 03/13] mm: lld busy status exporting interface

    This patch adds an interface to check lld's busy status
    from the block layer. (scsi patch is also included.)
    This resolves a performance problem on request stacking devices below.


    Some drivers like scsi mid layer stop dispatching request when
    they detect busy state on its low-level device like host/bus/device.
    It allows other requests to stay in the I/O scheduler's queue
    for a chance of merging.

    Request stacking drivers like request-based dm should follow
    the same logic.
    However, there is no generic interface for the stacked device
    to check if the underlying device(s) are busy.
    If the request stacking driver dispatches and submits requests to
    the busy underlying device, the requests will stay in
    the underlying device's queue without a chance of merging.
    This causes performance problem on burst I/O load.

    With this patch, busy state of the underlying device is exported
    via the state flag of queue's backing_dev_info. So the request
    stacking driver can check it and stop dispatching requests if busy.

    The underlying device driver must set/clear the flag appropriately:
    ON: when the device driver can't process requests immediately.
    OFF: when the device driver can process requests immediately,
    including abnormal situations where the device driver needs
    to kill all requests.


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: James Bottomley
    Cc: Andrew Morton
    ---
    include/linux/backing-dev.h | 8 ++++++++
    mm/backing-dev.c | 13 +++++++++++++
    2 files changed, 21 insertions(+)

    Index: 2.6.27-rc6/include/linux/backing-dev.h
    ================================================== =================
    --- 2.6.27-rc6.orig/include/linux/backing-dev.h
    +++ 2.6.27-rc6/include/linux/backing-dev.h
    @@ -26,6 +26,7 @@ enum bdi_state {
    BDI_pdflush, /* A pdflush thread is working this device */
    BDI_write_congested, /* The write queue is getting full */
    BDI_read_congested, /* The read queue is getting full */
    + BDI_lld_congested, /* The device/host is busy */
    BDI_unused, /* Available bits start here */
    };

    @@ -226,8 +227,15 @@ static inline int bdi_rw_congested(struc
    (1 << BDI_write_congested));
    }

    +static inline int bdi_lld_congested(struct backing_dev_info *bdi)
    +{
    + return bdi_congested(bdi, 1 << BDI_lld_congested);
    +}
    +
    void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
    void set_bdi_congested(struct backing_dev_info *bdi, int rw);
    +void clear_bdi_lld_congested(struct backing_dev_info *bdi);
    +void set_bdi_lld_congested(struct backing_dev_info *bdi);
    long congestion_wait(int rw, long timeout);


    Index: 2.6.27-rc6/mm/backing-dev.c
    ================================================== =================
    --- 2.6.27-rc6.orig/mm/backing-dev.c
    +++ 2.6.27-rc6/mm/backing-dev.c
    @@ -279,6 +279,19 @@ void set_bdi_congested(struct backing_de
    }
    EXPORT_SYMBOL(set_bdi_congested);

    +void clear_bdi_lld_congested(struct backing_dev_info *bdi)
    +{
    + clear_bit(BDI_lld_congested, &bdi->state);
    + smp_mb__after_clear_bit();
    +}
    +EXPORT_SYMBOL_GPL(clear_bdi_lld_congested);
    +
    +void set_bdi_lld_congested(struct backing_dev_info *bdi)
    +{
    + set_bit(BDI_lld_congested, &bdi->state);
    +}
    +EXPORT_SYMBOL_GPL(set_bdi_lld_congested);
    +
    /**
    * congestion_wait - wait for a backing_dev to become uncongested
    * @rw: READ or WRITE
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. [PATCH 07/13] dm: tidy local_init

    This patch tidies local_init() as preparation for another patch
    (PATCH 08), which creates some kmem_cache for request-based dm.
    No functional change.

    This patch is just a clean up of the codes and not functionally
    related to request-based dm. But included here due to literal
    dependency.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    ---
    drivers/md/dm.c | 34 +++++++++++++++++-----------------
    1 files changed, 17 insertions(+), 17 deletions(-)

    Index: 2.6.27-rc6/drivers/md/dm.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm.c
    +++ 2.6.27-rc6/drivers/md/dm.c
    @@ -150,40 +150,40 @@ static struct kmem_cache *_tio_cache;

    static int __init local_init(void)
    {
    - int r;
    + int r = -ENOMEM;

    /* allocate a slab for the dm_ios */
    _io_cache = KMEM_CACHE(dm_io, 0);
    if (!_io_cache)
    - return -ENOMEM;
    + return r;

    /* allocate a slab for the target ios */
    _tio_cache = KMEM_CACHE(dm_target_io, 0);
    - if (!_tio_cache) {
    - kmem_cache_destroy(_io_cache);
    - return -ENOMEM;
    - }
    + if (!_tio_cache)
    + goto out_free_io_cache;

    r = dm_uevent_init();
    - if (r) {
    - kmem_cache_destroy(_tio_cache);
    - kmem_cache_destroy(_io_cache);
    - return r;
    - }
    + if (r)
    + goto out_free_tio_cache;

    _major = major;
    r = register_blkdev(_major, _name);
    - if (r < 0) {
    - kmem_cache_destroy(_tio_cache);
    - kmem_cache_destroy(_io_cache);
    - dm_uevent_exit();
    - return r;
    - }
    + if (r < 0)
    + goto out_uevent_exit;

    if (!_major)
    _major = r;

    return 0;
    +
    +out_uevent_exit:
    + dm_uevent_exit();
    +out_free_tio_cache:
    + kmem_cache_destroy(_tio_cache);
    +out_free_io_cache:
    + kmem_cache_destroy(_io_cache);
    +
    + return r;
    }

    static void local_exit(void)
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. [PATCH 04/13] scsi: exports busy status

    This patch change scsi mid-layer to export its busy status.
    Not set the busy flag, when scsi can't dispatch I/Os anymore and
    needs to kill I/Os. Otherwise, request stacking drivers may hold
    requests forever.


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: James Bottomley
    ---
    drivers/scsi/scsi.c | 4 ++--
    drivers/scsi/scsi_lib.c | 23 ++++++++++++++++++++++-
    2 files changed, 24 insertions(+), 3 deletions(-)

    Index: 2.6.27-rc6/drivers/scsi/scsi_lib.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/scsi/scsi_lib.c
    +++ 2.6.27-rc6/drivers/scsi/scsi_lib.c
    @@ -459,17 +459,30 @@ static void scsi_init_cmd_errh(struct sc

    void scsi_device_unbusy(struct scsi_device *sdev)
    {
    + int host_congested;
    struct Scsi_Host *shost = sdev->host;
    unsigned long flags;

    spin_lock_irqsave(shost->host_lock, flags);
    shost->host_busy--;
    + if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
    + shost->host_blocked || shost->host_self_blocked ||
    + scsi_host_in_recovery(shost))
    + host_congested = 1;
    + else
    + host_congested = 0;
    +
    if (unlikely(scsi_host_in_recovery(shost) &&
    (shost->host_failed || shost->host_eh_scheduled)))
    scsi_eh_wakeup(shost);
    spin_unlock(shost->host_lock);
    +
    spin_lock(sdev->request_queue->queue_lock);
    sdev->device_busy--;
    + if (bdi_lld_congested(&sdev->request_queue->backing_dev_info) &&
    + !host_congested && !(sdev->device_busy >= sdev->queue_depth) &&
    + !sdev->device_blocked)
    + clear_bdi_lld_congested(&sdev->request_queue->backing_dev_info);
    spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
    }

    @@ -1495,9 +1508,14 @@ static void scsi_request_fn(struct reque
    * accept it.
    */
    req = elv_next_request(q);
    - if (!req || !scsi_dev_queue_ready(q, sdev))
    + if (!req)
    break;

    + if (!scsi_dev_queue_ready(q, sdev)) {
    + set_bdi_lld_congested(&q->backing_dev_info);
    + break;
    + }
    +
    if (unlikely(!scsi_device_online(sdev))) {
    sdev_printk(KERN_ERR, sdev,
    "rejecting I/O to offline device\n");
    @@ -1568,6 +1586,8 @@ static void scsi_request_fn(struct reque
    rtn = scsi_dispatch_cmd(cmd);
    spin_lock_irq(q->queue_lock);
    if(rtn) {
    + set_bdi_lld_congested(&q->backing_dev_info);
    +
    /* we're refusing the command; because of
    * the way locks get dropped, we need to
    * check here if plugging is required */
    @@ -1592,6 +1612,7 @@ static void scsi_request_fn(struct reque
    * later time.
    */
    spin_lock_irq(q->queue_lock);
    + set_bdi_lld_congested(&q->backing_dev_info);
    blk_requeue_request(q, req);
    sdev->device_busy--;
    if(sdev->device_busy == 0)
    Index: 2.6.27-rc6/drivers/scsi/scsi.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/scsi/scsi.c
    +++ 2.6.27-rc6/drivers/scsi/scsi.c
    @@ -861,8 +861,6 @@ void scsi_finish_command(struct scsi_cmn
    struct scsi_driver *drv;
    unsigned int good_bytes;

    - scsi_device_unbusy(sdev);
    -
    /*
    * Clear the flags which say that the device/host is no longer
    * capable of accepting new commands. These are set in scsi_queue.c
    @@ -874,6 +872,8 @@ void scsi_finish_command(struct scsi_cmn
    shost->host_blocked = 0;
    sdev->device_blocked = 0;

    + scsi_device_unbusy(sdev);
    +
    /*
    * If we have valid sense information, then some kind of recovery
    * must have taken place. Make a note of this.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. [PATCH 11/13] dm: enable request-based dm

    This patch enables request-based dm.

    o Request-based dm and bio-based dm coexist, since there are
    some target drivers which are more fitting to bio-based dm.
    Also, there are other bio-based devices in the kernel
    (e.g. md, loop).
    Since bio-based device can't receive struct request,
    there are some limitations on device stacking between
    bio-based and request-based.

    type of underlying device
    bio-based requeset-based
    ----------------------------------------------
    bio-based OK OK
    request-based NG OK

    The device type is recognized by the queue flag in the kernel,
    so dm follows that.

    o The type of a dm device is decided at the first table loading time.
    Until then, mempool creations are deferred, since mempools for
    request-based dm are different from those for bio-based dm.
    Once the type of a dm device is decided, the type can't be changed.

    o Currently, request-based dm supports only tables that have a single
    target. To support multiple targets, we need to support request
    splitting or prevent bio/request from spanning multiple targets.
    The former needs lots of changes in the block layer, and the latter
    needs that all target drivers support merge() function.
    Both will take a time.


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    ---
    drivers/md/dm-ioctl.c | 13 ++++
    drivers/md/dm-table.c | 68 +++++++++++++++++++++++
    drivers/md/dm.c | 122 ++++++++++++++++++++++++++++++++++--------
    drivers/md/dm.h | 15 +++++
    include/linux/device-mapper.h | 1
    5 files changed, 196 insertions(+), 23 deletions(-)

    Index: 2.6.27-rc6/drivers/md/dm-table.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm-table.c
    +++ 2.6.27-rc6/drivers/md/dm-table.c
    @@ -108,6 +108,8 @@ static void combine_restrictions_low(str
    lhs->bounce_pfn = min_not_zero(lhs->bounce_pfn, rhs->bounce_pfn);

    lhs->no_cluster |= rhs->no_cluster;
    +
    + lhs->no_request_stacking |= rhs->no_request_stacking;
    }

    /*
    @@ -522,6 +524,8 @@ void dm_set_device_limits(struct dm_targ
    rs->bounce_pfn = min_not_zero(rs->bounce_pfn, q->bounce_pfn);

    rs->no_cluster |= !test_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
    +
    + rs->no_request_stacking |= !blk_queue_stackable(q);
    }
    EXPORT_SYMBOL_GPL(dm_set_device_limits);

    @@ -731,6 +735,66 @@ int dm_table_add_target(struct dm_table
    return r;
    }

    +int dm_table_set_type(struct dm_table *t)
    +{
    + int i;
    + int bio_based = 0, request_based = 0;
    + struct dm_target *tgt;
    +
    + for (i = 0; i < t->num_targets; i++) {
    + tgt = t->targets + i;
    + if (tgt->type->map_rq)
    + request_based = 1;
    + else
    + bio_based = 1;
    +
    + if (bio_based && request_based) {
    + DMWARN("Inconsistent table: different target types"
    + " can't be mixed up");
    + return -EINVAL;
    + }
    + }
    +
    + if (bio_based) {
    + /* We must use this table as bio-based */
    + t->limits.no_request_stacking = 1;
    + return 0;
    + }
    +
    + BUG_ON(!request_based); /* No targets in this table */
    +
    + /* Non-request-stackable devices can't be used for request-based dm */
    + if (t->limits.no_request_stacking) {
    + DMWARN("table load rejected: including non-request-stackable"
    + " devices");
    + return -EINVAL;
    + }
    +
    + /*
    + * Request-based dm supports only tables that have a single target now.
    + * To support multiple targets, request splitting support is needed,
    + * and that needs lots of changes in the block-layer.
    + * (e.g. request completion process for partial completion.)
    + */
    + if (t->num_targets > 1) {
    + DMWARN("Request-based dm doesn't support multiple targets yet");
    + return -EINVAL;
    + }
    +
    + return 0;
    +}
    +
    +int dm_table_get_type(struct dm_table *t)
    +{
    + return t->limits.no_request_stacking ?
    + DM_TYPE_BIO_BASED : DM_TYPE_REQUEST_BASED;
    +}
    +
    +int dm_table_request_based(struct dm_table *t)
    +{
    + return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED;
    +}
    +
    static int setup_indexes(struct dm_table *t)
    {
    int i;
    @@ -861,6 +925,10 @@ void dm_table_set_restrictions(struct dm
    else
    queue_flag_set_unlocked(QUEUE_FLAG_CLUSTER, q);

    + if (t->limits.no_request_stacking)
    + queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, q);
    + else
    + queue_flag_set_unlocked(QUEUE_FLAG_STACKABLE, q);
    }

    unsigned int dm_table_get_num_targets(struct dm_table *t)
    Index: 2.6.27-rc6/drivers/md/dm.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm.c
    +++ 2.6.27-rc6/drivers/md/dm.c
    @@ -160,6 +160,8 @@ struct mapped_device {

    struct bio_set *bs;

    + unsigned int mempool_type; /* Type of mempools above. */
    +
    /*
    * Event handling.
    */
    @@ -1697,10 +1699,22 @@ static struct mapped_device *alloc_dev(i
    INIT_LIST_HEAD(&md->uevent_list);
    spin_lock_init(&md->uevent_lock);

    - md->queue = blk_alloc_queue(GFP_KERNEL);
    + md->queue = blk_init_queue(dm_request_fn, NULL);
    if (!md->queue)
    goto bad_queue;

    + /*
    + * Request-based dm devices cannot be stacked on top of bio-based dm
    + * devices. The type of this dm device has not been decided yet,
    + * although we initialized the queue using blk_init_queue().
    + * The type is decided at the first table loading time.
    + * To prevent problematic device stacking, clear the queue flag
    + * for request stacking support until then.
    + *
    + * This queue is new, so no concurrency on the queue_flags.
    + */
    + queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
    + md->saved_make_request_fn = md->queue->make_request_fn;
    md->queue->queuedata = md;
    md->queue->backing_dev_info.congested_fn = dm_congested;
    md->queue->backing_dev_info.congested_data = md;
    @@ -1708,18 +1722,8 @@ static struct mapped_device *alloc_dev(i
    blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
    md->queue->unplug_fn = dm_unplug_all;
    blk_queue_merge_bvec(md->queue, dm_merge_bvec);
    -
    - md->io_pool = mempool_create_slab_pool(MIN_IOS, _io_cache);
    - if (!md->io_pool)
    - goto bad_io_pool;
    -
    - md->tio_pool = mempool_create_slab_pool(MIN_IOS, _tio_cache);
    - if (!md->tio_pool)
    - goto bad_tio_pool;
    -
    - md->bs = bioset_create(16, 16);
    - if (!md->bs)
    - goto bad_no_bioset;
    + blk_queue_softirq_done(md->queue, dm_softirq_done);
    + blk_queue_prep_rq(md->queue, dm_prep_fn);

    md->disk = alloc_disk(1);
    if (!md->disk)
    @@ -1754,12 +1758,6 @@ static struct mapped_device *alloc_dev(i
    bad_thread:
    put_disk(md->disk);
    bad_disk:
    - bioset_free(md->bs);
    -bad_no_bioset:
    - mempool_destroy(md->tio_pool);
    -bad_tio_pool:
    - mempool_destroy(md->io_pool);
    -bad_io_pool:
    blk_cleanup_queue(md->queue);
    bad_queue:
    free_minor(minor);
    @@ -1781,9 +1779,12 @@ static void free_dev(struct mapped_devic
    bdput(md->suspended_bdev);
    }
    destroy_workqueue(md->wq);
    - mempool_destroy(md->tio_pool);
    - mempool_destroy(md->io_pool);
    - bioset_free(md->bs);
    + if (md->tio_pool)
    + mempool_destroy(md->tio_pool);
    + if (md->io_pool)
    + mempool_destroy(md->io_pool);
    + if (md->bs)
    + bioset_free(md->bs);
    del_gendisk(md->disk);
    free_minor(minor);

    @@ -1846,6 +1847,16 @@ static int __bind(struct mapped_device *
    dm_table_get(t);
    dm_table_event_callback(t, event_callback, md);

    + /*
    + * The queue hasn't been stopped yet, if the old table type wasn't
    + * for request-based during suspension. So stop it to prevent
    + * I/O mapping before resume.
    + * This must be done before setting the queue restrictions,
    + * because request-based dm may be run just after the setting.
    + */
    + if (dm_table_request_based(t) && !blk_queue_stopped(q))
    + stop_queue(q);
    +
    write_lock(&md->map_lock);
    md->map = t;
    dm_table_set_restrictions(t, q);
    @@ -1995,7 +2006,13 @@ static void __flush_deferred_io(struct m
    struct bio *c;

    while ((c = bio_list_pop(&md->deferred))) {
    - if (__split_bio(md, c))
    + /*
    + * Some bios might have been queued here during suspension
    + * before setting of request-based dm in resume
    + */
    + if (dm_request_based(md))
    + generic_make_request(c);
    + else if (__split_bio(md, c))
    bio_io_error(c);
    }

    @@ -2413,6 +2430,65 @@ int dm_noflush_suspending(struct dm_targ
    }
    EXPORT_SYMBOL_GPL(dm_noflush_suspending);

    +int dm_init_md_mempool(struct mapped_device *md, int type)
    +{
    + if (unlikely(type == DM_TYPE_NONE)) {
    + DMWARN("no type is specified, can't initialize mempool");
    + return -EINVAL;
    + }
    +
    + if (md->mempool_type == type)
    + return 0;
    +
    + if (md->map) {
    + /* The md has been using, can't change the mempool type */
    + DMWARN("can't change mempool type after a table is bound");
    + return -EINVAL;
    + }
    +
    + /* Not using the md yet, we can still change the mempool type */
    + if (md->mempool_type != DM_TYPE_NONE) {
    + mempool_destroy(md->io_pool);
    + md->io_pool = NULL;
    + mempool_destroy(md->tio_pool);
    + md->tio_pool = NULL;
    + bioset_free(md->bs);
    + md->bs = NULL;
    + md->mempool_type = DM_TYPE_NONE;
    + }
    +
    + md->io_pool = (type == DM_TYPE_BIO_BASED) ?
    + mempool_create_slab_pool(MIN_IOS, _io_cache) :
    + mempool_create_slab_pool(MIN_IOS, _bio_info_cache);
    + if (!md->io_pool)
    + return -ENOMEM;
    +
    + md->tio_pool = (type == DM_TYPE_BIO_BASED) ?
    + mempool_create_slab_pool(MIN_IOS, _tio_cache) :
    + mempool_create_slab_pool(MIN_IOS, _rq_tio_cache);
    + if (!md->tio_pool)
    + goto free_io_pool_and_out;
    +
    + md->bs = (type == DM_TYPE_BIO_BASED) ?
    + bioset_create(16, 16) : bioset_create(MIN_IOS, MIN_IOS);
    + if (!md->bs)
    + goto free_tio_pool_and_out;
    +
    + md->mempool_type = type;
    +
    + return 0;
    +
    +free_tio_pool_and_out:
    + mempool_destroy(md->tio_pool);
    + md->tio_pool = NULL;
    +
    +free_io_pool_and_out:
    + mempool_destroy(md->io_pool);
    + md->io_pool = NULL;
    +
    + return -ENOMEM;
    +}
    +
    static struct block_device_operations dm_blk_dops = {
    .open = dm_blk_open,
    .release = dm_blk_close,
    Index: 2.6.27-rc6/drivers/md/dm.h
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm.h
    +++ 2.6.27-rc6/drivers/md/dm.h
    @@ -23,6 +23,13 @@
    #define DM_SUSPEND_NOFLUSH_FLAG (1 << 1)

    /*
    + * Type of table and mapped_device's mempool
    + */
    +#define DM_TYPE_NONE 0
    +#define DM_TYPE_BIO_BASED 1
    +#define DM_TYPE_REQUEST_BASED 2
    +
    +/*
    * List of devices that a metadevice uses and should open/close.
    */
    struct dm_dev {
    @@ -51,6 +58,9 @@ int dm_table_resume_targets(struct dm_ta
    int dm_table_any_congested(struct dm_table *t, int bdi_bits);
    int dm_table_any_busy_target(struct dm_table *t);
    void dm_table_unplug_all(struct dm_table *t);
    +int dm_table_set_type(struct dm_table *t);
    +int dm_table_get_type(struct dm_table *t);
    +int dm_table_request_based(struct dm_table *t);

    /*
    * To check the return value from dm_table_find_target().
    @@ -113,4 +123,9 @@ void dm_kobject_uevent(struct mapped_dev
    int dm_kcopyd_init(void);
    void dm_kcopyd_exit(void);

    +/*
    + * Mempool initializer for a mapped_device
    + */
    +int dm_init_md_mempool(struct mapped_device *md, int type);
    +
    #endif
    Index: 2.6.27-rc6/drivers/md/dm-ioctl.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm-ioctl.c
    +++ 2.6.27-rc6/drivers/md/dm-ioctl.c
    @@ -1045,6 +1045,12 @@ static int populate_table(struct dm_tabl
    next = spec->next;
    }

    + r = dm_table_set_type(table);
    + if (r) {
    + DMWARN("unable to set table type");
    + return r;
    + }
    +
    return dm_table_complete(table);
    }

    @@ -1069,6 +1075,13 @@ static int table_load(struct dm_ioctl *p
    goto out;
    }

    + r = dm_init_md_mempool(md, dm_table_get_type(t));
    + if (r) {
    + DMWARN("unable to initialize the md mempools for this table");
    + dm_table_put(t);
    + goto out;
    + }
    +
    down_write(&_hash_lock);
    hc = dm_get_mdptr(md);
    if (!hc || hc->md != md) {
    Index: 2.6.27-rc6/include/linux/device-mapper.h
    ================================================== =================
    --- 2.6.27-rc6.orig/include/linux/device-mapper.h
    +++ 2.6.27-rc6/include/linux/device-mapper.h
    @@ -140,6 +140,7 @@ struct io_restrictions {
    unsigned short max_hw_segments;
    unsigned short max_phys_segments;
    unsigned char no_cluster; /* inverted so that 0 is default */
    + unsigned char no_request_stacking;
    };

    struct dm_target {
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. [PATCH 05/13] block: add a queue flag for request stacking support

    This patch adds a queue flag to indicate the block device can be
    used for request stacking.

    Request stacking drivers need to stack their devices on top of
    only devices of which q->request_fn is functional.
    Since bio stacking drivers (e.g. md, loop) basically initialize
    their queue using blk_alloc_queue() and don't set q->request_fn,
    the check of (q->request_fn == NULL) looks enough for that purpose.

    However, dm becomes both types of stacking driver (bio-based and
    request-based) with this patch-set. And dm always sets q->request_fn
    even if the dm device is bio-based of which q->request_fn is not
    functional actually.
    So we need something else to distinguish the type of the device.
    Adding a queue flag is a solution for that.


    The reason why dm always sets q->request_fn is to keep
    the compatibility of dm user-space tools.
    Currently, all dm user-space tools are using bio-based dm without
    specifying the type of the dm device they use.
    To use request-based dm without changing such tools, the kernel
    must decide the type of the dm device automatically.
    The automatic type decision can't be done at the device creation time
    and needs to be deferred until such tools load a mapping table,
    since the actual type is decided by dm target type included in
    the mapping table.

    So a dm device has to be initialized using blk_init_queue()
    so that we can load either type of table.
    Then, all queue stuffs are set (e.g. q->request_fn) and we have
    no element to distinguish that it is bio-based or request-based,
    even after a table is loaded and the type of the device is decided.

    By the way, some stuffs of the queue (e.g. request_list, elevator)
    are needless when the dm device is used as bio-based.
    But the memory size is not so large (about 20[KB] per queue on ia64),
    so I hope the memory loss can be acceptable for bio-based dm users.


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Jens Axboe
    ---
    block/blk-core.c | 3 ++-
    include/linux/blkdev.h | 3 +++
    2 files changed, 5 insertions(+), 1 deletion(-)

    Index: 2.6.27-rc6/block/blk-core.c
    ================================================== =================
    --- 2.6.27-rc6.orig/block/blk-core.c
    +++ 2.6.27-rc6/block/blk-core.c
    @@ -569,7 +569,8 @@ blk_init_queue_node(request_fn_proc *rfn
    q->request_fn = rfn;
    q->prep_rq_fn = NULL;
    q->unplug_fn = generic_unplug_device;
    - q->queue_flags = (1 << QUEUE_FLAG_CLUSTER);
    + q->queue_flags = (1 << QUEUE_FLAG_CLUSTER |
    + 1 << QUEUE_FLAG_STACKABLE);
    q->queue_lock = lock;

    blk_queue_segment_boundary(q, 0xffffffff);
    Index: 2.6.27-rc6/include/linux/blkdev.h
    ================================================== =================
    --- 2.6.27-rc6.orig/include/linux/blkdev.h
    +++ 2.6.27-rc6/include/linux/blkdev.h
    @@ -421,6 +421,7 @@ struct request_queue
    #define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */
    #define QUEUE_FLAG_BIDI 9 /* queue supports bidi requests */
    #define QUEUE_FLAG_NOMERGES 10 /* disable merge attempts */
    +#define QUEUE_FLAG_STACKABLE 11 /* supports request stacking */

    static inline int queue_is_locked(struct request_queue *q)
    {
    @@ -527,6 +528,8 @@ enum {
    #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
    #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
    #define blk_queue_flushing(q) ((q)->ordseq)
    +#define blk_queue_stackable(q) \
    + test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags)

    #define blk_fs_request(rq) ((rq)->cmd_type == REQ_TYPE_FS)
    #define blk_pc_request(rq) ((rq)->cmd_type == REQ_TYPE_BLOCK_PC)
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. [PATCH 13/13] dm-mpath: convert to request-based

    This patch converts dm-multipath target to request-based from bio-based.

    Basically, the patch just converts the I/O unit from struct bio
    to struct request.
    In the course of the conversion, it also changes the I/O queueing
    mechanism. The change in the I/O queueing is described in details
    as follows.

    I/O queueing mechanism change
    -----------------------------
    In I/O submission, map_io(), there is no mechanism change from
    bio-based, since the clone request is ready for retry as it is.
    However, in I/O complition, do_end_io(), there is a mechanism change
    from bio-based, since the clone request is not ready for retry.

    In do_end_io() of bio-based, the clone bio has all needed memory
    for resubmission. So the target driver can queue it and resubmit
    it later without memory allocations.
    The mechanism has almost no overhead.

    On the other hand, in do_end_io() of request-based, the clone request
    doesn't have clone bios, so the target driver can't resubmit it
    as it is. To resubmit the clone request, memory allocation for
    clone bios is needed, and it takes some overheads.
    To avoid the overheads just for queueing, the target driver doesn't
    queue the clone request inside itself.
    Instead, the target driver asks dm core for queueing and remapping
    the original request of the clone request, since the overhead for
    queueing is just a freeing memory for the clone request.

    As a result, the target driver doesn't need to record/restore
    the information of the original request for resubmitting
    the clone request. So dm_bio_details in dm_mpath_io is removed.


    multipath_busy()
    ---------------------
    The target driver returns "busy", only when the following case:
    o The target driver will map I/Os, if map() function is called
    and
    o The mapped I/Os will wait on underlying device's queue due to
    their congestions, if map() function is called now.

    In other cases, the target driver doesn't return "busy".
    Otherwise, dm core will keep the I/Os and the target driver can't
    do what it wants.
    (e.g. the target driver can't map I/Os now, so wants to kill I/Os.)


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    ---
    drivers/md/dm-mpath.c | 195 +++++++++++++++++++++++++++++++++-----------------
    1 files changed, 130 insertions(+), 65 deletions(-)

    Index: 2.6.27-rc6/drivers/md/dm-mpath.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm-mpath.c
    +++ 2.6.27-rc6/drivers/md/dm-mpath.c
    @@ -7,8 +7,6 @@

    #include "dm.h"
    #include "dm-path-selector.h"
    -#include "dm-bio-list.h"
    -#include "dm-bio-record.h"
    #include "dm-uevent.h"

    #include
    @@ -82,7 +80,7 @@ struct multipath {
    unsigned pg_init_count; /* Number of times pg_init called */

    struct work_struct process_queued_ios;
    - struct bio_list queued_ios;
    + struct list_head queued_ios;
    unsigned queue_size;

    struct work_struct trigger_event;
    @@ -99,7 +97,6 @@ struct multipath {
    */
    struct dm_mpath_io {
    struct pgpath *pgpath;
    - struct dm_bio_details details;
    };

    typedef int (*action_fn) (struct pgpath *pgpath);
    @@ -180,6 +177,7 @@ static struct multipath *alloc_multipath
    m = kzalloc(sizeof(*m), GFP_KERNEL);
    if (m) {
    INIT_LIST_HEAD(&m->priority_groups);
    + INIT_LIST_HEAD(&m->queued_ios);
    spin_lock_init(&m->lock);
    m->queue_io = 1;
    INIT_WORK(&m->process_queued_ios, process_queued_ios);
    @@ -304,12 +302,13 @@ static int __must_push_back(struct multi
    dm_noflush_suspending(m->ti));
    }

    -static int map_io(struct multipath *m, struct bio *bio,
    +static int map_io(struct multipath *m, struct request *clone,
    struct dm_mpath_io *mpio, unsigned was_queued)
    {
    int r = DM_MAPIO_REMAPPED;
    unsigned long flags;
    struct pgpath *pgpath;
    + struct block_device *bdev;

    spin_lock_irqsave(&m->lock, flags);

    @@ -326,16 +325,18 @@ static int map_io(struct multipath *m, s
    if ((pgpath && m->queue_io) ||
    (!pgpath && m->queue_if_no_path)) {
    /* Queue for the daemon to resubmit */
    - bio_list_add(&m->queued_ios, bio);
    + list_add_tail(&clone->queuelist, &m->queued_ios);
    m->queue_size++;
    if ((m->pg_init_required && !m->pg_init_in_progress) ||
    !m->queue_io)
    queue_work(kmultipathd, &m->process_queued_ios);
    pgpath = NULL;
    r = DM_MAPIO_SUBMITTED;
    - } else if (pgpath)
    - bio->bi_bdev = pgpath->path.dev->bdev;
    - else if (__must_push_back(m))
    + } else if (pgpath) {
    + bdev = pgpath->path.dev->bdev;
    + clone->q = bdev_get_queue(bdev);
    + clone->rq_disk = bdev->bd_disk;
    + } else if (__must_push_back(m))
    r = DM_MAPIO_REQUEUE;
    else
    r = -EIO; /* Failed */
    @@ -378,30 +379,31 @@ static void dispatch_queued_ios(struct m
    {
    int r;
    unsigned long flags;
    - struct bio *bio = NULL, *next;
    struct dm_mpath_io *mpio;
    union map_info *info;
    + struct request *clone, *n;
    + LIST_HEAD(cl);

    spin_lock_irqsave(&m->lock, flags);
    - bio = bio_list_get(&m->queued_ios);
    + list_splice_init(&m->queued_ios, &cl);
    spin_unlock_irqrestore(&m->lock, flags);

    - while (bio) {
    - next = bio->bi_next;
    - bio->bi_next = NULL;
    + list_for_each_entry_safe(clone, n, &cl, queuelist) {
    + list_del_init(&clone->queuelist);

    - info = dm_get_mapinfo(bio);
    + info = dm_get_rq_mapinfo(clone);
    mpio = info->ptr;

    - r = map_io(m, bio, mpio, 1);
    - if (r < 0)
    - bio_endio(bio, r);
    - else if (r == DM_MAPIO_REMAPPED)
    - generic_make_request(bio);
    - else if (r == DM_MAPIO_REQUEUE)
    - bio_endio(bio, -EIO);
    -
    - bio = next;
    + r = map_io(m, clone, mpio, 1);
    + if (r < 0) {
    + mempool_free(mpio, m->mpio_pool);
    + dm_kill_request(clone, r);
    + } else if (r == DM_MAPIO_REMAPPED)
    + dm_dispatch_request(clone);
    + else if (r == DM_MAPIO_REQUEUE) {
    + mempool_free(mpio, m->mpio_pool);
    + dm_requeue_request(clone);
    + }
    }
    }

    @@ -817,21 +819,24 @@ static void multipath_dtr(struct dm_targ
    }

    /*
    - * Map bios, recording original fields for later in case we have to resubmit
    + * Map cloned requests
    */
    -static int multipath_map(struct dm_target *ti, struct bio *bio,
    +static int multipath_map(struct dm_target *ti, struct request *clone,
    union map_info *map_context)
    {
    int r;
    struct dm_mpath_io *mpio;
    struct multipath *m = (struct multipath *) ti->private;

    - mpio = mempool_alloc(m->mpio_pool, GFP_NOIO);
    - dm_bio_record(&mpio->details, bio);
    + mpio = mempool_alloc(m->mpio_pool, GFP_ATOMIC);
    + if (!mpio)
    + /* ENOMEM, requeue */
    + return DM_MAPIO_REQUEUE;
    + memset(mpio, 0, sizeof(*mpio));

    map_context->ptr = mpio;
    - bio->bi_rw |= (1 << BIO_RW_FAILFAST);
    - r = map_io(m, bio, mpio, 0);
    + clone->cmd_flags |= REQ_FAILFAST;
    + r = map_io(m, clone, mpio, 0);
    if (r < 0 || r == DM_MAPIO_REQUEUE)
    mempool_free(mpio, m->mpio_pool);

    @@ -1105,53 +1110,41 @@ static void activate_path(struct work_st
    /*
    * end_io handling
    */
    -static int do_end_io(struct multipath *m, struct bio *bio,
    +static int do_end_io(struct multipath *m, struct request *clone,
    int error, struct dm_mpath_io *mpio)
    {
    + /*
    + * We don't queue any clone request inside the multipath target
    + * during end I/O handling, since those clone requests don't have
    + * bio clones. If we queue them inside the multipath target,
    + * we need to make bio clones, that requires memory allocation.
    + * (See drivers/md/dm.c:end_clone_bio() about why the clone requests
    + * don't have bio clones.)
    + * Instead of queueing the clone request here, we queue the original
    + * request into dm core, which will remake a clone request and
    + * clone bios for it and resubmit it later.
    + */
    + int r = DM_ENDIO_REQUEUE;
    unsigned long flags;

    - if (!error)
    + if (!error && !clone->errors)
    return 0; /* I/O complete */

    - if ((error == -EWOULDBLOCK) && bio_rw_ahead(bio))
    - return error;
    -
    if (error == -EOPNOTSUPP)
    return error;

    - spin_lock_irqsave(&m->lock, flags);
    - if (!m->nr_valid_paths) {
    - if (__must_push_back(m)) {
    - spin_unlock_irqrestore(&m->lock, flags);
    - return DM_ENDIO_REQUEUE;
    - } else if (!m->queue_if_no_path) {
    - spin_unlock_irqrestore(&m->lock, flags);
    - return -EIO;
    - } else {
    - spin_unlock_irqrestore(&m->lock, flags);
    - goto requeue;
    - }
    - }
    - spin_unlock_irqrestore(&m->lock, flags);
    -
    if (mpio->pgpath)
    fail_path(mpio->pgpath);

    - requeue:
    - dm_bio_restore(&mpio->details, bio);
    -
    - /* queue for the daemon to resubmit or fail */
    spin_lock_irqsave(&m->lock, flags);
    - bio_list_add(&m->queued_ios, bio);
    - m->queue_size++;
    - if (!m->queue_io)
    - queue_work(kmultipathd, &m->process_queued_ios);
    + if (!m->nr_valid_paths && !m->queue_if_no_path && !__must_push_back(m))
    + r = -EIO;
    spin_unlock_irqrestore(&m->lock, flags);

    - return DM_ENDIO_INCOMPLETE; /* io not complete */
    + return r;
    }

    -static int multipath_end_io(struct dm_target *ti, struct bio *bio,
    +static int multipath_end_io(struct dm_target *ti, struct request *clone,
    int error, union map_info *map_context)
    {
    struct multipath *m = ti->private;
    @@ -1160,14 +1153,13 @@ static int multipath_end_io(struct dm_ta
    struct path_selector *ps;
    int r;

    - r = do_end_io(m, bio, error, mpio);
    + r = do_end_io(m, clone, error, mpio);
    if (pgpath) {
    ps = &pgpath->pg->ps;
    if (ps->type->end_io)
    ps->type->end_io(ps, &pgpath->path);
    }
    - if (r != DM_ENDIO_INCOMPLETE)
    - mempool_free(mpio, m->mpio_pool);
    + mempool_free(mpio, m->mpio_pool);

    return r;
    }
    @@ -1403,6 +1395,78 @@ static int multipath_ioctl(struct dm_tar
    bdev->bd_disk, cmd, arg);
    }

    +static int __pgpath_congested(struct pgpath *pgpath)
    +{
    + struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev);
    +
    + if (dm_underlying_device_congested(q))
    + return 1;
    +
    + return 0;
    +}
    +
    +/*
    + * We return "busy", only when we can map I/Os but underlying devices
    + * are congested (so even if we map I/Os now, the I/Os will wait on
    + * the underlying queue).
    + * In other words, if we want to kill I/Os or queue them inside us
    + * due to map unavailability, we don't return "busy". Otherwise,
    + * dm core won't give us the I/Os and we can't do what we want.
    + */
    +static int multipath_busy(struct dm_target *ti)
    +{
    + int busy = 0, has_active = 0;
    + struct multipath *m = (struct multipath *) ti->private;
    + struct priority_group *pg;
    + struct pgpath *pgpath;
    + unsigned long flags;
    +
    + spin_lock_irqsave(&m->lock, flags);
    +
    + /* Guess which priority_group will be used at next mapping time */
    + if (unlikely(!m->current_pgpath && m->next_pg))
    + pg = m->next_pg;
    + else if (likely(m->current_pg))
    + pg = m->current_pg;
    + else
    + /*
    + * We don't know which pg will be used at next mapping time.
    + * We don't call __choose_pgpath() here to avoid to trigger
    + * pg_init just by congestion checking.
    + * So we don't know whether underlying devices we will be using
    + * at next mapping time are congested or not. Just try mapping.
    + */
    + goto out;
    +
    + /*
    + * If there is one uncongested active path at least, the path selector
    + * will be able to select it. So we consider such a pg as uncongested.
    + */
    + busy = 1;
    + list_for_each_entry(pgpath, &pg->pgpaths, list)
    + if (pgpath->is_active) {
    + has_active = 1;
    +
    + if (!__pgpath_congested(pgpath)) {
    + busy = 0;
    + break;
    + }
    + }
    +
    + if (!has_active)
    + /*
    + * No active path in this pg, so this pg won't be used and
    + * the current_pg will be changed at next mapping time.
    + * We need to try mapping to determine it.
    + */
    + busy = 0;
    +
    +out:
    + spin_unlock_irqrestore(&m->lock, flags);
    +
    + return busy;
    +}
    +
    /*-----------------------------------------------------------------
    * Module setup
    *---------------------------------------------------------------*/
    @@ -1412,13 +1476,14 @@ static struct target_type multipath_targ
    .module = THIS_MODULE,
    .ctr = multipath_ctr,
    .dtr = multipath_dtr,
    - .map = multipath_map,
    - .end_io = multipath_end_io,
    + .map_rq = multipath_map,
    + .rq_end_io = multipath_end_io,
    .presuspend = multipath_presuspend,
    .resume = multipath_resume,
    .status = multipath_status,
    .message = multipath_message,
    .ioctl = multipath_ioctl,
    + .busy = multipath_busy,
    };

    static int __init dm_multipath_init(void)
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. [PATCH 10/13] dm: add core functions for request-based dm

    This patch adds core functions for request-based dm.

    When struct mapped device (md) is initialized as request-based,
    md->queue has an I/O scheduler and the following functions are set:
    make_request_fn: __make_request() (existing block layer function)
    request_fn: dm_request_fn() (newly added function)
    Actual initializations are done in another patch (PATCH 11).

    Below is a brief summary of how request-based dm behaves, including:
    - making request from bio
    - cloning, mapping and dispatching request
    - completing request and bio
    - suspending md
    - resuming md


    bio to request
    ==============
    md->queue->make_request_fn() (__make_request()) is called for a bio
    submitted to the md.
    Then, the bio is kept in the queue as a new request or merged into
    another request in the queue if possible.


    Cloning and Mapping
    ===================
    Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
    when requests are dispatched after they are sorted by the I/O scheduler.

    dm_request_fn() checks busy state of underlying devices using
    target's defer_map() function and stops dispatching requests
    to keep them on the dm device's queue if busy.
    It helps better I/O merging, since no merge is done for a request
    once it is dispatched to underlying devices.

    Actual cloning and mapping are done in dm_prep_fn() and map_request()
    called from dm_request_fn().
    dm_prep_fn() clones not only request but also bios of the request
    so that dm can hold bio completion in error cases and prevent
    the bio submitter from noticing the error.
    (See the "Completion" section below for details.)

    After the cloning, the clone is mapped by target's map_rq() function
    and inserted to underlying device's queue using __elv_add_request().


    Completion
    ==========
    Request completion can be hooked by rq->end_io(), but then, all bios
    in the request will have been completed even error cases, and the bio
    submitter will have noticed the error.
    To prevent the bio completion in error cases, request-based dm clones
    both bio and request and hooks both bio->bi_end_io() and rq->end_io():
    bio->bi_end_io(): end_clone_bio()
    rq->end_io(): end_clone_request()

    Summary of the request completion flow is below:
    blk_end_request() for a clone request
    => __end_that_request_first()
    => bio->bi_end_io() == end_clone_bio() for each clone bio
    => Free the clone bio
    => Success: Complete the original bio (blk_update_request())
    Error: Don't complete the original bio
    => end_that_request_last()
    => rq->end_io() == end_clone_request()
    => blk_complete_request()
    => dm_softirq_done()
    => Free the clone request
    => Success: Complete the original request (blk_end_request())
    Error: Requeue the original request

    end_clone_bio() completes the original request on the size of
    the original bio in successful cases.
    Even if all bios in the original request are completed by that
    completion, the original request must not be completed yet to keep
    the ordering of request completion for the stacking.
    So end_clone_bio() uses blk_update_request() instead of
    blk_end_request().
    In error cases, end_clone_bio() doesn't complete the original bio.
    It just frees the cloned bio and gives over the error handling to
    end_clone_request().

    end_clone_request(), which is called with queue lock held, completes
    the clone request and the original request in a softirq context
    (dm_softirq_done()), which has no queue lock, to avoid a deadlock
    issue on submission of another request during the completion:
    - The submitted request may be mapped to the same device
    - Request submission requires queue lock, but the queue lock
    has been held by itself and it doesn't know that

    The clone request has no clone bio when dm_softirq_done() is called.
    So target drivers can't resubmit it again even error cases.
    Instead, they can ask dm core for requeueing and remapping
    the original request in that cases.


    suspend
    =======
    Request-based dm uses stopping md->queue as suspend of the md.
    For noflush suspend, just stops md->queue.

    For flush suspend, inserts a marker request to the tail of md->queue.
    And dispatches all requests in md->queue until the marker comes to
    the front of md->queue. Then, stops dispatching request and waits
    for the all dispatched requests to be completed.
    After that, completes the marker request, stops md->queue and
    wake up the waiter on the suspend queue, md->wait.


    resume
    ======
    Starts md->queue.


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    ---
    drivers/md/dm-table.c | 14
    drivers/md/dm.c | 709 +++++++++++++++++++++++++++++++++++++++++++++++++-
    drivers/md/dm.h | 10
    3 files changed, 726 insertions(+), 7 deletions(-)

    Index: 2.6.27-rc6/drivers/md/dm.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm.c
    +++ 2.6.27-rc6/drivers/md/dm.c
    @@ -86,6 +86,14 @@ union map_info *dm_get_mapinfo(struct bi
    return NULL;
    }

    +union map_info *dm_get_rq_mapinfo(struct request *rq)
    +{
    + if (rq && rq->end_io_data)
    + return &((struct dm_rq_target_io *)rq->end_io_data)->info;
    + return NULL;
    +}
    +EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
    +
    #define MINOR_ALLOCED ((void *)-1)

    /*
    @@ -169,6 +177,12 @@ struct mapped_device {

    /* forced geometry settings */
    struct hd_geometry geometry;
    +
    + /* marker of flush suspend for request-based dm */
    + struct request suspend_rq;
    +
    + /* For saving the address of __make_request for request based dm */
    + make_request_fn *saved_make_request_fn;
    };

    #define MIN_IOS 256
    @@ -416,6 +430,28 @@ static void free_tio(struct mapped_devic
    mempool_free(tio, md->tio_pool);
    }

    +static inline struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md)
    +{
    + return mempool_alloc(md->tio_pool, GFP_ATOMIC);
    +}
    +
    +static inline void free_rq_tio(struct mapped_device *md,
    + struct dm_rq_target_io *tio)
    +{
    + mempool_free(tio, md->tio_pool);
    +}
    +
    +static inline struct dm_clone_bio_info *alloc_bio_info(struct mapped_device *md)
    +{
    + return mempool_alloc(md->io_pool, GFP_ATOMIC);
    +}
    +
    +static inline void free_bio_info(struct mapped_device *md,
    + struct dm_clone_bio_info *info)
    +{
    + mempool_free(info, md->io_pool);
    +}
    +
    static void start_io_acct(struct dm_io *io)
    {
    struct mapped_device *md = io->md;
    @@ -604,6 +640,266 @@ static void clone_endio(struct bio *bio,
    free_tio(md, tio);
    }

    +/*
    + * Partial completion handling for request-based dm
    + */
    +static void end_clone_bio(struct bio *clone, int error)
    +{
    + struct dm_clone_bio_info *info = clone->bi_private;
    + struct dm_rq_target_io *tio = info->rq->end_io_data;
    + struct bio *bio = info->orig;
    + unsigned int nr_bytes = info->orig->bi_size;
    +
    + free_bio_info(tio->md, info);
    + clone->bi_private = tio->md->bs;
    + bio_put(clone);
    +
    + if (tio->error) {
    + /*
    + * An error has already been detected on the request.
    + * Once error occurred, just let clone->end_io() handle
    + * the remainder.
    + */
    + return;
    + } else if (error) {
    + /*
    + * Don't notice the error to the upper layer yet.
    + * The error handling decision is made by the target driver,
    + * when the request is completed.
    + */
    + tio->error = error;
    + return;
    + }
    +
    + /*
    + * I/O for the bio successfully completed.
    + * Notice the data completion to the upper layer.
    + */
    +
    + /*
    + * bios are processed from the head of the list.
    + * So the completing bio should always be rq->bio.
    + * If it's not, something wrong is happening.
    + */
    + if (tio->orig->bio != bio)
    + DMERR("bio completion is going in the middle of the request");
    +
    + /*
    + * Update the original request.
    + * Do not use blk_end_request() here, because it may complete
    + * the original request before the clone, and break the ordering.
    + */
    + blk_update_request(tio->orig, 0, nr_bytes);
    +}
    +
    +static void free_bio_clone(struct request *clone)
    +{
    + struct dm_rq_target_io *tio = clone->end_io_data;
    + struct mapped_device *md = tio->md;
    + struct bio *bio;
    + struct dm_clone_bio_info *info;
    +
    + while ((bio = clone->bio) != NULL) {
    + clone->bio = bio->bi_next;
    +
    + info = bio->bi_private;
    + free_bio_info(md, info);
    +
    + bio->bi_private = md->bs;
    + bio_put(bio);
    + }
    +}
    +
    +static void dec_rq_pending(struct dm_rq_target_io *tio)
    +{
    + if (!atomic_dec_return(&tio->md->pending))
    + /* nudge anyone waiting on suspend queue */
    + wake_up(&tio->md->wait);
    +}
    +
    +static void dm_unprep_request(struct request *rq)
    +{
    + struct request *clone = rq->special;
    + struct dm_rq_target_io *tio = clone->end_io_data;
    +
    + rq->special = NULL;
    + rq->cmd_flags &= ~REQ_DONTPREP;
    +
    + free_bio_clone(clone);
    + dec_rq_pending(tio);
    + free_rq_tio(tio->md, tio);
    +}
    +
    +/*
    + * Requeue the original request of a clone.
    + */
    +void dm_requeue_request(struct request *clone)
    +{
    + struct dm_rq_target_io *tio = clone->end_io_data;
    + struct request *rq = tio->orig;
    + struct request_queue *q = rq->q;
    + unsigned long flags;
    +
    + dm_unprep_request(rq);
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    + if (elv_queue_empty(q))
    + blk_plug_device(q);
    + blk_requeue_request(q, rq);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +}
    +EXPORT_SYMBOL_GPL(dm_requeue_request);
    +
    +static inline void __stop_queue(struct request_queue *q)
    +{
    + blk_stop_queue(q);
    +}
    +
    +static void stop_queue(struct request_queue *q)
    +{
    + unsigned long flags;
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    + __stop_queue(q);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +}
    +
    +static inline void __start_queue(struct request_queue *q)
    +{
    + if (blk_queue_stopped(q))
    + blk_start_queue(q);
    +}
    +
    +static void start_queue(struct request_queue *q)
    +{
    + unsigned long flags;
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    + __start_queue(q);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +}
    +
    +/*
    + * Complete the clone and the original request
    + */
    +static void dm_end_request(struct request *clone, int error)
    +{
    + struct dm_rq_target_io *tio = clone->end_io_data;
    + struct request *rq = tio->orig;
    + struct request_queue *q = rq->q;
    + unsigned int nr_bytes = blk_rq_bytes(rq);
    +
    + if (blk_pc_request(rq)) {
    + rq->errors = clone->errors;
    + rq->data_len = clone->data_len;
    +
    + if (rq->sense)
    + /*
    + * We are using the sense buffer of the original
    + * request.
    + * So setting the length of the sense data is enough.
    + */
    + rq->sense_len = clone->sense_len;
    + }
    +
    + free_bio_clone(clone);
    + dec_rq_pending(tio);
    + free_rq_tio(tio->md, tio);
    +
    + if (unlikely(blk_end_request(rq, error, nr_bytes)))
    + BUG();
    +
    + blk_run_queue(q);
    +}
    +
    +/*
    + * Request completion handler for request-based dm
    + */
    +static void dm_softirq_done(struct request *rq)
    +{
    + struct request *clone = rq->completion_data;
    + struct dm_rq_target_io *tio = clone->end_io_data;
    + dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
    + int error = tio->error;
    + int r;
    +
    + if (rq->cmd_flags & REQ_FAILED)
    + goto end_request;
    +
    + if (rq_end_io) {
    + r = rq_end_io(tio->ti, clone, error, &tio->info);
    + if (r <= 0)
    + /* The target wants to complete the I/O */
    + error = r;
    + else if (r == DM_ENDIO_INCOMPLETE)
    + /* The target will handle the I/O */
    + return;
    + else if (r == DM_ENDIO_REQUEUE) {
    + /*
    + * The target wants to requeue the I/O.
    + * Don't invoke blk_run_queue() so that the requeued
    + * request won't be dispatched again soon.
    + */
    + dm_requeue_request(clone);
    + return;
    + } else {
    + DMWARN("unimplemented target endio return value: %d",
    + r);
    + BUG();
    + }
    + }
    +
    +end_request:
    + dm_end_request(clone, error);
    +}
    +
    +/*
    + * Called with the queue lock held
    + */
    +static void end_clone_request(struct request *clone, int error)
    +{
    + struct dm_rq_target_io *tio = clone->end_io_data;
    + struct request *rq = tio->orig;
    +
    + /*
    + * For just cleaning up the information of the queue in which
    + * the clone was dispatched.
    + * The clone is *NOT* freed actually here because it is alloced from
    + * dm own mempool and REQ_ALLOCED isn't set in clone->cmd_flags.
    + */
    + __blk_put_request(clone->q, clone);
    +
    + /*
    + * Actual request completion is done in a softirq context which doesn't
    + * hold the queue lock. Otherwise, deadlock could occur because:
    + * - another request may be submitted by the upper level driver
    + * of the stacking during the completion
    + * - the submission which requires queue lock may be done
    + * against this queue
    + */
    + tio->error = error;
    + rq->completion_data = clone;
    + blk_complete_request(rq);
    +}
    +
    +/*
    + * Complete the original request of a clone with an error status.
    + * Target's rq_end_io() function isn't called.
    + * This may be used by target's map_rq() function when the mapping fails.
    + */
    +void dm_kill_request(struct request *clone, int error)
    +{
    + struct dm_rq_target_io *tio = clone->end_io_data;
    + struct request *rq = tio->orig;
    +
    + tio->error = error;
    + /* Avoid printing "I/O error" message, since we didn't I/O actually */
    + rq->cmd_flags |= (REQ_FAILED | REQ_QUIET);
    + rq->completion_data = clone;
    + blk_complete_request(rq);
    +}
    +EXPORT_SYMBOL_GPL(dm_kill_request);
    +
    static sector_t max_io_len(struct mapped_device *md,
    sector_t sector, struct dm_target *ti)
    {
    @@ -918,7 +1214,7 @@ static int dm_merge_bvec(struct request_
    * The request function that just remaps the bio built up by
    * dm_merge_bvec.
    */
    -static int dm_request(struct request_queue *q, struct bio *bio)
    +static int _dm_request(struct request_queue *q, struct bio *bio)
    {
    int r = -EIO;
    int rw = bio_data_dir(bio);
    @@ -968,18 +1264,303 @@ out_req:
    return 0;
    }

    +static int dm_make_request(struct request_queue *q, struct bio *bio)
    +{
    + struct mapped_device *md = (struct mapped_device *)q->queuedata;
    +
    + if (unlikely(bio_barrier(bio))) {
    + bio_endio(bio, -EOPNOTSUPP);
    + return 0;
    + }
    +
    + if (unlikely(!md->map)) {
    + bio_endio(bio, -EIO);
    + return 0;
    + }
    +
    + return md->saved_make_request_fn(q, bio); /* call __make_request() */
    +}
    +
    +static inline int dm_request_based(struct mapped_device *md)
    +{
    + return blk_queue_stackable(md->queue);
    +}
    +
    +static int dm_request(struct request_queue *q, struct bio *bio)
    +{
    + struct mapped_device *md = q->queuedata;
    +
    + if (dm_request_based(md))
    + return dm_make_request(q, bio);
    +
    + return _dm_request(q, bio);
    +}
    +
    +void dm_dispatch_request(struct request *rq)
    +{
    + int r;
    +
    + rq->start_time = jiffies;
    + r = blk_submit_request(rq->q, rq);
    + if (r)
    + dm_kill_request(rq, r);
    +}
    +EXPORT_SYMBOL_GPL(dm_dispatch_request);
    +
    +static void copy_request_info(struct request *clone, struct request *rq)
    +{
    + clone->cmd_flags = (rq_data_dir(rq) | REQ_NOMERGE);
    + clone->cmd_type = rq->cmd_type;
    + clone->sector = rq->sector;
    + clone->hard_sector = rq->hard_sector;
    + clone->nr_sectors = rq->nr_sectors;
    + clone->hard_nr_sectors = rq->hard_nr_sectors;
    + clone->current_nr_sectors = rq->current_nr_sectors;
    + clone->hard_cur_sectors = rq->hard_cur_sectors;
    + clone->nr_phys_segments = rq->nr_phys_segments;
    + clone->nr_hw_segments = rq->nr_hw_segments;
    + clone->ioprio = rq->ioprio;
    + clone->buffer = rq->buffer;
    + clone->cmd_len = rq->cmd_len;
    + if (rq->cmd_len)
    + clone->cmd = rq->cmd;
    + clone->data_len = rq->data_len;
    + clone->extra_len = rq->extra_len;
    + clone->sense_len = rq->sense_len;
    + clone->data = rq->data;
    + clone->sense = rq->sense;
    +}
    +
    +static int clone_request_bios(struct request *clone, struct request *rq,
    + struct mapped_device *md)
    +{
    + struct bio *bio, *clone_bio;
    + struct dm_clone_bio_info *info;
    +
    + for (bio = rq->bio; bio; bio = bio->bi_next) {
    + info = alloc_bio_info(md);
    + if (!info)
    + goto free_and_out;
    +
    + clone_bio = bio_alloc_bioset(GFP_ATOMIC, bio->bi_max_vecs,
    + md->bs);
    + if (!clone_bio) {
    + free_bio_info(md, info);
    + goto free_and_out;
    + }
    +
    + __bio_clone(clone_bio, bio);
    + clone_bio->bi_destructor = dm_bio_destructor;
    + clone_bio->bi_end_io = end_clone_bio;
    + info->rq = clone;
    + info->orig = bio;
    + clone_bio->bi_private = info;
    +
    + if (clone->bio) {
    + clone->biotail->bi_next = clone_bio;
    + clone->biotail = clone_bio;
    + } else
    + clone->bio = clone->biotail = clone_bio;
    + }
    +
    + return 0;
    +
    +free_and_out:
    + free_bio_clone(clone);
    +
    + return -ENOMEM;
    +}
    +
    +static int setup_clone(struct request *clone, struct request *rq,
    + struct dm_rq_target_io *tio)
    +{
    + int r;
    +
    + blk_rq_init(NULL, clone);
    +
    + r = clone_request_bios(clone, rq, tio->md);
    + if (r)
    + return r;
    +
    + copy_request_info(clone, rq);
    + clone->start_time = jiffies;
    + clone->end_io = end_clone_request;
    + clone->end_io_data = tio;
    +
    + return 0;
    +}
    +
    +static inline int dm_flush_suspending(struct mapped_device *md)
    +{
    + return !md->suspend_rq.data;
    +}
    +
    +/*
    + * Called with the queue lock held.
    + */
    +static int dm_prep_fn(struct request_queue *q, struct request *rq)
    +{
    + struct mapped_device *md = (struct mapped_device *)q->queuedata;
    + struct dm_rq_target_io *tio;
    + struct request *clone;
    +
    + if (unlikely(rq == &md->suspend_rq)) { /* Flush suspend marker */
    + if (dm_flush_suspending(md)) {
    + if (q->in_flight)
    + return BLKPREP_DEFER;
    + else {
    + /* This device should be quiet now */
    + __stop_queue(q);
    + smp_mb();
    + BUG_ON(atomic_read(&md->pending));
    + wake_up(&md->wait);
    + return BLKPREP_KILL;
    + }
    + } else
    + /*
    + * The suspend process was interrupted.
    + * So no need to suspend now.
    + */
    + return BLKPREP_KILL;
    + }
    +
    + if (unlikely(rq->special)) {
    + DMWARN("Already has something in rq->special.");
    + return BLKPREP_KILL;
    + }
    +
    + if (unlikely(!dm_request_based(md))) {
    + DMWARN("Request was queued into bio-based device");
    + return BLKPREP_KILL;
    + }
    +
    + tio = alloc_rq_tio(md); /* Only one for each original request */
    + if (!tio)
    + /* -ENOMEM */
    + return BLKPREP_DEFER;
    +
    + tio->md = md;
    + tio->ti = NULL;
    + tio->orig = rq;
    + tio->error = 0;
    + memset(&tio->info, 0, sizeof(tio->info));
    +
    + clone = &tio->clone;
    + if (setup_clone(clone, rq, tio)) {
    + /* -ENOMEM */
    + free_rq_tio(md, tio);
    + return BLKPREP_DEFER;
    + }
    +
    + rq->special = clone;
    + rq->cmd_flags |= REQ_DONTPREP;
    +
    + return BLKPREP_OK;
    +}
    +
    +static void map_request(struct dm_target *ti, struct request *rq,
    + struct mapped_device *md)
    +{
    + int r;
    + struct request *clone = rq->special;
    + struct dm_rq_target_io *tio = clone->end_io_data;
    +
    + tio->ti = ti;
    + atomic_inc(&md->pending);
    + r = ti->type->map_rq(ti, clone, &tio->info);
    + switch (r) {
    + case DM_MAPIO_SUBMITTED:
    + /* The target has taken the I/O to submit by itself later */
    + break;
    + case DM_MAPIO_REMAPPED:
    + /* The target has remapped the I/O so dispatch it */
    + dm_dispatch_request(clone);
    + break;
    + case DM_MAPIO_REQUEUE:
    + /* The target wants to requeue the I/O */
    + dm_requeue_request(clone);
    + break;
    + default:
    + if (r > 0) {
    + DMWARN("unimplemented target map return value: %d", r);
    + BUG();
    + }
    +
    + /* The target wants to complete the I/O */
    + dm_kill_request(clone, r);
    + break;
    + }
    +}
    +
    +/*
    + * q->request_fn for request-based dm.
    + * Called with the queue lock held.
    + */
    +static void dm_request_fn(struct request_queue *q)
    +{
    + struct mapped_device *md = (struct mapped_device *)q->queuedata;
    + struct dm_table *map = dm_get_table(md);
    + struct dm_target *ti;
    + struct request *rq;
    +
    + /*
    + * The check for blk_queue_stopped() needs here, because:
    + * - device suspend uses blk_stop_queue() and expects that
    + * no I/O will be dispatched any more after the queue stop
    + * - generic_unplug_device() doesn't call q->request_fn()
    + * when the queue is stopped, so no problem
    + * - but underlying device drivers may call q->request_fn()
    + * without the check through blk_run_queue()
    + */
    + while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) {
    + rq = elv_next_request(q);
    + if (!rq)
    + goto plug_and_out;
    +
    + ti = dm_table_find_target(map, rq->sector);
    + if (ti->type->busy && ti->type->busy(ti))
    + goto plug_and_out;
    +
    + blkdev_dequeue_request(rq);
    + spin_unlock(q->queue_lock);
    + map_request(ti, rq, md);
    + spin_lock_irq(q->queue_lock);
    + }
    +
    + goto out;
    +
    +plug_and_out:
    + if (!elv_queue_empty(q))
    + /* Some requests still remain, retry later */
    + blk_plug_device(q);
    +
    +out:
    + dm_table_put(map);
    +
    + return;
    +}
    +
    +int dm_underlying_device_congested(struct request_queue *q)
    +{
    + return bdi_lld_congested(&q->backing_dev_info);
    +}
    +EXPORT_SYMBOL_GPL(dm_underlying_device_congested) ;
    +
    static void dm_unplug_all(struct request_queue *q)
    {
    struct mapped_device *md = q->queuedata;
    struct dm_table *map = dm_get_table(md);

    if (map) {
    + if (dm_request_based(md))
    + generic_unplug_device(q);
    +
    dm_table_unplug_all(map);
    dm_table_put(map);
    }
    }

    -static int dm_any_congested(void *congested_data, int bdi_bits)
    +static int dm_congested(void *congested_data, int bdi_bits)
    {
    int r;
    struct mapped_device *md = (struct mapped_device *) congested_data;
    @@ -987,7 +1568,16 @@ static int dm_any_congested(void *conges

    if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
    r = bdi_bits;
    - else
    + else if (dm_request_based(md)) {
    + if (bdi_bits & (1 << BDI_lld_congested))
    + r = dm_table_any_busy_target(map);
    + else
    + /*
    + * Request-based dm cares about only own queue for
    + * the query about congestion status of request_queue
    + */
    + r = md->queue->backing_dev_info.state & bdi_bits;
    + } else
    r = dm_table_any_congested(map, bdi_bits);

    dm_table_put(map);
    @@ -1112,7 +1702,7 @@ static struct mapped_device *alloc_dev(i
    goto bad_queue;

    md->queue->queuedata = md;
    - md->queue->backing_dev_info.congested_fn = dm_any_congested;
    + md->queue->backing_dev_info.congested_fn = dm_congested;
    md->queue->backing_dev_info.congested_data = md;
    blk_queue_make_request(md->queue, dm_request);
    blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
    @@ -1378,7 +1968,11 @@ static int dm_wait_for_completion(struct
    set_current_state(TASK_INTERRUPTIBLE);

    smp_mb();
    - if (!atomic_read(&md->pending))
    + if (dm_request_based(md)) {
    + if (!atomic_read(&md->pending) &&
    + blk_queue_stopped(md->queue))
    + break;
    + } else if (!atomic_read(&md->pending))
    break;

    if (signal_pending(current)) {
    @@ -1480,6 +2074,88 @@ out:
    return r;
    }

    +static inline void dm_invalidate_flush_suspend(struct mapped_device *md)
    +{
    + md->suspend_rq.data = (void *)0x1;
    +}
    +
    +static void dm_abort_suspend(struct mapped_device *md, int noflush)
    +{
    + struct request_queue *q = md->queue;
    + unsigned long flags;
    +
    + /*
    + * For flush suspend, invalidation and queue restart must be protected
    + * by a single queue lock to prevent a race with dm_prep_fn().
    + */
    + spin_lock_irqsave(q->queue_lock, flags);
    + if (!noflush)
    + dm_invalidate_flush_suspend(md);
    + __start_queue(q);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +}
    +
    +/*
    + * Additional suspend work for request-based dm.
    + *
    + * In request-based dm, stopping request_queue prevents mapping.
    + * Even after stopping the request_queue, submitted requests from upper-layer
    + * can be inserted to the request_queue. So original (unmapped) requests are
    + * kept in the request_queue during suspension.
    + */
    +static void dm_start_suspend(struct mapped_device *md, int noflush)
    +{
    + struct request *rq = &md->suspend_rq;
    + struct request_queue *q = md->queue;
    + unsigned long flags;
    +
    + if (noflush) {
    + stop_queue(q);
    + return;
    + }
    +
    + /*
    + * For flush suspend, we need a marker to indicate the border line
    + * between flush needed I/Os and deferred I/Os, since all I/Os are
    + * queued in the request_queue during suspension.
    + *
    + * This marker must be inserted after setting DMF_BLOCK_IO,
    + * because dm_prep_fn() considers no DMF_BLOCK_IO to be
    + * a suspend interruption.
    + */
    + spin_lock_irqsave(q->queue_lock, flags);
    + if (unlikely(rq->ref_count)) {
    + /*
    + * This can happen when the previous suspend was interrupted,
    + * the inserted suspend_rq for the previous suspend has still
    + * been in the queue and this suspend has been invoked.
    + *
    + * We could re-insert the suspend_rq by deleting it from
    + * the queue forcibly using list_del_init(&rq->queuelist).
    + * But it would break the block-layer easily.
    + * So we don't re-insert the suspend_rq again in such a case.
    + * The suspend_rq should be already invalidated during
    + * the previous suspend interruption, so just wait for it
    + * to be completed.
    + *
    + * This suspend will never complete, so warn the user to
    + * interrupt this suspend and retry later.
    + */
    + BUG_ON(!rq->data);
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +
    + DMWARN("Invalidating the previous suspend is still in"
    + " progress. This suspend will be never done."
    + " Please interrupt this suspend and retry later.");
    + return;
    + }
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +
    + /* Now no user of the suspend_rq */
    + blk_rq_init(q, rq);
    + blk_insert_request(q, rq, 0, NULL);
    +}
    +
    /*
    * Functions to lock and unlock any filesystem running on the
    * device.
    @@ -1578,6 +2254,9 @@ int dm_suspend(struct mapped_device *md,
    add_wait_queue(&md->wait, &wait);
    up_write(&md->io_lock);

    + if (dm_request_based(md))
    + dm_start_suspend(md, noflush);
    +
    /* unplug */
    if (map)
    dm_table_unplug_all(map);
    @@ -1590,14 +2269,22 @@ int dm_suspend(struct mapped_device *md,
    down_write(&md->io_lock);
    remove_wait_queue(&md->wait, &wait);

    - if (noflush)
    - __merge_pushback_list(md);
    + if (noflush) {
    + if (dm_request_based(md))
    + /* All requeued requests are already in md->queue */
    + clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
    + else
    + __merge_pushback_list(md);
    + }
    up_write(&md->io_lock);

    /* were we interrupted ? */
    if (r < 0) {
    dm_queue_flush(md, DM_WQ_FLUSH_DEFERRED, NULL);

    + if (dm_request_based(md))
    + dm_abort_suspend(md, noflush);
    +
    unlock_fs(md);
    goto out; /* pushback list is already flushed, so skip flush */
    }
    @@ -1638,6 +2325,14 @@ int dm_resume(struct mapped_device *md)

    dm_queue_flush(md, DM_WQ_FLUSH_DEFERRED, NULL);

    + /*
    + * Flushing deferred I/Os must be done after targets are resumed
    + * so that mapping of targets can work correctly.
    + * Request-based dm is queueing the deferred I/Os in its request_queue.
    + */
    + if (dm_request_based(md))
    + start_queue(md->queue);
    +
    unlock_fs(md);

    if (md->suspended_bdev) {
    Index: 2.6.27-rc6/drivers/md/dm.h
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm.h
    +++ 2.6.27-rc6/drivers/md/dm.h
    @@ -49,6 +49,7 @@ void dm_table_presuspend_targets(struct
    void dm_table_postsuspend_targets(struct dm_table *t);
    int dm_table_resume_targets(struct dm_table *t);
    int dm_table_any_congested(struct dm_table *t, int bdi_bits);
    +int dm_table_any_busy_target(struct dm_table *t);
    void dm_table_unplug_all(struct dm_table *t);

    /*
    @@ -67,6 +68,14 @@ int dm_target_iterate(void (*iter_func)(
    void *param), void *param);

    /*-----------------------------------------------------------------
    + * Helper for block layer and dm core operations
    + *---------------------------------------------------------------*/
    +void dm_dispatch_request(struct request *rq);
    +void dm_requeue_request(struct request *rq);
    +void dm_kill_request(struct request *rq, int error);
    +int dm_underlying_device_congested(struct request_queue *q);
    +
    +/*-----------------------------------------------------------------
    * Useful inlines.
    *---------------------------------------------------------------*/
    static inline int array_too_big(unsigned long fixed, unsigned long obj,
    @@ -95,6 +104,7 @@ void dm_stripe_exit(void);

    void *dm_vcalloc(unsigned long nmemb, unsigned long elem_size);
    union map_info *dm_get_mapinfo(struct bio *bio);
    +union map_info *dm_get_rq_mapinfo(struct request *rq);
    int dm_open_count(struct mapped_device *md);
    int dm_lock_for_deletion(struct mapped_device *md);

    Index: 2.6.27-rc6/drivers/md/dm-table.c
    ================================================== =================
    --- 2.6.27-rc6.orig/drivers/md/dm-table.c
    +++ 2.6.27-rc6/drivers/md/dm-table.c
    @@ -949,6 +949,20 @@ int dm_table_any_congested(struct dm_tab
    return r;
    }

    +int dm_table_any_busy_target(struct dm_table *t)
    +{
    + int i;
    + struct dm_target *ti;
    +
    + for (i = 0; i < t->num_targets; i++) {
    + ti = t->targets + i;
    + if (ti->type->busy && ti->type->busy(ti))
    + return 1;
    + }
    +
    + return 0;
    +}
    +
    void dm_table_unplug_all(struct dm_table *t)
    {
    struct dm_dev *dd;
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. [PATCH 02/13] block: add request submission interface

    This patch adds blk_submit_request(), a generic request submission
    interface for request stacking drivers.
    Request-based dm will use it to submit their clones to underlying
    devices.

    blk_rq_check_limits() is also added because it is possible that
    the lower queue has stronger limitations than the upper queue
    if multiple drivers are stacking at request-level.
    Not only for blk_submit_request()'s internal use, the function
    will be used by request-based dm when the queue limitation is
    modified (e.g. by replacing dm's table).


    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Jens Axboe
    ---
    block/blk-core.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
    include/linux/blkdev.h | 2 +
    2 files changed, 83 insertions(+)

    Index: 2.6.27-rc6/block/blk-core.c
    ================================================== =================
    --- 2.6.27-rc6.orig/block/blk-core.c
    +++ 2.6.27-rc6/block/blk-core.c
    @@ -1517,6 +1517,87 @@ void submit_bio(int rw, struct bio *bio)
    EXPORT_SYMBOL(submit_bio);

    /**
    + * blk_rq_check_limits - Helper function to check a request for the queue limit
    + * @q: the queue
    + * @rq: the request being checked
    + *
    + * Description:
    + * @rq may have been made based on weaker limitations of upper-level queues
    + * in request stacking drivers, and it may violate the limitation of @q.
    + * Since the block layer and the underlying device driver trust @rq
    + * after it is inserted to @q, it should be checked against @q before
    + * the insertion using this generic function.
    + *
    + * This function should also be useful for request stacking drivers
    + * in some cases below, so export this fuction.
    + * Request stacking drivers like request-based dm may change the queue
    + * limits while requests are in the queue (e.g. dm's table swapping).
    + * Such request stacking drivers should check those requests agaist
    + * the new queue limits again when they dispatch those requests,
    + * although such checkings are also done against the old queue limits
    + * when submitting requests.
    + */
    +int blk_rq_check_limits(struct request_queue *q, struct request *rq)
    +{
    + if (rq->nr_sectors > q->max_sectors ||
    + rq->data_len > q->max_hw_sectors << 9) {
    + printk(KERN_ERR "%s: over max size limit.\n", __func__);
    + return -EIO;
    + }
    +
    + /*
    + * queue's settings related to segment counting like q->bounce_pfn
    + * may differ from that of other stacking queues.
    + * Recalculate it to check the request correctly on this queue's
    + * limitation.
    + */
    + blk_recalc_rq_segments(rq);
    + if (rq->nr_phys_segments > q->max_phys_segments ||
    + rq->nr_hw_segments > q->max_hw_segments) {
    + printk(KERN_ERR "%s: over max segments limit.\n", __func__);
    + return -EIO;
    + }
    +
    + return 0;
    +}
    +EXPORT_SYMBOL_GPL(blk_rq_check_limits);
    +
    +/**
    + * blk_submit_request - Helper for stacking drivers to submit a request
    + * @q: the queue to submit the request
    + * @rq: the request being queued
    + */
    +int blk_submit_request(struct request_queue *q, struct request *rq)
    +{
    + unsigned long flags;
    +
    + if (blk_rq_check_limits(q, rq))
    + return -EIO;
    +
    +#ifdef CONFIG_FAIL_MAKE_REQUEST
    + if (rq->rq_disk && rq->rq_disk->flags & GENHD_FL_FAIL &&
    + should_fail(&fail_make_request, blk_rq_bytes(rq)))
    + return -EIO;
    +#endif
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    +
    + /*
    + * Submitting request must be dequeued before calling this function
    + * because it will be linked to another request_queue
    + */
    + BUG_ON(blk_queued_rq(rq));
    +
    + drive_stat_acct(rq, 1);
    + __elv_add_request(q, rq, ELEVATOR_INSERT_BACK, 0);
    +
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +
    + return 0;
    +}
    +EXPORT_SYMBOL_GPL(blk_submit_request);
    +
    +/**
    * __end_that_request_first - end I/O on a request
    * @req: the request being processed
    * @error: 0 for success, < 0 for error
    Index: 2.6.27-rc6/include/linux/blkdev.h
    ================================================== =================
    --- 2.6.27-rc6.orig/include/linux/blkdev.h
    +++ 2.6.27-rc6/include/linux/blkdev.h
    @@ -664,6 +664,8 @@ extern void __blk_put_request(struct req
    extern struct request *blk_get_request(struct request_queue *, int, gfp_t);
    extern void blk_insert_request(struct request_queue *, struct request *, int, void *);
    extern void blk_requeue_request(struct request_queue *, struct request *);
    +extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
    +extern int blk_submit_request(struct request_queue *q, struct request *rq);
    extern void blk_plug_device(struct request_queue *);
    extern void blk_plug_device_unlocked(struct request_queue *);
    extern int blk_remove_plug(struct request_queue *);
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 02/13] block: add request submission interface

    Kiyoshi Ueda wrote:
    > This patch adds blk_submit_request(), a generic request submission
    > interface for request stacking drivers.
    > Request-based dm will use it to submit their clones to underlying
    > devices.
    >
    > blk_rq_check_limits() is also added because it is possible that
    > the lower queue has stronger limitations than the upper queue
    > if multiple drivers are stacking at request-level.
    > Not only for blk_submit_request()'s internal use, the function
    > will be used by request-based dm when the queue limitation is
    > modified (e.g. by replacing dm's table).
    >
    >
    > Signed-off-by: Kiyoshi Ueda
    > Signed-off-by: Jun'ichi Nomura
    > Cc: Jens Axboe
    > ---
    > block/blk-core.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
    > include/linux/blkdev.h | 2 +
    > 2 files changed, 83 insertions(+)
    >
    > Index: 2.6.27-rc6/block/blk-core.c
    > ================================================== =================
    > --- 2.6.27-rc6.orig/block/blk-core.c
    > +++ 2.6.27-rc6/block/blk-core.c
    > @@ -1517,6 +1517,87 @@ void submit_bio(int rw, struct bio *bio)
    > EXPORT_SYMBOL(submit_bio);
    >
    > /**
    > + * blk_rq_check_limits - Helper function to check a request for the queue limit
    > + * @q: the queue
    > + * @rq: the request being checked
    > + *
    > + * Description:
    > + * @rq may have been made based on weaker limitations of upper-level queues
    > + * in request stacking drivers, and it may violate the limitation of @q.
    > + * Since the block layer and the underlying device driver trust @rq
    > + * after it is inserted to @q, it should be checked against @q before
    > + * the insertion using this generic function.
    > + *
    > + * This function should also be useful for request stacking drivers
    > + * in some cases below, so export this fuction.
    > + * Request stacking drivers like request-based dm may change the queue
    > + * limits while requests are in the queue (e.g. dm's table swapping).
    > + * Such request stacking drivers should check those requests agaist
    > + * the new queue limits again when they dispatch those requests,
    > + * although such checkings are also done against the old queue limits
    > + * when submitting requests.
    > + */
    > +int blk_rq_check_limits(struct request_queue *q, struct request *rq)
    > +{
    > + if (rq->nr_sectors > q->max_sectors ||
    > + rq->data_len > q->max_hw_sectors << 9) {
    > + printk(KERN_ERR "%s: over max size limit.\n", __func__);
    > + return -EIO;
    > + }
    > +
    > + /*
    > + * queue's settings related to segment counting like q->bounce_pfn
    > + * may differ from that of other stacking queues.
    > + * Recalculate it to check the request correctly on this queue's
    > + * limitation.
    > + */
    > + blk_recalc_rq_segments(rq);
    > + if (rq->nr_phys_segments > q->max_phys_segments ||
    > + rq->nr_hw_segments > q->max_hw_segments) {
    > + printk(KERN_ERR "%s: over max segments limit.\n", __func__);
    > + return -EIO;
    > + }
    > +
    > + return 0;
    > +}
    > +EXPORT_SYMBOL_GPL(blk_rq_check_limits);
    > +
    > +/**
    > + * blk_submit_request - Helper for stacking drivers to submit a request
    > + * @q: the queue to submit the request
    > + * @rq: the request being queued
    > + */
    > +int blk_submit_request(struct request_queue *q, struct request *rq)
    > +{
    > + unsigned long flags;
    > +
    > + if (blk_rq_check_limits(q, rq))
    > + return -EIO;
    > +
    > +#ifdef CONFIG_FAIL_MAKE_REQUEST
    > + if (rq->rq_disk && rq->rq_disk->flags & GENHD_FL_FAIL &&
    > + should_fail(&fail_make_request, blk_rq_bytes(rq)))
    > + return -EIO;
    > +#endif
    > +
    > + spin_lock_irqsave(q->queue_lock, flags);
    > +
    > + /*
    > + * Submitting request must be dequeued before calling this function
    > + * because it will be linked to another request_queue
    > + */
    > + BUG_ON(blk_queued_rq(rq));
    > +
    > + drive_stat_acct(rq, 1);
    > + __elv_add_request(q, rq, ELEVATOR_INSERT_BACK, 0);
    > +
    > + spin_unlock_irqrestore(q->queue_lock, flags);
    > +
    > + return 0;
    > +}
    > +EXPORT_SYMBOL_GPL(blk_submit_request);
    > +


    This looks awfully similar to blk_execute_rq_nowait With an Added
    blk_rq_check_limits, minus the __generic_unplug_device() and
    q->request_fn(q) calls. Perhaps the common code could be re factored
    out? Also isn't block-exec.c a better file for this function?

    > +/**
    > * __end_that_request_first - end I/O on a request
    > * @req: the request being processed
    > * @error: 0 for success, < 0 for error
    > Index: 2.6.27-rc6/include/linux/blkdev.h
    > ================================================== =================
    > --- 2.6.27-rc6.orig/include/linux/blkdev.h
    > +++ 2.6.27-rc6/include/linux/blkdev.h
    > @@ -664,6 +664,8 @@ extern void __blk_put_request(struct req
    > extern struct request *blk_get_request(struct request_queue *, int, gfp_t);
    > extern void blk_insert_request(struct request_queue *, struct request *, int, void *);
    > extern void blk_requeue_request(struct request_queue *, struct request *);
    > +extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
    > +extern int blk_submit_request(struct request_queue *q, struct request *rq);
    > extern void blk_plug_device(struct request_queue *);
    > extern void blk_plug_device_unlocked(struct request_queue *);
    > extern int blk_remove_plug(struct request_queue *);
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH 00/13] request-based dm-multipath

    On Fri, Sep 12 2008, Kiyoshi Ueda wrote:
    > Hi Jens, James and Alasdair,
    >
    > This is a new version of request-based dm-multipath patches.
    > The patches are created on top of 2.6.27-rc6 + Alasdair's dm patches
    > for linux-next below:
    > dm-mpath-use-more-error-codes.patch
    > dm-mpath-remove-is_active-from-struct-dm_path.patch


    You have to base the block patches off the for-2.6.28 branch of the
    block git repo, otherwise I cannot merge the block bits.

    --
    Jens Axboe

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH 02/13] block: add request submission interface

    Hi Boaz, Jens,

    On Sun, 14 Sep 2008 16:10:58 +0300, Boaz Harrosh wrote:
    > Kiyoshi Ueda wrote:
    > > This patch adds blk_submit_request(), a generic request submission
    > > interface for request stacking drivers.
    > > Request-based dm will use it to submit their clones to underlying
    > > devices.
    > >
    > > blk_rq_check_limits() is also added because it is possible that
    > > the lower queue has stronger limitations than the upper queue
    > > if multiple drivers are stacking at request-level.
    > > Not only for blk_submit_request()'s internal use, the function
    > > will be used by request-based dm when the queue limitation is
    > > modified (e.g. by replacing dm's table).
    > >
    > >
    > > Signed-off-by: Kiyoshi Ueda
    > > Signed-off-by: Jun'ichi Nomura
    > > Cc: Jens Axboe
    > > ---
    > > block/blk-core.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
    > > include/linux/blkdev.h | 2 +
    > > 2 files changed, 83 insertions(+)
    > >
    > > Index: 2.6.27-rc6/block/blk-core.c
    > > ================================================== =================
    > > --- 2.6.27-rc6.orig/block/blk-core.c
    > > +++ 2.6.27-rc6/block/blk-core.c
    > > @@ -1517,6 +1517,87 @@ void submit_bio(int rw, struct bio *bio)
    > > EXPORT_SYMBOL(submit_bio);
    > >
    > > /**
    > > + * blk_rq_check_limits - Helper function to check a request for the queue limit
    > > + * @q: the queue
    > > + * @rq: the request being checked
    > > + *
    > > + * Description:
    > > + * @rq may have been made based on weaker limitations of upper-level queues
    > > + * in request stacking drivers, and it may violate the limitation of @q.
    > > + * Since the block layer and the underlying device driver trust @rq
    > > + * after it is inserted to @q, it should be checked against @q before
    > > + * the insertion using this generic function.
    > > + *
    > > + * This function should also be useful for request stacking drivers
    > > + * in some cases below, so export this fuction.
    > > + * Request stacking drivers like request-based dm may change the queue
    > > + * limits while requests are in the queue (e.g. dm's table swapping).
    > > + * Such request stacking drivers should check those requests agaist
    > > + * the new queue limits again when they dispatch those requests,
    > > + * although such checkings are also done against the old queue limits
    > > + * when submitting requests.
    > > + */
    > > +int blk_rq_check_limits(struct request_queue *q, struct request *rq)
    > > +{
    > > + if (rq->nr_sectors > q->max_sectors ||
    > > + rq->data_len > q->max_hw_sectors << 9) {
    > > + printk(KERN_ERR "%s: over max size limit.\n", __func__);
    > > + return -EIO;
    > > + }
    > > +
    > > + /*
    > > + * queue's settings related to segment counting like q->bounce_pfn
    > > + * may differ from that of other stacking queues.
    > > + * Recalculate it to check the request correctly on this queue's
    > > + * limitation.
    > > + */
    > > + blk_recalc_rq_segments(rq);
    > > + if (rq->nr_phys_segments > q->max_phys_segments ||
    > > + rq->nr_hw_segments > q->max_hw_segments) {
    > > + printk(KERN_ERR "%s: over max segments limit.\n", __func__);
    > > + return -EIO;
    > > + }
    > > +
    > > + return 0;
    > > +}
    > > +EXPORT_SYMBOL_GPL(blk_rq_check_limits);
    > > +
    > > +/**
    > > + * blk_submit_request - Helper for stacking drivers to submit a request
    > > + * @q: the queue to submit the request
    > > + * @rq: the request being queued
    > > + */
    > > +int blk_submit_request(struct request_queue *q, struct request *rq)
    > > +{
    > > + unsigned long flags;
    > > +
    > > + if (blk_rq_check_limits(q, rq))
    > > + return -EIO;
    > > +
    > > +#ifdef CONFIG_FAIL_MAKE_REQUEST
    > > + if (rq->rq_disk && rq->rq_disk->flags & GENHD_FL_FAIL &&
    > > + should_fail(&fail_make_request, blk_rq_bytes(rq)))
    > > + return -EIO;
    > > +#endif
    > > +
    > > + spin_lock_irqsave(q->queue_lock, flags);
    > > +
    > > + /*
    > > + * Submitting request must be dequeued before calling this function
    > > + * because it will be linked to another request_queue
    > > + */
    > > + BUG_ON(blk_queued_rq(rq));
    > > +
    > > + drive_stat_acct(rq, 1);
    > > + __elv_add_request(q, rq, ELEVATOR_INSERT_BACK, 0);
    > > +
    > > + spin_unlock_irqrestore(q->queue_lock, flags);
    > > +
    > > + return 0;
    > > +}
    > > +EXPORT_SYMBOL_GPL(blk_submit_request);
    > > +

    >
    > This looks awfully similar to blk_execute_rq_nowait With an Added
    > blk_rq_check_limits, minus the __generic_unplug_device() and
    > q->request_fn(q) calls. Perhaps the common code could be re factored
    > out?


    They might look simlar but don't have much in common actually.
    I could refactor them like the attached patch, but I'm not sure
    this is a correct way and this is cleaner than the current code.
    (e.g. blk_execute_rq_nowait() can't be called with irqs-disabled,
    but blk_insert_request() and my blk_submit_request() can be called
    with irqs-disabled.)

    So I'd leave them as it is unless you or others strongly prefer
    the attached patch...
    Anyway, I would like to leave the refactoring as a separate patch,
    since it's not so straightforward.


    > Also isn't block-exec.c a better file for this function?


    blk_insert_request() is in blk-core.c and it is similar to
    blk_submit_request(), so I added it to blk-core.c.
    But maybe both should be in blk-exec.c.
    I don't have any problem on this, I'd like to hear Jens' opinion.

    Thanks,
    Kiyoshi Ueda

    ---
    block/blk-core.c | 20 +++----------------
    block/blk-exec.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++-------
    2 files changed, 54 insertions(+), 23 deletions(-)

    Index: linux-2.6-block/block/blk-core.c
    ================================================== =================
    --- linux-2.6-block.orig/block/blk-core.c
    +++ linux-2.6-block/block/blk-core.c
    @@ -881,7 +881,7 @@ EXPORT_SYMBOL(blk_get_request);
    */
    void blk_start_queueing(struct request_queue *q)
    {
    - if (!blk_queue_plugged(q))
    + if (!blk_queue_plugged(q) && !blk_queue_stopped(q))
    q->request_fn(q);
    else
    __generic_unplug_device(q);
    @@ -930,11 +930,10 @@ EXPORT_SYMBOL(blk_requeue_request);
    * of the queue for things like a QUEUE_FULL message from a device, or a
    * host that is unable to accept a particular command.
    */
    -void blk_insert_request(struct request_queue *q, struct request *rq,
    - int at_head, void *data)
    +void blk_insert_special_request(struct request_queue *q, struct request *rq,
    + int at_head, void *data)
    {
    int where = at_head ? ELEVATOR_INSERT_FRONT : ELEVATOR_INSERT_BACK;
    - unsigned long flags;

    /*
    * tell I/O scheduler that this isn't a regular read/write (ie it
    @@ -946,18 +945,7 @@ void blk_insert_request(struct request_q

    rq->special = data;

    - spin_lock_irqsave(q->queue_lock, flags);
    -
    - /*
    - * If command is tagged, release the tag
    - */
    - if (blk_rq_tagged(rq))
    - blk_queue_end_tag(q, rq);
    -
    - drive_stat_acct(rq, 1);
    - __elv_add_request(q, rq, where, 0);
    - blk_start_queueing(q);
    - spin_unlock_irqrestore(q->queue_lock, flags);
    + blk_insert_request(q, rq, where, 1);
    }
    EXPORT_SYMBOL(blk_insert_request);

    Index: linux-2.6-block/block/blk-exec.c
    ================================================== =================
    --- linux-2.6-block.orig/block/blk-exec.c
    +++ linux-2.6-block/block/blk-exec.c
    @@ -33,6 +33,46 @@ static void blk_end_sync_rq(struct reque
    }

    /**
    + * blk_insert_request - Helper function for inserting a request
    + * @q: request queue where request should be inserted
    + * @rq: request to be inserted
    + * @where: where insert request to
    + * @run_queue: run the queue or not
    + */
    +static void blk_insert_request(struct request_queue *q, struct request *rq,
    + int where, int run_queue)
    +{
    + unsigned long flags;
    +
    + spin_lock_irqsave(q->queue_lock, flags);
    +
    + /*
    + * Submitting request must be dequeued before calling this function
    + * because it will be linked to another request_queue
    + */
    + BUG_ON(blk_queued_rq(rq));
    +
    + /*
    + * If command is tagged, release the tag
    + */
    + if (blk_rq_tagged(rq))
    + blk_queue_end_tag(q, rq);
    +
    + drive_stat_acct(rq, 1);
    + __elv_add_request(q, rq, where, 0);
    +
    + if (run_queue) {
    + blk_start_queueing(q);
    +
    + /* the queue is stopped so it won't be plugged+unplugged */
    + if (blk_pm_resume_request(rq))
    + q->request_fn(q);
    + }
    +
    + spin_unlock_irqrestore(q->queue_lock, flags);
    +}
    +
    +/**
    * blk_execute_rq_nowait - insert a request into queue for execution
    * @q: queue to insert the request in
    * @bd_disk: matching gendisk
    @@ -54,13 +94,7 @@ void blk_execute_rq_nowait(struct reques
    rq->cmd_flags |= REQ_NOMERGE;
    rq->end_io = done;
    WARN_ON(irqs_disabled());
    - spin_lock_irq(q->queue_lock);
    - __elv_add_request(q, rq, where, 1);
    - __generic_unplug_device(q);
    - /* the queue is stopped so it won't be plugged+unplugged */
    - if (blk_pm_resume_request(rq))
    - q->request_fn(q);
    - spin_unlock_irq(q->queue_lock);
    + blk_insert_request(q, rq, where, 1);
    }
    EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);

    @@ -104,3 +138,12 @@ int blk_execute_rq(struct request_queue
    return err;
    }
    EXPORT_SYMBOL(blk_execute_rq);
    +
    +int blk_insert_clone_request(struct request_queue *q, struct request *rq)
    +{
    + if (blk_rq_check_limits(q, rq))
    + return -EIO;
    +
    + blk_insert_request(q, rq, ELEVATOR_INSERT_BACK, 0);
    +}
    +EXPORT_SYMBOL_GPL(blk_insert_clone_request);
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH 02/13] block: add request submission interface

    On Tue, Sep 16 2008, Kiyoshi Ueda wrote:
    > Hi Boaz, Jens,
    >
    > On Sun, 14 Sep 2008 16:10:58 +0300, Boaz Harrosh wrote:
    > > Kiyoshi Ueda wrote:
    > > > This patch adds blk_submit_request(), a generic request submission
    > > > interface for request stacking drivers.
    > > > Request-based dm will use it to submit their clones to underlying
    > > > devices.
    > > >
    > > > blk_rq_check_limits() is also added because it is possible that
    > > > the lower queue has stronger limitations than the upper queue
    > > > if multiple drivers are stacking at request-level.
    > > > Not only for blk_submit_request()'s internal use, the function
    > > > will be used by request-based dm when the queue limitation is
    > > > modified (e.g. by replacing dm's table).
    > > >
    > > >
    > > > Signed-off-by: Kiyoshi Ueda
    > > > Signed-off-by: Jun'ichi Nomura
    > > > Cc: Jens Axboe
    > > > ---
    > > > block/blk-core.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
    > > > include/linux/blkdev.h | 2 +
    > > > 2 files changed, 83 insertions(+)
    > > >
    > > > Index: 2.6.27-rc6/block/blk-core.c
    > > > ================================================== =================
    > > > --- 2.6.27-rc6.orig/block/blk-core.c
    > > > +++ 2.6.27-rc6/block/blk-core.c
    > > > @@ -1517,6 +1517,87 @@ void submit_bio(int rw, struct bio *bio)
    > > > EXPORT_SYMBOL(submit_bio);
    > > >
    > > > /**
    > > > + * blk_rq_check_limits - Helper function to check a request for the queue limit
    > > > + * @q: the queue
    > > > + * @rq: the request being checked
    > > > + *
    > > > + * Description:
    > > > + * @rq may have been made based on weaker limitations of upper-level queues
    > > > + * in request stacking drivers, and it may violate the limitation of @q.
    > > > + * Since the block layer and the underlying device driver trust @rq
    > > > + * after it is inserted to @q, it should be checked against @q before
    > > > + * the insertion using this generic function.
    > > > + *
    > > > + * This function should also be useful for request stacking drivers
    > > > + * in some cases below, so export this fuction.
    > > > + * Request stacking drivers like request-based dm may change the queue
    > > > + * limits while requests are in the queue (e.g. dm's table swapping).
    > > > + * Such request stacking drivers should check those requests agaist
    > > > + * the new queue limits again when they dispatch those requests,
    > > > + * although such checkings are also done against the old queue limits
    > > > + * when submitting requests.
    > > > + */
    > > > +int blk_rq_check_limits(struct request_queue *q, struct request *rq)
    > > > +{
    > > > + if (rq->nr_sectors > q->max_sectors ||
    > > > + rq->data_len > q->max_hw_sectors << 9) {
    > > > + printk(KERN_ERR "%s: over max size limit.\n", __func__);
    > > > + return -EIO;
    > > > + }
    > > > +
    > > > + /*
    > > > + * queue's settings related to segment counting like q->bounce_pfn
    > > > + * may differ from that of other stacking queues.
    > > > + * Recalculate it to check the request correctly on this queue's
    > > > + * limitation.
    > > > + */
    > > > + blk_recalc_rq_segments(rq);
    > > > + if (rq->nr_phys_segments > q->max_phys_segments ||
    > > > + rq->nr_hw_segments > q->max_hw_segments) {
    > > > + printk(KERN_ERR "%s: over max segments limit.\n", __func__);
    > > > + return -EIO;
    > > > + }
    > > > +
    > > > + return 0;
    > > > +}
    > > > +EXPORT_SYMBOL_GPL(blk_rq_check_limits);
    > > > +
    > > > +/**
    > > > + * blk_submit_request - Helper for stacking drivers to submit a request
    > > > + * @q: the queue to submit the request
    > > > + * @rq: the request being queued
    > > > + */
    > > > +int blk_submit_request(struct request_queue *q, struct request *rq)
    > > > +{
    > > > + unsigned long flags;
    > > > +
    > > > + if (blk_rq_check_limits(q, rq))
    > > > + return -EIO;
    > > > +
    > > > +#ifdef CONFIG_FAIL_MAKE_REQUEST
    > > > + if (rq->rq_disk && rq->rq_disk->flags & GENHD_FL_FAIL &&
    > > > + should_fail(&fail_make_request, blk_rq_bytes(rq)))
    > > > + return -EIO;
    > > > +#endif
    > > > +
    > > > + spin_lock_irqsave(q->queue_lock, flags);
    > > > +
    > > > + /*
    > > > + * Submitting request must be dequeued before calling this function
    > > > + * because it will be linked to another request_queue
    > > > + */
    > > > + BUG_ON(blk_queued_rq(rq));
    > > > +
    > > > + drive_stat_acct(rq, 1);
    > > > + __elv_add_request(q, rq, ELEVATOR_INSERT_BACK, 0);
    > > > +
    > > > + spin_unlock_irqrestore(q->queue_lock, flags);
    > > > +
    > > > + return 0;
    > > > +}
    > > > +EXPORT_SYMBOL_GPL(blk_submit_request);
    > > > +

    > >
    > > This looks awfully similar to blk_execute_rq_nowait With an Added
    > > blk_rq_check_limits, minus the __generic_unplug_device() and
    > > q->request_fn(q) calls. Perhaps the common code could be re factored
    > > out?

    >
    > They might look simlar but don't have much in common actually.
    > I could refactor them like the attached patch, but I'm not sure
    > this is a correct way and this is cleaner than the current code.
    > (e.g. blk_execute_rq_nowait() can't be called with irqs-disabled,
    > but blk_insert_request() and my blk_submit_request() can be called
    > with irqs-disabled.)
    >
    > So I'd leave them as it is unless you or others strongly prefer
    > the attached patch...
    > Anyway, I would like to leave the refactoring as a separate patch,
    > since it's not so straightforward.


    If it wasn't for the _irq vs _irqsave, I would apply it. But I agree,
    your current approach is fine.

    --
    Jens Axboe

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH 02/13] block: add request submission interface

    Hi Jens,

    On Tue, 16 Sep 2008 19:02:20 +0200, Jens Axboe wrote:
    > On Tue, Sep 16 2008, Kiyoshi Ueda wrote:
    > > Hi Boaz, Jens,
    > >
    > > On Sun, 14 Sep 2008 16:10:58 +0300, Boaz Harrosh wrote:
    > > > Kiyoshi Ueda wrote:
    > > > > This patch adds blk_submit_request(), a generic request submission
    > > > > interface for request stacking drivers.
    > > > > Request-based dm will use it to submit their clones to underlying
    > > > > devices.
    > > > >
    > > > > blk_rq_check_limits() is also added because it is possible that
    > > > > the lower queue has stronger limitations than the upper queue
    > > > > if multiple drivers are stacking at request-level.
    > > > > Not only for blk_submit_request()'s internal use, the function
    > > > > will be used by request-based dm when the queue limitation is
    > > > > modified (e.g. by replacing dm's table).
    > > > >
    > > > >
    > > > > Signed-off-by: Kiyoshi Ueda
    > > > > Signed-off-by: Jun'ichi Nomura
    > > > > Cc: Jens Axboe
    > > > > ---
    > > > > block/blk-core.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
    > > > > include/linux/blkdev.h | 2 +
    > > > > 2 files changed, 83 insertions(+)
    > > > >
    > > > > Index: 2.6.27-rc6/block/blk-core.c
    > > > > ================================================== =================
    > > > > --- 2.6.27-rc6.orig/block/blk-core.c
    > > > > +++ 2.6.27-rc6/block/blk-core.c
    > > > > @@ -1517,6 +1517,87 @@ void submit_bio(int rw, struct bio *bio)
    > > > > EXPORT_SYMBOL(submit_bio);
    > > > >
    > > > > /**
    > > > > + * blk_rq_check_limits - Helper function to check a request for the queue limit
    > > > > + * @q: the queue
    > > > > + * @rq: the request being checked
    > > > > + *
    > > > > + * Description:
    > > > > + * @rq may have been made based on weaker limitations of upper-level queues
    > > > > + * in request stacking drivers, and it may violate the limitation of @q.
    > > > > + * Since the block layer and the underlying device driver trust @rq
    > > > > + * after it is inserted to @q, it should be checked against @q before
    > > > > + * the insertion using this generic function.
    > > > > + *
    > > > > + * This function should also be useful for request stacking drivers
    > > > > + * in some cases below, so export this fuction.
    > > > > + * Request stacking drivers like request-based dm may change the queue
    > > > > + * limits while requests are in the queue (e.g. dm's table swapping).
    > > > > + * Such request stacking drivers should check those requests agaist
    > > > > + * the new queue limits again when they dispatch those requests,
    > > > > + * although such checkings are also done against the old queue limits
    > > > > + * when submitting requests.
    > > > > + */
    > > > > +int blk_rq_check_limits(struct request_queue *q, struct request *rq)
    > > > > +{
    > > > > + if (rq->nr_sectors > q->max_sectors ||
    > > > > + rq->data_len > q->max_hw_sectors << 9) {
    > > > > + printk(KERN_ERR "%s: over max size limit.\n", __func__);
    > > > > + return -EIO;
    > > > > + }
    > > > > +
    > > > > + /*
    > > > > + * queue's settings related to segment counting like q->bounce_pfn
    > > > > + * may differ from that of other stacking queues.
    > > > > + * Recalculate it to check the request correctly on this queue's
    > > > > + * limitation.
    > > > > + */
    > > > > + blk_recalc_rq_segments(rq);
    > > > > + if (rq->nr_phys_segments > q->max_phys_segments ||
    > > > > + rq->nr_hw_segments > q->max_hw_segments) {
    > > > > + printk(KERN_ERR "%s: over max segments limit.\n", __func__);
    > > > > + return -EIO;
    > > > > + }
    > > > > +
    > > > > + return 0;
    > > > > +}
    > > > > +EXPORT_SYMBOL_GPL(blk_rq_check_limits);
    > > > > +
    > > > > +/**
    > > > > + * blk_submit_request - Helper for stacking drivers to submit a request
    > > > > + * @q: the queue to submit the request
    > > > > + * @rq: the request being queued
    > > > > + */
    > > > > +int blk_submit_request(struct request_queue *q, struct request *rq)
    > > > > +{
    > > > > + unsigned long flags;
    > > > > +
    > > > > + if (blk_rq_check_limits(q, rq))
    > > > > + return -EIO;
    > > > > +
    > > > > +#ifdef CONFIG_FAIL_MAKE_REQUEST
    > > > > + if (rq->rq_disk && rq->rq_disk->flags & GENHD_FL_FAIL &&
    > > > > + should_fail(&fail_make_request, blk_rq_bytes(rq)))
    > > > > + return -EIO;
    > > > > +#endif
    > > > > +
    > > > > + spin_lock_irqsave(q->queue_lock, flags);
    > > > > +
    > > > > + /*
    > > > > + * Submitting request must be dequeued before calling this function
    > > > > + * because it will be linked to another request_queue
    > > > > + */
    > > > > + BUG_ON(blk_queued_rq(rq));
    > > > > +
    > > > > + drive_stat_acct(rq, 1);
    > > > > + __elv_add_request(q, rq, ELEVATOR_INSERT_BACK, 0);
    > > > > +
    > > > > + spin_unlock_irqrestore(q->queue_lock, flags);
    > > > > +
    > > > > + return 0;
    > > > > +}
    > > > > +EXPORT_SYMBOL_GPL(blk_submit_request);
    > > > > +
    > > >
    > > > This looks awfully similar to blk_execute_rq_nowait With an Added
    > > > blk_rq_check_limits, minus the __generic_unplug_device() and
    > > > q->request_fn(q) calls. Perhaps the common code could be re factored
    > > > out?

    > >
    > > They might look simlar but don't have much in common actually.
    > > I could refactor them like the attached patch, but I'm not sure
    > > this is a correct way and this is cleaner than the current code.
    > > (e.g. blk_execute_rq_nowait() can't be called with irqs-disabled,
    > > but blk_insert_request() and my blk_submit_request() can be called
    > > with irqs-disabled.)
    > >
    > > So I'd leave them as it is unless you or others strongly prefer
    > > the attached patch...
    > > Anyway, I would like to leave the refactoring as a separate patch,
    > > since it's not so straightforward.

    >
    > If it wasn't for the _irq vs _irqsave, I would apply it. But I agree,
    > your current approach is fine.


    OK, I'll rebase my patches for for-2.6.28 of your block git and
    repost the block bits. Maybe I need a time to confirm whether
    the diffrences between 2.6.27-rc6 and the block git affect my patches.
    (Hopefully, I'd like to repost this week.)

    Thanks,
    Kiyoshi Ueda
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [dm-devel] [PATCH 10/13] dm: add core functions for request-based dm

    On Fri, Sep 12, 2008 at 8:16 PM, Kiyoshi Ueda wrote:


    > +static int dm_make_request(struct request_queue *q, struct bio *bio)
    > +{
    > + struct mapped_device *md = (struct mapped_device *)q->queuedata;
    > +
    > + if (unlikely(bio_barrier(bio))) {
    > + bio_endio(bio, -EOPNOTSUPP);
    > + return 0;
    > + }
    > +



    Why not add barrier support in the beginning itself, so that targets
    can be developed with barriers in mind? At least can we make the target
    to return error, instead of the core?

    Thanks
    Nikanth Karthikesan
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 2 1 2 LastLast