Re: [rfc] direct IO submission and completion scalability issues - Kernel

This is a discussion on Re: [rfc] direct IO submission and completion scalability issues - Kernel ; On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote: > > Second experiment which we did was migrating the IO submission to the > IO completion cpu. Instead of submitting the IO on the same cpu where the ...

+ Reply to Thread
Results 1 to 19 of 19

Thread: Re: [rfc] direct IO submission and completion scalability issues

  1. Re: [rfc] direct IO submission and completion scalability issues

    On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
    >
    > Second experiment which we did was migrating the IO submission to the
    > IO completion cpu. Instead of submitting the IO on the same cpu where the
    > request arrived, in this experiment the IO submission gets migrated to the
    > cpu that is processing IO completions(interrupt). This will minimize the
    > access to remote cachelines (that happens in timers, slab, scsi layers). The
    > IO submission request is forwarded to the kblockd thread on the cpu receiving
    > the interrupts. As part of this, we also made kblockd thread on each cpu as the
    > highest priority thread, so that IO gets submitted as soon as possible on the
    > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
    > resulted in 2% performance improvement and 3.3% improvement on two node ia64
    > platform.
    >
    > Quick and dirty prototype patch(not meant for inclusion) for this io migration
    > experiment is appended to this e-mail.
    >
    > Observation #1 mentioned above is also applicable to this experiment. CPU's
    > processing interrupts will now have to cater IO submission/processing
    > load aswell.
    >
    > Observation #2: This introduces some migration overhead during IO submission.
    > With the current prototype, every incoming IO request results in an IPI and
    > context switch(to kblockd thread) on the interrupt processing cpu.
    > This issue needs to be addressed and main challenge to address is
    > the efficient mechanism of doing this IO migration(how much batching to do and
    > when to send the migrate request?), so that we don't delay the IO much and at
    > the same point, don't cause much overhead during migration.


    Hi guys,

    Just had another way we might do this. Migrate the completions out to
    the submitting CPUs rather than migrate submission into the completing
    CPU.

    I've got a basic patch that passes some stress testing. It seems fairly
    simple to do at the block layer, and the bulk of the patch involves
    introducing a scalable smp_call_function for it.

    Now it could be optimised more by looking at batching up IPIs or
    optimising the call function path or even mirating the completion event
    at a different level...

    However, this is a first cut. It actually seems like it might be taking
    slightly more CPU to process block IO (~0.2%)... however, this is on my
    dual core system that shares an llc, which means that there are very few
    cache benefits to the migration, but non-zero overhead. So on multisocket
    systems hopefully it might get to positive territory.

    ---

    Index: linux-2.6/arch/x86/kernel/smp_64.c
    ================================================== =================
    --- linux-2.6.orig/arch/x86/kernel/smp_64.c
    +++ linux-2.6/arch/x86/kernel/smp_64.c
    @@ -321,6 +321,99 @@ void unlock_ipi_call_lock(void)
    spin_unlock_irq(&call_lock);
    }

    +struct call_single_data {
    + struct list_head list;
    + void (*func) (void *info);
    + void *info;
    + int wait;
    +};
    +
    +struct call_single_queue {
    + spinlock_t lock;
    + struct list_head list;
    +};
    +static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
    +
    +int __cpuinit init_smp_call(void)
    +{
    + int i;
    +
    + for_each_cpu_mask(i, cpu_possible_map) {
    + spin_lock_init(&per_cpu(call_single_queue, i).lock);
    + INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
    + }
    + return 0;
    +}
    +core_initcall(init_smp_call);
    +
    +/*
    + * this function sends a 'generic call function' IPI to all other CPU
    + * of the system defined in the mask.
    + */
    +int smp_call_function_fast(int cpu, void (*func)(void *), void *info,
    + int wait)
    +{
    + struct call_single_data *data;
    + struct call_single_queue *dst = &per_cpu(call_single_queue, cpu);
    + cpumask_t mask = cpumask_of_cpu(cpu);
    + int ipi;
    +
    + data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
    + data->func = func;
    + data->info = info;
    + data->wait = wait;
    +
    + spin_lock_irq(&dst->lock);
    + ipi = list_empty(&dst->list);
    + list_add_tail(&data->list, &dst->list);
    + spin_unlock_irq(&dst->lock);
    +
    + if (ipi)
    + send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
    +
    + if (wait) {
    + /* Wait for response */
    + while (data->wait)
    + cpu_relax();
    + kfree(data);
    + }
    +
    + return 0;
    +}
    +
    +asmlinkage void smp_call_function_fast_interrupt(void)
    +{
    + struct call_single_queue *q;
    + unsigned long flags;
    + LIST_HEAD(list);
    +
    + ack_APIC_irq();
    +
    + q = &__get_cpu_var(call_single_queue);
    + spin_lock_irqsave(&q->lock, flags);
    + list_replace_init(&q->list, &list);
    + spin_unlock_irqrestore(&q->lock, flags);
    +
    + exit_idle();
    + irq_enter();
    + while (!list_empty(&list)) {
    + struct call_single_data *data;
    +
    + data = list_entry(list.next, struct call_single_data, list);
    + list_del(&data->list);
    +
    + data->func(data->info);
    + if (data->wait) {
    + smp_mb();
    + data->wait = 0;
    + } else {
    + kfree(data);
    + }
    + }
    + add_pda(irq_call_count, 1);
    + irq_exit();
    +}
    +
    /*
    * this function sends a 'generic call function' IPI to all other CPU
    * of the system defined in the mask.
    Index: linux-2.6/block/blk-core.c
    ================================================== =================
    --- linux-2.6.orig/block/blk-core.c
    +++ linux-2.6/block/blk-core.c
    @@ -1604,6 +1604,13 @@ static int __end_that_request_first(stru
    return 1;
    }

    +static void blk_done_softirq_other(void *data)
    +{
    + struct request *rq = data;
    +
    + blk_complete_request(rq);
    +}
    +
    /*
    * splice the completion data to a local structure and hand off to
    * process_completion_queue() to complete the requests
    @@ -1622,7 +1629,15 @@ static void blk_done_softirq(struct soft

    rq = list_entry(local_list.next, struct request, donelist);
    list_del_init(&rq->donelist);
    - rq->q->softirq_done_fn(rq);
    + if (rq->submission_cpu != smp_processor_id()) {
    + /*
    + * Could batch up IPIs here, but we should measure how
    + * often blk_done_softirq gets a large batch...
    + */
    + smp_call_function_fast(rq->submission_cpu,
    + blk_done_softirq_other, rq, 0);
    + } else
    + rq->q->softirq_done_fn(rq);
    }
    }

    Index: linux-2.6/include/asm-x86/hw_irq_64.h
    ================================================== =================
    --- linux-2.6.orig/include/asm-x86/hw_irq_64.h
    +++ linux-2.6/include/asm-x86/hw_irq_64.h
    @@ -68,8 +68,7 @@
    #define ERROR_APIC_VECTOR 0xfe
    #define RESCHEDULE_VECTOR 0xfd
    #define CALL_FUNCTION_VECTOR 0xfc
    -/* fb free - please don't readd KDB here because it's useless
    - (hint - think what a NMI bit does to a vector) */
    +#define CALL_FUNCTION_SINGLE_VECTOR 0xfb
    #define THERMAL_APIC_VECTOR 0xfa
    #define THRESHOLD_APIC_VECTOR 0xf9
    /* f8 free */
    @@ -102,6 +101,7 @@ void spurious_interrupt(void);
    void error_interrupt(void);
    void reschedule_interrupt(void);
    void call_function_interrupt(void);
    +void call_function_fast_interrupt(void);
    void irq_move_cleanup_interrupt(void);
    void invalidate_interrupt0(void);
    void invalidate_interrupt1(void);
    Index: linux-2.6/include/linux/smp.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/smp.h
    +++ linux-2.6/include/linux/smp.h
    @@ -53,6 +53,7 @@ extern void smp_cpus_done(unsigned int m
    * Call a function on all other processors
    */
    int smp_call_function(void(*func)(void *info), void *info, int retry, int wait);
    +int smp_call_function_fast(int cpuid, void(*func)(void *info), void *info, int wait);

    int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
    int retry, int wait);
    @@ -92,6 +93,11 @@ static inline int up_smp_call_function(v
    }
    #define smp_call_function(func, info, retry, wait) \
    (up_smp_call_function(func, info))
    +static inline int smp_call_function_fast(int cpuid, void(*func)(void *info), void *info, int wait)
    +{
    + return 0;
    +}
    +
    #define on_each_cpu(func,info,retry,wait) \
    ({ \
    local_irq_disable(); \
    Index: linux-2.6/block/elevator.c
    ================================================== =================
    --- linux-2.6.orig/block/elevator.c
    +++ linux-2.6/block/elevator.c
    @@ -648,6 +648,8 @@ void elv_insert(struct request_queue *q,
    void __elv_add_request(struct request_queue *q, struct request *rq, int where,
    int plug)
    {
    + rq->submission_cpu = smp_processor_id();
    +
    if (q->ordcolor)
    rq->cmd_flags |= REQ_ORDERED_COLOR;

    Index: linux-2.6/include/linux/blkdev.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/blkdev.h
    +++ linux-2.6/include/linux/blkdev.h
    @@ -208,6 +208,8 @@ struct request {

    int ref_count;

    + int submission_cpu;
    +
    /*
    * when request is used as a packet command carrier
    */
    Index: linux-2.6/arch/x86/kernel/entry_64.S
    ================================================== =================
    --- linux-2.6.orig/arch/x86/kernel/entry_64.S
    +++ linux-2.6/arch/x86/kernel/entry_64.S
    @@ -696,6 +696,9 @@ END(invalidate_interrupt\num)
    ENTRY(call_function_interrupt)
    apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt
    END(call_function_interrupt)
    +ENTRY(call_function_fast_interrupt)
    + apicinterrupt CALL_FUNCTION_SINGLE_VECTOR,smp_call_function_fast _interrupt
    +END(call_function_fast_interrupt)
    ENTRY(irq_move_cleanup_interrupt)
    apicinterrupt IRQ_MOVE_CLEANUP_VECTOR,smp_irq_move_cleanup_inter rupt
    END(irq_move_cleanup_interrupt)
    Index: linux-2.6/arch/x86/kernel/i8259_64.c
    ================================================== =================
    --- linux-2.6.orig/arch/x86/kernel/i8259_64.c
    +++ linux-2.6/arch/x86/kernel/i8259_64.c
    @@ -493,6 +493,7 @@ void __init native_init_IRQ(void)

    /* IPI for generic function call */
    set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
    + set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR, call_function_fast_interrupt);

    /* Low priority IPI to cleanup after moving an irq */
    set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
    Index: linux-2.6/include/asm-x86/mach-default/entry_arch.h
    ================================================== =================
    --- linux-2.6.orig/include/asm-x86/mach-default/entry_arch.h
    +++ linux-2.6/include/asm-x86/mach-default/entry_arch.h
    @@ -13,6 +13,7 @@
    BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VE CTOR)
    BUILD_INTERRUPT(invalidate_interrupt,INVALIDATE_TL B_VECTOR)
    BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCT ION_VECTOR)
    +BUILD_INTERRUPT(call_function_fast_interrupt,CALL _FUNCTION_SINGLE_VECTOR)
    #endif

    /*
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [rfc] direct IO submission and completion scalability issues

    Hi Nick,

    On Feb 3, 2008 11:52 AM, Nick Piggin wrote:
    > +asmlinkage void smp_call_function_fast_interrupt(void)
    > +{


    [snip]

    > + while (!list_empty(&list)) {
    > + struct call_single_data *data;
    > +
    > + data = list_entry(list.next, struct call_single_data, list);
    > + list_del(&data->list);
    > +
    > + data->func(data->info);
    > + if (data->wait) {
    > + smp_mb();
    > + data->wait = 0;


    Why do we need smp_mb() here (maybe add a comment to keep
    Andrew/checkpatch happy)?

    Pekka
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [rfc] direct IO submission and completion scalability issues

    On Sun, Feb 03, 2008 at 12:53:02PM +0200, Pekka Enberg wrote:
    > Hi Nick,
    >
    > On Feb 3, 2008 11:52 AM, Nick Piggin wrote:
    > > +asmlinkage void smp_call_function_fast_interrupt(void)
    > > +{

    >
    > [snip]
    >
    > > + while (!list_empty(&list)) {
    > > + struct call_single_data *data;
    > > +
    > > + data = list_entry(list.next, struct call_single_data, list);
    > > + list_del(&data->list);
    > > +
    > > + data->func(data->info);
    > > + if (data->wait) {
    > > + smp_mb();
    > > + data->wait = 0;

    >
    > Why do we need smp_mb() here (maybe add a comment to keep
    > Andrew/checkpatch happy)?


    Yeah, definitely... it's just a really basic RFC, but I should get
    into the habit of just doing it anyway.

    Thanks,
    Nick
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [rfc] direct IO submission and completion scalability issues

    On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
    > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
    > >
    > > Second experiment which we did was migrating the IO submission to the
    > > IO completion cpu. Instead of submitting the IO on the same cpu where the
    > > request arrived, in this experiment the IO submission gets migrated to the
    > > cpu that is processing IO completions(interrupt). This will minimize the
    > > access to remote cachelines (that happens in timers, slab, scsi layers). The
    > > IO submission request is forwarded to the kblockd thread on the cpu receiving
    > > the interrupts. As part of this, we also made kblockd thread on each cpu as the
    > > highest priority thread, so that IO gets submitted as soon as possible on the
    > > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
    > > resulted in 2% performance improvement and 3.3% improvement on two node ia64
    > > platform.
    > >
    > > Quick and dirty prototype patch(not meant for inclusion) for this io migration
    > > experiment is appended to this e-mail.
    > >
    > > Observation #1 mentioned above is also applicable to this experiment. CPU's
    > > processing interrupts will now have to cater IO submission/processing
    > > load aswell.
    > >
    > > Observation #2: This introduces some migration overhead during IO submission.
    > > With the current prototype, every incoming IO request results in an IPI and
    > > context switch(to kblockd thread) on the interrupt processing cpu.
    > > This issue needs to be addressed and main challenge to address is
    > > the efficient mechanism of doing this IO migration(how much batching to do and
    > > when to send the migrate request?), so that we don't delay the IO much and at
    > > the same point, don't cause much overhead during migration.

    >
    > Hi guys,
    >
    > Just had another way we might do this. Migrate the completions out to
    > the submitting CPUs rather than migrate submission into the completing
    > CPU.


    Hi Nick,

    When Matthew was describing this work at an LCA presentation (not
    sure whether you were at that presentation or not), Zach came up
    with the idea that allowing the submitting application control the
    CPU that the io completion processing was occurring would be a good
    approach to try. That is, we submit a "completion cookie" with the
    bio that indicates where we want completion to run, rather than
    dictating that completion runs on the submission CPU.

    The reasoning is that only the higher level context really knows
    what is optimal, and that changes from application to application.
    The "complete on the submission CPU" policy _may_ be more optimal
    for database workloads, but it is definitely suboptimal for XFS and
    transaction I/O completion handling because it simply drags a bunch
    of global filesystem state around between all the CPUs running
    completions. In that case, we really only want a single CPU to be
    handling the completions.....

    (Zach - please correct me if I've missed anything)

    Looking at your patch - if you turn it around so that the
    "submission CPU" field can be specified as the "completion cpu" then
    I think the patch will expose the policy knobs needed to do the
    above. Add the bio -> rq linkage to enable filesystems and DIO to
    control the completion CPU field and we're almost done....

    Cheers,

    Dave.
    --
    Dave Chinner
    Principal Engineer
    SGI Australian Software Group
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [rfc] direct IO submission and completion scalability issues

    David Chinner wrote:
    > Hi Nick,
    >
    > When Matthew was describing this work at an LCA presentation (not
    > sure whether you were at that presentation or not), Zach came up
    > with the idea that allowing the submitting application control the
    > CPU that the io completion processing was occurring would be a good
    > approach to try. That is, we submit a "completion cookie" with the
    > bio that indicates where we want completion to run, rather than
    > dictating that completion runs on the submission CPU.
    >
    > The reasoning is that only the higher level context really knows
    > what is optimal, and that changes from application to application.


    well.. kinda. One of the really hard parts of the submit/completion stuff is that
    the slab/slob/slub/slib allocator ends up basically "cycling" memory through the system;
    there's a sink of free memory on all the submission cpus and a source of free memory
    on the completion cpu. I don't think applications are capable of working out what is
    best in this scenario..


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [rfc] direct IO submission and completion scalability issues

    On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
    > David Chinner wrote:
    > >Hi Nick,
    > >
    > >When Matthew was describing this work at an LCA presentation (not
    > >sure whether you were at that presentation or not), Zach came up
    > >with the idea that allowing the submitting application control the
    > >CPU that the io completion processing was occurring would be a good
    > >approach to try. That is, we submit a "completion cookie" with the
    > >bio that indicates where we want completion to run, rather than
    > >dictating that completion runs on the submission CPU.
    > >
    > >The reasoning is that only the higher level context really knows
    > >what is optimal, and that changes from application to application.

    >
    > well.. kinda. One of the really hard parts of the submit/completion stuff
    > is that
    > the slab/slob/slub/slib allocator ends up basically "cycling" memory
    > through the system;
    > there's a sink of free memory on all the submission cpus and a source of
    > free memory
    > on the completion cpu. I don't think applications are capable of working
    > out what is
    > best in this scenario..


    Applications as in "anything that calls submit_bio()". i.e, direct I/O,
    filesystems, etc. i.e. not userspace but in-kernel applications.

    In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
    contention of global structures in XFS. By controlling where completions are
    delivered, we can greatly reduce this contention, especially on large,
    mulitpathed devices that deliver interrupts to multiple CPUs that may be far
    distant from each other. We have all the state and intelligence necessary
    to control this sort policy decision effectively.....

    Cheers,

    Dave.
    --
    Dave Chinner
    Principal Engineer
    SGI Australian Software Group
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [rfc] direct IO submission and completion scalability issues

    On Sun, Feb 03 2008, Nick Piggin wrote:
    > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
    > >
    > > Second experiment which we did was migrating the IO submission to the
    > > IO completion cpu. Instead of submitting the IO on the same cpu where the
    > > request arrived, in this experiment the IO submission gets migrated to the
    > > cpu that is processing IO completions(interrupt). This will minimize the
    > > access to remote cachelines (that happens in timers, slab, scsi layers). The
    > > IO submission request is forwarded to the kblockd thread on the cpu receiving
    > > the interrupts. As part of this, we also made kblockd thread on each cpu as the
    > > highest priority thread, so that IO gets submitted as soon as possible on the
    > > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
    > > resulted in 2% performance improvement and 3.3% improvement on two node ia64
    > > platform.
    > >
    > > Quick and dirty prototype patch(not meant for inclusion) for this io migration
    > > experiment is appended to this e-mail.
    > >
    > > Observation #1 mentioned above is also applicable to this experiment. CPU's
    > > processing interrupts will now have to cater IO submission/processing
    > > load aswell.
    > >
    > > Observation #2: This introduces some migration overhead during IO submission.
    > > With the current prototype, every incoming IO request results in an IPI and
    > > context switch(to kblockd thread) on the interrupt processing cpu.
    > > This issue needs to be addressed and main challenge to address is
    > > the efficient mechanism of doing this IO migration(how much batching to do and
    > > when to send the migrate request?), so that we don't delay the IO much and at
    > > the same point, don't cause much overhead during migration.

    >
    > Hi guys,
    >
    > Just had another way we might do this. Migrate the completions out to
    > the submitting CPUs rather than migrate submission into the completing
    > CPU.
    >
    > I've got a basic patch that passes some stress testing. It seems fairly
    > simple to do at the block layer, and the bulk of the patch involves
    > introducing a scalable smp_call_function for it.
    >
    > Now it could be optimised more by looking at batching up IPIs or
    > optimising the call function path or even mirating the completion event
    > at a different level...
    >
    > However, this is a first cut. It actually seems like it might be taking
    > slightly more CPU to process block IO (~0.2%)... however, this is on my
    > dual core system that shares an llc, which means that there are very few
    > cache benefits to the migration, but non-zero overhead. So on multisocket
    > systems hopefully it might get to positive territory.


    That's pretty funny, I did pretty much the exact same thing last week!
    The primary difference between yours and mine is that I used a more
    private interface to signal a softirq raise on another CPU, instead of
    allocating call data and exposing a generic interface. That put the
    locking in blk-core instead, turning blk_cpu_done into a structure with
    a lock and list_head instead of just being a list head, and intercepted
    at blk_complete_request() time instead of waiting for an already raised
    softirq on that CPU.

    Didn't get around to any performance testing yet, though. Will try and
    clean it up a bit and do that.

    --
    Jens Axboe

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [rfc] direct IO submission and completion scalability issues

    On Mon, Feb 04, 2008 at 03:40:20PM +1100, David Chinner wrote:
    > On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
    > > David Chinner wrote:
    > > >Hi Nick,
    > > >
    > > >When Matthew was describing this work at an LCA presentation (not
    > > >sure whether you were at that presentation or not), Zach came up
    > > >with the idea that allowing the submitting application control the
    > > >CPU that the io completion processing was occurring would be a good
    > > >approach to try. That is, we submit a "completion cookie" with the
    > > >bio that indicates where we want completion to run, rather than
    > > >dictating that completion runs on the submission CPU.
    > > >
    > > >The reasoning is that only the higher level context really knows
    > > >what is optimal, and that changes from application to application.

    > >
    > > well.. kinda. One of the really hard parts of the submit/completion stuff
    > > is that
    > > the slab/slob/slub/slib allocator ends up basically "cycling" memory
    > > through the system;
    > > there's a sink of free memory on all the submission cpus and a source of
    > > free memory
    > > on the completion cpu. I don't think applications are capable of working
    > > out what is
    > > best in this scenario..

    >
    > Applications as in "anything that calls submit_bio()". i.e, direct I/O,
    > filesystems, etc. i.e. not userspace but in-kernel applications.
    >
    > In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
    > contention of global structures in XFS. By controlling where completions are
    > delivered, we can greatly reduce this contention, especially on large,
    > mulitpathed devices that deliver interrupts to multiple CPUs that may be far
    > distant from each other. We have all the state and intelligence necessary
    > to control this sort policy decision effectively.....


    Hi Dave,

    Thanks for taking a look at the patch... yes it would be easy to turn
    this bit of state into a more flexible cookie (eg. complete on submitter;
    complete on interrupt; complete on CPUx/nodex etc.). Maybe we'll need
    something that complex... I'm not sure, it would probably need more
    fine tuning. That said, I just wanted to get this approach out there
    early for rfc.

    I guess both you and Arjan have points. For a _lot_ of things, completing
    on the same CPU as submitter (whether that is migrating submission as in
    the original patch in the thread, or migrating completion like I do).

    You get better behaviour in the slab and page allocators and locality
    and cache hotness of memory. For example, I guess in a filesystem /
    pagecache heavy workload, you have to touch each struct page, buffer head,
    fs private state, and also often have to wake the thread for completion.
    Much of this data has just been touched at submit time, so doin this on
    the same CPU is nice...

    I'm surprised that the xfs global state bouncing would outweigh the
    bouncing of all the per-page/block/bio/request/etc data that gets touched
    during completion. We'll see.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [rfc] direct IO submission and completion scalability issues

    > + q = &__get_cpu_var(call_single_queue);
    > + spin_lock_irqsave(&q->lock, flags);
    > + list_replace_init(&q->list, &list);
    > + spin_unlock_irqrestore(&q->lock, flags);


    I think you could do that lockless if you use a similar data structure
    as netchannels (essentially a fixed size single buffer queue with atomic
    exchange of the first/last pointers) and not using a list. That would avoid
    at least one bounce for the lock and likely another one for the list
    manipulation.

    Also the right way would be to not add a second mechanism for this,
    but fix the standard smp_call_function_single() to support it.

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [rfc] direct IO submission and completion scalability issues

    On Mon, Feb 04 2008, Nick Piggin wrote:
    > On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote:
    > > On Sun, Feb 03 2008, Nick Piggin wrote:
    > > > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
    > > >
    > > > Hi guys,
    > > >
    > > > Just had another way we might do this. Migrate the completions out to
    > > > the submitting CPUs rather than migrate submission into the completing
    > > > CPU.
    > > >
    > > > I've got a basic patch that passes some stress testing. It seems fairly
    > > > simple to do at the block layer, and the bulk of the patch involves
    > > > introducing a scalable smp_call_function for it.
    > > >
    > > > Now it could be optimised more by looking at batching up IPIs or
    > > > optimising the call function path or even mirating the completion event
    > > > at a different level...
    > > >
    > > > However, this is a first cut. It actually seems like it might be taking
    > > > slightly more CPU to process block IO (~0.2%)... however, this is on my
    > > > dual core system that shares an llc, which means that there are very few
    > > > cache benefits to the migration, but non-zero overhead. So on multisocket
    > > > systems hopefully it might get to positive territory.

    > >
    > > That's pretty funny, I did pretty much the exact same thing last week!

    >
    > Oh nice
    >
    >
    > > The primary difference between yours and mine is that I used a more
    > > private interface to signal a softirq raise on another CPU, instead of
    > > allocating call data and exposing a generic interface. That put the
    > > locking in blk-core instead, turning blk_cpu_done into a structure with
    > > a lock and list_head instead of just being a list head, and intercepted
    > > at blk_complete_request() time instead of waiting for an already raised
    > > softirq on that CPU.

    >
    > Yeah I was looking at that... didn't really want to add the spinlock
    > overhead to the non-migration case. Anyway, I guess that sort of
    > fine implementation details is going to have to be sorted out with
    > results.


    As Andi mentions, we can look into making that lockless. For the initial
    implementation I didn't really care, just wanted something to play with
    that would nicely allow me to control both the submit and complete side
    of the affinity issue.

    --
    Jens Axboe

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [rfc] direct IO submission and completion scalability issues

    On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote:
    > On Sun, Feb 03 2008, Nick Piggin wrote:
    > > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
    > >
    > > Hi guys,
    > >
    > > Just had another way we might do this. Migrate the completions out to
    > > the submitting CPUs rather than migrate submission into the completing
    > > CPU.
    > >
    > > I've got a basic patch that passes some stress testing. It seems fairly
    > > simple to do at the block layer, and the bulk of the patch involves
    > > introducing a scalable smp_call_function for it.
    > >
    > > Now it could be optimised more by looking at batching up IPIs or
    > > optimising the call function path or even mirating the completion event
    > > at a different level...
    > >
    > > However, this is a first cut. It actually seems like it might be taking
    > > slightly more CPU to process block IO (~0.2%)... however, this is on my
    > > dual core system that shares an llc, which means that there are very few
    > > cache benefits to the migration, but non-zero overhead. So on multisocket
    > > systems hopefully it might get to positive territory.

    >
    > That's pretty funny, I did pretty much the exact same thing last week!


    Oh nice


    > The primary difference between yours and mine is that I used a more
    > private interface to signal a softirq raise on another CPU, instead of
    > allocating call data and exposing a generic interface. That put the
    > locking in blk-core instead, turning blk_cpu_done into a structure with
    > a lock and list_head instead of just being a list head, and intercepted
    > at blk_complete_request() time instead of waiting for an already raised
    > softirq on that CPU.


    Yeah I was looking at that... didn't really want to add the spinlock
    overhead to the non-migration case. Anyway, I guess that sort of
    fine implementation details is going to have to be sorted out with
    results.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [rfc] direct IO submission and completion scalability issues

    [ ugh, still jet lagged. ]

    > Hi Nick,
    >
    > When Matthew was describing this work at an LCA presentation (not
    > sure whether you were at that presentation or not), Zach came up
    > with the idea that allowing the submitting application control the
    > CPU that the io completion processing was occurring would be a good
    > approach to try. That is, we submit a "completion cookie" with the
    > bio that indicates where we want completion to run, rather than
    > dictating that completion runs on the submission CPU.
    >
    > The reasoning is that only the higher level context really knows
    > what is optimal, and that changes from application to application.
    > The "complete on the submission CPU" policy _may_ be more optimal
    > for database workloads, but it is definitely suboptimal for XFS and
    > transaction I/O completion handling because it simply drags a bunch
    > of global filesystem state around between all the CPUs running
    > completions. In that case, we really only want a single CPU to be
    > handling the completions.....
    >
    > (Zach - please correct me if I've missed anything)


    Yeah, I think Nick's patch (and Jens' approach, presumably) is just the
    sort of thing we were hoping for when discussing this during Matthew's talk.

    I was imagining the patch a little bit differently (per-cpu tasks, do a
    wake_up from the driver instead of cpu nr testing up in blk, work
    queues, whatever), but we know how to iron out these kinds of details .

    > Looking at your patch - if you turn it around so that the
    > "submission CPU" field can be specified as the "completion cpu" then
    > I think the patch will expose the policy knobs needed to do the
    > above.


    Yeah, that seems pretty straight forward.

    We might need some logic for noticing that the desired cpu has been
    hot-plugged away while the IO was in flight, it occurs to me.

    - z
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [rfc] direct IO submission and completion scalability issues

    On Mon, Feb 04 2008, Zach Brown wrote:
    > [ ugh, still jet lagged. ]
    >
    > > Hi Nick,
    > >
    > > When Matthew was describing this work at an LCA presentation (not
    > > sure whether you were at that presentation or not), Zach came up
    > > with the idea that allowing the submitting application control the
    > > CPU that the io completion processing was occurring would be a good
    > > approach to try. That is, we submit a "completion cookie" with the
    > > bio that indicates where we want completion to run, rather than
    > > dictating that completion runs on the submission CPU.
    > >
    > > The reasoning is that only the higher level context really knows
    > > what is optimal, and that changes from application to application.
    > > The "complete on the submission CPU" policy _may_ be more optimal
    > > for database workloads, but it is definitely suboptimal for XFS and
    > > transaction I/O completion handling because it simply drags a bunch
    > > of global filesystem state around between all the CPUs running
    > > completions. In that case, we really only want a single CPU to be
    > > handling the completions.....
    > >
    > > (Zach - please correct me if I've missed anything)

    >
    > Yeah, I think Nick's patch (and Jens' approach, presumably) is just the
    > sort of thing we were hoping for when discussing this during Matthew's talk.
    >
    > I was imagining the patch a little bit differently (per-cpu tasks, do a
    > wake_up from the driver instead of cpu nr testing up in blk, work
    > queues, whatever), but we know how to iron out these kinds of details .


    per-cpu tasks/wq's might be better, it's a little awkward to jump
    through hoops

    > > Looking at your patch - if you turn it around so that the
    > > "submission CPU" field can be specified as the "completion cpu" then
    > > I think the patch will expose the policy knobs needed to do the
    > > above.

    >
    > Yeah, that seems pretty straight forward.
    >
    > We might need some logic for noticing that the desired cpu has been
    > hot-plugged away while the IO was in flight, it occurs to me.


    the softirq completion stuff already handles cpus going away, at least
    with my patch that stuff works fine (with a dead flag added).

    --
    Jens Axboe

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [rfc] direct IO submission and completion scalability issues

    On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
    > Hi guys,
    >
    > Just had another way we might do this. Migrate the completions out to
    > the submitting CPUs rather than migrate submission into the completing
    > CPU.


    Hi Nick, This was the first experiment I tried on a quad core four
    package SMP platform. And it didn't show much improvement in my
    prototype(my protoype was migrating the softirq to the kblockd context
    of the submitting CPU).

    In the OLTP workload, quite a bit of activity happens below the block layer
    and by the time we come to softirq, some damage is done in
    slab, scsi cmds, timers etc. Last year OLS paper
    (http://ols.108.redhat.com/2007/Repri...gh-Reprint.pdf)
    shows different cache lines that are contended in the kernel for the
    OLTP workload.

    Softirq migration should atleast reduce the cacheline contention that
    happens in sched and AIO layers. I didn't spend much time why my softirq
    migration patch didn't help much (as I was behind bigger birds of migrating
    IO submission to completion CPU at that time). If this solution has
    less side-effects and easily acceptable, then we can analyze the softirq
    migration patch further and findout the potential.

    While there is some potential with the softirq migration, full potential
    can be exploited by making the IO submission and completion on the same CPU.

    thanks,
    suresh
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [rfc] direct IO submission and completion scalability issues

    Jens Axboe wrote:
    >> I was imagining the patch a little bit differently (per-cpu tasks, do a
    >> wake_up from the driver instead of cpu nr testing up in blk, work
    >> queues, whatever), but we know how to iron out these kinds of details .

    >
    > per-cpu tasks/wq's might be better, it's a little awkward to jump
    > through hoops
    >


    one caveat btw; when the multiqueue storage hw becomes available for Linux,
    we need to figure out how to deal with the preference thing; since there
    honoring a "non-logical" preference would be quite expensive (it means
    you can't make the local submit queues lockless etc etc), so before we go down
    the road of having widespread APIs for this stuff.. we need to make sure we're
    not going to do something that's going to be really stupid 6 to 18 months down the road.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [rfc] direct IO submission and completion scalability issues

    On Mon, 2008-02-04 at 05:33 -0500, Jens Axboe wrote:
    > As Andi mentions, we can look into making that lockless. For the initial
    > implementation I didn't really care, just wanted something to play with
    > that would nicely allow me to control both the submit and complete side
    > of the affinity issue.


    Sorry, late to the party ... it went to my steeleye address, not my
    current one.

    Could you try re-running the tests with a low queue depth (say around 8)
    and the card interrupt bound to a single CPU.

    The reason for asking you to do this is that it should emulate almost
    precisely what you're looking for: The submit path will be picked up in
    the SCSI softirq where the queue gets run, so you should find that all
    submit and returns happen on a single CPU, so everything gets cache hot
    there.

    James

    p.s. if everyone could also update my email address to the
    hansenpartnership one, the people at steeleye who monitor my old email
    account would be grateful.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [rfc] direct IO submission and completion scalability issues

    On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
    > You get better behaviour in the slab and page allocators and locality
    > and cache hotness of memory. For example, I guess in a filesystem /
    > pagecache heavy workload, you have to touch each struct page, buffer head,
    > fs private state, and also often have to wake the thread for completion.
    > Much of this data has just been touched at submit time, so doin this on
    > the same CPU is nice...


    [....]

    > I'm surprised that the xfs global state bouncing would outweigh the
    > bouncing of all the per-page/block/bio/request/etc data that gets touched
    > during completion. We'll see.


    per-page/block.bio/request/etc is local to a single I/O. the only
    penalty is a cacheline bounce for each of the structures from one
    CPU to another. That is, there is no global state modified by these
    completions.

    The real issue is metadata. The transaction log I/O completion
    funnels through a state machine protected by a single lock, which
    means completions on different CPUs pulls that lock to all
    completion CPUs. Given that the same lock is used during transaction
    completion for other state transitions (in task context, not intr),
    the more cpus active at once touches, the worse the problem gets.

    Then there's metadata I/O completion, which funnels through a larger
    set of global locks in the transaction subsystem (e.g. the active
    item list lock, the log reservation locks, the log state lock, etc)
    which once again means the more CPUs we have delivering I/O
    completions, the worse the problem gets.

    Cheers,

    Dave.
    --
    Dave Chinner
    Principal Engineer
    SGI Australian Software Group
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [rfc] direct IO submission and completion scalability issues

    On Mon, Feb 04 2008, Arjan van de Ven wrote:
    > Jens Axboe wrote:
    > >>I was imagining the patch a little bit differently (per-cpu tasks, do a
    > >>wake_up from the driver instead of cpu nr testing up in blk, work
    > >>queues, whatever), but we know how to iron out these kinds of details .

    > >
    > >per-cpu tasks/wq's might be better, it's a little awkward to jump
    > >through hoops
    > >

    >
    > one caveat btw; when the multiqueue storage hw becomes available for Linux,
    > we need to figure out how to deal with the preference thing; since there
    > honoring a "non-logical" preference would be quite expensive (it means


    non-local?

    > you can't make the local submit queues lockless etc etc), so before we
    > go down the road of having widespread APIs for this stuff.. we need to
    > make sure we're not going to do something that's going to be really
    > stupid 6 to 18 months down the road.


    As far as I'm concerned, so far this is just playing around with
    affinity (and to some extents taking it too far, on purpose). For
    instance, my current patch can move submissions and completions
    independently, with a set mask or by 'binding' a request to a CPU. Most
    of that doesn't make sense. 'complete on the same CPU, if possible'
    makes sense and would fit fine with multi-queue hw.

    Moving submissions at the block layer to a defined set of CPUs is a bit
    silly imho, it's pretty costly and it's a lot more sane simply bind the
    submitters instead. So if you can set irq affinity, then just make the
    submitters follow that.

    --
    Jens Axboe

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [rfc] direct IO submission and completion scalability issues

    On Tue, Feb 05, 2008 at 11:14:19AM +1100, David Chinner wrote:
    > On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
    > > You get better behaviour in the slab and page allocators and locality
    > > and cache hotness of memory. For example, I guess in a filesystem /
    > > pagecache heavy workload, you have to touch each struct page, buffer head,
    > > fs private state, and also often have to wake the thread for completion.
    > > Much of this data has just been touched at submit time, so doin this on
    > > the same CPU is nice...

    >
    > [....]
    >
    > > I'm surprised that the xfs global state bouncing would outweigh the
    > > bouncing of all the per-page/block/bio/request/etc data that gets touched
    > > during completion. We'll see.

    >
    > per-page/block.bio/request/etc is local to a single I/O. the only
    > penalty is a cacheline bounce for each of the structures from one
    > CPU to another. That is, there is no global state modified by these
    > completions.


    Yeah, but it is going from _all_ submitting CPUs to the one completing
    CPU. So you could bottleneck the interconnect at the completing CPU
    just as much as if you had cachelines being pulled the other way (ie.
    many CPUs trying to pull in a global cacheline).


    > The real issue is metadata. The transaction log I/O completion
    > funnels through a state machine protected by a single lock, which
    > means completions on different CPUs pulls that lock to all
    > completion CPUs. Given that the same lock is used during transaction
    > completion for other state transitions (in task context, not intr),
    > the more cpus active at once touches, the worse the problem gets.


    OK, once you add locking (and not simply cacheline contention), then
    the problem gets harder I agree. But I think that if the submitting
    side takes the same locks as log completion (eg. maybe for starting a
    new transaction), then it is not going to be a clear win either way,
    and you'd have to measure it in the end.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread