[RFC patch 00/18] Trace Clock v2 - Kernel

This is a discussion on [RFC patch 00/18] Trace Clock v2 - Kernel ; On Mon, 10 Nov 2008 18:15:32 -0500 (EST) Nicolas Pitre wrote: > On Mon, 10 Nov 2008, Andrew Morton wrote: > > > On Mon, 10 Nov 2008 16:34:54 -0500 (EST) > > Nicolas Pitre wrote: > > > > ...

+ Reply to Thread
Page 6 of 6 FirstFirst ... 4 5 6
Results 101 to 112 of 112

Thread: [RFC patch 00/18] Trace Clock v2

  1. Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

    On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
    Nicolas Pitre wrote:

    > On Mon, 10 Nov 2008, Andrew Morton wrote:
    >
    > > On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
    > > Nicolas Pitre wrote:
    > >
    > > > > It is far better to make the management of the state explicit and at
    > > > > the control of the caller. Get the caller to allocate the state and
    > > > > pass its address into this function. Simple, clear, explicit and
    > > > > robust.
    > > >
    > > > Sigh... What about this compromize then?
    > > >
    > > > diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
    > > > index 7605fdd..74ce767 100644
    > > > --- a/include/linux/cnt32_to_63.h
    > > > +++ b/include/linux/cnt32_to_63.h
    > > > @@ -32,8 +32,9 @@ union cnt32_to_63 {
    > > >
    > > >
    > > > /**
    > > > - * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
    > > > + * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
    > > > * @cnt_lo: The low part of the counter
    > > > + * @cnt_hi_p: Pointer to storage for the extended part of the counter
    > > > *
    > > > * Many hardware clock counters are only 32 bits wide and therefore have
    > > > * a relatively short period making wrap-arounds rather frequent. This
    > > > @@ -75,16 +76,31 @@ union cnt32_to_63 {
    > > > * clear-bit instruction. Otherwise caller must remember to clear the top
    > > > * bit explicitly.
    > > > */
    > > > -#define cnt32_to_63(cnt_lo) \
    > > > +#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
    > > > ({ \
    > > > - static u32 __m_cnt_hi; \
    > > > union cnt32_to_63 __x; \
    > > > - __x.hi = __m_cnt_hi; \
    > > > + __x.hi = *(cnt_hi_p); \
    > > > smp_rmb(); \
    > > > __x.lo = (cnt_lo); \
    > > > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
    > > > - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
    > > > + *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
    > > > __x.val; \
    > > > })

    > >
    > > This references its second argument twice, which can cause correctness
    > > or efficiency problems.
    > >
    > > There is no reason that this had to be implemented in cpp.
    > > Implementing it in C will fix the above problem.

    >
    > No, it won't, for correctness and efficiency reasons.
    >
    > And I've explained why already.


    I'd be very surprised if you've really found a case where a macro is
    faster than an inlined function. I don't think that has happened
    before.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH v2] clarify usage expectations for cnt32_to_63()


    On Mon, 10 Nov 2008, Andrew Morton wrote:
    > On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
    > Nicolas Pitre wrote:
    > > >
    > > > This references its second argument twice, which can cause correctness
    > > > or efficiency problems.
    > > >
    > > > There is no reason that this had to be implemented in cpp.
    > > > Implementing it in C will fix the above problem.

    > >
    > > No, it won't, for correctness and efficiency reasons.
    > >
    > > And I've explained why already.

    >
    > I'd be very surprised if you've really found a case where a macro is
    > faster than an inlined function. I don't think that has happened
    > before.


    But that's the way my Grandpa did it. With macros!

    -- Steve

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH v2] clarify usage expectations for cnt32_to_63()

    On Mon, 10 Nov 2008, Andrew Morton wrote:

    > On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
    > Nicolas Pitre wrote:
    >
    > > On Mon, 10 Nov 2008, Andrew Morton wrote:
    > >
    > > > This references its second argument twice, which can cause correctness
    > > > or efficiency problems.
    > > >
    > > > There is no reason that this had to be implemented in cpp.
    > > > Implementing it in C will fix the above problem.

    > >
    > > No, it won't, for correctness and efficiency reasons.
    > >
    > > And I've explained why already.

    >
    > I'd be very surprised if you've really found a case where a macro is
    > faster than an inlined function. I don't think that has happened
    > before.


    That hasn't anything to do with "a macro is faster" at all. It's all
    about the order used to evaluate provided arguments. And the first one
    might be anything like a memory value, an IO operation, an expression,
    etc. An inline function would work correctly with pointers only and
    therefore totally break apart on x86 for example.


    Nicolas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [PATCH] convert cnt32_to_63 to inline

    * Nicolas Pitre (nico@cam.org) wrote:
    > On Mon, 10 Nov 2008, Andrew Morton wrote:
    >
    > > On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
    > > Nicolas Pitre wrote:
    > >
    > > > On Mon, 10 Nov 2008, Andrew Morton wrote:
    > > >
    > > > > This references its second argument twice, which can cause correctness
    > > > > or efficiency problems.
    > > > >
    > > > > There is no reason that this had to be implemented in cpp.
    > > > > Implementing it in C will fix the above problem.
    > > >
    > > > No, it won't, for correctness and efficiency reasons.
    > > >
    > > > And I've explained why already.

    > >
    > > I'd be very surprised if you've really found a case where a macro is
    > > faster than an inlined function. I don't think that has happened
    > > before.

    >
    > That hasn't anything to do with "a macro is faster" at all. It's all
    > about the order used to evaluate provided arguments. And the first one
    > might be anything like a memory value, an IO operation, an expression,
    > etc. An inline function would work correctly with pointers only and
    > therefore totally break apart on x86 for example.
    >
    >
    > Nicolas


    Let's see what it gives once implemented. Only compile-tested. Assumes
    pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
    arm versatile.

    Turn cnt32_to_63 into an inline function.
    Change all callers to new API.
    Document barrier usage.

    Signed-off-by: Mathieu Desnoyers
    CC: Nicolas Pitre
    CC: Andrew Morton
    CC: torvalds@linux-foundation.org
    CC: rmk+lkml@arm.linux.org.uk
    CC: dhowells@redhat.com
    CC: paulus@samba.org
    CC: a.p.zijlstra@chello.nl
    CC: mingo@elte.hu
    CC: benh@kernel.crashing.org
    CC: rostedt@goodmis.org
    CC: tglx@linutronix.de
    CC: davem@davemloft.net
    CC: ralf@linux-mips.org
    ---
    arch/arm/mach-pxa/time.c | 14 ++++++++++++-
    arch/arm/mach-sa1100/generic.c | 15 +++++++++++++-
    arch/arm/mach-versatile/core.c | 12 ++++++++++-
    arch/mn10300/kernel/time.c | 19 +++++++++++++-----
    include/linux/cnt32_to_63.h | 42 +++++++++++++++++++++++++++++------------
    5 files changed, 82 insertions(+), 20 deletions(-)

    Index: linux.trees.git/arch/arm/mach-pxa/time.c
    ================================================== =================
    --- linux.trees.git.orig/arch/arm/mach-pxa/time.c 2008-11-11 12:20:42.000000000 -0500
    +++ linux.trees.git/arch/arm/mach-pxa/time.c 2008-11-11 13:05:01.000000000 -0500
    @@ -37,6 +37,10 @@
    #define OSCR2NS_SCALE_FACTOR 10

    static unsigned long oscr2ns_scale;
    +static u32 sched_clock_cnt_hi; /*
    + * Shared cnt_hi OK with cycle counter only
    + * for UP systems.
    + */

    static void __init set_oscr2ns_scale(unsigned long oscr_rate)
    {
    @@ -54,7 +58,15 @@ static void __init set_oscr2ns_scale(uns

    unsigned long long sched_clock(void)
    {
    - unsigned long long v = cnt32_to_63(OSCR);
    + u32 cnt_lo, cnt_hi;
    + unsigned long long v;
    +
    + preempt_disable_notrace();
    + cnt_hi = sched_clock_cnt_hi;
    + barrier(); /* read cnt_hi before cnt_lo */
    + cnt_lo = OSCR;
    + v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
    + preempt_enable_notrace();
    return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
    }

    Index: linux.trees.git/include/linux/cnt32_to_63.h
    ================================================== =================
    --- linux.trees.git.orig/include/linux/cnt32_to_63.h 2008-11-11 12:20:17.000000000 -0500
    +++ linux.trees.git/include/linux/cnt32_to_63.h 2008-11-11 13:10:44.000000000 -0500
    @@ -32,7 +32,9 @@ union cnt32_to_63 {

    /**
    * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
    - * @cnt_lo: The low part of the counter
    + * @cnt_hi: The high part of the counter (read first)
    + * @cnt_lo: The low part of the counter (read after cnt_hi)
    + * @cnt_hi_ptr: Pointer to high part of the counter
    *
    * Many hardware clock counters are only 32 bits wide and therefore have
    * a relatively short period making wrap-arounds rather frequent. This
    @@ -57,7 +59,10 @@ union cnt32_to_63 {
    * code must be executed at least once per each half period of the 32-bit
    * counter to properly update the state bit in memory. This is usually not a
    * problem in practice, but if it is then a kernel timer could be scheduled
    - * to manage for this code to be executed often enough.
    + * to manage for this code to be executed often enough. If a per-cpu cnt_hi is
    + * used, the value must be updated at least once per 32-bits half-period on each
    + * CPU. If cnt_hi is shared between CPUs, it suffice to update it once per
    + * 32-bits half-period on any CPU.
    *
    * Note that the top bit (bit 63) in the returned value should be considered
    * as garbage. It is not cleared here because callers are likely to use a
    @@ -65,16 +70,29 @@ union cnt32_to_63 {
    * implicitly by making the multiplier even, therefore saving on a runtime
    * clear-bit instruction. Otherwise caller must remember to clear the top
    * bit explicitly.
    + *
    + * Preemption must be disabled when reading the cnt_hi and cnt_lo values and
    + * calling this function.
    + *
    + * The cnt_hi parameter _must_ be read before cnt_lo. This implies using the
    + * proper barriers :
    + * - smp_rmb() if cnt_lo is read from mmio and the cnt_hi variable is shared
    + * across CPUs.
    + * - use a per-cpu variable for cnt_high if cnt_lo is read from per-cpu cycles
    + * counters or to read the counters with only a barrier().
    */
    -#define cnt32_to_63(cnt_lo) \
    -({ \
    - static volatile u32 __m_cnt_hi; \
    - union cnt32_to_63 __x; \
    - __x.hi = __m_cnt_hi; \
    - __x.lo = (cnt_lo); \
    - if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
    - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
    - __x.val; \
    -})
    +static inline u64 cnt32_to_63(u32 cnt_hi, u32 cnt_lo, u32 *cnt_hi_ptr)
    +{
    + union cnt32_to_63 __x = {
    + {
    + .hi = cnt_hi,
    + .lo = cnt_lo,
    + },
    + };
    +
    + if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
    + *cnt_hi_ptr = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
    + return __x.val; /* Remember to clear the top bit in the caller */
    +}

    #endif
    Index: linux.trees.git/arch/arm/mach-sa1100/generic.c
    ================================================== =================
    --- linux.trees.git.orig/arch/arm/mach-sa1100/generic.c 2008-11-11 12:20:42.000000000 -0500
    +++ linux.trees.git/arch/arm/mach-sa1100/generic.c 2008-11-11 13:05:10.000000000 -0500
    @@ -34,6 +34,11 @@
    unsigned int reset_status;
    EXPORT_SYMBOL(reset_status);

    +static u32 sched_clock_cnt_hi; /*
    + * Shared cnt_hi OK with cycle counter only
    + * for UP systems.
    + */
    +
    #define NR_FREQS 16

    /*
    @@ -133,7 +138,15 @@ EXPORT_SYMBOL(cpufreq_get);
    */
    unsigned long long sched_clock(void)
    {
    - unsigned long long v = cnt32_to_63(OSCR);
    + u32 cnt_lo, cnt_hi;
    + unsigned long long v;
    +
    + preempt_disable_notrace();
    + cnt_hi = sched_clock_cnt_hi;
    + barrier(); /* read cnt_hi before cnt_lo */
    + cnt_lo = OSCR;
    + v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
    + preempt_enable_notrace();

    /* the <<1 gets rid of the cnt_32_to_63 top bit saving on a bic insn */
    v *= 78125<<1;
    Index: linux.trees.git/arch/arm/mach-versatile/core.c
    ================================================== =================
    --- linux.trees.git.orig/arch/arm/mach-versatile/core.c 2008-11-11 12:20:42.000000000 -0500
    +++ linux.trees.git/arch/arm/mach-versatile/core.c 2008-11-11 12:57:55.000000000 -0500
    @@ -60,6 +60,8 @@
    #define VA_VIC_BASE __io_address(VERSATILE_VIC_BASE)
    #define VA_SIC_BASE __io_address(VERSATILE_SIC_BASE)

    +static u32 sched_clock_cnt_hi;
    +
    static void sic_mask_irq(unsigned int irq)
    {
    irq -= IRQ_SIC_START;
    @@ -238,7 +240,15 @@ void __init versatile_map_io(void)
    */
    unsigned long long sched_clock(void)
    {
    - unsigned long long v = cnt32_to_63(readl(VERSATILE_REFCOUNTER));
    + u32 cnt_lo, cnt_hi;
    + unsigned long long v;
    +
    + preempt_disable_notrace();
    + cnt_hi = sched_clock_cnt_hi;
    + smp_rmb();
    + cnt_lo = readl(VERSATILE_REFCOUNTER);
    + v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
    + preempt_enable_notrace();

    /* the <<1 gets rid of the cnt_32_to_63 top bit saving on a bic insn */
    v *= 125<<1;
    Index: linux.trees.git/arch/mn10300/kernel/time.c
    ================================================== =================
    --- linux.trees.git.orig/arch/mn10300/kernel/time.c 2008-11-11 12:41:42.000000000 -0500
    +++ linux.trees.git/arch/mn10300/kernel/time.c 2008-11-11 13:04:42.000000000 -0500
    @@ -29,6 +29,11 @@ unsigned long mn10300_iobclk; /* system
    unsigned long mn10300_tsc_per_HZ; /* number of ioclks per jiffy */
    #endif /* CONFIG_MN10300_RTC */

    +static u32 sched_clock_cnt_hi; /*
    + * shared cnt_hi OK with cycle counter only
    + * for UP systems.
    + */
    +
    static unsigned long mn10300_last_tsc; /* time-stamp counter at last time
    * interrupt occurred */

    @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
    unsigned long long ll;
    unsigned l[2];
    } tsc64, result;
    - unsigned long tsc, tmp;
    + unsigned long tmp;
    unsigned product[3]; /* 96-bit intermediate value */
    + u32 cnt_lo, cnt_hi;

    - /* read the TSC value
    - */
    - tsc = 0 - get_cycles(); /* get_cycles() counts down */
    + preempt_disable_notrace();
    + cnt_hi = sched_clock_cnt_hi;
    + barrier(); /* read cnt_hi before cnt_lo */
    + cnt_lo = 0 - get_cycles(); /* get_cycles() counts down */

    /* expand to 64-bits.
    * - sched_clock() must be called once a minute or better or the
    * following will go horribly wrong - see cnt32_to_63()
    */
    - tsc64.ll = cnt32_to_63(tsc) & 0x7fffffffffffffffULL;
    + tsc64.ll = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
    + tsc64.ll &= 0x7fffffffffffffffULL;
    + preempt_enable_notrace();

    /* scale the 64-bit TSC value to a nanosecond value via a 96-bit
    * intermediate


    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH] convert cnt32_to_63 to inline

    On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
    > Let's see what it gives once implemented. Only compile-tested. Assumes
    > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
    > arm versatile.


    Versatile is also UP only.

    The following are results from PXA built with gcc 3.4.3:

    1. two additional registers used in sched_clock()
    2. 8 additional bytes of code (which are needless if gcc was more inteligent)

    both of these I put down to inefficiencies in gcc's register allocation.

    3. worse instruction scheduling - two inter-dependent loads next to each
    other causing a pipeline stall

    Actual reading of variables/hardware is unaffected by this patch.

    Old code:

    c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
    10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
    14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
    18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
    1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
    20: e1a07002 mov r7, r2
    24: e3a08000 mov r8, #0 ; 0x0
    28: e5933000 ldr r3, [r3] ; read OSCR register
    ....
    58: e1820b04 orr r0, r2, r4, lsl #22
    5c: e1a01524 lsr r1, r4, #10
    60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}


    New code:

    c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
    10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
    14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
    18: e1a08001 mov r8, r1
    1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
    20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
    24: e1a09002 mov r9, r2
    28: e3a0a000 mov sl, #0 ; 0x0
    2c: e5933000 ldr r3, [r3] ; read OSCR
    ....
    58: e1825b04 orr r5, r2, r4, lsl #22
    5c: e1a06524 lsr r6, r4, #10
    60: e1a01006 mov r1, r6
    64: e1a00005 mov r0, r5
    68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}

    Versatile:

    1. 12 additional bytes of code
    2. same number of registers
    3. worse instruction scheduling causing pipeline stall

    Actual reading of variables/hardware is unaffected by this patch.

    So, we have two platforms where this patch makes things visibly worse
    with no material benefit.

    --
    Russell King
    Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
    maintainer of:
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH] convert cnt32_to_63 to inline

    * Russell King (rmk+lkml@arm.linux.org.uk) wrote:
    > On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
    > > Let's see what it gives once implemented. Only compile-tested. Assumes
    > > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
    > > arm versatile.

    >
    > Versatile is also UP only.
    >
    > The following are results from PXA built with gcc 3.4.3:
    >
    > 1. two additional registers used in sched_clock()
    > 2. 8 additional bytes of code (which are needless if gcc was more inteligent)
    >
    > both of these I put down to inefficiencies in gcc's register allocation.
    >
    > 3. worse instruction scheduling - two inter-dependent loads next to each
    > other causing a pipeline stall
    >
    > Actual reading of variables/hardware is unaffected by this patch.
    >
    > Old code:
    >
    > c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
    > 10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
    > 14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
    > 18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
    > 1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
    > 20: e1a07002 mov r7, r2
    > 24: e3a08000 mov r8, #0 ; 0x0
    > 28: e5933000 ldr r3, [r3] ; read OSCR register
    > ...
    > 58: e1820b04 orr r0, r2, r4, lsl #22
    > 5c: e1a01524 lsr r1, r4, #10
    > 60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
    >
    >
    > New code:
    >
    > c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
    > 10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
    > 14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
    > 18: e1a08001 mov r8, r1
    > 1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
    > 20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
    > 24: e1a09002 mov r9, r2
    > 28: e3a0a000 mov sl, #0 ; 0x0
    > 2c: e5933000 ldr r3, [r3] ; read OSCR
    > ...
    > 58: e1825b04 orr r5, r2, r4, lsl #22
    > 5c: e1a06524 lsr r6, r4, #10
    > 60: e1a01006 mov r1, r6
    > 64: e1a00005 mov r0, r5
    > 68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
    >
    > Versatile:
    >
    > 1. 12 additional bytes of code
    > 2. same number of registers
    > 3. worse instruction scheduling causing pipeline stall
    >
    > Actual reading of variables/hardware is unaffected by this patch.
    >
    > So, we have two platforms where this patch makes things visibly worse
    > with no material benefit.
    >


    I think the added barrier() are causing these pipeline stalls. They
    don't allow the compiler to read variables such as oscr2ns_scale before
    the barrier because gcc cannot assume it won't be modified. However, to
    insure that OSCR read is done after __m_cnt_hi read, this barrier seems
    required to be safe against gcc optimizations.

    Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
    the macro or to a vanilla tree ?

    I wonder if reading those values sooner would help gcc ?

    Mathieu


    --
    Mathieu Desnoyers
    OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] convert cnt32_to_63 to inline

    On Tue, 11 Nov 2008, Mathieu Desnoyers wrote:

    > * Nicolas Pitre (nico@cam.org) wrote:
    > > That hasn't anything to do with "a macro is faster" at all. It's all
    > > about the order used to evaluate provided arguments. And the first one
    > > might be anything like a memory value, an IO operation, an expression,
    > > etc. An inline function would work correctly with pointers only and
    > > therefore totally break apart on x86 for example.
    > >
    > >
    > > Nicolas

    >
    > Let's see what it gives once implemented. Only compile-tested. Assumes
    > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
    > arm versatile.
    >
    > Turn cnt32_to_63 into an inline function.
    > Change all callers to new API.
    > Document barrier usage.


    Look, I'm not interested at all into this mess.

    The _WHOLE_ point of the cnt32_to_63 macro was to abstract and
    encapsulate the subtlety of the algorithm. It initially started as an
    open coded implementation in PXA's sched_clock(). Then I was asked to
    make it a macro that can be reused for other ARM platforms. Then David
    wanted to reuse it on other platforms than ARM.

    Now you are simply destroying all the value of having that macro in the
    first place. The argument is that the macro could be misused because it
    has a static variable inside it, etc. etc. The solution: spread the
    subtlety all around instead of keeping it into the macro and risk having
    it wrong or broken due to future changes surrounding it in the future.
    And it _will_ happen due to the increased exposure making the whole idea
    even more fragile compared to having it concentrated in only one spot.
    This is a total non sense and I can't believe you truly think your patch
    makes things better...

    You're even disabling preemption where it is really unneeded making the
    whole thing about the double its initial cost. Look at the generated
    assembly and count the cycles if you don't believe me!

    No thank you. If this trend continues I'm going to make it back private
    to ARM again so you could pessimize your own code as much as you want.

    NACK.


    Nicolas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH] convert cnt32_to_63 to inline

    On Tue, Nov 11, 2008 at 04:00:46PM -0500, Nicolas Pitre wrote:
    > No thank you. If this trend continues I'm going to make it back private
    > to ARM again so you could pessimize your own code as much as you want.


    As I've already stated several days ago, I think that's the right
    course of action. Given all the concerns raised, it's clearly not
    something that should have been allowed to become generic.

    So, let's just close this discussion off by taking that course of
    action.

    What's required is (in order):
    1. a local copy for NM10300 needs to be created and it converted to that
    2. these two commits then need to be reverted:

    bc173c5789e1fc6065fd378edc815914b40ee86b
    b4f151ff899362fec952c45d166252c9912c041f

    Then our usage is again limited to sched_clock() which is well understood
    and known to be problem free.

    --
    Russell King
    Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
    maintainer of:
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH] convert cnt32_to_63 to inline

    On Tue, Nov 11, 2008 at 03:11:30PM -0500, Mathieu Desnoyers wrote:
    > I think the added barrier() are causing these pipeline stalls. They
    > don't allow the compiler to read variables such as oscr2ns_scale before
    > the barrier because gcc cannot assume it won't be modified. However, to
    > insure that OSCR read is done after __m_cnt_hi read, this barrier seems
    > required to be safe against gcc optimizations.
    >
    > Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
    > the macro or to a vanilla tree ?


    Nicolas' patch compared to unmodified - there's less side effects,
    which come down to two pipeline stalls whereas we had none with
    the unmodified code.

    One pipeline stall for loading the address of __m_cnt_hi and reading
    its value, followed by the same thing for oscr2ns_scale.

    I think this is showing the problem of compiler barriers - they are
    indescriminate. They are total and complete barriers - not only do
    they act on the data but also the compilers ability to emit code for
    generating the addresses of the data to be loaded.

    Clearly, the address of OSCR, __m_cnt_hi nor oscr2ns_scale is ever
    going to change at run time - their addresses are all stored in the
    literal pool, but by putting compiler barriers in, the compiler is
    being prevented from reading from the literal pool at the most
    appropriate point.

    So, I've tried this:

    unsigned long long sched_clock(void)
    {
    + unsigned long *oscr2ns_ptr = &oscr2ns_scale;
    unsigned long long v = cnt32_to_63(OSCR);
    - return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
    + return (v * *oscr2ns_ptr) >> OSCR2NS_SCALE_FACTOR;
    }

    to try to explicitly code the loads. This unfortunately results in
    three pipeline stalls. Also tried swapping the two lines starting
    'unsigned long' without any improvement on not having those extra hacks
    to work around the barrier.

    So, let's summarise this:

    1. the existing code works, is correct on ARM, and is efficient.
    2. throwing barriers into the function makes it less efficient.
    3. re-engineering the code appears to make things worse.

    --
    Russell King
    Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
    maintainer of:
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH] convert cnt32_to_63 to inline

    On Tue, 2008-11-11 at 22:31 +0000, David Howells wrote:
    > Mathieu Desnoyers wrote:
    >
    > > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
    > > ...
    > > + preempt_disable_notrace();

    >
    > Please, no! sched_clock() is called with preemption or interrupts disabled
    > everywhere except from some debugging code (lock tracing IIRC). If you need
    > to insert this preemption disablement somewhere, please insert it there. At
    > least then sched_clock() will be called consistently.


    Agreed. You could do a WARN_ON(!in_atomic); in sched_clock() depending
    on DEBUG_PREEMPT or something to ensure this.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH] convert cnt32_to_63 to inline

    Mathieu Desnoyers wrote:

    > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
    > ...
    > + preempt_disable_notrace();


    Please, no! sched_clock() is called with preemption or interrupts disabled
    everywhere except from some debugging code (lock tracing IIRC). If you need
    to insert this preemption disablement somewhere, please insert it there. At
    least then sched_clock() will be called consistently.

    David
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH] convert cnt32_to_63 to inline


    On Tue, 11 Nov 2008, Peter Zijlstra wrote:

    > On Tue, 2008-11-11 at 22:31 +0000, David Howells wrote:
    > > Mathieu Desnoyers wrote:
    > >
    > > > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
    > > > ...
    > > > + preempt_disable_notrace();

    > >
    > > Please, no! sched_clock() is called with preemption or interrupts disabled
    > > everywhere except from some debugging code (lock tracing IIRC). If you need
    > > to insert this preemption disablement somewhere, please insert it there. At
    > > least then sched_clock() will be called consistently.

    >
    > Agreed. You could do a WARN_ON(!in_atomic); in sched_clock() depending
    > on DEBUG_PREEMPT or something to ensure this.


    It would also be nice if this requirement (calling sched_clock with
    preemption disabled) was documented somewhere more obvious.

    Doing as Peter suggested, adding a WARN_ON and documenting that this must
    be called with preemption disabled, would be nice.

    -- Steve

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 6 of 6 FirstFirst ... 4 5 6