[RFC patch 00/18] Trace Clock v2 - Kernel
This is a discussion on [RFC patch 00/18] Trace Clock v2 - Kernel ; On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
Nicolas Pitre wrote:
> On Mon, 10 Nov 2008, Andrew Morton wrote:
>
> > On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
> > Nicolas Pitre wrote:
> >
> > ...
-
Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
Nicolas Pitre wrote:
> On Mon, 10 Nov 2008, Andrew Morton wrote:
>
> > On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
> > Nicolas Pitre wrote:
> >
> > > > It is far better to make the management of the state explicit and at
> > > > the control of the caller. Get the caller to allocate the state and
> > > > pass its address into this function. Simple, clear, explicit and
> > > > robust.
> > >
> > > Sigh... What about this compromize then?
> > >
> > > diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> > > index 7605fdd..74ce767 100644
> > > --- a/include/linux/cnt32_to_63.h
> > > +++ b/include/linux/cnt32_to_63.h
> > > @@ -32,8 +32,9 @@ union cnt32_to_63 {
> > >
> > >
> > > /**
> > > - * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > > + * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > > * @cnt_lo: The low part of the counter
> > > + * @cnt_hi_p: Pointer to storage for the extended part of the counter
> > > *
> > > * Many hardware clock counters are only 32 bits wide and therefore have
> > > * a relatively short period making wrap-arounds rather frequent. This
> > > @@ -75,16 +76,31 @@ union cnt32_to_63 {
> > > * clear-bit instruction. Otherwise caller must remember to clear the top
> > > * bit explicitly.
> > > */
> > > -#define cnt32_to_63(cnt_lo) \
> > > +#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
> > > ({ \
> > > - static u32 __m_cnt_hi; \
> > > union cnt32_to_63 __x; \
> > > - __x.hi = __m_cnt_hi; \
> > > + __x.hi = *(cnt_hi_p); \
> > > smp_rmb(); \
> > > __x.lo = (cnt_lo); \
> > > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > > - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > > + *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > > __x.val; \
> > > })
> >
> > This references its second argument twice, which can cause correctness
> > or efficiency problems.
> >
> > There is no reason that this had to be implemented in cpp.
> > Implementing it in C will fix the above problem.
>
> No, it won't, for correctness and efficiency reasons.
>
> And I've explained why already.
I'd be very surprised if you've really found a case where a macro is
faster than an inlined function. I don't think that has happened
before.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
On Mon, 10 Nov 2008, Andrew Morton wrote:
> On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> Nicolas Pitre wrote:
> > >
> > > This references its second argument twice, which can cause correctness
> > > or efficiency problems.
> > >
> > > There is no reason that this had to be implemented in cpp.
> > > Implementing it in C will fix the above problem.
> >
> > No, it won't, for correctness and efficiency reasons.
> >
> > And I've explained why already.
>
> I'd be very surprised if you've really found a case where a macro is
> faster than an inlined function. I don't think that has happened
> before.
But that's the way my Grandpa did it. With macros!
-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
On Mon, 10 Nov 2008, Andrew Morton wrote:
> On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> Nicolas Pitre wrote:
>
> > On Mon, 10 Nov 2008, Andrew Morton wrote:
> >
> > > This references its second argument twice, which can cause correctness
> > > or efficiency problems.
> > >
> > > There is no reason that this had to be implemented in cpp.
> > > Implementing it in C will fix the above problem.
> >
> > No, it won't, for correctness and efficiency reasons.
> >
> > And I've explained why already.
>
> I'd be very surprised if you've really found a case where a macro is
> faster than an inlined function. I don't think that has happened
> before.
That hasn't anything to do with "a macro is faster" at all. It's all
about the order used to evaluate provided arguments. And the first one
might be anything like a memory value, an IO operation, an expression,
etc. An inline function would work correctly with pointers only and
therefore totally break apart on x86 for example.
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
[PATCH] convert cnt32_to_63 to inline
* Nicolas Pitre (nico@cam.org) wrote:
> On Mon, 10 Nov 2008, Andrew Morton wrote:
>
> > On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> > Nicolas Pitre wrote:
> >
> > > On Mon, 10 Nov 2008, Andrew Morton wrote:
> > >
> > > > This references its second argument twice, which can cause correctness
> > > > or efficiency problems.
> > > >
> > > > There is no reason that this had to be implemented in cpp.
> > > > Implementing it in C will fix the above problem.
> > >
> > > No, it won't, for correctness and efficiency reasons.
> > >
> > > And I've explained why already.
> >
> > I'd be very surprised if you've really found a case where a macro is
> > faster than an inlined function. I don't think that has happened
> > before.
>
> That hasn't anything to do with "a macro is faster" at all. It's all
> about the order used to evaluate provided arguments. And the first one
> might be anything like a memory value, an IO operation, an expression,
> etc. An inline function would work correctly with pointers only and
> therefore totally break apart on x86 for example.
>
>
> Nicolas
Let's see what it gives once implemented. Only compile-tested. Assumes
pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
arm versatile.
Turn cnt32_to_63 into an inline function.
Change all callers to new API.
Document barrier usage.
Signed-off-by: Mathieu Desnoyers
CC: Nicolas Pitre
CC: Andrew Morton
CC: torvalds@linux-foundation.org
CC: rmk+lkml@arm.linux.org.uk
CC: dhowells@redhat.com
CC: paulus@samba.org
CC: a.p.zijlstra@chello.nl
CC: mingo@elte.hu
CC: benh@kernel.crashing.org
CC: rostedt@goodmis.org
CC: tglx@linutronix.de
CC: davem@davemloft.net
CC: ralf@linux-mips.org
---
arch/arm/mach-pxa/time.c | 14 ++++++++++++-
arch/arm/mach-sa1100/generic.c | 15 +++++++++++++-
arch/arm/mach-versatile/core.c | 12 ++++++++++-
arch/mn10300/kernel/time.c | 19 +++++++++++++-----
include/linux/cnt32_to_63.h | 42 +++++++++++++++++++++++++++++------------
5 files changed, 82 insertions(+), 20 deletions(-)
Index: linux.trees.git/arch/arm/mach-pxa/time.c
================================================== =================
--- linux.trees.git.orig/arch/arm/mach-pxa/time.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-pxa/time.c 2008-11-11 13:05:01.000000000 -0500
@@ -37,6 +37,10 @@
#define OSCR2NS_SCALE_FACTOR 10
static unsigned long oscr2ns_scale;
+static u32 sched_clock_cnt_hi; /*
+ * Shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */
static void __init set_oscr2ns_scale(unsigned long oscr_rate)
{
@@ -54,7 +58,15 @@ static void __init set_oscr2ns_scale(uns
unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(OSCR);
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = OSCR;
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();
return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
}
Index: linux.trees.git/include/linux/cnt32_to_63.h
================================================== =================
--- linux.trees.git.orig/include/linux/cnt32_to_63.h 2008-11-11 12:20:17.000000000 -0500
+++ linux.trees.git/include/linux/cnt32_to_63.h 2008-11-11 13:10:44.000000000 -0500
@@ -32,7 +32,9 @@ union cnt32_to_63 {
/**
* cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
- * @cnt_lo: The low part of the counter
+ * @cnt_hi: The high part of the counter (read first)
+ * @cnt_lo: The low part of the counter (read after cnt_hi)
+ * @cnt_hi_ptr: Pointer to high part of the counter
*
* Many hardware clock counters are only 32 bits wide and therefore have
* a relatively short period making wrap-arounds rather frequent. This
@@ -57,7 +59,10 @@ union cnt32_to_63 {
* code must be executed at least once per each half period of the 32-bit
* counter to properly update the state bit in memory. This is usually not a
* problem in practice, but if it is then a kernel timer could be scheduled
- * to manage for this code to be executed often enough.
+ * to manage for this code to be executed often enough. If a per-cpu cnt_hi is
+ * used, the value must be updated at least once per 32-bits half-period on each
+ * CPU. If cnt_hi is shared between CPUs, it suffice to update it once per
+ * 32-bits half-period on any CPU.
*
* Note that the top bit (bit 63) in the returned value should be considered
* as garbage. It is not cleared here because callers are likely to use a
@@ -65,16 +70,29 @@ union cnt32_to_63 {
* implicitly by making the multiplier even, therefore saving on a runtime
* clear-bit instruction. Otherwise caller must remember to clear the top
* bit explicitly.
+ *
+ * Preemption must be disabled when reading the cnt_hi and cnt_lo values and
+ * calling this function.
+ *
+ * The cnt_hi parameter _must_ be read before cnt_lo. This implies using the
+ * proper barriers :
+ * - smp_rmb() if cnt_lo is read from mmio and the cnt_hi variable is shared
+ * across CPUs.
+ * - use a per-cpu variable for cnt_high if cnt_lo is read from per-cpu cycles
+ * counters or to read the counters with only a barrier().
*/
-#define cnt32_to_63(cnt_lo) \
-({ \
- static volatile u32 __m_cnt_hi; \
- union cnt32_to_63 __x; \
- __x.hi = __m_cnt_hi; \
- __x.lo = (cnt_lo); \
- if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
- __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
- __x.val; \
-})
+static inline u64 cnt32_to_63(u32 cnt_hi, u32 cnt_lo, u32 *cnt_hi_ptr)
+{
+ union cnt32_to_63 __x = {
+ {
+ .hi = cnt_hi,
+ .lo = cnt_lo,
+ },
+ };
+
+ if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
+ *cnt_hi_ptr = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
+ return __x.val; /* Remember to clear the top bit in the caller */
+}
#endif
Index: linux.trees.git/arch/arm/mach-sa1100/generic.c
================================================== =================
--- linux.trees.git.orig/arch/arm/mach-sa1100/generic.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-sa1100/generic.c 2008-11-11 13:05:10.000000000 -0500
@@ -34,6 +34,11 @@
unsigned int reset_status;
EXPORT_SYMBOL(reset_status);
+static u32 sched_clock_cnt_hi; /*
+ * Shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */
+
#define NR_FREQS 16
/*
@@ -133,7 +138,15 @@ EXPORT_SYMBOL(cpufreq_get);
*/
unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(OSCR);
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = OSCR;
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();
/* the <<1 gets rid of the cnt_32_to_63 top bit saving on a bic insn */
v *= 78125<<1;
Index: linux.trees.git/arch/arm/mach-versatile/core.c
================================================== =================
--- linux.trees.git.orig/arch/arm/mach-versatile/core.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-versatile/core.c 2008-11-11 12:57:55.000000000 -0500
@@ -60,6 +60,8 @@
#define VA_VIC_BASE __io_address(VERSATILE_VIC_BASE)
#define VA_SIC_BASE __io_address(VERSATILE_SIC_BASE)
+static u32 sched_clock_cnt_hi;
+
static void sic_mask_irq(unsigned int irq)
{
irq -= IRQ_SIC_START;
@@ -238,7 +240,15 @@ void __init versatile_map_io(void)
*/
unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(readl(VERSATILE_REFCOUNTER));
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ smp_rmb();
+ cnt_lo = readl(VERSATILE_REFCOUNTER);
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();
/* the <<1 gets rid of the cnt_32_to_63 top bit saving on a bic insn */
v *= 125<<1;
Index: linux.trees.git/arch/mn10300/kernel/time.c
================================================== =================
--- linux.trees.git.orig/arch/mn10300/kernel/time.c 2008-11-11 12:41:42.000000000 -0500
+++ linux.trees.git/arch/mn10300/kernel/time.c 2008-11-11 13:04:42.000000000 -0500
@@ -29,6 +29,11 @@ unsigned long mn10300_iobclk; /* system
unsigned long mn10300_tsc_per_HZ; /* number of ioclks per jiffy */
#endif /* CONFIG_MN10300_RTC */
+static u32 sched_clock_cnt_hi; /*
+ * shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */
+
static unsigned long mn10300_last_tsc; /* time-stamp counter at last time
* interrupt occurred */
@@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
unsigned long long ll;
unsigned l[2];
} tsc64, result;
- unsigned long tsc, tmp;
+ unsigned long tmp;
unsigned product[3]; /* 96-bit intermediate value */
+ u32 cnt_lo, cnt_hi;
- /* read the TSC value
- */
- tsc = 0 - get_cycles(); /* get_cycles() counts down */
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = 0 - get_cycles(); /* get_cycles() counts down */
/* expand to 64-bits.
* - sched_clock() must be called once a minute or better or the
* following will go horribly wrong - see cnt32_to_63()
*/
- tsc64.ll = cnt32_to_63(tsc) & 0x7fffffffffffffffULL;
+ tsc64.ll = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ tsc64.ll &= 0x7fffffffffffffffULL;
+ preempt_enable_notrace();
/* scale the 64-bit TSC value to a nanosecond value via a 96-bit
* intermediate
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH] convert cnt32_to_63 to inline
On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> Let's see what it gives once implemented. Only compile-tested. Assumes
> pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> arm versatile.
Versatile is also UP only.
The following are results from PXA built with gcc 3.4.3:
1. two additional registers used in sched_clock()
2. 8 additional bytes of code (which are needless if gcc was more inteligent)
both of these I put down to inefficiencies in gcc's register allocation.
3. worse instruction scheduling - two inter-dependent loads next to each
other causing a pipeline stall
Actual reading of variables/hardware is unaffected by this patch.
Old code:
c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
20: e1a07002 mov r7, r2
24: e3a08000 mov r8, #0 ; 0x0
28: e5933000 ldr r3, [r3] ; read OSCR register
....
58: e1820b04 orr r0, r2, r4, lsl #22
5c: e1a01524 lsr r1, r4, #10
60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
New code:
c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
18: e1a08001 mov r8, r1
1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
24: e1a09002 mov r9, r2
28: e3a0a000 mov sl, #0 ; 0x0
2c: e5933000 ldr r3, [r3] ; read OSCR
....
58: e1825b04 orr r5, r2, r4, lsl #22
5c: e1a06524 lsr r6, r4, #10
60: e1a01006 mov r1, r6
64: e1a00005 mov r0, r5
68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
Versatile:
1. 12 additional bytes of code
2. same number of registers
3. worse instruction scheduling causing pipeline stall
Actual reading of variables/hardware is unaffected by this patch.
So, we have two platforms where this patch makes things visibly worse
with no material benefit.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH] convert cnt32_to_63 to inline
* Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> > Let's see what it gives once implemented. Only compile-tested. Assumes
> > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> > arm versatile.
>
> Versatile is also UP only.
>
> The following are results from PXA built with gcc 3.4.3:
>
> 1. two additional registers used in sched_clock()
> 2. 8 additional bytes of code (which are needless if gcc was more inteligent)
>
> both of these I put down to inefficiencies in gcc's register allocation.
>
> 3. worse instruction scheduling - two inter-dependent loads next to each
> other causing a pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> Old code:
>
> c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
> 10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
> 14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
> 18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
> 20: e1a07002 mov r7, r2
> 24: e3a08000 mov r8, #0 ; 0x0
> 28: e5933000 ldr r3, [r3] ; read OSCR register
> ...
> 58: e1820b04 orr r0, r2, r4, lsl #22
> 5c: e1a01524 lsr r1, r4, #10
> 60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
>
>
> New code:
>
> c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
> 10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
> 14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
> 18: e1a08001 mov r8, r1
> 1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
> 20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 24: e1a09002 mov r9, r2
> 28: e3a0a000 mov sl, #0 ; 0x0
> 2c: e5933000 ldr r3, [r3] ; read OSCR
> ...
> 58: e1825b04 orr r5, r2, r4, lsl #22
> 5c: e1a06524 lsr r6, r4, #10
> 60: e1a01006 mov r1, r6
> 64: e1a00005 mov r0, r5
> 68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
>
> Versatile:
>
> 1. 12 additional bytes of code
> 2. same number of registers
> 3. worse instruction scheduling causing pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> So, we have two platforms where this patch makes things visibly worse
> with no material benefit.
>
I think the added barrier() are causing these pipeline stalls. They
don't allow the compiler to read variables such as oscr2ns_scale before
the barrier because gcc cannot assume it won't be modified. However, to
insure that OSCR read is done after __m_cnt_hi read, this barrier seems
required to be safe against gcc optimizations.
Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
the macro or to a vanilla tree ?
I wonder if reading those values sooner would help gcc ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH] convert cnt32_to_63 to inline
On Tue, 11 Nov 2008, Mathieu Desnoyers wrote:
> * Nicolas Pitre (nico@cam.org) wrote:
> > That hasn't anything to do with "a macro is faster" at all. It's all
> > about the order used to evaluate provided arguments. And the first one
> > might be anything like a memory value, an IO operation, an expression,
> > etc. An inline function would work correctly with pointers only and
> > therefore totally break apart on x86 for example.
> >
> >
> > Nicolas
>
> Let's see what it gives once implemented. Only compile-tested. Assumes
> pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> arm versatile.
>
> Turn cnt32_to_63 into an inline function.
> Change all callers to new API.
> Document barrier usage.
Look, I'm not interested at all into this mess.
The _WHOLE_ point of the cnt32_to_63 macro was to abstract and
encapsulate the subtlety of the algorithm. It initially started as an
open coded implementation in PXA's sched_clock(). Then I was asked to
make it a macro that can be reused for other ARM platforms. Then David
wanted to reuse it on other platforms than ARM.
Now you are simply destroying all the value of having that macro in the
first place. The argument is that the macro could be misused because it
has a static variable inside it, etc. etc. The solution: spread the
subtlety all around instead of keeping it into the macro and risk having
it wrong or broken due to future changes surrounding it in the future.
And it _will_ happen due to the increased exposure making the whole idea
even more fragile compared to having it concentrated in only one spot.
This is a total non sense and I can't believe you truly think your patch
makes things better...
You're even disabling preemption where it is really unneeded making the
whole thing about the double its initial cost. Look at the generated
assembly and count the cycles if you don't believe me!
No thank you. If this trend continues I'm going to make it back private
to ARM again so you could pessimize your own code as much as you want.
NACK.
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH] convert cnt32_to_63 to inline
On Tue, Nov 11, 2008 at 04:00:46PM -0500, Nicolas Pitre wrote:
> No thank you. If this trend continues I'm going to make it back private
> to ARM again so you could pessimize your own code as much as you want.
As I've already stated several days ago, I think that's the right
course of action. Given all the concerns raised, it's clearly not
something that should have been allowed to become generic.
So, let's just close this discussion off by taking that course of
action.
What's required is (in order):
1. a local copy for NM10300 needs to be created and it converted to that
2. these two commits then need to be reverted:
bc173c5789e1fc6065fd378edc815914b40ee86b
b4f151ff899362fec952c45d166252c9912c041f
Then our usage is again limited to sched_clock() which is well understood
and known to be problem free.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH] convert cnt32_to_63 to inline
On Tue, Nov 11, 2008 at 03:11:30PM -0500, Mathieu Desnoyers wrote:
> I think the added barrier() are causing these pipeline stalls. They
> don't allow the compiler to read variables such as oscr2ns_scale before
> the barrier because gcc cannot assume it won't be modified. However, to
> insure that OSCR read is done after __m_cnt_hi read, this barrier seems
> required to be safe against gcc optimizations.
>
> Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
> the macro or to a vanilla tree ?
Nicolas' patch compared to unmodified - there's less side effects,
which come down to two pipeline stalls whereas we had none with
the unmodified code.
One pipeline stall for loading the address of __m_cnt_hi and reading
its value, followed by the same thing for oscr2ns_scale.
I think this is showing the problem of compiler barriers - they are
indescriminate. They are total and complete barriers - not only do
they act on the data but also the compilers ability to emit code for
generating the addresses of the data to be loaded.
Clearly, the address of OSCR, __m_cnt_hi nor oscr2ns_scale is ever
going to change at run time - their addresses are all stored in the
literal pool, but by putting compiler barriers in, the compiler is
being prevented from reading from the literal pool at the most
appropriate point.
So, I've tried this:
unsigned long long sched_clock(void)
{
+ unsigned long *oscr2ns_ptr = &oscr2ns_scale;
unsigned long long v = cnt32_to_63(OSCR);
- return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
+ return (v * *oscr2ns_ptr) >> OSCR2NS_SCALE_FACTOR;
}
to try to explicitly code the loads. This unfortunately results in
three pipeline stalls. Also tried swapping the two lines starting
'unsigned long' without any improvement on not having those extra hacks
to work around the barrier.
So, let's summarise this:
1. the existing code works, is correct on ARM, and is efficient.
2. throwing barriers into the function makes it less efficient.
3. re-engineering the code appears to make things worse.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH] convert cnt32_to_63 to inline
On Tue, 2008-11-11 at 22:31 +0000, David Howells wrote:
> Mathieu Desnoyers wrote:
>
> > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> > ...
> > + preempt_disable_notrace();
>
> Please, no! sched_clock() is called with preemption or interrupts disabled
> everywhere except from some debugging code (lock tracing IIRC). If you need
> to insert this preemption disablement somewhere, please insert it there. At
> least then sched_clock() will be called consistently.
Agreed. You could do a WARN_ON(!in_atomic); in sched_clock() depending
on DEBUG_PREEMPT or something to ensure this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH] convert cnt32_to_63 to inline
Mathieu Desnoyers wrote:
> @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> ...
> + preempt_disable_notrace();
Please, no! sched_clock() is called with preemption or interrupts disabled
everywhere except from some debugging code (lock tracing IIRC). If you need
to insert this preemption disablement somewhere, please insert it there. At
least then sched_clock() will be called consistently.
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
Re: [PATCH] convert cnt32_to_63 to inline
On Tue, 11 Nov 2008, Peter Zijlstra wrote:
> On Tue, 2008-11-11 at 22:31 +0000, David Howells wrote:
> > Mathieu Desnoyers wrote:
> >
> > > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> > > ...
> > > + preempt_disable_notrace();
> >
> > Please, no! sched_clock() is called with preemption or interrupts disabled
> > everywhere except from some debugging code (lock tracing IIRC). If you need
> > to insert this preemption disablement somewhere, please insert it there. At
> > least then sched_clock() will be called consistently.
>
> Agreed. You could do a WARN_ON(!in_atomic); in sched_clock() depending
> on DEBUG_PREEMPT or something to ensure this.
It would also be nice if this requirement (calling sched_clock with
preemption disabled) was documented somewhere more obvious.
Doing as Peter suggested, adding a WARN_ON and documenting that this must
be called with preemption disabled, would be nice.
-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/