On 28 Nov, To: [email]kmarx@vicor.com[/email] wrote:[color=blue]

> On 24 Nov, Ken Marx wrote:[color=green]

>>

>>

>> Don Lewis wrote:[/color]

>[color=green][color=darkred]

>>> Index: sys/kern/vfs_bio.c

>>> ===================================================================

>>> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v

>>> retrieving revision 1.242.2.21

>>> diff -u -r1.242.2.21 vfs_bio.c

>>> --- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21

>>> +++ sys/kern/vfs_bio.c 18 Nov 2003 02:10:55 -0000

>>> @@ -140,6 +140,7 @@

>>> &bufreusecnt, 0, "");

>>>

>>> static int bufhashmask;

>>> +static int bufhashshift;

>>> static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;

>>> struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };

>>> char *buf_wmesg = BUF_WMESG;

>>> @@ -160,7 +161,20 @@

>>> struct bufhashhdr *

>>> bufhash(struct vnode *vnp, daddr_t bn)

>>> {

>>> - return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);

>>> + u_int64_t hashkey64;

>>> + int hashkey;

>>> +

>>> + /*

>>> + * Fibonacci hash, see Knuth's

>>> + * _Art of Computer Programming, Volume 3 / Sorting and Searching_

>>> + *

>>> + * We reduce the argument to 32 bits before doing the hash to

>>> + * avoid the need for a slow 64x64 multiply on 32 bit platforms.

>>> + */

>>> + hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;

>>> + hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>

>>> + bufhashshift) & bufhashmask;

>>> + return(&bufhashtbl[hashkey]);

>>> }

>>>

>>> /*

>>> @@ -319,8 +333,9 @@

>>> bufhashinit(caddr_t vaddr)

>>> {

>>> /* first, make a null hash table */

>>> + bufhashshift = 29;

>>> for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)

>>> - ;

>>> + bufhashshift--;

>>> bufhashtbl = (void *)vaddr;

>>> vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;

>>> --bufhashmask;

>>>

>>>[/color]

>>

>> Well, I'm mildly beflummoxed - I tried to compare hashtable preformance

>> between all three known versions of the hashing - legacy power of 2,

>> the Vicor ^= hash, and Don's fibonacci hash.

>>

>> Running with

>>

>> minifree = max( 1, avgifree / 4 );

>> minbfree = max( 1, avgbfree );

>>

>> all perform about the same, with no performance problems all

>> the way up to 100% disk capacity (didn't test into reserved space).

>>

>> Looking at instrumentation to show freq and avg depth of the

>> hash buckets, everything seems very calm (mainly because

>> we're not hitting the linear searching very often, I'd presume).

>>

>> I can't explain why I seemlingly got performance problems

>> with similar (identical) minbfree code previously.

>>

>> So, out of spite, I went back to

>>

>> minbfree = max( 1, avgbfree/4 );

>>

>> This does hit the hashtable harder for the legacy version

>> and not so much for either new flavor. Here are a few

>> samplings of calling my dump routine from the debugger.

>> "avgdepth" really means 'search depth' since we use

>> the depth reached after finding a bp in gbincore.

>>

>> The line below such as,

>>

>> 0: avgdepth[1] cnt=801

>>

>> means that 801 of the hashtable buckets had an avg search

>> depth of 1 at the time the debug routine was called.

>> The 'N:' prefix means the N-th unique non-zero such value.

>> So large cnt's for small []'d depth values means an efficient hash.

>>

>> I've edited out the details as much as possible.

>>

>> LEGACY:

>> --------

>> Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154

>> ...

>> Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015

>> Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7

>> Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1 <- !!

>> Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1

>> -----------

>>

>> Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141

>> ...

>> Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017

>> Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1

>> Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6

>>

>> VICOR x-or hashtable:

>> ---------------------

>> Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762

>> Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259

>> Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3

>> -----------

>>

>> Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744

>> Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275

>> Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5

>>

>> FIBONACCI:

>> ----------

>> Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811

>> Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88

>> Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124

>> Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1

>> -----------

>>

>> Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801

>> Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93

>> Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130

>>

>> So, while this is far from analytically eshaustive,

>> it almost appears the fibonacci hash has more entries

>> of depth 3, while the Vicor one has more at depth 2.

>>

>> I'm happy to run more tests if you have ideas. I'm also fine

>> to cut bait and go with whatever you decide. It *seems* like

>> putting the fibonacci hash is prudent since the current hash

>> has been observed to be expensive. I had trouble proving this

>> unequivocally though. So, perhaps Don's minbfree fix is sufficient

>> after all. I'm tempted at this point to go with the 100% flavor.[/color]

>

> I think we're running into one of the weaknesses in the Fibonacci hash.

> There are a large number of hash entries for the cylinder group blocks,

> which are located at offsets which are multiples of 89 * 2^10 in your

> example, or something on the order of 2^16. The effect of this is for

> the cylinder group number to be hashed using the least significant bits

> of the hash multiplier, which don't work as well for distributing the

> hash values. I tried some of Knuth's suggestions, and got better

> results with the hash multiplier 0x9E376DB1u. The most significant 16

> bits of the multplier are the same as the original constant, and the

> least significant bits act as a fraction in the desirable range of 1/3

> to 3/7. Please give this new hash multiplier a try.[/color]

I went ahead and spun a new version of my patch with the new multiplier,

one other tweak to the formula, and updated comments.

Index: sys/kern/vfs_bio.c

===================================================================

RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v

retrieving revision 1.242.2.21

diff -u -r1.242.2.21 vfs_bio.c

--- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21

+++ sys/kern/vfs_bio.c 28 Nov 2003 20:02:06 -0000

@@ -140,6 +140,7 @@

&bufreusecnt, 0, "");

static int bufhashmask;

+static int bufhashshift;

static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;

struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };

char *buf_wmesg = BUF_WMESG;

@@ -160,7 +161,40 @@

struct bufhashhdr *

bufhash(struct vnode *vnp, daddr_t bn)

{

- return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);

+ u_int64_t hashkey64;

+ int hashkey;

+

+ /*

+ * A variation on the Fibonacci hash that Knuth credits to

+ * R. W. Floyd, see Knuth's _Art of Computer Programming,

+ * Volume 3 / Sorting and Searching_

+ *

+ * We reduce the argument to 32 bits before doing the hash to

+ * avoid the need for a slow 64x64 multiply on 32 bit platforms.

+ *

+ * sizeof(struct vnode) is 168 on i386, so toss some of the lower

+ * bits of the vnode address to reduce the key range, which

+ * improves the distribution of keys across buckets.

+ *

+ * The file system cylinder group blocks are very heavily

+ * used. They are located at invervals of fbg, which is

+ * on the order of 89 to 94 * 2^10, depending on other

+ * filesystem parameters, for a 16k block size. Smaller block

+ * sizes will reduce fpg approximately proportionally. This

+ * will cause the cylinder group index to be hashed using the

+ * lower bits of the hash multiplier, which will not distribute

+ * the keys as uniformly in a classic Fibonacci hash where a

+ * relatively small number of the upper bits of the result

+ * are used. Using 2^16 as a close-enough approximation to

+ * fpg, split the hash multiplier in half, with the upper 16

+ * bits being the inverse of the golden ratio, and the lower

+ * 16 bits being a fraction between 1/3 and 3/7 (closer to

+ * 3/7 in this case), that gives good experimental results.

+ */

+ hashkey64 = ((u_int64_t)(uintptr_t)vnp >> 3) + (u_int64_t)bn;

+ hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 0x9E376DB1u) >>

+ bufhashshift) & bufhashmask;

+ return(&bufhashtbl[hashkey]);

}

/*

@@ -319,8 +353,9 @@

bufhashinit(caddr_t vaddr)

{

/* first, make a null hash table */

+ bufhashshift = 29;

for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)

- ;

+ bufhashshift--;

bufhashtbl = (void *)vaddr;

vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;

--bufhashmask;

_______________________________________________

[email]freebsd-fs@freebsd.org[/email] mailing list

[url]http://lists.freebsd.org/mailman/listinfo/freebsd-fs[/url]

To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"