Thread: Use simplehash.h instead of dynahash in SMgr

Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
Hackers,

Last year, when working on making compactify_tuples() go faster for
19c60ad69, I did quite a bit of benchmarking of the recovery process.
The next thing that was slow after compactify_tuples() was the hash
lookups done in smgropen().

Currently, we use dynahash hash tables to store the SMgrRelation so we
can perform fast lookups by RelFileNodeBackend. However, I had in mind
that a simplehash table might perform better. So I tried it...

The attached converts the hash table lookups done in smgr.c to use
simplehash instead of dynahash.

This does require a few changes in simplehash.h to make it work.  The
reason being is that RelationData.rd_smgr points directly into the
hash table entries.  This works ok for dynahash as that hash table
implementation does not do any reallocations of existing items or move
any items around in the table, however, simplehash moves entries
around all the time, so we can't point any pointers directly at the
hash entries and expect them to be valid after adding or removing
anything else from the table.

To work around that, I've just made an additional type that serves as
the hash entry type that has a pointer to the SMgrRelationData along
with the hash status and hash value. It's just 16 bytes (or 12 on
32-bit machines).  I opted to keep the hash key in the
SMgrRelationData rather than duplicating it as it keeps the SMgrEntry
struct nice and small.  We only need to dereference the SMgrRelation
pointer when we find an entry with the same hash value.  The chances
are quite good that an entry with the same hash value is the one that
we want, so any additional dereferences to compare the key are not
going to happen very often.

I did experiment with putting the hash key in SMgrEntry and found it
to be quite a bit slower.  I also did try to use hash_bytes() but
found building a hash function that uses murmurhash32 to be quite a
bit faster.

Benchmarking
===========

I did some of that.  It made my test case about 10% faster.

The test case was basically inserting 100 million rows one at a time
into a hash partitioned table with 1000 partitions and 2 int columns
and a primary key on one of those columns. It was about 12GB of WAL. I
used a hash partitioned table in the hope to create a fairly
random-looking SMgr hash table access pattern.  Hopefully something
similar to what might happen in the real world.

Over 10 runs of recovery, master took an average of 124.89 seconds.
The patched version took 113.59 seconds. About 10% faster.

I bumped shared_buffers up to 10GB, max_wal_size to 20GB and
checkpoint_timeout to 60 mins.

To make the benchmark more easily to repeat I patched with the
attached recovery_panic.patch.txt.  This just PANICs at the end of
recovery so that the database shuts down before performing the end of
recovery checkpoint.  Just start the database up again to do another
run.

I did 10 runs.  The end of recovery log message reported:

master (aa271209f)
CPU: user: 117.89 s, system: 5.70 s, elapsed: 123.65 s
CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s
CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.20 s
CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s
CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s
CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s
CPU: user: 120.60 s, system: 5.82 s, elapsed: 126.49 s
CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s
CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s
CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s

master + v1 patch
CPU: user: 106.90 s, system: 4.45 s, elapsed: 111.39 s
CPU: user: 107.31 s, system: 5.98 s, elapsed: 113.35 s
CPU: user: 107.14 s, system: 5.58 s, elapsed: 112.77 s
CPU: user: 105.79 s, system: 5.64 s, elapsed: 111.48 s
CPU: user: 105.78 s, system: 5.80 s, elapsed: 111.63 s
CPU: user: 113.18 s, system: 6.21 s, elapsed: 119.45 s
CPU: user: 107.74 s, system: 4.57 s, elapsed: 112.36 s
CPU: user: 107.42 s, system: 4.62 s, elapsed: 112.09 s
CPU: user: 106.54 s, system: 4.65 s, elapsed: 111.24 s
CPU: user: 113.24 s, system: 6.86 s, elapsed: 120.16 s

I wrote this patch a few days ago. I'm only posting it now as I know a
couple of other people have expressed an interest in working on this.
I didn't really want any duplicate efforts, so thought I'd better post
it now before someone else goes and writes a similar patch.

I'll park this here and have another look at it when the PG15 branch opens.

David

Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
Yura Sokolov
Date:
David Rowley писал 2021-04-24 18:58:
> Hackers,
> 
> Last year, when working on making compactify_tuples() go faster for
> 19c60ad69, I did quite a bit of benchmarking of the recovery process.
> The next thing that was slow after compactify_tuples() was the hash
> lookups done in smgropen().
> 
> Currently, we use dynahash hash tables to store the SMgrRelation so we
> can perform fast lookups by RelFileNodeBackend. However, I had in mind
> that a simplehash table might perform better. So I tried it...
> 
> The attached converts the hash table lookups done in smgr.c to use
> simplehash instead of dynahash.
> 
> This does require a few changes in simplehash.h to make it work.  The
> reason being is that RelationData.rd_smgr points directly into the
> hash table entries.  This works ok for dynahash as that hash table
> implementation does not do any reallocations of existing items or move
> any items around in the table, however, simplehash moves entries
> around all the time, so we can't point any pointers directly at the
> hash entries and expect them to be valid after adding or removing
> anything else from the table.
> 
> To work around that, I've just made an additional type that serves as
> the hash entry type that has a pointer to the SMgrRelationData along
> with the hash status and hash value. It's just 16 bytes (or 12 on
> 32-bit machines).  I opted to keep the hash key in the
> SMgrRelationData rather than duplicating it as it keeps the SMgrEntry
> struct nice and small.  We only need to dereference the SMgrRelation
> pointer when we find an entry with the same hash value.  The chances
> are quite good that an entry with the same hash value is the one that
> we want, so any additional dereferences to compare the key are not
> going to happen very often.
> 
> I did experiment with putting the hash key in SMgrEntry and found it
> to be quite a bit slower.  I also did try to use hash_bytes() but
> found building a hash function that uses murmurhash32 to be quite a
> bit faster.
> 
> Benchmarking
> ===========
> 
> I did some of that.  It made my test case about 10% faster.
> 
> The test case was basically inserting 100 million rows one at a time
> into a hash partitioned table with 1000 partitions and 2 int columns
> and a primary key on one of those columns. It was about 12GB of WAL. I
> used a hash partitioned table in the hope to create a fairly
> random-looking SMgr hash table access pattern.  Hopefully something
> similar to what might happen in the real world.
> 
> Over 10 runs of recovery, master took an average of 124.89 seconds.
> The patched version took 113.59 seconds. About 10% faster.
> 
> I bumped shared_buffers up to 10GB, max_wal_size to 20GB and
> checkpoint_timeout to 60 mins.
> 
> To make the benchmark more easily to repeat I patched with the
> attached recovery_panic.patch.txt.  This just PANICs at the end of
> recovery so that the database shuts down before performing the end of
> recovery checkpoint.  Just start the database up again to do another
> run.
> 
> I did 10 runs.  The end of recovery log message reported:
> 
> master (aa271209f)
> CPU: user: 117.89 s, system: 5.70 s, elapsed: 123.65 s
> CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s
> CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.20 s
> CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s
> CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s
> CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s
> CPU: user: 120.60 s, system: 5.82 s, elapsed: 126.49 s
> CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s
> CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s
> CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s
> 
> master + v1 patch
> CPU: user: 106.90 s, system: 4.45 s, elapsed: 111.39 s
> CPU: user: 107.31 s, system: 5.98 s, elapsed: 113.35 s
> CPU: user: 107.14 s, system: 5.58 s, elapsed: 112.77 s
> CPU: user: 105.79 s, system: 5.64 s, elapsed: 111.48 s
> CPU: user: 105.78 s, system: 5.80 s, elapsed: 111.63 s
> CPU: user: 113.18 s, system: 6.21 s, elapsed: 119.45 s
> CPU: user: 107.74 s, system: 4.57 s, elapsed: 112.36 s
> CPU: user: 107.42 s, system: 4.62 s, elapsed: 112.09 s
> CPU: user: 106.54 s, system: 4.65 s, elapsed: 111.24 s
> CPU: user: 113.24 s, system: 6.86 s, elapsed: 120.16 s
> 
> I wrote this patch a few days ago. I'm only posting it now as I know a
> couple of other people have expressed an interest in working on this.
> I didn't really want any duplicate efforts, so thought I'd better post
> it now before someone else goes and writes a similar patch.
> 
> I'll park this here and have another look at it when the PG15 branch 
> opens.
> 
> David

Hi, David

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.

> +    /* rotate hashkey left 1 bit at each step */
> +    hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
> +    hashkey ^= murmurhash32((uint32) rnode->node.dbNode);

Why do you use so strange rotation expression? I know compillers are 
able
to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
Do they recognize construction in your code as well?

Your construction looks more like "multiplate-modulo" operation in 32bit
Galois field . It is widely used operation in cryptographic, but it is
used modulo some primitive polynomial, and 0x100000001 is not such
polynomial. 0x1000000c5 is, therefore it should be:

     hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
or
     hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 0xc5);

But why don't just use hash_combine(uint32 a, uint32 b) instead (defined
in hashfn.h)? Yep, it could be a bit slower, but is it critical?

> - *    smgrclose() -- Close and delete an SMgrRelation object.
> + *    smgrclose() -- Close and delete an SMgrRelation object but don't
> + *    remove from the SMgrRelationHash table.

I believe `smgrclose_internal()` should be in this comment.

Still I don't believe it worth to separate smgrclose_internal from
smgrclose. Is there measurable performance improvement from this
change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .

As well I don't support modification simplehash.h for 
SH_ENTRY_INITIALIZER,
SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
doesn't mean much.

Summary:

regards,
Yura Sokolov
Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
Thanks for having a look at this.

 "On Sun, 25 Apr 2021 at 10:27, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>
> It is quite interesting result. Simplehash being open-addressing with
> linear probing is friendly for cpu cache. I'd recommend to define
> SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
> suitable most for such kind of hash table.

You might be right there, although, with the particular benchmark I'm
using the size of the table does not change as a result of that. I'd
need to experiment with varying numbers of relations to see if
dropping the fillfactor helps or hinders performance.

FWIW, the hash stats at the end of recovery are:

LOG:  redo done at 3/C6E34F0 system usage: CPU: user: 107.00 s,
system: 5.61 s, elapsed: 112.67 s
LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

Perhaps if try using a number of relations somewhere between 2048 *
0.75 and 2048 * 0.9 then I might see some gains.  Because I have 2032,
the hash table grew up to 4096 buckets.

I did a quick test dropping the fillfactor down to 0.4.  The aim there
was just to see if having 8192 buckets in this test would make it
faster or slower

LOG:  redo done at 3/C6E34F0 system usage: CPU: user: 109.61 s,
system: 4.28 s, elapsed: 113.93 s
LOG:  size: 8192, members: 2032, filled: 0.248047, total chain: 303,
max chain: 2, avg chain: 0.149114, total_collisions: 209,
max_collisions: 2, avg_collisions: 0.102854

it was slightly slower.  I guess since the SMgrEntry is just 16 bytes
wide that 4 of these will sit on each cache line which means there is
a 75% chance that the next bucket over is on the same cache line.
Since the average chain length is just 0.49 then we'll mostly just
need to look at a single cache line to find the entry with the correct
hash key.

> > +     /* rotate hashkey left 1 bit at each step */
> > +     hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
> > +     hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
>
> Why do you use so strange rotation expression? I know compillers are
> able
> to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
> Do they recognize construction in your code as well?

Not sure about all compilers, I only checked the earliest version of
clang and gcc at godbolt.org and they both use a single "rol"
instruction. https://godbolt.org/z/1GqdE6T3q

> Your construction looks more like "multiplate-modulo" operation in 32bit
> Galois field . It is widely used operation in cryptographic, but it is
> used modulo some primitive polynomial, and 0x100000001 is not such
> polynomial. 0x1000000c5 is, therefore it should be:
>
>      hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
> or
>      hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 0xc5);

That does not really make sense to me.  If you're shifting a 32-bit
variable left 31 places then why would you AND with 0xc5? The only
possible result is 1 or 0 depending on if the most significant bit is
on or off. I see gcc and clang are unable to optimise that into an
"rol" instruction. If I swap the "0xc5" for "1", then they're able to
optimise the expression.

> But why don't just use hash_combine(uint32 a, uint32 b) instead (defined
> in hashfn.h)? Yep, it could be a bit slower, but is it critical?

I had that function in the corner of my eye when writing this, but
TBH, the hash function performance was just too big a factor to slow
it down any further by using the more expensive hash_combine()
function. I saw pretty good performance gains from writing my own hash
function rather than using hash_bytes(). I didn't want to detract from
that by using hash_combine().  Rotating the bits left 1 slot seems
good enough for hash join and hash aggregate, so I don't have any
reason to believe it's a bad way to combine the hash values.  Do you?

If you grep the source for "hashkey = (hashkey << 1) | ((hashkey &
0x80000000) ? 1 : 0);", then you'll see where else we do the same
rotate left trick.

> > - *   smgrclose() -- Close and delete an SMgrRelation object.
> > + *   smgrclose() -- Close and delete an SMgrRelation object but don't
> > + *   remove from the SMgrRelationHash table.
>
> I believe `smgrclose_internal()` should be in this comment.

Oops. Yeah, that's a mistake.

> Still I don't believe it worth to separate smgrclose_internal from
> smgrclose. Is there measurable performance improvement from this
> change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .

The reason I did that is due to the fact that smgrcloseall() loops
over the entire hash table and removes each entry one by one.  The
problem is that if I do a smgrtable_delete or smgrtable_delete_item in
that loop then I'd need to restart the loop each time.  Be aware that
a simplehash delete can move entries earlier in the table, so it might
cause us to miss entries during the loop.  Restarting the loop each
iteration is not going to be very efficient, so instead, I opted to
make a version of smgrclose() that does not remove from the table so
that I can just wipe out all table entries at the end of the loop.  I
called that smgrclose_internal().  Maybe there's a better name, but I
don't really see any realistic way of not having some version that
skips the hash table delete.  I was hoping the 5 line comment I added
to smgrcloseall() would explain the reason for the code being written
way.

An additional small benefit is that smgrclosenode() can get away with
a single hashtable lookup rather than having to lookup the entry again
with smgrtable_delete().  Using smgrtable_delete_item() deletes by
bucket rather than key value which should be a good bit faster in many
cases.  I think the SH_ENTRY_CLEANUP macro is quite useful here as I
don't need to worry about NULLing out the smgr_owner in yet another
location where I do a hash delete.

> As well I don't support modification simplehash.h for
> SH_ENTRY_INITIALIZER,
> SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
> live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
> doesn't mean much.

Can you share what you've got in mind here?

The problem I'm solving with SH_ENTRY_INITIALIZER is the fact that in
SH_INSERT_HASH_INTERNAL(), when we add a new item, we do entry->SH_KEY
= key; to set the new entries key.  Since I have SH_KEY defined as:

#define SH_KEY data->smgr_rnode

then I need some way to allocate the memory for ->data before the key
is set. Doing that in smrgopen() is too late. We've already crashed by
then for referencing uninitialised memory.

I did try putting the key in SMgrEntry but found the performance to be
quite a bit worse than keeping the SMgrEntry down to 16 bytes.  That
makes sense to me as we only need to compare the key when we find an
entry with the same hash value as the one we're looking for. There's a
pretty high chance of that being the entry we want. If I got my hash
function right then the odds are about 1 in 4 billion of it not being
the one we want.   The only additional price we pay when we get two
entries with the same hash value is an additional pointer dereference
and a key comparison.

David



Re: Use simplehash.h instead of dynahash in SMgr

From
Yura Sokolov
Date:
David Rowley писал 2021-04-25 05:23:
> Thanks for having a look at this.
> 
>  "On Sun, 25 Apr 2021 at 10:27, Yura Sokolov <y.sokolov@postgrespro.ru> 
> wrote:
>> 
>> It is quite interesting result. Simplehash being open-addressing with
>> linear probing is friendly for cpu cache. I'd recommend to define
>> SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
>> suitable most for such kind of hash table.
> 
> You might be right there, although, with the particular benchmark I'm
> using the size of the table does not change as a result of that. I'd
> need to experiment with varying numbers of relations to see if
> dropping the fillfactor helps or hinders performance.
> 
> FWIW, the hash stats at the end of recovery are:
> 
> LOG:  redo done at 3/C6E34F0 system usage: CPU: user: 107.00 s,
> system: 5.61 s, elapsed: 112.67 s
> LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 997,
> max chain: 5, avg chain: 0.490650, total_collisions: 422,
> max_collisions: 3, avg_collisions: 0.207677
> 
> Perhaps if try using a number of relations somewhere between 2048 *
> 0.75 and 2048 * 0.9 then I might see some gains.  Because I have 2032,
> the hash table grew up to 4096 buckets.
> 
> I did a quick test dropping the fillfactor down to 0.4.  The aim there
> was just to see if having 8192 buckets in this test would make it
> faster or slower
> 
> LOG:  redo done at 3/C6E34F0 system usage: CPU: user: 109.61 s,
> system: 4.28 s, elapsed: 113.93 s
> LOG:  size: 8192, members: 2032, filled: 0.248047, total chain: 303,
> max chain: 2, avg chain: 0.149114, total_collisions: 209,
> max_collisions: 2, avg_collisions: 0.102854
> 
> it was slightly slower.

Certainly. That is because in unmodified case you've got fillfactor 0.49
because table just grew. Below somewhat near 0.6 there is no gain in 
lower
fillfactor. But if you test it when it closer to upper bound, you will
notice difference. Try to test it with 3600 nodes, for example, if
going down to 1800 nodes is not possible.

>> > +     /* rotate hashkey left 1 bit at each step */
>> > +     hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
>> > +     hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
>> 
>> Why do you use so strange rotation expression? I know compillers are
>> able
>> to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
>> Do they recognize construction in your code as well?
> 
> Not sure about all compilers, I only checked the earliest version of
> clang and gcc at godbolt.org and they both use a single "rol"
> instruction. https://godbolt.org/z/1GqdE6T3q

Yep, looks like all compilers recognize such construction with single
exception of old icc compiler (both 13.0.1 and 16.0.3): 
https://godbolt.org/z/qsrjY5Eof
and all compilers recognize `(h << 1) | (h >> 31)` well

>> Your construction looks more like "multiplate-modulo" operation in 
>> 32bit
>> Galois field . It is widely used operation in cryptographic, but it is
>> used modulo some primitive polynomial, and 0x100000001 is not such
>> polynomial. 0x1000000c5 is, therefore it should be:
>> 
>>      hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
>> or
>>      hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 
>> 0xc5);
> 
> That does not really make sense to me.  If you're shifting a 32-bit
> variable left 31 places then why would you AND with 0xc5? The only
> possible result is 1 or 0 depending on if the most significant bit is
> on or off.

That is why there is cast to signed int before shifting: `(int32)hashkey 
 >> 31`.
Shift is then also signed ie arithmetic, and results are 0 or 
0xffffffff.

>> But why don't just use hash_combine(uint32 a, uint32 b) instead 
>> (defined
>> in hashfn.h)? Yep, it could be a bit slower, but is it critical?
> 
> I had that function in the corner of my eye when writing this, but
> TBH, the hash function performance was just too big a factor to slow
> it down any further by using the more expensive hash_combine()
> function. I saw pretty good performance gains from writing my own hash
> function rather than using hash_bytes(). I didn't want to detract from
> that by using hash_combine().  Rotating the bits left 1 slot seems
> good enough for hash join and hash aggregate, so I don't have any
> reason to believe it's a bad way to combine the hash values.  Do you?

Well, if think a bit more, this hash values could be combined with using
just addition: `hash(a) + hash(b) + hash(c)`.

I thought more about consistency in a codebase. But looks like both ways
(`hash_combine(a,b)` and `rotl(a,1)^b`) are used in a code.
- hash_combine is in one time/three lines in hashTupleDesc at 
tupledesc.c
- rotl+xor six times:
-- three times/three lines in execGrouping.c with construction like 
yours
-- three times in jsonb_util.c, multirangetypes.c and rangetypes.c with
    `(h << 1) | (h >> 31)`.
Therefore I step down on recommendation in this place.

Looks like it is possibility for micropatch to unify hash combining :-)

> 
> If you grep the source for "hashkey = (hashkey << 1) | ((hashkey &
> 0x80000000) ? 1 : 0);", then you'll see where else we do the same
> rotate left trick.
> 
>> > - *   smgrclose() -- Close and delete an SMgrRelation object.
>> > + *   smgrclose() -- Close and delete an SMgrRelation object but don't
>> > + *   remove from the SMgrRelationHash table.
>> 
>> I believe `smgrclose_internal()` should be in this comment.
> 
> Oops. Yeah, that's a mistake.
> 
>> Still I don't believe it worth to separate smgrclose_internal from
>> smgrclose. Is there measurable performance improvement from this
>> change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .
> 
> The reason I did that is due to the fact that smgrcloseall() loops
> over the entire hash table and removes each entry one by one.  The
> problem is that if I do a smgrtable_delete or smgrtable_delete_item in
> that loop then I'd need to restart the loop each time.  Be aware that
> a simplehash delete can move entries earlier in the table, so it might
> cause us to miss entries during the loop.  Restarting the loop each
> iteration is not going to be very efficient, so instead, I opted to
> make a version of smgrclose() that does not remove from the table so
> that I can just wipe out all table entries at the end of the loop.  I
> called that smgrclose_internal().

If you read comments in SH_START_ITERATE, you'll see:

   * Search for the first empty element. As deletions during iterations 
are
   * supported, we want to start/end at an element that cannot be 
affected
   * by elements being shifted.

   * Iterate backwards, that allows the current element to be deleted, 
even
   * if there are backward shifts

Therefore, it is safe to delete during iteration, and it doesn't lead 
nor
require loop restart.

> 
> An additional small benefit is that smgrclosenode() can get away with
> a single hashtable lookup rather than having to lookup the entry again
> with smgrtable_delete().  Using smgrtable_delete_item() deletes by
> bucket rather than key value which should be a good bit faster in many
> cases.  I think the SH_ENTRY_CLEANUP macro is quite useful here as I
> don't need to worry about NULLing out the smgr_owner in yet another
> location where I do a hash delete.

Doubtfully it makes sense since smgrclosenode is called only in
LocalExecuteInvalidationMessage, ie when other backend drops some
relation. There is no useful performance gain from it.

> 
>> As well I don't support modification simplehash.h for
>> SH_ENTRY_INITIALIZER,
>> SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
>> live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
>> doesn't mean much.
> 
> Can you share what you've got in mind here?
> 
> The problem I'm solving with SH_ENTRY_INITIALIZER is the fact that in
> SH_INSERT_HASH_INTERNAL(), when we add a new item, we do entry->SH_KEY
> = key; to set the new entries key.  Since I have SH_KEY defined as:
>
> #define SH_KEY data->smgr_rnode
> 
> then I need some way to allocate the memory for ->data before the key
> is set. Doing that in smrgopen() is too late. We've already crashed by
> then for referencing uninitialised memory.

Oh, now I see.
I could suggest work-around:
- use entry->hash as a whole key value and manually resolve hash
   collision with chaining.
But it looks ugly: use hash table and still manually resolve collisions.

Therefore perhaps SH_ENTRY_INITIALIZER has sense.

But SH_ENTRY_CLEANUP is abused in the patch: it is not symmetric to
SH_ENTRY_INITIALIZER. It smells bad. `smgr_owner` is better cleaned
in a way it is cleaned now in smgrclose because it is less obscure.
And SH_ENTRY_CLEANUP should be just `pfree(a->data)`.

And still no reason to have SH_TRUNCATE.

> I did try putting the key in SMgrEntry but found the performance to be
> quite a bit worse than keeping the SMgrEntry down to 16 bytes.  That
> makes sense to me as we only need to compare the key when we find an
> entry with the same hash value as the one we're looking for. There's a
> pretty high chance of that being the entry we want. If I got my hash
> function right then the odds are about 1 in 4 billion of it not being
> the one we want.   The only additional price we pay when we get two
> entries with the same hash value is an additional pointer dereference
> and a key comparison.

It has sense: whole benefit of simplehash is cache locality, and
it is gained with smaller entry.

regards,
Yura Sokolov



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Sun, 25 Apr 2021 at 18:48, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> If you read comments in SH_START_ITERATE, you'll see:
>
>    * Search for the first empty element. As deletions during iterations
> are
>    * supported, we want to start/end at an element that cannot be
> affected
>    * by elements being shifted.
>
>    * Iterate backwards, that allows the current element to be deleted,
> even
>    * if there are backward shifts
>
> Therefore, it is safe to delete during iteration, and it doesn't lead
> nor
> require loop restart.

I had only skimmed that with a pre-loaded assumption that it wouldn't
be safe. I didn't do a very good job of reading it as I failed to
notice the lack of guarantees were about deleting items other than the
current one. I didn't consider the option of finding a free bucket
then looping backwards to avoid missing entries that are moved up
during a delete.

With that, I changed the patch to get rid of the SH_TRUNCATE and got
rid of the smgrclose_internal which skips the hash delete.  The code
is much more similar to how it was now.

In regards to the hashing stuff.  I added a new function to
pg_bitutils.h to rotate left and I'm using that instead of the other
expression that was taken from nodeHash.c

For the hash function, I've done some further benchmarking with:

1) The attached v2 patch
2) The attached + plus use_hash_combine.patch.txt which uses
hash_combine() instead of pg_rotate_left32()ing the hashkey each time.
3) The attached v2 with use_hash_bytes.patch.txt applied.
4) Master

I've also included the hash stats from each version of the hash function.

I hope the numbers help indicate the reason I picked the hash function
that I did.

1) v2 patch.
CPU: user: 108.23 s, system: 6.97 s, elapsed: 115.63 s
CPU: user: 114.78 s, system: 6.88 s, elapsed: 121.71 s
CPU: user: 107.53 s, system: 5.70 s, elapsed: 113.28 s
CPU: user: 108.43 s, system: 5.73 s, elapsed: 114.22 s
CPU: user: 106.18 s, system: 5.73 s, elapsed: 111.96 s
CPU: user: 108.04 s, system: 5.29 s, elapsed: 113.39 s
CPU: user: 107.64 s, system: 5.64 s, elapsed: 113.34 s
CPU: user: 106.64 s, system: 5.58 s, elapsed: 112.27 s
CPU: user: 107.91 s, system: 5.40 s, elapsed: 113.36 s
CPU: user: 115.35 s, system: 6.60 s, elapsed: 122.01 s

Median = 113.375 s

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

2) v2 patch + use_hash_combine.patch.txt
CPU: user: 113.22 s, system: 5.52 s, elapsed: 118.80 s
CPU: user: 116.63 s, system: 5.87 s, elapsed: 122.56 s
CPU: user: 115.33 s, system: 5.73 s, elapsed: 121.12 s
CPU: user: 113.11 s, system: 5.61 s, elapsed: 118.78 s
CPU: user: 112.56 s, system: 5.51 s, elapsed: 118.13 s
CPU: user: 114.55 s, system: 5.80 s, elapsed: 120.40 s
CPU: user: 121.79 s, system: 6.45 s, elapsed: 128.29 s
CPU: user: 113.98 s, system: 4.50 s, elapsed: 118.52 s
CPU: user: 113.24 s, system: 5.63 s, elapsed: 118.93 s
CPU: user: 114.11 s, system: 5.60 s, elapsed: 119.78 s

Median = 119.355 s

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 971,
max chain: 6, avg chain: 0.477854, total_collisions: 433,
max_collisions: 4, avg_collisions: 0.213091

3) v2 patch + use_hash_bytes.patch.txt
CPU: user: 120.87 s, system: 6.69 s, elapsed: 127.62 s
CPU: user: 112.40 s, system: 4.68 s, elapsed: 117.14 s
CPU: user: 113.19 s, system: 5.44 s, elapsed: 118.69 s
CPU: user: 112.15 s, system: 4.73 s, elapsed: 116.93 s
CPU: user: 111.10 s, system: 5.59 s, elapsed: 116.74 s
CPU: user: 112.03 s, system: 5.74 s, elapsed: 117.82 s
CPU: user: 113.69 s, system: 4.33 s, elapsed: 118.07 s
CPU: user: 113.30 s, system: 4.19 s, elapsed: 117.55 s
CPU: user: 112.77 s, system: 5.57 s, elapsed: 118.39 s
CPU: user: 112.25 s, system: 4.59 s, elapsed: 116.88 s

Median = 117.685 s

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 900,
max chain: 4, avg chain: 0.442913, total_collisions: 415,
max_collisions: 4, avg_collisions: 0.204232

4) master
CPU: user: 117.89 s, system: 5.7 s, elapsed: 123.65 s
CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s
CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.2 s
CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s
CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s
CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s
CPU: user: 120.6 s, system: 5.82 s, elapsed: 126.49 s
CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s
CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s
CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s

Median = 124.49 s

You can see that the bare v2 patch is quite a bit faster than any of
the alternatives.  We'd be better of with hash_bytes than using
hash_combine() both for performance and for the seemingly better job
the hash function does at reducing the hash collisions.

David

Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
Yura Sokolov
Date:
David Rowley писал 2021-04-25 16:36:
> On Sun, 25 Apr 2021 at 18:48, Yura Sokolov <y.sokolov@postgrespro.ru> 
> wrote:
>> If you read comments in SH_START_ITERATE, you'll see:
>> 
>>    * Search for the first empty element. As deletions during 
>> iterations
>> are
>>    * supported, we want to start/end at an element that cannot be
>> affected
>>    * by elements being shifted.
>> 
>>    * Iterate backwards, that allows the current element to be deleted,
>> even
>>    * if there are backward shifts
>> 
>> Therefore, it is safe to delete during iteration, and it doesn't lead
>> nor
>> require loop restart.
> 
> I had only skimmed that with a pre-loaded assumption that it wouldn't
> be safe. I didn't do a very good job of reading it as I failed to
> notice the lack of guarantees were about deleting items other than the
> current one. I didn't consider the option of finding a free bucket
> then looping backwards to avoid missing entries that are moved up
> during a delete.
> 
> With that, I changed the patch to get rid of the SH_TRUNCATE and got
> rid of the smgrclose_internal which skips the hash delete.  The code
> is much more similar to how it was now.
> 
> In regards to the hashing stuff.  I added a new function to
> pg_bitutils.h to rotate left and I'm using that instead of the other
> expression that was taken from nodeHash.c
> 
> For the hash function, I've done some further benchmarking with:
> 
> 1) The attached v2 patch
> 2) The attached + plus use_hash_combine.patch.txt which uses
> hash_combine() instead of pg_rotate_left32()ing the hashkey each time.
> 3) The attached v2 with use_hash_bytes.patch.txt applied.
> 4) Master
> 
> I've also included the hash stats from each version of the hash 
> function.
> 
> I hope the numbers help indicate the reason I picked the hash function
> that I did.
> 
> 1) v2 patch.
> Median = 113.375 s
> 
> LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 997,
> max chain: 5, avg chain: 0.490650, total_collisions: 422,
> max_collisions: 3, avg_collisions: 0.207677
> 
> 2) v2 patch + use_hash_combine.patch.txt
> Median = 119.355 s
> 
> LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 971,
> max chain: 6, avg chain: 0.477854, total_collisions: 433,
> max_collisions: 4, avg_collisions: 0.213091
> 
> 3) v2 patch + use_hash_bytes.patch.txt
> Median = 117.685 s
> 
> LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 900,
> max chain: 4, avg chain: 0.442913, total_collisions: 415,
> max_collisions: 4, avg_collisions: 0.204232
> 
> 4) master
> Median = 124.49 s
> 
> You can see that the bare v2 patch is quite a bit faster than any of
> the alternatives.  We'd be better of with hash_bytes than using
> hash_combine() both for performance and for the seemingly better job
> the hash function does at reducing the hash collisions.
> 
> David

Looks much better! Simpler is almost always better.

Minor remarks:

Comment for SH_ENTRY_INITIALIZER e. May be like:
- SH_ENTRY_INITIALIZER(a) - if defined, this macro is called for new 
entries
   before key or hash is stored in. For example, it can be used to make
   necessary memory allocations.

`pg_rotate_left32(x, 1) == pg_rotate_right(x, 31)`, therefore there's
no need to add `pg_rotate_left32` at all. More over, for hash combining
there's no much difference between `pg_rotate_left32(x, 1)` and
`pg_rotate_right(x, 1)`. (To be honestly, there is a bit of difference
due to murmur construction, but it should not be very big.)

If your test so sensitive to hash function speed, then I'd suggest
to try something even simpler:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
    uint32        h = 0;
#define step(x) h ^= (uint32)(x) * 0x85ebca6b; h = pg_rotate_right(h, 
11); h *= 9;
    step(rnode->node.relNode);
    step(rnode->node.spcNode); // spcNode could be different for same 
relNode only
                                    // during table movement. Does it pay 
to hash it?
    step(rnode->node.dbNode);
    step(rnode->backend); // does it matter to hash backend?
                               // It equals to InvalidBackendId for 
non-temporary relations
                               // and temporary relations in same 
database never have same
                               // relNode (have they?).
    return murmurhash32(hashkey);
}

I'd like to see benchmark code. It quite interesting this place became
measurable at all.

regards,
Yura Sokolov.
Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Mon, 26 Apr 2021 at 05:03, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> If your test so sensitive to hash function speed, then I'd suggest
> to try something even simpler:
>
> static inline uint32
> relfilenodebackend_hash(RelFileNodeBackend *rnode)
> {
>         uint32          h = 0;
> #define step(x) h ^= (uint32)(x) * 0x85ebca6b; h = pg_rotate_right(h,
> 11); h *= 9;
>         step(rnode->node.relNode);
>         step(rnode->node.spcNode); // spcNode could be different for same
> relNode only
>                                     // during table movement. Does it pay
> to hash it?
>         step(rnode->node.dbNode);
>         step(rnode->backend); // does it matter to hash backend?
>                                // It equals to InvalidBackendId for
> non-temporary relations
>                                // and temporary relations in same
> database never have same
>                                // relNode (have they?).
>         return murmurhash32(hashkey);
> }

I tried that and it got a median result of 113.795 seconds over 14
runs with this recovery benchmark test.

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 1014,
max chain: 6, avg chain: 0.499016, total_collisions: 428,
max_collisions: 3, avg_collisions: 0.210630

I also tried the following hash function just to see how much
performance might be left from speeding it up:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
uint32 h;

h = pg_rotate_right32((uint32) rnode->node.relNode, 16) ^ ((uint32)
rnode->node.dbNode);
return murmurhash32(h);
}

I got a median of 112.685 seconds over 14 runs with:

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 1044,
max chain: 7, avg chain: 0.513780, total_collisions: 438,
max_collisions: 3, avg_collisions: 0.215551

So it looks like there might not be too much left given that v2 was
113.375 seconds (median over 10 runs)

> I'd like to see benchmark code. It quite interesting this place became
> measurable at all.

Sure.

$ cat recoverybench_insert_hash.sh
#!/bin/bash

pg_ctl stop -D pgdata -m smart
pg_ctl start -D pgdata -l pg.log -w
psql -f setup1.sql postgres > /dev/null
psql -c "create table log_wal (lsn pg_lsn not null);" postgres > /dev/null
psql -c "insert into log_wal values(pg_current_wal_lsn());" postgres > /dev/null
psql -c "insert into hp select x,0 from generate_series(1,100000000)
x;" postgres > /dev/null
psql -c "insert into log_wal values(pg_current_wal_lsn());" postgres > /dev/null
psql -c "select 'Used ' ||
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), lsn)) || ' of
WAL' from log_wal limit 1;" postgres
pg_ctl stop -D pgdata -m immediate -w
echo Starting Postgres...
pg_ctl start -D pgdata -l pg.log

$ cat setup1.sql
drop table if exists hp;
create table hp (a int primary key, b int not null) partition by hash(a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 1000, remainder '||x||');' from generate_Series(0,999) x;
\gexec

config:
shared_buffers = 10GB
checkpoint_timeout = 60min
max_wal_size = 20GB
min_wal_size = 20GB

For subsequent runs, if you apply the patch that does the PANIC at the
end of recovery, you'll just need to start the database up again to
perform recovery again. You can then just tail -f on your postgres
logs to watch for the "redo done" message which will show you the time
spent doing recovery.

David.



Re: Use simplehash.h instead of dynahash in SMgr

From
Andres Freund
Date:
Hi,

On 2021-04-25 01:27:24 +0300, Yura Sokolov wrote:
> It is quite interesting result. Simplehash being open-addressing with
> linear probing is friendly for cpu cache. I'd recommend to define
> SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
> suitable most for such kind of hash table.

It's not a "plain" linear probing hash table (although it is on the lookup
side). During insertions collisions are reordered so that the average distance
from the "optimal" position is ~even ("robin hood hashing"). That allows a
higher load factor than a plain linear probed hash table (for which IIRC
there's data to show 0.75 to be a good default load factor).

There of course may still be a benefit in lowering the load factor, but I'd
not start there.

David's test aren't really suited to benchmarking the load factor, but to me
the stats he showed didn't highlight a need to lower the load factor. Lowering
the fill factor does influence the cache hit ratio...

Greetings,

Andres Freund



Re: Use simplehash.h instead of dynahash in SMgr

From
Yura Sokolov
Date:
Andres Freund писал 2021-04-26 21:46:
> Hi,
> 
> On 2021-04-25 01:27:24 +0300, Yura Sokolov wrote:
>> It is quite interesting result. Simplehash being open-addressing with
>> linear probing is friendly for cpu cache. I'd recommend to define
>> SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
>> suitable most for such kind of hash table.
> 
> It's not a "plain" linear probing hash table (although it is on the 
> lookup
> side). During insertions collisions are reordered so that the average 
> distance
> from the "optimal" position is ~even ("robin hood hashing"). That 
> allows a
> higher load factor than a plain linear probed hash table (for which 
> IIRC
> there's data to show 0.75 to be a good default load factor).

Even for Robin Hood hashing 0.9 fill factor is too high. It leads to too
much movements on insertion/deletion and longer average collision chain.

Note that Robin Hood doesn't optimize average case. Indeed, usually 
Robin Hood
has worse (longer) average collision chain than simple linear probing.
Robin Hood hashing optimizes worst case, ie it guarantees there is no 
unnecessary
long collision chain.

See discussion on Rust hash table fill factor when it were Robin Hood:
https://github.com/rust-lang/rust/issues/38003

> 
> There of course may still be a benefit in lowering the load factor, but 
> I'd
> not start there.
> 
> David's test aren't really suited to benchmarking the load factor, but 
> to me
> the stats he showed didn't highlight a need to lower the load factor. 
> Lowering
> the fill factor does influence the cache hit ratio...
> 
> Greetings,
> 
> Andres Freund

regards,
Yura.



Re: Use simplehash.h instead of dynahash in SMgr

From
Andres Freund
Date:
Hi,

On 2021-04-26 22:44:13 +0300, Yura Sokolov wrote:
> Even for Robin Hood hashing 0.9 fill factor is too high. It leads to too
> much movements on insertion/deletion and longer average collision chain.

That's true for modification heavy cases - but most hash tables in PG,
including the smgr one, are quite read heavy. For workloads where
there's a lot of smgr activity, the other overheads in relation
creation/drop handling are magnitudes more expensive than the collision
handling.

Note that simplehash.h also grows when the distance gets too big and
when there are too many elements to move, not just based on the fill
factor.


I kinda wish we had a chained hashtable implementation with the same
interface as simplehash. It's very use-case dependent which approach is
better, and right now we might be forcing some users to choose linear
probing because simplehash.h is still faster than dynahash, even though
chaining would actually be more appropriate.


> Note that Robin Hood doesn't optimize average case.

Right.


> See discussion on Rust hash table fill factor when it were Robin Hood:
> https://github.com/rust-lang/rust/issues/38003

The first sentence actually confirms my point above, about it being a
question of read vs write heavy.

Greetings,

Andres Freund



Re: Use simplehash.h instead of dynahash in SMgr

From
Yura Sokolov
Date:
David Rowley писал 2021-04-26 09:43:
> I tried that and it got a median result of 113.795 seconds over 14
> runs with this recovery benchmark test.
> 
> LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 1014,
> max chain: 6, avg chain: 0.499016, total_collisions: 428,
> max_collisions: 3, avg_collisions: 0.210630
> 
> I also tried the following hash function just to see how much
> performance might be left from speeding it up:
> 
> static inline uint32
> relfilenodebackend_hash(RelFileNodeBackend *rnode)
> {
> uint32 h;
> 
> h = pg_rotate_right32((uint32) rnode->node.relNode, 16) ^ ((uint32)
> rnode->node.dbNode);
> return murmurhash32(h);
> }
> 
> I got a median of 112.685 seconds over 14 runs with:
> 
> LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 1044,
> max chain: 7, avg chain: 0.513780, total_collisions: 438,
> max_collisions: 3, avg_collisions: 0.215551

The best result is with just:

    return (uint32)rnode->node.relNode;

ie, relNode could be taken without mixing at all.
relNode is unique inside single database, and almost unique among whole 
cluster
since it is Oid.

>> I'd like to see benchmark code. It quite interesting this place became
>> measurable at all.
> 
> Sure.
> 
> $ cat recoverybench_insert_hash.sh
> ....
> 
> David.

So, I've repeated benchmark with different number of partitons (I tried
to catch higher fillfactor) and less amount of inserted data (since I 
don't
want to stress my SSD).

partitions/ | dynahash | dynahash +  | simplehash | simplehash + |
fillfactor  |          | simple func |           | simple func  |
------------+----------+-------------+--------------+
  3500/0.43  |   3.73s  |   3.54s     |   3.58s    |   3.34s      |
  3200/0.78  |   3.64s  |   3.46s     |   3.47s    |   3.25s      |
  1500/0.74  |   3.18s  |   2.97s     |   3.03s    |   2.79s      |

Fillfactor is effective fillfactor in simplehash with than number of 
partitions.
I wasn't able to measure with fillfactor close to 0.9 since looks like
simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.

Simple function is hash function that returns only rnode->node.relNode.
I've test it both with simplehash and dynahash.
For dynahash also custom match function were made.

Conclusion:
- trivial hash function gives better results for both simplehash and 
dynahash,
- simplehash improves performance for both complex and trivial hash 
function,
- simplehash + trivial function perform best.

I'd like to hear other's people comments on trivial hash function. But 
since
generation of relation's Oid are not subject of human interventions, I'd 
recommend
to stick with trivial.

Since patch is simple, harmless and gives measurable improvement,
I think it is ready for commit fest.

regards,
Yura Sokolov.
Postgres Proffesional https://www.postgrespro.com

PS. David, please send patch once again since my mail client reattached 
files in
previous messages, and commit fest robot could think I'm author.



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Thu, 29 Apr 2021 at 00:28, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> The best result is with just:
>
>     return (uint32)rnode->node.relNode;
>
> ie, relNode could be taken without mixing at all.
> relNode is unique inside single database, and almost unique among whole
> cluster
> since it is Oid.

I admit to having tried that too just to almost eliminate the cost of
hashing.  I just didn't consider it something we'd actually do.

The system catalogues are quite likely to all have the same
relfilenode in all databases, so for workloads that have a large
number of databases, looking up WAL records that touch the catalogues
is going to be pretty terrible.

The simplified hash function I wrote with just the relNode and dbNode
would be the least I'd be willing to entertain.  However, I just
wouldn't be surprised if there was a good reason for that being bad
too.


> So, I've repeated benchmark with different number of partitons (I tried
> to catch higher fillfactor) and less amount of inserted data (since I
> don't
> want to stress my SSD).
>
> partitions/ | dynahash | dynahash +  | simplehash | simplehash + |
> fillfactor  |          | simple func |           | simple func  |
> ------------+----------+-------------+--------------+
>   3500/0.43  |   3.73s  |   3.54s     |   3.58s    |   3.34s      |
>   3200/0.78  |   3.64s  |   3.46s     |   3.47s    |   3.25s      |
>   1500/0.74  |   3.18s  |   2.97s     |   3.03s    |   2.79s      |
>
> Fillfactor is effective fillfactor in simplehash with than number of
> partitions.
> I wasn't able to measure with fillfactor close to 0.9 since looks like
> simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.

Thanks for testing that.

I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to
see if it made a difference.  The test was similar to last time, but I
trimmed the number of rows inserted from 100 million down to 25
million.  Last time I tested with 1000 partitions, this time with each
of: 100 200 300 400 500 600 700 800 900 1000 partitions.  There didn't
seem to be any point of testing lower than that as the minimum hash
table size is 512.

The averages over 10 runs were:

nparts  ff75       ff90
100      21.898  22.226
200      23.105  25.493
300      25.274  24.251
400      25.139  25.611
500      25.738  25.454
600      26.656  26.82
700      27.577  27.102
800      27.608  27.546
900      27.284  28.186
1000    29         28.153

Or to summarise a bit, we could just look at the sum of all the
results per fillfactor:

sum ff75 2592.79
sum ff90 2608.42 100.6%

fillfactor 75 did come out slightly faster, but only just. It seems
close enough that it might be better just to keep the 90 to save a
little memory and improve caching elsewhere.  Also, from below, you
can see that for 300, 500, 600, 700, 1000 tables tests, the hash
tables ended up the same size, yet there's a bit of variability in the
timing result.  So the 0.6% gain might just be noise.

I don't think it's worth making the fillfactor 75.

drowley@amd3990x:~/recoverylogs$ grep -rH -m 1 "collisions"
ff75_tb100.log:LOG:  size: 1024, members: 231, filled: 0.225586, total
chain: 33, max chain: 2, avg chain: 0.142857, total_collisions: 20,
max_collisions: 2, avg_collisions: 0.086580
ff90_tb100.log:LOG:  size: 512, members: 231, filled: 0.451172, total
chain: 66, max chain: 2, avg chain: 0.285714, total_collisions: 36,
max_collisions: 2, avg_collisions: 0.155844

ff75_tb200.log:LOG:  size: 1024, members: 431, filled: 0.420898, total
chain: 160, max chain: 4, avg chain: 0.371230, total_collisions: 81,
max_collisions: 3, avg_collisions: 0.187935
ff90_tb200.log:LOG:  size: 512, members: 431, filled: 0.841797, total
chain: 942, max chain: 9, avg chain: 2.185615, total_collisions: 134,
max_collisions: 3, avg_collisions: 0.310905

ff90_tb300.log:LOG:  size: 1024, members: 631, filled: 0.616211, total
chain: 568, max chain: 9, avg chain: 0.900158, total_collisions: 158,
max_collisions: 4, avg_collisions: 0.250396
ff75_tb300.log:LOG:  size: 1024, members: 631, filled: 0.616211, total
chain: 568, max chain: 9, avg chain: 0.900158, total_collisions: 158,
max_collisions: 4, avg_collisions: 0.250396

ff75_tb400.log:LOG:  size: 2048, members: 831, filled: 0.405762, total
chain: 341, max chain: 4, avg chain: 0.410349, total_collisions: 162,
max_collisions: 3, avg_collisions: 0.194946
ff90_tb400.log:LOG:  size: 1024, members: 831, filled: 0.811523, total
chain: 1747, max chain: 15, avg chain: 2.102286, total_collisions:
269, max_collisions: 3, avg_collisions: 0.323706

ff75_tb500.log:LOG:  size: 2048, members: 1031, filled: 0.503418,
total chain: 568, max chain: 5, avg chain: 0.550921, total_collisions:
219, max_collisions: 4, avg_collisions: 0.212415
ff90_tb500.log:LOG:  size: 2048, members: 1031, filled: 0.503418,
total chain: 568, max chain: 5, avg chain: 0.550921, total_collisions:
219, max_collisions: 4, avg_collisions: 0.212415

ff75_tb600.log:LOG:  size: 2048, members: 1231, filled: 0.601074,
total chain: 928, max chain: 7, avg chain: 0.753859, total_collisions:
298, max_collisions: 4, avg_collisions: 0.242080
ff90_tb600.log:LOG:  size: 2048, members: 1231, filled: 0.601074,
total chain: 928, max chain: 7, avg chain: 0.753859, total_collisions:
298, max_collisions: 4, avg_collisions: 0.242080

ff75_tb700.log:LOG:  size: 2048, members: 1431, filled: 0.698730,
total chain: 1589, max chain: 9, avg chain: 1.110412,
total_collisions: 391, max_collisions: 4, avg_collisions: 0.273235
ff90_tb700.log:LOG:  size: 2048, members: 1431, filled: 0.698730,
total chain: 1589, max chain: 9, avg chain: 1.110412,
total_collisions: 391, max_collisions: 4, avg_collisions: 0.273235

ff75_tb800.log:LOG:  size: 4096, members: 1631, filled: 0.398193,
total chain: 628, max chain: 6, avg chain: 0.385040, total_collisions:
296, max_collisions: 3, avg_collisions: 0.181484
ff90_tb800.log:LOG:  size: 2048, members: 1631, filled: 0.796387,
total chain: 2903, max chain: 12, avg chain: 1.779890,
total_collisions: 515, max_collisions: 3, avg_collisions: 0.315757

ff75_tb900.log:LOG:  size: 4096, members: 1831, filled: 0.447021,
total chain: 731, max chain: 5, avg chain: 0.399235, total_collisions:
344, max_collisions: 3, avg_collisions: 0.187875
ff90_tb900.log:LOG:  size: 2048, members: 1831, filled: 0.894043,
total chain: 6364, max chain: 14, avg chain: 3.475696,
total_collisions: 618, max_collisions: 4, avg_collisions: 0.337520

ff75_tb1000.log:LOG:  size: 4096, members: 2031, filled: 0.495850,
total chain: 1024, max chain: 6, avg chain: 0.504185,
total_collisions: 416, max_collisions: 3, avg_collisions: 0.204825
ff90_tb1000.log:LOG:  size: 4096, members: 2031, filled: 0.495850,
total chain: 1024, max chain: 6, avg chain: 0.504185,
total_collisions: 416, max_collisions: 3, avg_collisions: 0.204825


Another line of thought for making it go faster would be to do
something like get rid of the hash status field from SMgrEntry.  That
could be either coded into a single bit we'd borrow from the hash
value, or it could be coded into the least significant bit of the data
field.  A pointer to palloc'd memory should always be MAXALIGNed,
which means at least the lower two bits are always zero.  We'd just
need to make sure and do something like "data & ~((uintptr_t) 3)" to
trim off the hash status bits before dereferencing the pointer.  That
would make the SMgrEntry 12 bytes on a 64-bit machine.  However, it
would also mean that some entries would span 2 cache lines, which
might affect performance a bit.

> PS. David, please send patch once again since my mail client reattached
> files in
> previous messages, and commit fest robot could think I'm author.

Authors are listed manually in the CF app. The app will pickup .patch
files from the latest email in the thread and the CF bot will test
those. So it does pay to be pretty careful when attaching patches to
threads that are in the CF app.  That's the reason I added the .txt
extension to the recovery panic patch.  The CF bot machines would have
complained about that.



Re: Use simplehash.h instead of dynahash in SMgr

From
Yura Sokolov
Date:
David Rowley писал 2021-04-29 02:51:
> On Thu, 29 Apr 2021 at 00:28, Yura Sokolov <y.sokolov@postgrespro.ru> 
> wrote:
>> The best result is with just:
>> 
>>     return (uint32)rnode->node.relNode;
>> 
>> ie, relNode could be taken without mixing at all.
>> relNode is unique inside single database, and almost unique among 
>> whole
>> cluster
>> since it is Oid.
> 
> I admit to having tried that too just to almost eliminate the cost of
> hashing.  I just didn't consider it something we'd actually do.
> 
> The system catalogues are quite likely to all have the same
> relfilenode in all databases, so for workloads that have a large
> number of databases, looking up WAL records that touch the catalogues
> is going to be pretty terrible.

Single workload that could touch system catalogues in different
databases is recovery (and autovacuum?). Client backends couldn't
be connected to more than one database.

But netherless, you're right. I oversimplified it intentionally.
I wrote originally:

    hashcode = (uint32)rnode->node.dbNode * 0xc2b2ae35;
    hashcode ^= (uint32)rnode->node.relNode;
    return hashcode;

But before sending mail I'd cut dbNode part.
Still, main point: relNode could be put unmixed into final value.
This way less collisions happen.

> 
> The simplified hash function I wrote with just the relNode and dbNode
> would be the least I'd be willing to entertain.  However, I just
> wouldn't be surprised if there was a good reason for that being bad
> too.
> 
> 
>> So, I've repeated benchmark with different number of partitons (I 
>> tried
>> to catch higher fillfactor) and less amount of inserted data (since I
>> don't
>> want to stress my SSD).
>> 
>> partitions/ | dynahash | dynahash +  | simplehash | simplehash + |
>> fillfactor  |          | simple func |           | simple func  |
>> ------------+----------+-------------+--------------+
>>   3500/0.43  |   3.73s  |   3.54s     |   3.58s    |   3.34s      |
>>   3200/0.78  |   3.64s  |   3.46s     |   3.47s    |   3.25s      |
>>   1500/0.74  |   3.18s  |   2.97s     |   3.03s    |   2.79s      |
>> 
>> Fillfactor is effective fillfactor in simplehash with than number of
>> partitions.
>> I wasn't able to measure with fillfactor close to 0.9 since looks like
>> simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.
> 
> Thanks for testing that.
> 
> I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to
> see if it made a difference.  The test was similar to last time, but I
> trimmed the number of rows inserted from 100 million down to 25
> million.  Last time I tested with 1000 partitions, this time with each
> of: 100 200 300 400 500 600 700 800 900 1000 partitions.  There didn't
> seem to be any point of testing lower than that as the minimum hash
> table size is 512.
> 
> The averages over 10 runs were:
> 
> nparts  ff75       ff90
> 100      21.898  22.226
> 200      23.105  25.493
> 300      25.274  24.251
> 400      25.139  25.611
> 500      25.738  25.454
> 600      26.656  26.82
> 700      27.577  27.102
> 800      27.608  27.546
> 900      27.284  28.186
> 1000    29         28.153
> 
> Or to summarise a bit, we could just look at the sum of all the
> results per fillfactor:
> 
> sum ff75 2592.79
> sum ff90 2608.42 100.6%
> 
> fillfactor 75 did come out slightly faster, but only just. It seems
> close enough that it might be better just to keep the 90 to save a
> little memory and improve caching elsewhere.  Also, from below, you
> can see that for 300, 500, 600, 700, 1000 tables tests, the hash
> tables ended up the same size, yet there's a bit of variability in the
> timing result.  So the 0.6% gain might just be noise.
> 
> I don't think it's worth making the fillfactor 75.

To be clear: I didn't change SH_FILLFACTOR. It were equal to 0.9 .
I just were not able to catch table with fill factor more than 0.78.
Looks like you've got it with 900 partitions :-)

> 
> Another line of thought for making it go faster would be to do
> something like get rid of the hash status field from SMgrEntry.  That
> could be either coded into a single bit we'd borrow from the hash
> value, or it could be coded into the least significant bit of the data
> field.  A pointer to palloc'd memory should always be MAXALIGNed,
> which means at least the lower two bits are always zero.  We'd just
> need to make sure and do something like "data & ~((uintptr_t) 3)" to
> trim off the hash status bits before dereferencing the pointer.  That
> would make the SMgrEntry 12 bytes on a 64-bit machine.  However, it
> would also mean that some entries would span 2 cache lines, which
> might affect performance a bit.

Then data pointer will be itself unaligned to 8 bytes. While x86 is
mostly indifferent to this, doubtfully this memory economy will pay
off.

regards,
Yura Sokolov.



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Thu, 29 Apr 2021 at 12:30, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>
> David Rowley писал 2021-04-29 02:51:
> > Another line of thought for making it go faster would be to do
> > something like get rid of the hash status field from SMgrEntry.  That
> > could be either coded into a single bit we'd borrow from the hash
> > value, or it could be coded into the least significant bit of the data
> > field.  A pointer to palloc'd memory should always be MAXALIGNed,
> > which means at least the lower two bits are always zero.  We'd just
> > need to make sure and do something like "data & ~((uintptr_t) 3)" to
> > trim off the hash status bits before dereferencing the pointer.  That
> > would make the SMgrEntry 12 bytes on a 64-bit machine.  However, it
> > would also mean that some entries would span 2 cache lines, which
> > might affect performance a bit.
>
> Then data pointer will be itself unaligned to 8 bytes. While x86 is
> mostly indifferent to this, doubtfully this memory economy will pay
> off.

Actually, I didn't think very hard about that. The struct would still
be 16 bytes and just have padding so the data pointer was aligned to 8
bytes (assuming a 64-bit machine).

David



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
I've attached an updated patch.  I forgot to call SH_ENTRY_CLEANUP,
when it's defined during SH_RESET.

I also tided up a couple of comments and change the code to use
pg_rotate_right32(.., 31) instead of adding a new function for
pg_rotate_left32 and calling that to shift left 1 bit.

David

Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
Alvaro Herrera
Date:
Hi David,

You're probably aware of this, but just to make it explicit: Jakub
Wartak was testing performance of recovery, and one of the bottlenecks
he found in some of his cases was dynahash as used by SMgr.  It seems
quite possible that this work would benefit some of his test workloads.
He last posted about it here:

https://postgr.es/m/VI1PR0701MB69608CBCE44D80857E59572EF6CA0@VI1PR0701MB6960.eurprd07.prod.outlook.com

though the fraction of dynahash-from-SMgr is not as high there as in
some of other reports I saw.

-- 
Álvaro Herrera       Valdivia, Chile



RE: Use simplehash.h instead of dynahash in SMgr

From
Jakub Wartak
Date:
Hi David, Alvaro, -hackers

> Hi David,
>
> You're probably aware of this, but just to make it explicit: Jakub Wartak was
> testing performance of recovery, and one of the bottlenecks he found in
> some of his cases was dynahash as used by SMgr.  It seems quite possible
> that this work would benefit some of his test workloads.

I might be a little bit out of the loop, but as Alvaro stated - Thomas did plenty of excellent job related to recovery
performancein that thread. In my humble opinion and if I'm not mistaken (I'm speculating here) it might be *not* how
Smgrhash works, but how often it is being exercised and that would also explain relatively lower than expected(?) gains
here.There are at least two very important emails from him that I'm aware that are touching the topic of
reordering/compacting/batchingcalls to Smgr: 
https://www.postgresql.org/message-id/CA%2BhUKG%2B2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CA%2BhUKGK4StQ%3DeXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H%2BUeg%40mail.gmail.com

Another potential option that we've discussed is that the redo generation itself is likely a brake of efficient
recoveryperformance today (e.g. INSERT-SELECT on table with indexes, generates interleaved WAL records that touch often
limitedset of blocks that usually put Smgr into spotlight). 

-Jakub Wartak.



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
Hi Jakub,

On Wed, 5 May 2021 at 20:16, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> I might be a little bit out of the loop, but as Alvaro stated - Thomas did plenty of excellent job related to
recoveryperformance in that thread. In my humble opinion and if I'm not mistaken (I'm speculating here) it might be
*not*how Smgr hash works, but how often it is being exercised and that would also explain relatively lower than
expected(?)gains here. There are at least two very important emails from him that I'm aware that are touching the topic
ofreordering/compacting/batching calls to Smgr: 
> https://www.postgresql.org/message-id/CA%2BhUKG%2B2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw%40mail.gmail.com
> https://www.postgresql.org/message-id/flat/CA%2BhUKGK4StQ%3DeXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H%2BUeg%40mail.gmail.com

I'm not much of an expert here and I didn't follow the recovery
prefetching stuff closely. So, with that in mind, I think there are
lots that could be done along the lines of what Thomas is mentioning.
Batching WAL records up by filenode then replaying each filenode one
by one when our batching buffer is full.  There could be some sort of
parallel options there too, where workers replay a filenode each.
However, that wouldn't really work for recovery on a hot-standby
though.  We'd need to ensure we replay the commit record for each
transaction last. I think you'd have to batch by filenode and
transaction in that case.  Each batch might be pretty small on a
typical OLTP workload, so it might not help much there, or it might
hinder.

But having said that, I don't think any of those possibilities should
stop us speeding up smgropen().

> Another potential option that we've discussed is that the redo generation itself is likely a brake of efficient
recoveryperformance today (e.g. INSERT-SELECT on table with indexes, generates interleaved WAL records that touch often
limitedset of blocks that usually put Smgr into spotlight). 

I'm not quite sure if I understand what you mean here.  Is this
queuing up WAL records up during transactions and flush them out to
WAL every so often after rearranging them into an order that's more
optimal for replay?

David



RE: Use simplehash.h instead of dynahash in SMgr

From
Jakub Wartak
Date:
Hey David,

> I think you'd have to batch by filenode and transaction in that case. Each batch might be pretty small on a typical
OLTPworkload, so it might not help much there, or it might hinder.
 

True, it is very workload dependent (I was chasing mainly INSERTs multiValues, INSERT-SELECT) that often hit the same
$block,certainly not OLTP. I would even say that INSERT-as-SELECT would be more suited for DWH-like processing.
 

> But having said that, I don't think any of those possibilities should stop us speeding up smgropen().

Of course! I've tried a couple of much more smaller ideas, but without any big gains. I was able to squeeze like
300-400kfunction calls per second (WAL records/s), that was the point I think where I think smgropen() got abused. 
 

> > Another potential option that we've discussed is that the redo generation
> itself is likely a brake of efficient recovery performance today (e.g. INSERT-
> SELECT on table with indexes, generates interleaved WAL records that touch
> often limited set of blocks that usually put Smgr into spotlight).
> 
> I'm not quite sure if I understand what you mean here.  Is this queuing up
> WAL records up during transactions and flush them out to WAL every so
> often after rearranging them into an order that's more optimal for replay?

Why not both? 😉 We were very concentrated on standby side, but on primary side one could also change how WAL records
aregenerated:
 

1) Minimalization of records towards same repeated $block eg. Heap2 table_multi_insert() API already does this and it
mattersto generate more optimal stream for replay:
 

postgres@test=# create table t (id bigint primary key);
postgres@test=# insert into t select generate_series(1, 10);

results in many calls due to interleave heap with btree records for the same block from Smgr perspective (this is
especiallyvisible on highly indexed tables)  =>    
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 4/E7000108, prev 4/E70000A0, desc: INSERT_LEAF off
1,blkref #0: rel 1663/16384/32794 blk 1
 
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 4/E7000148, prev 4/E7000108, desc: INSERT off 2
flags0x00, blkref #0: rel 1663/16384/32791 blk 0
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 4/E7000188, prev 4/E7000148, desc: INSERT_LEAF off
2,blkref #0: rel 1663/16384/32794 blk 1
 
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 4/E70001C8, prev 4/E7000188, desc: INSERT off 3
flags0x00, blkref #0: rel 1663/16384/32791 blk 0
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 4/E7000208, prev 4/E70001C8, desc: INSERT_LEAF off
3,blkref #0: rel 1663/16384/32794 blk 1
 
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 4/E7000248, prev 4/E7000208, desc: INSERT off 4
flags0x00, blkref #0: rel 1663/16384/32791 blk 0
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 4/E7000288, prev 4/E7000248, desc: INSERT_LEAF off
4,blkref #0: rel 1663/16384/32794 blk 1
 
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 4/E70002C8, prev 4/E7000288, desc: INSERT off 5
flags0x00, blkref #0: rel 1663/16384/32791 blk 0
 
[..]
Similar stuff happens for UPDATE. It basically prevents recent-buffer optimization that avoid repeated calls to
smgropen().

And here's already existing table_multi_inserts v2 API (Heap2) sample with obvious elimination of unnecessary
individualcalls to smgopen() via one big MULTI_INSERT instead (for CTAS/COPY/REFRESH MV) :
 
postgres@test=# create table t (id bigint primary key);
postgres@test=# copy (select generate_series (1, 10)) to '/tmp/t';
postgres@test=# copy t from '/tmp/t';
=>
rmgr: Heap2       len (rec/tot):    210/   210, tx:   17243290, lsn: 4/E9000028, prev 4/E8004410, desc:
MULTI_INSERT+INIT10 tuples flags 0x02, blkref #0: rel 1663/16384/32801 blk 0
 
rmgr: Btree       len (rec/tot):    102/   102, tx:   17243290, lsn: 4/E9000100, prev 4/E9000028, desc: NEWROOT lev 0,
blkref#0: rel 1663/16384/32804 blk 1, blkref #2: rel 1663/16384/32804 blk 0
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off
1,blkref #0: rel 1663/16384/32804 blk 1
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off
2,blkref #0: rel 1663/16384/32804 blk 1
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off
3,blkref #0: rel 1663/16384/32804 blk 1
 
[..]
Here Btree it is very localized (at least when concurrent sessions are not generating WAL) and it enables Thomas's
recent-bufferto kick in
 

DELETE is much more simple (thanks to not chewing out those Btree records) and also thanks to Thomas's recent-buffer
shouldtheoretically put much less stress on smgropen() already:
 
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 4/ED000028, prev 4/EC002800, desc: DELETE off 1
flags0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
 
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 4/ED000060, prev 4/ED000028, desc: DELETE off 2
flags0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
 
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 4/ED000098, prev 4/ED000060, desc: DELETE off 3
flags0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
 
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 4/ED0000D0, prev 4/ED000098, desc: DELETE off 4
flags0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
 
[..]

2) So what's missing - I may be wrong on this one - something like "index_multi_inserts" Btree2 API to avoid repeatedly
overwhelmingsmgropen() on recovery side for same index's $buffer. Not sure it is worth the effort, though especially
recent-bufferfixes that:
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off
1,blkref #0: rel 1663/16384/32804 blk 1
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off
2,blkref #0: rel 1663/16384/32804 blk 1
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off
3,blkref #0: rel 1663/16384/32804 blk 1
 
right?

3) Concurrent DML sessions mixing WAL records: the buffering on backend's side of things (on private "thread" of WAL -
inprivate memory - that would be simply "copied" into logwriter's main WAL buffer when committing/buffer full) -  it
wouldseem like an very interesting idea to limit interleaving concurrent sessions WAL records between each other and
exploitthe recent-buffer enhancement to avoid repeating the same calls to Smgr, wouldn't it? (I'm just mentioning it as
Isaw you were benchmarking it here and called out this idea).
 

I could be wrong though with many of those simplifications, in any case please consult with Thomas as he knows much
betterand is much more trusted source than me 😉
 

-J.


Re: Use simplehash.h instead of dynahash in SMgr

From
Yura Sokolov
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

I believe it is ready for committer.

The new status of this patch is: Ready for Committer

Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
I'd been thinking of this patch again.  When testing with simplehash,
I found that the width of the hash bucket type was fairly critical for
getting good performance from simplehash.h.  With simplehash.h I
didn't manage to narrow this any more than 16 bytes. I needed to store
the 32-bit hash value and a pointer to the data. On a 64-bit machine,
with padding, that's 16-bytes.  I've been thinking about a way to
narrow this down further to just 8 bytes and also solve the stable
pointer problem at the same time...

I've come up with a new hash table implementation that I've called
generichash.   It works similarly to simplehash in regards to the
linear probing, only instead of storing the data in the hash bucket,
we just store a uint32 index that indexes off into an array.  To keep
the pointers in that array stable, we cannot resize the array as the
table grows.  Instead, I just allocate another array of the same size.
Since these arrays are always sized as powers of 2, it's very fast to
index into them using the uint32 index that's stored in the bucket.
Unused buckets just store the special index of 0xFFFFFFFF.

I've also proposed to use this hash table implementation over in [1]
to speed up LockReleaseAll(). The 0001 patch here is just the same as
the patch from [1].

The 0002 patch includes using a generichash hash table for SMgr.

The performance using generichash.h is about the same as the
simplehash.h version of the patch. Although, the test was not done on
the same version of master.

Master (97b713418)
drowley@amd3990x:~$ tail -f pg.log | grep "redo done"
CPU: user: 124.85 s, system: 6.83 s, elapsed: 131.74 s
CPU: user: 115.01 s, system: 4.76 s, elapsed: 119.83 s
CPU: user: 122.13 s, system: 6.41 s, elapsed: 128.60 s
CPU: user: 113.85 s, system: 6.11 s, elapsed: 120.02 s
CPU: user: 121.40 s, system: 6.28 s, elapsed: 127.74 s
CPU: user: 113.71 s, system: 5.80 s, elapsed: 119.57 s
CPU: user: 113.96 s, system: 5.90 s, elapsed: 119.92 s
CPU: user: 122.74 s, system: 6.21 s, elapsed: 129.01 s
CPU: user: 122.00 s, system: 6.38 s, elapsed: 128.44 s
CPU: user: 113.06 s, system: 6.14 s, elapsed: 119.25 s
CPU: user: 114.42 s, system: 4.35 s, elapsed: 118.82 s

Median: 120.02 s

master + v1 + v2

drowley@amd3990x:~$ tail -n 0 -f pg.log | grep "redo done"
CPU: user: 107.75 s, system: 4.61 s, elapsed: 112.41 s
CPU: user: 108.07 s, system: 4.49 s, elapsed: 112.61 s
CPU: user: 106.89 s, system: 5.55 s, elapsed: 112.49 s
CPU: user: 107.42 s, system: 5.64 s, elapsed: 113.12 s
CPU: user: 106.85 s, system: 4.42 s, elapsed: 111.31 s
CPU: user: 107.36 s, system: 4.76 s, elapsed: 112.16 s
CPU: user: 107.20 s, system: 4.47 s, elapsed: 111.72 s
CPU: user: 106.94 s, system: 5.89 s, elapsed: 112.88 s
CPU: user: 115.32 s, system: 6.12 s, elapsed: 121.49 s
CPU: user: 108.02 s, system: 4.48 s, elapsed: 112.54 s
CPU: user: 106.93 s, system: 4.54 s, elapsed: 111.51 s

Median: 112.49 s

So about a 6.69% speedup

David

[1] https://www.postgresql.org/message-id/CAApHDvoKqWRxw5nnUPZ8+mAJKHPOPxYGoY1gQdh0WeS4+biVhg@mail.gmail.com

Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
Robert Haas
Date:
On Mon, Jun 21, 2021 at 10:15 AM David Rowley <dgrowleyml@gmail.com> wrote:
> I've come up with a new hash table implementation that I've called
> generichash.

At the risk of kibitzing the least-important detail of this proposal,
I'm not very happy with the names of our hash implementations.
simplehash is not especially simple, and dynahash is not particularly
dynamic, especially now that the main place we use it is for
shared-memory hash tables that can't be resized. Likewise, generichash
doesn't really give any kind of clue about how this hash table is
different from any of the others. I don't know how possible it is to
do better here; naming things is one of the two hard problems in
computer science. In a perfect world, though, our hash table
implementations would be named in such a way that somebody might be
able to look at the names and guess on that basis which one is
best-suited to a given task.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Use simplehash.h instead of dynahash in SMgr

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 21, 2021 at 10:15 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> I've come up with a new hash table implementation that I've called
>> generichash.

> At the risk of kibitzing the least-important detail of this proposal,
> I'm not very happy with the names of our hash implementations.

I kind of wonder if we really need four different hash table
implementations (this being the third "generic" one, plus hash join
has its own, and I may have forgotten others).  Should we instead
think about revising simplehash to gain the benefits of this patch?

            regards, tom lane



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Tue, 22 Jun 2021 at 02:53, Robert Haas <robertmhaas@gmail.com> wrote:
> At the risk of kibitzing the least-important detail of this proposal,
> I'm not very happy with the names of our hash implementations.
> simplehash is not especially simple, and dynahash is not particularly
> dynamic, especially now that the main place we use it is for
> shared-memory hash tables that can't be resized. Likewise, generichash
> doesn't really give any kind of clue about how this hash table is
> different from any of the others. I don't know how possible it is to
> do better here; naming things is one of the two hard problems in
> computer science. In a perfect world, though, our hash table
> implementations would be named in such a way that somebody might be
> able to look at the names and guess on that basis which one is
> best-suited to a given task.

I'm certainly open to better names.  I did almost call it stablehash,
in regards to the pointers to elements not moving around like they do
with simplehash.

I think more generally, hash table implementations are complex enough
that it's pretty much impossible to give them a short enough
meaningful name.  Most papers just end up assigning a name to some
technique. e.g Robinhood, Cuckoo etc.

Both simplehash and generichash use a variant of Robinhood hashing.
simplehash uses open addressing and generichash does not.  Instead of
Andres naming it simplehash, if he'd instead called it
"robinhoodhash", then someone might come along and complain that his
implementation is broken because it does not implement tombstoning.
Maybe Andres thought he'd avoid that by not claiming that it's an
implementation of a Robinhood hash table.  That seems pretty wise to
me. Naming it simplehash was a pretty simple way of avoiding that
problem.

Anyway, I'm open to better names, but I don't think the name should
drive the implementation.  If the implementation does not fit the name
perfectly, then the name should change rather than the implementation.

Personally, I think we should call it RowleyHash, but I think others
might object. ;-)

David



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Tue, 22 Jun 2021 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I kind of wonder if we really need four different hash table
> implementations (this being the third "generic" one, plus hash join
> has its own, and I may have forgotten others).  Should we instead
> think about revising simplehash to gain the benefits of this patch?

hmm, yeah. I definitely agree with trying to have as much reusable
code as we can when we can. It certainly reduces maintenance and bugs
tend to be found more quickly too. It's a very worthy cause.

I did happen to think of this when I was copying swathes of code out
of simplehash.h.  However, I decided that the two implementations are
sufficiently different that if I tried to merge them both into one .h
file, we'd have some unreadable and unmaintainable mess.  I just don't
think their DNA is compatible enough for the two to be mated
successfully.  For example, simplehash uses open addressing and
generichash does not.  This means that things like iterating over the
table works completely differently.  Lookups in generichash need to
perform an extra step to fetch the actual data from the segment
arrays.  I think it would certainly be possible to merge the two, but
I just don't think it would be easy code to work on if we did that.

The good thing is that that the API between the two is very similar
and it's quite easy to swap one for the other.  I did make changes
around memory allocation due to me being too cheap to zero memory when
I didn't need to and simplehash not having any means of allocating
memory without zeroing it.

I also think that there's just no one-size-fits-all hash table type.
simplehash will not perform well when the size of the stored element
is very large. There's simply too much memcpying to move data around
during insert/delete.  simplehash will also have terrible iteration
performance in sparsely populated tables.  However, simplehash will be
pretty much unbeatable for lookups where the element type is very
small, e.g single Datum, or an int.  The CPU cache efficiency there
will be pretty much unbeatable.

I tried to document the advantages of each in the file header
comments.  I should probably also add something to simplehash.h's
comments to mention generichash.h

David



Re: Use simplehash.h instead of dynahash in SMgr

From
Thomas Munro
Date:
On Tue, Jun 22, 2021 at 1:55 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Tue, 22 Jun 2021 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I kind of wonder if we really need four different hash table
> > implementations (this being the third "generic" one, plus hash join
> > has its own, and I may have forgotten others).  Should we instead
> > think about revising simplehash to gain the benefits of this patch?
>
> hmm, yeah. I definitely agree with trying to have as much reusable
> code as we can when we can. It certainly reduces maintenance and bugs
> tend to be found more quickly too. It's a very worthy cause.

It is indeed really hard to decide when you have a new thing, and when
you need a new way to parameterise the existing generic thing.  I
wondered about this how-many-hash-tables-does-it-take question a lot
when writing dshash.c (a chaining hash table that can live in weird
dsa.c memory, backed by DSM segments created on the fly that may be
mapped at different addresses in each backend, and has dynahash-style
partition locking), and this was around the time Andres was talking
about simplehash.  In retrospect, I'd probably kick out the built-in
locking and partitions and instead let callers create their own
partitioning scheme on top from N tables, and that'd remove one quirk,
leaving only the freaky pointers and allocator.  I recall from a
previous life that Boost's unordered_map template is smart enough to
support running in shared memory mapped at different addresses just
through parameterisation that controls the way it deals with internal
pointers (boost::unordered_map<..., ShmemAllocator>), which seemed
pretty clever to me, and it might be achievable to do the same with a
generic hash table for us that could take over dshash's specialness.

One idea I had at the time is that the right number of hash table
implementations in our tree is two: one for chaining (like dynahash)
and one for open addressing/probing (like simplehash), and that
everything else should be hoisted out (locking, partitioning) or made
into template parameters through the generic programming technique
that simplehash.h has demonstrated (allocators, magic pointer type for
internal pointers, plus of course the inlinable ops).  But that was
before we'd really fully adopted the idea of this style of template
code.  (I also assumed the weird memory stuff would be temporary and
we'd move to threads, but that's another topic for another thread.)
It seems like you'd disagree with this, and you'd say the right number
is three.  But it's also possible to argue for one...

A more superficial comment: I don't like calling hash tables "hash".
I blame perl.



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Tue, 22 Jun 2021 at 14:49, Thomas Munro <thomas.munro@gmail.com> wrote:
> One idea I had at the time is that the right number of hash table
> implementations in our tree is two: one for chaining (like dynahash)
> and one for open addressing/probing (like simplehash), and that
> everything else should be hoisted out (locking, partitioning) or made
> into template parameters through the generic programming technique
> that simplehash.h has demonstrated (allocators, magic pointer type for
> internal pointers, plus of course the inlinable ops).  But that was
> before we'd really fully adopted the idea of this style of template
> code.  (I also assumed the weird memory stuff would be temporary and
> we'd move to threads, but that's another topic for another thread.)
> It seems like you'd disagree with this, and you'd say the right number
> is three.  But it's also possible to argue for one...

I guess we could also ask ourselves how many join algorithms we need.
We have 3.something. None of which is perfect for every job. That's
why we have multiple. I wonder why this is different.

Just for anyone who missed it, the reason I wrote generichash and
didn't just use simplehash is that it's not possible to point any
other pointers to a simplehash table because these get shuffled around
during insert/delete. For the locallock stuff over on [1] we need the
LOCALLOCK object to be stable as we point to these from the resource
manager. Likewise here for SMgr, we point to SMgrRelationData objects
from RelationData. We can't have the hash table implementation swap
these out from under us.

Additionally, I coded generichash to fix the very slow hash seq scan
problem that we have in LockReleaseAll() when a transaction has run in
the backend that took lots of locks and caused the locallock hash
table to bloat.  Later when we run transactions that just grab a few
locks it takes us a relatively long time to do LockReleaseAll()
because we have to skip all those empty hash table buckets in the
bloated table. (See iterate_sparse_table.png and
iterate_very_sparse_table.png)

I just finished writing a benchmark suite for comparing simplehash to
generichash. I did this as a standalone C program. See the attached
hashbench.tar.gz. You can run the tests with just ./test.sh. Just be
careful if compiling manually as test.sh passes -DHAVE__BUILTIN_CTZ
-DHAVE_LONG_INT_64 which have quite a big effect on the performance of
generichash due to it using pg_rightmost_one_pos64() when searching
the bitmaps for used items.

I've attached graphs showing the results I got from running test.sh on
my AMD 3990x machine. Because the size of the struct being hashed
matters a lot to the performance of simplehash, I ran tests with 8,
16, 32, 64, 128, 256-byte structs.  This matters because simplehash
does memcpy() on this when moving stuff around during insert/delere.
The size of the "payload" matters a bit less to generichash.

You can see that the lookup performance of generichash very similar to
simplehash. The insert/delete test shows the generichash is very
slightly slower from 8-128 bytes but wins when simplehash has to
tackle 256 bytes of data.

The seq scan tests show that simplehash is better when the table is
full of items, but it's terrible when bucket array is only sparsely
populated.  I needed generichash to be fast at this for
LockReleaseAll(). I might be able to speed up generichash iteration
when the table is full a bit more by checking if the segment is full
and skipping to the next item rather than consulting the bitmap. That
will slow down the sparse case a bit though.  Not sure if it's worth
it.

Anyway, what I hope to show here is that there is no one-size-fits-all
hash table.

David

[1] https://www.postgresql.org/message-id/CAApHDvoKqWRxw5nnUPZ8+mAJKHPOPxYGoY1gQdh0WeS4+biVhg@mail.gmail.com

Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
Thomas Munro
Date:
On Tue, Jun 22, 2021 at 6:51 PM David Rowley <dgrowleyml@gmail.com> wrote:
> I guess we could also ask ourselves how many join algorithms we need.

David and I discussed this a bit off-list, and I just wanted to share
how I understand the idea so far in case it helps someone else.  There
are essentially three subcomponents working together:

1.  A data structure similar in some ways to a C++ std::deque<T>,
which gives O(1) access to elements by index, is densely packed to
enable cache-friendly scanning of all elements, has stable addresses
(as long as you only add new elements at the end or overwrite existing
slots), and is internally backed by an array of pointers to a set of
chunks.

2.  A bitmapset that tracks unused elements in 1, making it easy to
find the lowest-index hole when looking for a place to put a new one
by linear search for a 1 bit, so that we tend towards maximum density
despite having random frees from time to time (seems good, the same
idea is used in  kernels to allocate the lowest unused file descriptor
number).

3. A hash table that has as elements indexes into 1.  It somehow hides
the difference between keys (what callers look things up with) and
keys reachable by following an index into 1 (where elements' keys
live).

One thought is that you could do 1 as a separate component as the
"primary" data structure, and use a plain old simplehash for 3 as a
kind of index into it, but use pointers (rather than indexes) to
objects in 1 as elements.  I don't know if it's better.



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Wed, 23 Jun 2021 at 12:17, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> David and I discussed this a bit off-list, and I just wanted to share
> how I understand the idea so far in case it helps someone else.  There
> are essentially three subcomponents working together:

Thanks for taking an interest in this. I started looking at your idea
and I've now changed my mind from just not liking it to thinking that
the whole idea is just completely horrible :-(

It gets really messy with all the nested pre-processor stuff around
fetching the element from the segmented array inside simplehash.  One
problem is that simplehash needs the address of the segments despite
simplehash not knowing anything about segments. I've tried to make
that work by passing in the generic hash struct as simplehash's
private_data. This ends up with deeply nested macros all defined in
different files. I pitty the future person debugging that.

There is also a problem of how to reference simplehash functions
inside the generichash code.  It's not possible to do things like
SH_CREATE() because all those macros are undefined at the end of
simplehash.h. It's no good to hardcode the names either as GH_PREFIX
must be used, else it wouldn't be possible to use more than 1
differenrly defined hash table per .c file.  Fixing this means either
modifying simplehash.h to not undefine all the name macros at the end
maybe with SH_NOUNDEF or creating another set of macros to build the
names for the simplehash functions inside the generic hash code. I
don't like either of those ideas.

There are also a bunch of changes / API breakages that need to be done
to make this work with simplehash.h.

1) Since I really need 8-byte buckets in the hash table to make this
as fast as possible, I want to use the array index for the hash status
and that means changing the simplehash API to allow that to work.
This requires something like SH_IS_BUCKET_INUSE, SH_SET_BUCKET_INUSE,
SH_SET_BUCKET_EMPTY.
2) I need to add a new memory allocation function to not zero the
memory.  At the moment all hash buckets are emptied when creating a
table by zeroing the bucket memory. If someone defines
SH_SET_BUCKET_EMPTY to do something that says 0 memory is not empty,
then that won't work. So I need to allocate the bucket memory then
call SH_SET_BUCKET_EMPTY on each bucket.
3) I'll need to replace SH_KEY with something more complex.  Since the
simplehash bucket will just have a uint32 hashvalue and uint32 index,
the hash key is not stored in the bucket, it's stored over in the
segment. I'll need to replace SH_KEY with SH_GETKEY and SH_SETKEY.
These will need to consult the simplehash's private_data so that the
element can be found in the segmented array.

Also, simplehash internally manages when the hash table needs to grow.
I'll need to perform separate checks to see if the segmented array
also must grow.  It's a bit annoying to double up those checks as
they're in a very hot path as they're done everytime someone inserts
into the table.

> 2.  A bitmapset that tracks unused elements in 1, making it easy to
> find the lowest-index hole when looking for a place to put a new one
> by linear search for a 1 bit, so that we tend towards maximum density
> despite having random frees from time to time (seems good, the same
> idea is used in  kernels to allocate the lowest unused file descriptor
> number).

I didn't use Bitmapsets. I wanted the bitmaps to be allocated in the
same chunk of memory as the segments of the array.  Also, because
bitmapset's nwords is variable, then they can't really do any loop
unrolling.  Since in my implementation the number of bitmap words are
known at compile-time, the compiler has the flexibility to do loop
unrolling.  The bitmap manipulation is one of the biggest overheads in
generichash.h. I'd prefer to keep that as fast as possible.

> 3. A hash table that has as elements indexes into 1.  It somehow hides
> the difference between keys (what callers look things up with) and
> keys reachable by following an index into 1 (where elements' keys
> live).

I think that can be done, but it would require looking up the
segmented array twice instead of once.  The first time would be when
we compare the keys after seeing the hash values match. The final time
would be in the calling code to translate the index to the pointer.
Hopefully the compiler would be able to optimize that to a single
lookup.

> One thought is that you could do 1 as a separate component as the
> "primary" data structure, and use a plain old simplehash for 3 as a
> kind of index into it, but use pointers (rather than indexes) to
> objects in 1 as elements.  I don't know if it's better.

Using pointers would double the bucket width on a 64 bit machine. I
don't want to do that. Also, to be able to determine the segment from
the pointer it would require looping over each segment to check if the
pointer belongs there. With the index we can determine the segment
directly with bit-shifting the index.

So, with all that. I really don't think it's a great idea to try and
have this use simplehash.h code. I plan to pursue the idea I proposed
with having seperate hash table code that is coded properly to have
stable pointers into the data rather than trying to contort
simplehash's code into working that way.

David



Re: Use simplehash.h instead of dynahash in SMgr

From
Thomas Munro
Date:
On Wed, Jun 30, 2021 at 11:14 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Wed, 23 Jun 2021 at 12:17, Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks for taking an interest in this. I started looking at your idea
> and I've now changed my mind from just not liking it to thinking that
> the whole idea is just completely horrible :-(

Hah.

I accept that trying to make a thing that "wraps" these data
structures and provides a simple interface is probably really quite
horrible with preprocessor voodoo.

I was mainly questioning how bad it would be if we had a generic
segmented array component (seems like a great idea, which I'm sure
would find other uses, I recall wanting to write that myself before),
and then combined that with the presence map idea to make a dense
object pool (ditto), but then, in each place where we need something
like this, just used a plain old hash table to point directly to
objects in it whenever we needed that, open coding the logic to keep
it in sync (I mean, just the way that people usually use hash tables).
That way, the object pool can give you very fast scans over all
objects in cache friendly order (no linked lists), and the hash table
doesn't know/care about its existence.  In other words, small reusable
components that each do one thing well and are not coupled together.

I think I understand now that you really, really want small index
numbers and not 64 bit pointers in the hash table.  Hmm.

> It gets really messy with all the nested pre-processor stuff around
> fetching the element from the segmented array inside simplehash.  One
> problem is that simplehash needs the address of the segments despite
> simplehash not knowing anything about segments. I've tried to make
> that work by passing in the generic hash struct as simplehash's
> private_data. This ends up with deeply nested macros all defined in
> different files. I pitty the future person debugging that.

Yeah, that sounds terrible.

> There are also a bunch of changes / API breakages that need to be done
> to make this work with simplehash.h.
>
> 1) Since I really need 8-byte buckets in the hash table to make this
> as fast as possible, I want to use the array index for the hash status
> and that means changing the simplehash API to allow that to work.
> This requires something like SH_IS_BUCKET_INUSE, SH_SET_BUCKET_INUSE,
> SH_SET_BUCKET_EMPTY.

+1 for doing customisable "is in use" checks on day anyway, as a
separate project.  Not sure if any current users could shrink their
structs in practice because, at a glance, the same amount of space
might be used by padding anyway, but when a case like that shows up...

> > 2.  A bitmapset that tracks unused elements in 1, making it easy to
> > find the lowest-index hole when looking for a place to put a new one
> > by linear search for a 1 bit, so that we tend towards maximum density
> > despite having random frees from time to time (seems good, the same
> > idea is used in  kernels to allocate the lowest unused file descriptor
> > number).
>
> I didn't use Bitmapsets. I wanted the bitmaps to be allocated in the
> same chunk of memory as the segments of the array.  Also, because
> bitmapset's nwords is variable, then they can't really do any loop
> unrolling.  Since in my implementation the number of bitmap words are
> known at compile-time, the compiler has the flexibility to do loop
> unrolling.  The bitmap manipulation is one of the biggest overheads in
> generichash.h. I'd prefer to keep that as fast as possible.

I think my hands autocompleted "bitmapset", I really meant to write
just "bitmap" as I didn't think you were using the actual thing called
bitmapset, but point taken.

> So, with all that. I really don't think it's a great idea to try and
> have this use simplehash.h code. I plan to pursue the idea I proposed
> with having seperate hash table code that is coded properly to have
> stable pointers into the data rather than trying to contort
> simplehash's code into working that way.

Fair enough.

It's not that I don't believe it's a good idea to be able to perform
cache-friendly iteration over densely packed objects... that part
sounds great... it's just that it's not obvious to me that it should
be a *hashtable's* job to provide that access path.  Perhaps I lack
imagination and we'll have to agree to differ.



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Thu, 1 Jul 2021 at 13:00, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jun 30, 2021 at 11:14 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > 1) Since I really need 8-byte buckets in the hash table to make this
> > as fast as possible, I want to use the array index for the hash status
> > and that means changing the simplehash API to allow that to work.
> > This requires something like SH_IS_BUCKET_INUSE, SH_SET_BUCKET_INUSE,
> > SH_SET_BUCKET_EMPTY.
>
> +1 for doing customisable "is in use" checks on day anyway, as a
> separate project.  Not sure if any current users could shrink their
> structs in practice because, at a glance, the same amount of space
> might be used by padding anyway, but when a case like that shows up...

Yeah, I did look at that when messing with simplehash when working on
Result Cache a few months ago. I found all current usages have at
least a free byte, so I wasn't motivated to allow custom statuses to
be defined.

There's probably a small tidy up to do in simplehash maybe along with
that patch.  If you look at SH_GROW, for example, you'll see various
formations of:

if (oldentry->status != SH_STATUS_IN_USE)
if (oldentry->status == SH_STATUS_IN_USE)
if (newentry->status == SH_STATUS_EMPTY)

I'm not all that sure why there's a need to distinguish !=
SH_STATUS_IN_USE from == SH_STATUS_EMPTY. I can only imagine that
Andres was messing around with tombstoning and at one point had a 3rd
status in a development version. There are some minor inefficiencies
as a result of this, e.g in SH_DELETE, the code does:

if (entry->status == SH_STATUS_EMPTY)
    return false;

if (entry->status == SH_STATUS_IN_USE &&
    SH_COMPARE_KEYS(tb, hash, key, entry))

That SH_STATUS_IN_USE check is always true.

David



Re: Use simplehash.h instead of dynahash in SMgr

From
Jaime Casanova
Date:
On Tue, Jun 22, 2021 at 02:15:26AM +1200, David Rowley wrote:
[...]
> 
> I've come up with a new hash table implementation that I've called
> generichash.   It works similarly to simplehash in regards to the

Hi David,

Are you planning to work on this in this CF?
This is marked as "Ready for committer" but it doesn't apply anymore.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Fri, 24 Sept 2021 at 20:26, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:
> Are you planning to work on this in this CF?
> This is marked as "Ready for committer" but it doesn't apply anymore.

I've attached an updated patch. Since this patch is pretty different
from the one that was marked as ready for committer, I'll move this to
needs review.

However, I'm a bit disinclined to go ahead with this patch at all.
Thomas made it quite clear it's not for the patch, and on discussing
the patch with Andres, it turned out he does not like the idea either.
Andres' argument was along the lines of bitmaps being slow.  The hash
table uses bitmaps to record which items in each segment are in use. I
don't really agree with him about that, so we'd likely need some more
comments to help reach a consensus about if we want this or not.

Maybe Andres has more comments, so I've included him here.

David

Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
Jaime Casanova
Date:
On Mon, Sep 27, 2021 at 04:30:25PM +1300, David Rowley wrote:
> On Fri, 24 Sept 2021 at 20:26, Jaime Casanova
> <jcasanov@systemguards.com.ec> wrote:
> > Are you planning to work on this in this CF?
> > This is marked as "Ready for committer" but it doesn't apply anymore.
> 
> I've attached an updated patch. Since this patch is pretty different
> from the one that was marked as ready for committer, I'll move this to
> needs review.
> 
> However, I'm a bit disinclined to go ahead with this patch at all.
> Thomas made it quite clear it's not for the patch, and on discussing
> the patch with Andres, it turned out he does not like the idea either.
> Andres' argument was along the lines of bitmaps being slow.  The hash
> table uses bitmaps to record which items in each segment are in use. I
> don't really agree with him about that, so we'd likely need some more
> comments to help reach a consensus about if we want this or not.
> 
> Maybe Andres has more comments, so I've included him here.
> 

Hi David,

Thanks for the updated patch.

Based on your comments I will mark this patch as withdrawn at midday of 
my monday unless someone objects to that.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL



Re: Use simplehash.h instead of dynahash in SMgr

From
David Rowley
Date:
On Mon, 4 Oct 2021 at 20:37, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:
> Based on your comments I will mark this patch as withdrawn at midday of
> my monday unless someone objects to that.

I really think we need a hash table implementation that's faster than
dynahash and supports stable pointers to elements (simplehash does not
have stable pointers). I think withdrawing this won't help us move
towards getting that.

Thomas voiced his concerns here about having an extra hash table
implementation and then also concerns that I've coded the hash table
code to be fast to iterate over the hashed items.  To be honest, I
think both Andres and Thomas must be misunderstanding the bitmap part.
I get the impression that they both think the bitmap is solely there
to make interations faster, but in reality it's primarily there as a
compact freelist and can also be used to make iterations over sparsely
populated tables fast. For the freelist we look for 0-bits, and we
look for 1-bits during iteration.

I think I'd much rather talk about the concerns here than just
withdraw this. Even if what I have today just serves as something to
aid discussion.

It would also be good to get the points Andres raised with me off-list
on this thread.  I think his primary concern was that bitmaps are
slow, but I don't really think maintaining full pointers into freed
items is going to improve the performance of this.

David



Re: Use simplehash.h instead of dynahash in SMgr

From
Yura Sokolov
Date:
Good day, David and all.

В Вт, 05/10/2021 в 11:07 +1300, David Rowley пишет:
> On Mon, 4 Oct 2021 at 20:37, Jaime Casanova
> <jcasanov@systemguards.com.ec> wrote:
> > Based on your comments I will mark this patch as withdrawn at midday
> > of
> > my monday unless someone objects to that.
> 
> I really think we need a hash table implementation that's faster than
> dynahash and supports stable pointers to elements (simplehash does not
> have stable pointers). I think withdrawing this won't help us move
> towards getting that.

Agree with you. I believe densehash could replace both dynahash and
simplehash. Shared memory usages of dynahash should be reworked to
other less dynamic hash structure. So there should be densehash for
local hashes and statichash for static shared memory.

densehash slight slowness compared to simplehash in some operations
doesn't worth keeping simplehash beside densehash.

> Thomas voiced his concerns here about having an extra hash table
> implementation and then also concerns that I've coded the hash table
> code to be fast to iterate over the hashed items.  To be honest, I
> think both Andres and Thomas must be misunderstanding the bitmap part.
> I get the impression that they both think the bitmap is solely there
> to make interations faster, but in reality it's primarily there as a
> compact freelist and can also be used to make iterations over sparsely
> populated tables fast. For the freelist we look for 0-bits, and we
> look for 1-bits during iteration.

I think this part is overengineered. More below.

> I think I'd much rather talk about the concerns here than just
> withdraw this. Even if what I have today just serves as something to
> aid discussion.
> 
> It would also be good to get the points Andres raised with me off-list
> on this thread.  I think his primary concern was that bitmaps are
> slow, but I don't really think maintaining full pointers into freed
> items is going to improve the performance of this.
> 
> David

First on "quirks" in the patch I was able to see:

DH_NEXT_ZEROBIT:

   DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
   DH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */

really should be

   DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
   DH_BITMAP_WORD word = (~words[wordnum]) & mask; /* flip bits */

But it doesn't harm because DH_NEXT_ZEROBIT is always called with
`prevbit = -1`, which is incremented to `0`. Therefore `mask` is always
`0xffff...ff`.

DH_INDEX_TO_ELEMENT

   /* ensure this segment is marked as used */
should be
   /* ensure this item is marked as used in the segment */

DH_GET_NEXT_UNUSED_ENTRY

   /* find the first segment with an unused item */
   while (seg != NULL && seg->nitems == DH_ITEMS_PER_SEGMENT)
       seg = tb->segments[++segidx];

No protection for `++segidx <= tb->nsegments` . I understand, it could
not happen due to `grow_threshold` is always lesser than
`nsegment * DH_ITEMS_PER_SEGMENT`. But at least comment should be
leaved about legality of absence of check.

Now architecture notes:

I don't believe there is need for configurable DH_ITEMS_PER_SEGMENT. I
don't event believe it should be not equal to 16 (or 8). Then segment
needs only one `used_items` word, which simplifies code a lot.
There is no much difference in overhead between 1/16 and 1/256.

And then I believe, segment doesn't need both `nitems` and `used_items`.
Condition "segment is full" will be equal to `used_items == 0xffff`.

Next, I think it is better to make real free list instead of looping to
search such one. Ie add `uint32 DH_SEGMENT->next` field and maintain
list starting from `first_free_segment`.
If concern were "allocate from lower-numbered segments first", than min-
heap could be created. It is possible to create very efficient non-
balanced "binary heap" with just two fields (`uint32 left, right`).
Algorithmic PoC in Ruby language is attached.

There is also allocation concern: AllocSet tends to allocate in power2
sizes. Use of power2 segments with header (nitems/used_items) certainly
will lead to wasted 2x space on every segment if element size is also
power2, and a bit lesser for other element sizes.
There could be two workarounds:
- make segment a bit less capable (15 elements instead of 16)
- move header from segment itself to `DH_TYPE->segments` array.
I think, second option is more prefered:
- `DH_TYPE->segments[x]` inevitable accessed on every operation,
  therefore why not store some info here?
- if nitems/used_items will be in `DH_TYPE->segments[x]`, then
  hashtable iteration doesn't need bitmap at all - there will be no need
  in `DH_TYPE->used_segments` bitmap. Absence of this bitmap will
  reduce overhead on usual operations (insert/delete) as well.

Hope I was useful.

regards

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
Michael Paquier
Date:
On Tue, Oct 05, 2021 at 11:07:48AM +1300, David Rowley wrote:
> I think I'd much rather talk about the concerns here than just
> withdraw this. Even if what I have today just serves as something to
> aid discussion.

Hmm.  This last update was two months ago, and the patch does not
apply anymore.  I am marking it as RwF for now.
--
Michael

Attachment

Re: Use simplehash.h instead of dynahash in SMgr

From
Andres Freund
Date:
Hi,

On 2021-04-25 03:58:38 +1200, David Rowley wrote:
> Currently, we use dynahash hash tables to store the SMgrRelation so we
> can perform fast lookups by RelFileNodeBackend. However, I had in mind
> that a simplehash table might perform better. So I tried it...

> The test case was basically inserting 100 million rows one at a time
> into a hash partitioned table with 1000 partitions and 2 int columns
> and a primary key on one of those columns. It was about 12GB of WAL. I
> used a hash partitioned table in the hope to create a fairly
> random-looking SMgr hash table access pattern.  Hopefully something
> similar to what might happen in the real world.

A potentially stupid question: Do we actually need to do smgr lookups in this
path? Afaict nearly all of the buffer lookups here will end up as cache hits in
shared buffers, correct?

Afaict we'll do two smgropens in a lot of paths:
1) XLogReadBufferExtended() does smgropen so it can do smgrnblocks()
2) ReadBufferWithoutRelcache() does an smgropen()

It's pretty sad that we constantly do two smgropen()s to start with. But in
the cache hit path we don't actually need an smgropen in either case afaict.

ReadBufferWithoutRelcache() does an smgropen, because that's
ReadBuffer_common()'s API. Which in turn has that API because it wants to use
RelationGetSmgr() when coming from ReadBufferExtended().  It doesn't seem
awful to allow smgr to be NULL and to pass in the rlocator in addition.

XLogReadBufferExtended() does an smgropen() so it can do smgrcreate() and
smgrnblocks(). But neither is needed in the cache hit case, I think. We could
do a "read-only" lookup in s_b, and only do the current logic in case that
fails?

Greetings,

Andres Freund