Thread: hash_search_with_hash_value is high in "perf top" on a replica

hash_search_with_hash_value is high in "perf top" on a replica

From

Dmitry Koterov

Date:

31 January, 14:30:35

Hi.

Debugging some replication lag on a replica when the master node experiences heavy writes.

PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys), which is a little surprising (what is it doing with all those CPU cycles? it looked like WAL replay should be more IO bound than CPU bound?).

Running "perf top -p <pid>", it shows this:

Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.): 18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value
5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel
3.42% postgres [.] XLogReadBufferExtended
2.35% libc.so.6 [.] 0x0000000000097f4c
2.04% postgres [.] pg_comp_crc32c_armv8
1.77% [kernel] [k] mutex_lock
1.63% postgres [.] XLogPrefetcherReadRecord
1.56% postgres [.] DecodeXLogRecord
1.55% postgres [.] LWLockAcquire

This is more surprising: hash_search_with_hash_value is a hash table lookup function, is it expected that it is #1 in the profiler? (Called mostly from PrefetchSharedBuffer*.)

Configuration:

- PG17
- ZFS, compression is off, empty database (with compression on, most of "startup recovering" CPU was spent in ZFS guts doing compression/decompression according to the profiler)

- full_page_writes=off, recovery_prefetch=on (ZFS supports it, I tested with a small C program), wal_decode_buffer_size=2048kB (it doesn't seem to affect anything though).
- shared_buffers = 25% of RAM
- testing with a giant COPY command basically

My main question is about hash_search_with_hash_value CPU usage. With both recovery_prefetch=on and off, it is the topmost function in "perf top". I see no IO bottleneck on the machine, it's only "startup recovering" maxing out one CPU core.

Maybe it's a red herring though, but it looks pretty suspicious.

Re: hash_search_with_hash_value is high in "perf top" on a replica

From

Yura Sokolov

Date:

31 January, 17:38:10

31.01.2025 17:25, Álvaro Herrera пишет:
> On 2025-Jan-31, Dmitry Koterov wrote:
> 
>> PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys),
>> which is a little surprising (what is it doing with all those CPU cycles?
>> it looked like WAL replay should be more IO bound than CPU bound?).
>>
>> Running "perf top -p <pid>", it shows this:
>>
>> Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.):
>> 18178814660 lost: 0/0 drop: 0/0
>> Overhead  Shared Object     Symbol
>>   16.63%  postgres          [.] hash_search_with_hash_value
> 
> Yeah, I noticed that this function was showing high in some profiles a
> couple of days ago as well.  Looking now at hashfn.c (hash_bytes_uint32
> there is the function involved in the buffer mapping hash table), the
> comments state that we're updated to Bob Jenkins code from 2006, but
> there's a version in his website that (if I read correctly) is twice as
> fast as what we're using now:
> http://burtleburtle.net/bob/hash/spooky.html
> 
> Apparently this code in our repo is mostly unchanged since commit
> 1f559b7d3aa4, in 2007.
> 
> He mentions that on Intel chips, Google's CityHash is faster; but they
> in turn claim that the difference is small on Intel chips and that
> Jenkins' hash is better on AMD chips.
> https://github.com/google/cityhash
> 
> Anyway if you wanted to try your luck at improving things, here's your
> chance.
> 

`hash_search_with_hash_value` uses already calculated hash value, so its
performance doesn't depend on performance of hash function
(hash_bytes_uint32 or any other).

I believe, it is memory bound, since dynahash does at least three
indirection jumps before it event reaches first node in linked list

  segp = HTAB->dir[segment_num];
  *bucketptr = segp[segment_ndx];

And then iterates through linked list with each iteration is being (for
large hash table) both cache and TLB (without huge pages) miss.

Re: hash_search_with_hash_value is high in "perf top" on a replica

From

Ants Aasma

Date:

01 February, 17:43:41

On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote:

> Maybe it's a red herring though, but it looks pretty suspicious.

It's unfortunately not too surprising - our buffer mapping table is a pretty
big bottleneck. Both because a hash table is just not a good fit for the
buffer mapping table due to the lack of locality and because dynahash is
really poor hash table implementation.

I measured similar things when looking at apply throughput recently. For in-cache workloads buffer lookup and locking was about half of the load.

One other direction is to extract more memory concurrency. Prefetcher could batch multiple lookups together so CPU OoO execution has a chance to fire off multiple memory accesses at the same time.

The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from the main redo loop.

Ants Aasma

Re: hash_search_with_hash_value is high in "perf top" on a replica

From

Andres Freund

Date:

01 February, 18:55:38

Hi,

On 2025-02-01 15:43:41 +0100, Ants Aasma wrote:
> On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote:
> 
> > > Maybe it's a red herring though, but it looks pretty suspicious.
> >
> > It's unfortunately not too surprising - our buffer mapping table is a
> > pretty
> > big bottleneck.  Both because a hash table is just not a good fit for the
> > buffer mapping table due to the lack of locality and because dynahash is
> > really poor hash table implementation.
> >
> 
> I measured similar things when looking at apply throughput recently. For
> in-cache workloads buffer lookup and locking was about half of the load.
> 
> One other direction is to extract more memory concurrency. Prefetcher could
> batch multiple lookups together so CPU OoO execution has a chance to fire
> off multiple memory accesses at the same time.

I think at the moment we have a *hilariously* cache-inefficient buffer lookup,
that's the first thing to address. A hash table for buffer mapping lookups imo
is a bad idea, due to loosing all locality in a workload that exhibits a *lot*
of locality. But furthermore, dynahash.c is very far from a cache efficient
hashtable implementation.

The other aspect is that in many workloads we'll look up a small set of
buffers over and over, which a) wastes cycles b) wastes cache space for stuff
that could be elided much more efficiently.

We also do a lot of hash lookups for smgr, because we don't have any
cross-record caching infrastructure for that.

> The other direction is to split off WAL decoding, buffer lookup and maybe
> even pinning to a separate process from the main redo loop.

Maybe, but I think we're rather far away from those things being the most
productive thing to tackle.

Greetings,

Andres Freund

Re: hash_search_with_hash_value is high in "perf top" on a replica

From

Matthias van de Meent

Date:

04 February, 22:03:10

On Sat, 1 Feb 2025 at 16:55, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-02-01 15:43:41 +0100, Ants Aasma wrote:
> > On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote:
> >
> > > > Maybe it's a red herring though, but it looks pretty suspicious.
> > >
> > > It's unfortunately not too surprising - our buffer mapping table is a
> > > pretty
> > > big bottleneck.  Both because a hash table is just not a good fit for the
> > > buffer mapping table due to the lack of locality and because dynahash is
> > > really poor hash table implementation.
> > >
> >
> > I measured similar things when looking at apply throughput recently. For
> > in-cache workloads buffer lookup and locking was about half of the load.
> >
> > One other direction is to extract more memory concurrency. Prefetcher could
> > batch multiple lookups together so CPU OoO execution has a chance to fire
> > off multiple memory accesses at the same time.
>
> I think at the moment we have a *hilariously* cache-inefficient buffer lookup,
> that's the first thing to address. A hash table for buffer mapping lookups imo
> is a bad idea, due to loosing all locality in a workload that exhibits a *lot*
> of locality. But furthermore, dynahash.c is very far from a cache efficient
> hashtable implementation.

In case you might be interested, I've sent a new patch with a new
approach to reducing the buffer lookup table's memory in [0], which
attempts to create a more cache-efficient hash table implementation.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.postgresql.org/message-id/CAEze2WiRo4Zu71jwxYmqjq6XK814Avf2-kytaL6n%3DBreZR2ZbA%40mail.gmail.com

Re: hash_search_with_hash_value is high in "perf top" on a replica

From

Thomas Munro

Date:

05 February, 00:22:06

On Sun, Feb 2, 2025 at 3:44 AM Ants Aasma <ants.aasma@cybertec.at> wrote:
> The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from the
mainredo loop. 

Hi Ants,

FWIW I have a patch set that changes xlogprefetcher.c to use
read_stream.c, which I hope to propose for 19.  It pins ahead of time,
which means that it  already consolidates some pin/unpin/repin
sequences when it looks ahead, a little bit like what you're
suggesting, just not in a separate process.  Here[1] is a preview, but
it needs some major rebasing and has some issues... it's on my list...
I mention this because, although the primary goal of the
recovery-streamification project is recovery I/O performance,
especially anticipating direct I/O (especially without FPW), there are
many more opportunities along those lines for cached blocks.  I've
written about some of them before and posted toy sketch code, but I
didn't let myself get too distracted as the first thing seems to be
real streamification (replaying LsnReadQueue).  With a bit more work
it could avoid lots of buffer mapping table lookups too: I posted a
few ideas and toy patches, one based on already held pins which begins
to pay off once you hold many pins in a window, and one more generally
caching recently accessed buffers in smgr relations.  That'd be
independent of whether our buffer mapping table also sucks and needs a
rewrite, it bypasses it completely in many interesting cases, and
would need to be so cheap as to be basically free when it doesn't
help.  An assumption I just made, in remembering all that: OP didn't
mention it but I guess that this COPY replay is probably repeatedly
hammering the same pages from separate records here, because otherwise
multi-insert stuff would already avoid a lot of mapping table lookups
already?  There are many more automatic batching opportunities, like
holding content locks across records (I posted a toy patch to measure
the potential speedup for that once, too lazy to search for it...),
and then to really benefit from all of these things in more general
real workloads I think you need to be able to do a little bit of
re-ordering because eg heap/index accesses are so often interleaved.
IOW there seem to be plenty of lower-hanging fruit in the "mechanical
sympathy" category, before you need another thread.

No argument against also making the hashing faster, the mapping table
better, and eventually parallelising some bits when it's not just
compensating for inefficient code that throws away all the locality, I
just wanted to mention that related work in the pipeline :-)

[1] https://github.com/macdice/postgres/commit/52533b652e544464add6f174019161889a37ed93

Re: hash_search_with_hash_value is high in "perf top" on a replica

From

Thomas Munro

Date:

05 February, 00:25:35

On Wed, Feb 5, 2025 at 10:22 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> (replaying LsnReadQueue)

s/replaying/replacing/

Re: hash_search_with_hash_value is high in "perf top" on a replica

From

Jakub Wartak

Date:

10 February, 11:45:05

Hi Thomas!

On Tue, Feb 4, 2025 at 10:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Feb 2, 2025 at 3:44 AM Ants Aasma <ants.aasma@cybertec.at> wrote:
> > The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from
themain redo loop. 
>
> Hi Ants,
>
[..]
> An assumption I just made, in remembering all that: OP didn't
> mention it but I guess that this COPY replay is probably repeatedly
> hammering the same pages from separate records here, because otherwise
> multi-insert stuff would already avoid a lot of mapping table lookups
> already?

Basic COPY (he mentioned basic CREATE SUBSCRIPTION table copy, so I
assume it's fresh): ends emitting stuff like this (note the "blk" is
increasing):

rmgr: Heap2       len (rec/tot):   3666/  3666, tx:        759, lsn:
0/040069A0, prev 0/04004950, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 5
rmgr: Heap2       len (rec/tot):   3666/  3666, tx:        759, lsn:
0/040077F8, prev 0/040069A0, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 6
rmgr: Heap2       len (rec/tot):   3666/  3666, tx:        759, lsn:
0/04008668, prev 0/040077F8, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 7
rmgr: Heap2       len (rec/tot):   3122/  3122, tx:        759, lsn:
0/040094C0, prev 0/04008668, desc: MULTI_INSERT+INIT ntuples: 192,
flags: 0x02, blkref #0: rel 1663/5/16393 blk 8

now if the table would have PK , we end up doing *massive*
INSERT_LEAFs due to lack of batched btree emision like just in heap2
case:

rmgr: Heap2       len (rec/tot):   3666/  3666, tx:        763, lsn:
0/05000028, prev 0/041C8198, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16396 blk 0
rmgr: Heap2       len (rec/tot):   3666/  3666, tx:        763, lsn:
0/05000E80, prev 0/05000028, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16396 blk 1
[..]
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/05004050, prev 0/05003FD8, desc: INSERT_LEAF off: 1, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/05004090, prev 0/05004050, desc: INSERT_LEAF off: 2, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/050040D0, prev 0/05004090, desc: INSERT_LEAF off: 3, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/05004110, prev 0/050040D0, desc: INSERT_LEAF off: 4, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/05004150, prev 0/05004110, desc: INSERT_LEAF off: 5, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/05004190, prev 0/05004150, desc: INSERT_LEAF off: 6, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/050041D0, prev 0/05004190, desc: INSERT_LEAF off: 7, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/05004210, prev 0/050041D0, desc: INSERT_LEAF off: 8, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:        763, lsn:
0/05004250, prev 0/05004210, desc: INSERT_LEAF off: 9, blkref #0: rel
1663/5/16398 blk 1
[..]

I don't know exactly which situation Dmitry has hit while the second
scenario would be much easier to optimize. This of course reminds me
of Your earlier work on recet_buffer optimization too and Andres
mentioned some form of LRU cache to just protect the hash table for
buffer mapping lookups. Also I was under the impression that work by
Bhrath [0] could help here too to lower the number of WAL records
emitted.

@Dmitry : So if you're reading then this is a known problem for a
while (max performance ceiling, see [1]), but there are a myriad of
possible long-term solutions in the code. Outside of hacking PG
changes, you could only probably split the big table on publisher into
many smaller partitions (but not too many), and then having COPY
running on those smaller ones with some delays so that replication lag
doesn't grow too much. The worst alternative is trying to get the
fastest possible cores (it's a single threaded bottleneck). As Andres
noted you probably could try to recompile with some better -march flag
on that ARM and see how much that helps.

-J.

[0] - https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR=gdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ@mail.gmail.com
[1] -
https://www.postgresql.org/message-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0%40VI1PR0701MB6960.eurprd07.prod.outlook.com