Thread: hash_search_with_hash_value is high in "perf top" on a replica
Hi.
Debugging some replication lag on a replica when the master node experiences heavy writes.
PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys), which is a little surprising (what is it doing with all those CPU cycles? it looked like WAL replay should be more IO bound than CPU bound?).
Running "perf top -p <pid>", it shows this:
Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.): 18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value
5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel
3.42% postgres [.] XLogReadBufferExtended
2.35% libc.so.6 [.] 0x0000000000097f4c
2.04% postgres [.] pg_comp_crc32c_armv8
1.77% [kernel] [k] mutex_lock
1.63% postgres [.] XLogPrefetcherReadRecord
1.56% postgres [.] DecodeXLogRecord
1.55% postgres [.] LWLockAcquire
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value
5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel
3.42% postgres [.] XLogReadBufferExtended
2.35% libc.so.6 [.] 0x0000000000097f4c
2.04% postgres [.] pg_comp_crc32c_armv8
1.77% [kernel] [k] mutex_lock
1.63% postgres [.] XLogPrefetcherReadRecord
1.56% postgres [.] DecodeXLogRecord
1.55% postgres [.] LWLockAcquire
This is more surprising: hash_search_with_hash_value is a hash table lookup function, is it expected that it is #1 in the profiler? (Called mostly from PrefetchSharedBuffer*.)
Configuration:
Configuration:
- PG17
- ZFS, compression is off, empty database (with compression on, most of "startup recovering" CPU was spent in ZFS guts doing compression/decompression according to the profiler)
- ZFS, compression is off, empty database (with compression on, most of "startup recovering" CPU was spent in ZFS guts doing compression/decompression according to the profiler)
- full_page_writes=off, recovery_prefetch=on (ZFS supports it, I tested with a small C program), wal_decode_buffer_size=2048kB (it doesn't seem to affect anything though).
- shared_buffers = 25% of RAM
- testing with a giant COPY command basically
My main question is about hash_search_with_hash_value CPU usage. With both recovery_prefetch=on and off, it is the topmost function in "perf top". I see no IO bottleneck on the machine, it's only "startup recovering" maxing out one CPU core.
Maybe it's a red herring though, but it looks pretty suspicious.
- shared_buffers = 25% of RAM
- testing with a giant COPY command basically
My main question is about hash_search_with_hash_value CPU usage. With both recovery_prefetch=on and off, it is the topmost function in "perf top". I see no IO bottleneck on the machine, it's only "startup recovering" maxing out one CPU core.
Maybe it's a red herring though, but it looks pretty suspicious.
31.01.2025 17:25, Álvaro Herrera пишет: > On 2025-Jan-31, Dmitry Koterov wrote: > >> PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys), >> which is a little surprising (what is it doing with all those CPU cycles? >> it looked like WAL replay should be more IO bound than CPU bound?). >> >> Running "perf top -p <pid>", it shows this: >> >> Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.): >> 18178814660 lost: 0/0 drop: 0/0 >> Overhead Shared Object Symbol >> 16.63% postgres [.] hash_search_with_hash_value > > Yeah, I noticed that this function was showing high in some profiles a > couple of days ago as well. Looking now at hashfn.c (hash_bytes_uint32 > there is the function involved in the buffer mapping hash table), the > comments state that we're updated to Bob Jenkins code from 2006, but > there's a version in his website that (if I read correctly) is twice as > fast as what we're using now: > http://burtleburtle.net/bob/hash/spooky.html > > Apparently this code in our repo is mostly unchanged since commit > 1f559b7d3aa4, in 2007. > > He mentions that on Intel chips, Google's CityHash is faster; but they > in turn claim that the difference is small on Intel chips and that > Jenkins' hash is better on AMD chips. > https://github.com/google/cityhash > > Anyway if you wanted to try your luck at improving things, here's your > chance. > `hash_search_with_hash_value` uses already calculated hash value, so its performance doesn't depend on performance of hash function (hash_bytes_uint32 or any other). I believe, it is memory bound, since dynahash does at least three indirection jumps before it event reaches first node in linked list segp = HTAB->dir[segment_num]; *bucketptr = segp[segment_ndx]; And then iterates through linked list with each iteration is being (for large hash table) both cache and TLB (without huge pages) miss.
On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote:
> Maybe it's a red herring though, but it looks pretty suspicious.
It's unfortunately not too surprising - our buffer mapping table is a pretty
big bottleneck. Both because a hash table is just not a good fit for the
buffer mapping table due to the lack of locality and because dynahash is
really poor hash table implementation.
I measured similar things when looking at apply throughput recently. For in-cache workloads buffer lookup and locking was about half of the load.
One other direction is to extract more memory concurrency. Prefetcher could batch multiple lookups together so CPU OoO execution has a chance to fire off multiple memory accesses at the same time.
The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from the main redo loop.
--
Ants Aasma
Hi, On 2025-02-01 15:43:41 +0100, Ants Aasma wrote: > On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote: > > > > Maybe it's a red herring though, but it looks pretty suspicious. > > > > It's unfortunately not too surprising - our buffer mapping table is a > > pretty > > big bottleneck. Both because a hash table is just not a good fit for the > > buffer mapping table due to the lack of locality and because dynahash is > > really poor hash table implementation. > > > > I measured similar things when looking at apply throughput recently. For > in-cache workloads buffer lookup and locking was about half of the load. > > One other direction is to extract more memory concurrency. Prefetcher could > batch multiple lookups together so CPU OoO execution has a chance to fire > off multiple memory accesses at the same time. I think at the moment we have a *hilariously* cache-inefficient buffer lookup, that's the first thing to address. A hash table for buffer mapping lookups imo is a bad idea, due to loosing all locality in a workload that exhibits a *lot* of locality. But furthermore, dynahash.c is very far from a cache efficient hashtable implementation. The other aspect is that in many workloads we'll look up a small set of buffers over and over, which a) wastes cycles b) wastes cache space for stuff that could be elided much more efficiently. We also do a lot of hash lookups for smgr, because we don't have any cross-record caching infrastructure for that. > The other direction is to split off WAL decoding, buffer lookup and maybe > even pinning to a separate process from the main redo loop. Maybe, but I think we're rather far away from those things being the most productive thing to tackle. Greetings, Andres Freund
On Sat, 1 Feb 2025 at 16:55, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-02-01 15:43:41 +0100, Ants Aasma wrote: > > On Fri, Jan 31, 2025, 15:43 Andres Freund <andres@anarazel.de> wrote: > > > > > > Maybe it's a red herring though, but it looks pretty suspicious. > > > > > > It's unfortunately not too surprising - our buffer mapping table is a > > > pretty > > > big bottleneck. Both because a hash table is just not a good fit for the > > > buffer mapping table due to the lack of locality and because dynahash is > > > really poor hash table implementation. > > > > > > > I measured similar things when looking at apply throughput recently. For > > in-cache workloads buffer lookup and locking was about half of the load. > > > > One other direction is to extract more memory concurrency. Prefetcher could > > batch multiple lookups together so CPU OoO execution has a chance to fire > > off multiple memory accesses at the same time. > > I think at the moment we have a *hilariously* cache-inefficient buffer lookup, > that's the first thing to address. A hash table for buffer mapping lookups imo > is a bad idea, due to loosing all locality in a workload that exhibits a *lot* > of locality. But furthermore, dynahash.c is very far from a cache efficient > hashtable implementation. In case you might be interested, I've sent a new patch with a new approach to reducing the buffer lookup table's memory in [0], which attempts to create a more cache-efficient hash table implementation. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://www.postgresql.org/message-id/CAEze2WiRo4Zu71jwxYmqjq6XK814Avf2-kytaL6n%3DBreZR2ZbA%40mail.gmail.com
On Sun, Feb 2, 2025 at 3:44 AM Ants Aasma <ants.aasma@cybertec.at> wrote: > The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from the mainredo loop. Hi Ants, FWIW I have a patch set that changes xlogprefetcher.c to use read_stream.c, which I hope to propose for 19. It pins ahead of time, which means that it already consolidates some pin/unpin/repin sequences when it looks ahead, a little bit like what you're suggesting, just not in a separate process. Here[1] is a preview, but it needs some major rebasing and has some issues... it's on my list... I mention this because, although the primary goal of the recovery-streamification project is recovery I/O performance, especially anticipating direct I/O (especially without FPW), there are many more opportunities along those lines for cached blocks. I've written about some of them before and posted toy sketch code, but I didn't let myself get too distracted as the first thing seems to be real streamification (replaying LsnReadQueue). With a bit more work it could avoid lots of buffer mapping table lookups too: I posted a few ideas and toy patches, one based on already held pins which begins to pay off once you hold many pins in a window, and one more generally caching recently accessed buffers in smgr relations. That'd be independent of whether our buffer mapping table also sucks and needs a rewrite, it bypasses it completely in many interesting cases, and would need to be so cheap as to be basically free when it doesn't help. An assumption I just made, in remembering all that: OP didn't mention it but I guess that this COPY replay is probably repeatedly hammering the same pages from separate records here, because otherwise multi-insert stuff would already avoid a lot of mapping table lookups already? There are many more automatic batching opportunities, like holding content locks across records (I posted a toy patch to measure the potential speedup for that once, too lazy to search for it...), and then to really benefit from all of these things in more general real workloads I think you need to be able to do a little bit of re-ordering because eg heap/index accesses are so often interleaved. IOW there seem to be plenty of lower-hanging fruit in the "mechanical sympathy" category, before you need another thread. No argument against also making the hashing faster, the mapping table better, and eventually parallelising some bits when it's not just compensating for inefficient code that throws away all the locality, I just wanted to mention that related work in the pipeline :-) [1] https://github.com/macdice/postgres/commit/52533b652e544464add6f174019161889a37ed93
On Wed, Feb 5, 2025 at 10:22 AM Thomas Munro <thomas.munro@gmail.com> wrote: > (replaying LsnReadQueue) s/replaying/replacing/
Hi Thomas! On Tue, Feb 4, 2025 at 10:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Feb 2, 2025 at 3:44 AM Ants Aasma <ants.aasma@cybertec.at> wrote: > > The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from themain redo loop. > > Hi Ants, > [..] > An assumption I just made, in remembering all that: OP didn't > mention it but I guess that this COPY replay is probably repeatedly > hammering the same pages from separate records here, because otherwise > multi-insert stuff would already avoid a lot of mapping table lookups > already? Basic COPY (he mentioned basic CREATE SUBSCRIPTION table copy, so I assume it's fresh): ends emitting stuff like this (note the "blk" is increasing): rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn: 0/040069A0, prev 0/04004950, desc: MULTI_INSERT+INIT ntuples: 226, flags: 0x00, blkref #0: rel 1663/5/16393 blk 5 rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn: 0/040077F8, prev 0/040069A0, desc: MULTI_INSERT+INIT ntuples: 226, flags: 0x00, blkref #0: rel 1663/5/16393 blk 6 rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn: 0/04008668, prev 0/040077F8, desc: MULTI_INSERT+INIT ntuples: 226, flags: 0x00, blkref #0: rel 1663/5/16393 blk 7 rmgr: Heap2 len (rec/tot): 3122/ 3122, tx: 759, lsn: 0/040094C0, prev 0/04008668, desc: MULTI_INSERT+INIT ntuples: 192, flags: 0x02, blkref #0: rel 1663/5/16393 blk 8 now if the table would have PK , we end up doing *massive* INSERT_LEAFs due to lack of batched btree emision like just in heap2 case: rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 763, lsn: 0/05000028, prev 0/041C8198, desc: MULTI_INSERT+INIT ntuples: 226, flags: 0x00, blkref #0: rel 1663/5/16396 blk 0 rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 763, lsn: 0/05000E80, prev 0/05000028, desc: MULTI_INSERT+INIT ntuples: 226, flags: 0x00, blkref #0: rel 1663/5/16396 blk 1 [..] rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/05004050, prev 0/05003FD8, desc: INSERT_LEAF off: 1, blkref #0: rel 1663/5/16398 blk 1 rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/05004090, prev 0/05004050, desc: INSERT_LEAF off: 2, blkref #0: rel 1663/5/16398 blk 1 rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/050040D0, prev 0/05004090, desc: INSERT_LEAF off: 3, blkref #0: rel 1663/5/16398 blk 1 rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/05004110, prev 0/050040D0, desc: INSERT_LEAF off: 4, blkref #0: rel 1663/5/16398 blk 1 rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/05004150, prev 0/05004110, desc: INSERT_LEAF off: 5, blkref #0: rel 1663/5/16398 blk 1 rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/05004190, prev 0/05004150, desc: INSERT_LEAF off: 6, blkref #0: rel 1663/5/16398 blk 1 rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/050041D0, prev 0/05004190, desc: INSERT_LEAF off: 7, blkref #0: rel 1663/5/16398 blk 1 rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/05004210, prev 0/050041D0, desc: INSERT_LEAF off: 8, blkref #0: rel 1663/5/16398 blk 1 rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn: 0/05004250, prev 0/05004210, desc: INSERT_LEAF off: 9, blkref #0: rel 1663/5/16398 blk 1 [..] I don't know exactly which situation Dmitry has hit while the second scenario would be much easier to optimize. This of course reminds me of Your earlier work on recet_buffer optimization too and Andres mentioned some form of LRU cache to just protect the hash table for buffer mapping lookups. Also I was under the impression that work by Bhrath [0] could help here too to lower the number of WAL records emitted. @Dmitry : So if you're reading then this is a known problem for a while (max performance ceiling, see [1]), but there are a myriad of possible long-term solutions in the code. Outside of hacking PG changes, you could only probably split the big table on publisher into many smaller partitions (but not too many), and then having COPY running on those smaller ones with some delays so that replication lag doesn't grow too much. The worst alternative is trying to get the fastest possible cores (it's a single threaded bottleneck). As Andres noted you probably could try to recompile with some better -march flag on that ARM and see how much that helps. -J. [0] - https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR=gdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ@mail.gmail.com [1] - https://www.postgresql.org/message-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0%40VI1PR0701MB6960.eurprd07.prod.outlook.com