Thread: hash_search_with_hash_value is high in "perf top" on a replica
Hi.
Debugging some replication lag on a replica when the master node experiences heavy writes.
PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys), which is a little surprising (what is it doing with all those CPU cycles? it looked like WAL replay should be more IO bound than CPU bound?).
Running "perf top -p <pid>", it shows this:
Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.): 18178814660 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value
5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel
3.42% postgres [.] XLogReadBufferExtended
2.35% libc.so.6 [.] 0x0000000000097f4c
2.04% postgres [.] pg_comp_crc32c_armv8
1.77% [kernel] [k] mutex_lock
1.63% postgres [.] XLogPrefetcherReadRecord
1.56% postgres [.] DecodeXLogRecord
1.55% postgres [.] LWLockAcquire
Overhead Shared Object Symbol
16.63% postgres [.] hash_search_with_hash_value
5.38% postgres [.] __aarch64_ldset4_sync
4.42% postgres [.] __aarch64_cas4_acq_rel
3.42% postgres [.] XLogReadBufferExtended
2.35% libc.so.6 [.] 0x0000000000097f4c
2.04% postgres [.] pg_comp_crc32c_armv8
1.77% [kernel] [k] mutex_lock
1.63% postgres [.] XLogPrefetcherReadRecord
1.56% postgres [.] DecodeXLogRecord
1.55% postgres [.] LWLockAcquire
This is more surprising: hash_search_with_hash_value is a hash table lookup function, is it expected that it is #1 in the profiler? (Called mostly from PrefetchSharedBuffer*.)
Configuration:
Configuration:
- PG17
- ZFS, compression is off, empty database (with compression on, most of "startup recovering" CPU was spent in ZFS guts doing compression/decompression according to the profiler)
- ZFS, compression is off, empty database (with compression on, most of "startup recovering" CPU was spent in ZFS guts doing compression/decompression according to the profiler)
- full_page_writes=off, recovery_prefetch=on (ZFS supports it, I tested with a small C program), wal_decode_buffer_size=2048kB (it doesn't seem to affect anything though).
- shared_buffers = 25% of RAM
- testing with a giant COPY command basically
My main question is about hash_search_with_hash_value CPU usage. With both recovery_prefetch=on and off, it is the topmost function in "perf top". I see no IO bottleneck on the machine, it's only "startup recovering" maxing out one CPU core.
Maybe it's a red herring though, but it looks pretty suspicious.
- shared_buffers = 25% of RAM
- testing with a giant COPY command basically
My main question is about hash_search_with_hash_value CPU usage. With both recovery_prefetch=on and off, it is the topmost function in "perf top". I see no IO bottleneck on the machine, it's only "startup recovering" maxing out one CPU core.
Maybe it's a red herring though, but it looks pretty suspicious.
31.01.2025 17:25, Álvaro Herrera пишет: > On 2025-Jan-31, Dmitry Koterov wrote: > >> PG "startup recovering" eats up a lot of CPU (like 65 %user and 30 %sys), >> which is a little surprising (what is it doing with all those CPU cycles? >> it looked like WAL replay should be more IO bound than CPU bound?). >> >> Running "perf top -p <pid>", it shows this: >> >> Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.): >> 18178814660 lost: 0/0 drop: 0/0 >> Overhead Shared Object Symbol >> 16.63% postgres [.] hash_search_with_hash_value > > Yeah, I noticed that this function was showing high in some profiles a > couple of days ago as well. Looking now at hashfn.c (hash_bytes_uint32 > there is the function involved in the buffer mapping hash table), the > comments state that we're updated to Bob Jenkins code from 2006, but > there's a version in his website that (if I read correctly) is twice as > fast as what we're using now: > http://burtleburtle.net/bob/hash/spooky.html > > Apparently this code in our repo is mostly unchanged since commit > 1f559b7d3aa4, in 2007. > > He mentions that on Intel chips, Google's CityHash is faster; but they > in turn claim that the difference is small on Intel chips and that > Jenkins' hash is better on AMD chips. > https://github.com/google/cityhash > > Anyway if you wanted to try your luck at improving things, here's your > chance. > `hash_search_with_hash_value` uses already calculated hash value, so its performance doesn't depend on performance of hash function (hash_bytes_uint32 or any other). I believe, it is memory bound, since dynahash does at least three indirection jumps before it event reaches first node in linked list segp = HTAB->dir[segment_num]; *bucketptr = segp[segment_ndx]; And then iterates through linked list with each iteration is being (for large hash table) both cache and TLB (without huge pages) miss.