> Maybe it's a red herring though, but it looks pretty suspicious.
It's unfortunately not too surprising - our buffer mapping table is a pretty
big bottleneck. Both because a hash table is just not a good fit for the
buffer mapping table due to the lack of locality and because dynahash is
really poor hash table implementation.
I measured similar things when looking at apply throughput recently. For in-cache workloads buffer lookup and locking was about half of the load.
One other direction is to extract more memory concurrency. Prefetcher could batch multiple lookups together so CPU OoO execution has a chance to fire off multiple memory accesses at the same time.
The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from the main redo loop.
--
Ants Aasma