On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
Looking at this profile made me wonder if this was a build without optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should be inlined. And while perf can reconstruct inlined functions when using --call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.
Yeah, for profiling generally I build without optimizations so that I can see all the functions in the stack, so yeah profile results are without optimizations build but the performance results are with optimizations build.
Is this with or without patch, I mean can we see a comparison that patch improved anything in your environment?
Looking at a profile I see the biggest bottleneck in the leader (which is the bottleneck as soon as the worker count is increased) to be reading the length word of the message. I do see shm_mq_receive_bytes() in the profile, but the costly part there is the "read % (uint64) ringsize" - divisions are slow. We could just compute a mask instead of the size.
Yeah that could be done, I can test with this change as well that how much we gain with this.
We also should probably split the read-mostly data in shm_mq (ring_size, detached, ring_offset, receiver, sender) into a separate cacheline from the read/write data. Or perhaps copy more info into the handle, particularly the ringsize (or mask).
Good suggestion, I will do some experiments around this.