Hello,
On 11/12/24 10:34, Andres Freund wrote:
> I have working code - pretty ugly at this state, but mostly needs a fair bit
> of elbow grease not divine inspiration... It's not a trivial change, but
> entirely doable.
>
> The short summary of how it works is that it uses a single 64bit atomic that
> is internally subdivided into a ringbuffer position in N high bits and an
> offset from a base LSN in the remaining bits. The insertion sequence is
>
> ...
>
> This leaves you with a single xadd to contended cacheline as the contention
> point (scales far better than cmpxchg and far far better than
> cmpxchg16b). There's a bit of contention for the ringbuffer[].oldpos being set
> and read, but it's only by two backends, not all of them.
That sounds rather promising.
Would it be reasonable to have both implementations available at least
at compile time, if not at runtime? Is it possible that we need to do
that anyway for some time or are those atomic operations available on
all supported CPU architectures?
>
> The nice part is this scheme leaves you with a ringbuffer that's ordered by
> the insertion-lsn. Which allows to make WaitXLogInsertionsToFinish() far more
> efficient and to get rid of NUM_XLOGINSERT_LOCKS (by removing WAL insertion
> locks). Right now NUM_XLOGINSERT_LOCKS is a major scalability limit - but at
> the same time increasing it makes the contention on the spinlock *much* worse,
> leading to slowdowns in other workloads.
Yeah, that is a complex wart that I believe was the answer to the NUMA
overload that Kevin Grittner and myself discovered many years ago, where
on a 4-socket machine the cacheline stealing would get so bad that
whoever was holding the lock could not release it.
In any case, thanks for the input. Looks like in the long run we need to
come up with a different way to solve the inversion problem.
Best Regards, Jan