Re: Commit Timestamp and LSN Inversion issue - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Commit Timestamp and LSN Inversion issue |
Date | |
Msg-id | wpitjnzyno7pmhnaudsehw5v6lw644cz52wx4qqd2wvfazlxgq@gsogjmyhn3xw Whole thread Raw |
In response to | Re: Commit Timestamp and LSN Inversion issue (Jan Wieck <jan@wi3ck.info>) |
Responses |
Re: Commit Timestamp and LSN Inversion issue
|
List | pgsql-hackers |
Hi, On 2024-11-12 10:12:40 -0500, Jan Wieck wrote: > On 11/12/24 08:55, Andres Freund wrote: > > Hi, > > > > On 2024-11-12 08:51:49 -0500, Jan Wieck wrote: > > > On 11/11/24 23:21, Amit Kapila wrote: > > > > As the inversion issue can mainly hamper logical replication-based > > > > solutions we can do any of this additional work under spinlock only > > > > when the current record is a commit record (which the currently > > > > proposed patch is already doing) and "wal_level = logical" and also > > > > can have another option at the subscription level to enable this new > > > > code path. I am not sure what is best but just sharing the ideas here. > > > > > > It can indeed be reduced to one extra *unlikely* if test only for commit > > > records and only when WAL level is "logical". Without those two being true > > > there would be zero impact on ReserveXLogInsertLocation(). > > > > No, the impact would be far larger, because it would make it impractical to > > remove this bottleneck by using atomic instructions in > > ReserveXLogInsertLocation(). It doesn't make sense to make it harder to > > resolve one of the core central bottlenecks in postgres for a niche feature > > like this. Sorry, but this is just not a viable path. > > > The current section of code covered by the spinlock performs the following > operations: > > - read CurrBytePos (shmem) > - read PrevBytePos (shmem) > - store CurrBytePos in PrevBytePos > - increment CurrBytePos by size > In addition there is ReserveXLogSwitch() that is far more complex and is > guarded by the same spinlock because both functions manipulate the same > shared memory variables. With that in mind, how viable is the proposal to > replace these code sections in both functions with atomic operations? I have working code - pretty ugly at this state, but mostly needs a fair bit of elbow grease not divine inspiration... It's not a trivial change, but entirely doable. The short summary of how it works is that it uses a single 64bit atomic that is internally subdivided into a ringbuffer position in N high bits and an offset from a base LSN in the remaining bits. The insertion sequence is 1) xadd of [increment ringbuffer by 1] + [size of record] to the insertion position state variable. This assigns a ringbuffer position and an insertion LSN (by adding the position offset to the base pointer) to the inserting backend. 2) The inserting backend sets ringbuffer[pos].oldpos = oldpos 3) The inserting backend gets the prior record's pos from ringbuffer[pos - 1].oldpos, this might need to wait for the prior inserter to set ringbuffer[pos - 1].oldpos. Note that this does *not* have to happen at the point we currently get the prev pointer. Instead we can start copying the record into place and then acquire the prev pointer and update the record with it. That makes it extremely rare to have to wait for the prev pointer to be set. This leaves you with a single xadd to contended cacheline as the contention point (scales far better than cmpxchg and far far better than cmpxchg16b). There's a bit of contention for the ringbuffer[].oldpos being set and read, but it's only by two backends, not all of them. The nice part is this scheme leaves you with a ringbuffer that's ordered by the insertion-lsn. Which allows to make WaitXLogInsertionsToFinish() far more efficient and to get rid of NUM_XLOGINSERT_LOCKS (by removing WAL insertion locks). Right now NUM_XLOGINSERT_LOCKS is a major scalability limit - but at the same time increasing it makes the contention on the spinlock *much* worse, leading to slowdowns in other workloads. The above description leaves out a bunch of complexity, of course - we e.g. need to make sure that ringbuffer positions can't be reused too eagerly and we need to to update the base pointer. Greetings, Andres Freund
pgsql-hackers by date: