Re: Commit Timestamp and LSN Inversion issue - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Commit Timestamp and LSN Inversion issue
Date
Msg-id wpitjnzyno7pmhnaudsehw5v6lw644cz52wx4qqd2wvfazlxgq@gsogjmyhn3xw
Whole thread Raw
In response to Re: Commit Timestamp and LSN Inversion issue  (Jan Wieck <jan@wi3ck.info>)
Responses Re: Commit Timestamp and LSN Inversion issue
List pgsql-hackers
Hi,

On 2024-11-12 10:12:40 -0500, Jan Wieck wrote:
> On 11/12/24 08:55, Andres Freund wrote:
> > Hi,
> >
> > On 2024-11-12 08:51:49 -0500, Jan Wieck wrote:
> > > On 11/11/24 23:21, Amit Kapila wrote:
> > > > As the inversion issue can mainly hamper logical replication-based
> > > > solutions we can do any of this additional work under spinlock only
> > > > when the current record is a commit record (which the currently
> > > > proposed patch is already doing) and "wal_level = logical" and also
> > > > can have another option at the subscription level to enable this new
> > > > code path. I am not sure what is best but just sharing the ideas here.
> > >
> > > It can indeed be reduced to one extra *unlikely* if test only for commit
> > > records and only when WAL level is "logical". Without those two being true
> > > there would be zero impact on ReserveXLogInsertLocation().
> >
> > No, the impact would be far larger, because it would make it impractical to
> > remove this bottleneck by using atomic instructions in
> > ReserveXLogInsertLocation(). It doesn't make sense to make it harder to
> > resolve one of the core central bottlenecks in postgres for a niche feature
> > like this. Sorry, but this is just not a viable path.
>
>
> The current section of code covered by the spinlock performs the following
> operations:
>
>     - read CurrBytePos (shmem)
>     - read PrevBytePos (shmem)
>     - store CurrBytePos in PrevBytePos
>     - increment CurrBytePos by size

> In addition there is ReserveXLogSwitch() that is far more complex and is
> guarded by the same spinlock because both functions manipulate the same
> shared memory variables. With that in mind, how viable is the proposal to
> replace these code sections in both functions with atomic operations?

I have working code - pretty ugly at this state, but mostly needs a fair bit
of elbow grease not divine inspiration...  It's not a trivial change, but
entirely doable.

The short summary of how it works is that it uses a single 64bit atomic that
is internally subdivided into a ringbuffer position in N high bits and an
offset from a base LSN in the remaining bits.  The insertion sequence is

1) xadd of [increment ringbuffer by 1] + [size of record] to the
   insertion position state variable.

   This assigns a ringbuffer position and an insertion LSN (by adding the
   position offset to the base pointer) to the inserting backend.

2) The inserting backend sets ringbuffer[pos].oldpos = oldpos

3) The inserting backend gets the prior record's pos from
   ringbuffer[pos - 1].oldpos, this might need to wait for the prior
   inserter to set ringbuffer[pos - 1].oldpos.

   Note that this does *not* have to happen at the point we currently get the
   prev pointer. Instead we can start copying the record into place and then
   acquire the prev pointer and update the record with it. That makes it
   extremely rare to have to wait for the prev pointer to be set.


This leaves you with a single xadd to contended cacheline as the contention
point (scales far better than cmpxchg and far far better than
cmpxchg16b). There's a bit of contention for the ringbuffer[].oldpos being set
and read, but it's only by two backends, not all of them.

The nice part is this scheme leaves you with a ringbuffer that's ordered by
the insertion-lsn. Which allows to make WaitXLogInsertionsToFinish() far more
efficient and to get rid of NUM_XLOGINSERT_LOCKS (by removing WAL insertion
locks). Right now NUM_XLOGINSERT_LOCKS is a major scalability limit - but at
the same time increasing it makes the contention on the spinlock *much* worse,
leading to slowdowns in other workloads.

The above description leaves out a bunch of complexity, of course - we
e.g. need to make sure that ringbuffer positions can't be reused too eagerly
and we need to to update the base pointer.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Michael Banck
Date:
Subject: Re: Parallel workers stats in pg_stat_database
Next
From: Guillaume Lelarge
Date:
Subject: Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE