Re: Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Improving connection scalability: GetSnapshotData()
Date
Msg-id CA+hUKG+M4+p0Ciei1UZb4PJBn=utYJEnBZZb+4xBQw3D_e2A3Q@mail.gmail.com
Whole thread Raw
In response to Improving connection scalability: GetSnapshotData()  (Andres Freund <andres@anarazel.de>)
Responses Re: Improving connection scalability: GetSnapshotData()  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <andres@anarazel.de> wrote:
> conns   tps master              tps pgxact-split
>
> 1       26842.492845            26524.194821
> 10      246923.158682           249224.782661
> 50      695956.539704           709833.746374
> 100     1054727.043139          1903616.306028
> 200     964795.282957           1949200.338012
> 300     906029.377539           1927881.231478
> 400     845696.690912           1911065.369776
> 500     812295.222497           1926237.255856
> 600     888030.104213           1903047.236273
> 700     866896.532490           1886537.202142
> 800     863407.341506           1883768.592610
> 900     871386.608563           1874638.012128
> 1000    887668.277133           1876402.391502
> 1500    860051.361395           1815103.564241
> 2000    890900.098657           1775435.271018
> 3000    874184.980039           1653953.817997
> 4000    845023.080703           1582582.316043
> 5000    817100.195728           1512260.802371

This will clearly be really big news for lots of PostgreSQL users.

> One further cool recognition of the fact that GetSnapshotData()'s
> results can be made to only depend on the set of xids in progress, is
> that caching the results of GetSnapshotData() is almost trivial at that
> point: We only need to recompute snapshots when a toplevel transaction
> commits/aborts.
>
> So we can avoid rebuilding snapshots when no commt has happened since it
> was last built. Which amounts to assigning a current 'commit sequence
> number' to the snapshot, and checking that against the current number
> at the time of the next GetSnapshotData() call. Well, turns out there's
> this "LSN" thing we assign to commits (there are some small issues with
> that though).  I've experimented with that, and it considerably further
> improves the numbers above. Both with a higher peak throughput, but more
> importantly it almost entirely removes the throughput regression from
> 2000 connections onwards.
>
> I'm still working on cleaning that part of the patch up, I'll post it in
> a bit.

I looked at that part on your public pgxact-split branch.  In that
version you used "CSN" rather than something based on LSNs, which I
assume avoids complications relating to WAL locking or something like
that.  We should probably be careful to avoid confusion with the
pre-existing use of the term "commit sequence number" (CommitSeqNo,
CSN) that appears in predicate.c.  This also calls to mind the
2013-2016 work by Ants Aasma and others[1] on CSN-based snapshots,
which is obviously a much more radical change, but really means what
it says (commits).  The CSN in your patch set is used purely as a
level-change for snapshot cache invalidation IIUC, and it advances
also for aborts -- so maybe it should be called something like
completed_xact_count, using existing terminology from procarray.c.

+       if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
+               UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)

Why is it OK to read ShmemVariableCache->csn without at least a read
barrier?  I suppose this allows a cached snapshot to be used very soon
after a transaction commits and should be visible to you, but ...
hmmmrkwjherkjhg... I guess it must be really hard to observe any
anomaly.  Let's see... maybe it's possible on a relaxed memory system
like POWER or ARM, if you use a shm flag to say "hey I just committed
a transaction", and the other guy sees the flag but can't yet see the
new CSN, so an SPI query can't see the transaction?

Another theoretical problem is the non-atomic read of a uint64 on some
32 bit platforms.

> 0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
>
>       This is the most crucial piece. Instead of code directly using
>       RecentOldestXmin, there's a new set of functions for testing
>       whether an xid is visible (InvisibleToEveryoneTestXid() et al).
>
>       Those function use new horizon boundaries computed as part of
>       GetSnapshotData(), and recompute an accurate boundary when the
>       tested xid falls inbetween.
>
>       There's a bit more infrastructure needed - we need to limit how
>       often an accurate snapshot is computed. Probably to once per
>       snapshot? Or once per transaction?
>
>
>       To avoid issues with the lower boundary getting too old and
>       presenting a wraparound danger, I made all the xids be
>       FullTransactionIds. That imo is a good thing?

+1, as long as we don't just move the wraparound danger to the places
where we convert xids to fxids!

+/*
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that, at the time of the call, xid is within 2
+ * billion xids of rel. That e.g. can be guaranteed if the the caller assures
+ * a snapshot is held by the backend, and xid is from a table (where
+ * vacuum/freezing ensures the xid has to be within that range).
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+       uint32 rel_epoch = EpochFromFullTransactionId(rel);
+       TransactionId rel_xid = XidFromFullTransactionId(rel);
+       uint32 epoch;
+
+       /*
+        * TODO: A function to easily write an assertion ensuring that xid is
+        * between [oldestXid, nextFullXid) woudl be useful here, and in plenty
+        * other places.
+        */
+
+       if (xid > rel_xid)
+               epoch = rel_epoch - 1;
+       else
+               epoch = rel_epoch;
+
+       return FullTransactionIdFromEpochAndXid(epoch, xid);
+}

I hate it, but I don't have a better concrete suggestion right now.
Whatever I come up with amounts to the same thing on some level,
though I feel like it might be better to used an infrequently updated
oldestFxid as the lower bound in a conversion.  An upper bound would
also seem better, though requires much trickier interlocking.  What
you have means "it's near here!"... isn't that too prone to bugs that
are hidden because of the ambient fuzziness?  A lower bound seems like
it could move extremely infrequently and therefore it'd be OK for it
to be protected by both proc array and xid gen locks (ie it'd be
recomputed when nextFxid needs to move too far ahead of it, so every
~2 billion xacts).  I haven't looked at this long enough to have a
strong opinion, though.

On a more constructive note:

GetOldestXminInt() does:

        LWLockAcquire(ProcArrayLock, LW_SHARED);

+       nextfxid = ShmemVariableCache->nextFullXid;
+
...
        LWLockRelease(ProcArrayLock);
...
+       return FullXidViaRelative(nextfxid, result);

But nextFullXid is protected by XidGenLock; maybe that's OK from a
data freshness point of view (I'm not sure), but from an atomicity
point of view, you can't do that can you?

>       This patch currently breaks old_snapshot_threshold, as I've not
>       yet integrated it with the new functions. I think we can make the
>       old snapshot stuff a lot more precise with this - instead of
>       always triggering conflicts when a RecentGlobalXmin is too old, we
>       can do so only in the cases we actually remove a row. I ran out of
>       energy threading that through the heap_page_prune and
>       HeapTupleSatisfiesVacuum.

CCing Kevin as an FYI.

> 0008: Move PGXACT->xmin back to PGPROC.
>
>       Now that GetSnapshotData() doesn't access xmin anymore, we can
>       make it a normal field in PGPROC again.
>
>
> 0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
> 0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
> 0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access
>
>       These successively move the remaining PGXACT fields into separate
>       arrays in ProcGlobal, and adjust GetSnapshotData() to take
>       advantage.  Those arrays are dense in the sense that they only
>       contain data for PGPROCs that are in use (i.e. when disconnecting,
>       the array is moved around)..
>
>       I think xid, and vacuumFlags are pretty reasonable. But need
>       cleanup, obviously:
>       - The biggest cleanup would be to add a few helper functions for
>         accessing the values, rather than open coding that.
>       - Perhaps we should call the ProcGlobal ones 'cached', and name
>         the PGPROC ones as the one true source of truth?
>
>       For subxid I thought it'd be nice to have nxids and overflow be
>       only one number. But that probably was the wrong call? Now
>       TransactionIdInProgress() cannot look at at the subxids that did
>       fit in PGPROC.subxid. I'm not sure that's important, given the
>       likelihood of misses?  But I'd probably still have the subxid
>       array be one of {uint8 nsubxids; bool overflowed} instead.
>
>
>       To keep the arrays dense they copy the logic for pgprocnos. Which
>       means that ProcArrayAdd/Remove move things around. Unfortunately
>       that requires holding both ProcArrayLock and XidGenLock currently
>       (to avoid GetNewTransactionId() having to hold ProcArrayLock). But
>       that doesn't seem too bad?

In the places where you now acquire both, I guess you also need to
release both in the error path?

[1]
https://www.postgresql.org/message-id/flat/CA%2BCSw_tEpJ%3Dmd1zgxPkjH6CWDnTDft4gBi%3D%2BP9SnoC%2BWy3pKdA%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Pengzhou Tang
Date:
Subject: Re: Memory-Bounded Hash Aggregation
Next
From: Paul A Jungwirth
Date:
Subject: Re: range_agg