Re: Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Improving connection scalability: GetSnapshotData() |
Date | |
Msg-id | CA+hUKG+M4+p0Ciei1UZb4PJBn=utYJEnBZZb+4xBQw3D_e2A3Q@mail.gmail.com Whole thread Raw |
In response to | Improving connection scalability: GetSnapshotData() (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Improving connection scalability: GetSnapshotData()
|
List | pgsql-hackers |
On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <andres@anarazel.de> wrote: > conns tps master tps pgxact-split > > 1 26842.492845 26524.194821 > 10 246923.158682 249224.782661 > 50 695956.539704 709833.746374 > 100 1054727.043139 1903616.306028 > 200 964795.282957 1949200.338012 > 300 906029.377539 1927881.231478 > 400 845696.690912 1911065.369776 > 500 812295.222497 1926237.255856 > 600 888030.104213 1903047.236273 > 700 866896.532490 1886537.202142 > 800 863407.341506 1883768.592610 > 900 871386.608563 1874638.012128 > 1000 887668.277133 1876402.391502 > 1500 860051.361395 1815103.564241 > 2000 890900.098657 1775435.271018 > 3000 874184.980039 1653953.817997 > 4000 845023.080703 1582582.316043 > 5000 817100.195728 1512260.802371 This will clearly be really big news for lots of PostgreSQL users. > One further cool recognition of the fact that GetSnapshotData()'s > results can be made to only depend on the set of xids in progress, is > that caching the results of GetSnapshotData() is almost trivial at that > point: We only need to recompute snapshots when a toplevel transaction > commits/aborts. > > So we can avoid rebuilding snapshots when no commt has happened since it > was last built. Which amounts to assigning a current 'commit sequence > number' to the snapshot, and checking that against the current number > at the time of the next GetSnapshotData() call. Well, turns out there's > this "LSN" thing we assign to commits (there are some small issues with > that though). I've experimented with that, and it considerably further > improves the numbers above. Both with a higher peak throughput, but more > importantly it almost entirely removes the throughput regression from > 2000 connections onwards. > > I'm still working on cleaning that part of the patch up, I'll post it in > a bit. I looked at that part on your public pgxact-split branch. In that version you used "CSN" rather than something based on LSNs, which I assume avoids complications relating to WAL locking or something like that. We should probably be careful to avoid confusion with the pre-existing use of the term "commit sequence number" (CommitSeqNo, CSN) that appears in predicate.c. This also calls to mind the 2013-2016 work by Ants Aasma and others[1] on CSN-based snapshots, which is obviously a much more radical change, but really means what it says (commits). The CSN in your patch set is used purely as a level-change for snapshot cache invalidation IIUC, and it advances also for aborts -- so maybe it should be called something like completed_xact_count, using existing terminology from procarray.c. + if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId && + UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn) Why is it OK to read ShmemVariableCache->csn without at least a read barrier? I suppose this allows a cached snapshot to be used very soon after a transaction commits and should be visible to you, but ... hmmmrkwjherkjhg... I guess it must be really hard to observe any anomaly. Let's see... maybe it's possible on a relaxed memory system like POWER or ARM, if you use a shm flag to say "hey I just committed a transaction", and the other guy sees the flag but can't yet see the new CSN, so an SPI query can't see the transaction? Another theoretical problem is the non-atomic read of a uint64 on some 32 bit platforms. > 0007: WIP: Introduce abstraction layer for "is tuple invisible" tests. > > This is the most crucial piece. Instead of code directly using > RecentOldestXmin, there's a new set of functions for testing > whether an xid is visible (InvisibleToEveryoneTestXid() et al). > > Those function use new horizon boundaries computed as part of > GetSnapshotData(), and recompute an accurate boundary when the > tested xid falls inbetween. > > There's a bit more infrastructure needed - we need to limit how > often an accurate snapshot is computed. Probably to once per > snapshot? Or once per transaction? > > > To avoid issues with the lower boundary getting too old and > presenting a wraparound danger, I made all the xids be > FullTransactionIds. That imo is a good thing? +1, as long as we don't just move the wraparound danger to the places where we convert xids to fxids! +/* + * Be very careful about when to use this function. It can only safely be used + * when there is a guarantee that, at the time of the call, xid is within 2 + * billion xids of rel. That e.g. can be guaranteed if the the caller assures + * a snapshot is held by the backend, and xid is from a table (where + * vacuum/freezing ensures the xid has to be within that range). + */ +static inline FullTransactionId +FullXidViaRelative(FullTransactionId rel, TransactionId xid) +{ + uint32 rel_epoch = EpochFromFullTransactionId(rel); + TransactionId rel_xid = XidFromFullTransactionId(rel); + uint32 epoch; + + /* + * TODO: A function to easily write an assertion ensuring that xid is + * between [oldestXid, nextFullXid) woudl be useful here, and in plenty + * other places. + */ + + if (xid > rel_xid) + epoch = rel_epoch - 1; + else + epoch = rel_epoch; + + return FullTransactionIdFromEpochAndXid(epoch, xid); +} I hate it, but I don't have a better concrete suggestion right now. Whatever I come up with amounts to the same thing on some level, though I feel like it might be better to used an infrequently updated oldestFxid as the lower bound in a conversion. An upper bound would also seem better, though requires much trickier interlocking. What you have means "it's near here!"... isn't that too prone to bugs that are hidden because of the ambient fuzziness? A lower bound seems like it could move extremely infrequently and therefore it'd be OK for it to be protected by both proc array and xid gen locks (ie it'd be recomputed when nextFxid needs to move too far ahead of it, so every ~2 billion xacts). I haven't looked at this long enough to have a strong opinion, though. On a more constructive note: GetOldestXminInt() does: LWLockAcquire(ProcArrayLock, LW_SHARED); + nextfxid = ShmemVariableCache->nextFullXid; + ... LWLockRelease(ProcArrayLock); ... + return FullXidViaRelative(nextfxid, result); But nextFullXid is protected by XidGenLock; maybe that's OK from a data freshness point of view (I'm not sure), but from an atomicity point of view, you can't do that can you? > This patch currently breaks old_snapshot_threshold, as I've not > yet integrated it with the new functions. I think we can make the > old snapshot stuff a lot more precise with this - instead of > always triggering conflicts when a RecentGlobalXmin is too old, we > can do so only in the cases we actually remove a row. I ran out of > energy threading that through the heap_page_prune and > HeapTupleSatisfiesVacuum. CCing Kevin as an FYI. > 0008: Move PGXACT->xmin back to PGPROC. > > Now that GetSnapshotData() doesn't access xmin anymore, we can > make it a normal field in PGPROC again. > > > 0009: Improve GetSnapshotData() performance by avoiding indirection for xid access. > 0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags > 0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access > > These successively move the remaining PGXACT fields into separate > arrays in ProcGlobal, and adjust GetSnapshotData() to take > advantage. Those arrays are dense in the sense that they only > contain data for PGPROCs that are in use (i.e. when disconnecting, > the array is moved around).. > > I think xid, and vacuumFlags are pretty reasonable. But need > cleanup, obviously: > - The biggest cleanup would be to add a few helper functions for > accessing the values, rather than open coding that. > - Perhaps we should call the ProcGlobal ones 'cached', and name > the PGPROC ones as the one true source of truth? > > For subxid I thought it'd be nice to have nxids and overflow be > only one number. But that probably was the wrong call? Now > TransactionIdInProgress() cannot look at at the subxids that did > fit in PGPROC.subxid. I'm not sure that's important, given the > likelihood of misses? But I'd probably still have the subxid > array be one of {uint8 nsubxids; bool overflowed} instead. > > > To keep the arrays dense they copy the logic for pgprocnos. Which > means that ProcArrayAdd/Remove move things around. Unfortunately > that requires holding both ProcArrayLock and XidGenLock currently > (to avoid GetNewTransactionId() having to hold ProcArrayLock). But > that doesn't seem too bad? In the places where you now acquire both, I guess you also need to release both in the error path? [1] https://www.postgresql.org/message-id/flat/CA%2BCSw_tEpJ%3Dmd1zgxPkjH6CWDnTDft4gBi%3D%2BP9SnoC%2BWy3pKdA%40mail.gmail.com
pgsql-hackers by date: