Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Improving connection scalability: GetSnapshotData() |
Date | |
Msg-id | 20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Improving connection scalability: GetSnapshotData()
Re: Improving connection scalability: GetSnapshotData() Re: Improving connection scalability: GetSnapshotData() Re: Improving connection scalability: GetSnapshotData() Re: Improving connection scalability: GetSnapshotData() Re: Improving connection scalability: GetSnapshotData() |
List | pgsql-hackers |
Hi, I think postgres' issues with scaling to larger numbers of connections is a serious problem in the field. While poolers can address some of that, given the issues around prepared statements, transaction state, etc, I don't think that's sufficient in many cases. It also adds latency. Nor do I think the argument that one shouldn't have more than a few dozen connection holds particularly much water. As clients have think time, and database results have to be sent/received (most clients don't use pipelining), and as many applications have many application servers with individual connection pools, it's very common to need more connections than postgres can easily deal with. The largest reason for that is GetSnapshotData(). It scales poorly to larger connection counts. Part of that is obviously it's O(connections) nature, but I always thought it had to be more. I've seen production workloads spending > 98% of the cpu time n GetSnapshotData(). After a lot of analysis and experimentation I figured out that the primary reason for this is PGXACT->xmin. Even the simplest transaction modifies MyPgXact->xmin several times during its lifetime (IIRC twice (snapshot & release) for exec_bind_message(), same for exec_exec_message(), then again as part of EOXact processing). Which means that a backend doing GetSnapshotData() on a system with a number of other connections active, is very likely to hit PGXACT cachelines that are owned by another cpu / set of cpus / socket. The larger the system is, the worse the consequences of this are. This problem is most prominent (and harder to fix) for xmin, but also exists for the other fields in PGXACT. We rarely have xid, nxids, overflow, or vacuumFlags set, yet constantly set them, leading to cross-node traffic. The second biggest problem is that the indirection through pgprocnos that GetSnapshotData() has to do to go through to get each backend's xmin is very unfriendly for a pipelined CPU (i.e. all that postgres runs on). There's basically a stall at the end of every loop iteration - which is exascerbated by there being so many cache misses. It's fairly easy to avoid unnecessarily dirtying cachelines for all the PGXACT fields except xmin. Because that actually needs to be visible to other backends. While it sounds almost trivial in hindsight, it took me a long while to grasp a solution to a big part of this problem: We don't actually need to look at PGXACT->xmin to compute a snapshot. The only reason that GetSnapshotData() does so, is because it also computes RecentGlobal[Data]Xmin. But we don't actually need them all that frequently. They're primarily used as a horizons for heap_page_prune_opt() etc. But for one, while pruning is really important, it doesn't happen *all* the time. But more importantly a RecentGlobalXmin from an earlier transaction is actually sufficient for most pruning requests, especially when there is a larger percentage of reading than updating transaction (very common). By having GetSnapshotData() compute an accurate upper bound after which we are certain not to be able to prune (basically the transaction's xmin, slots horizons, etc), and a conservative lower bound below which we are definitely able to prune, we can allow some pruning actions to happen. If a pruning request (or something similar) encounters an xid between those, an accurate lower bound can be computed. That allows to avoid looking at PGXACT->xmin. To address the second big problem (the indirection), we can instead pack the contents of PGXACT tightly, just like we do for pgprocnos. In the attached series, I introduced separate arrays for xids, vacuumFlags, nsubxids. The reason for splitting them is that they change at different rates, and different sizes. In a read-mostly workload, most backends are not going to have an xid, therefore making the xids array almost constant. As long as all xids are unassigned, GetSnapshotData() doesn't need to look at anything else, therefore making it sensible to check the xid first. Here are some numbers for the submitted patch series. I'd to cull some further improvements to make it more manageable, but I think the numbers still are quite convincing. The workload is a pgbench readonly, with pgbench -M prepared -c $conns -j $conns -S -n for each client count. This is on a machine with 2 Intel(R) Xeon(R) Platinum 8168, but virtualized. conns tps master tps pgxact-split 1 26842.492845 26524.194821 10 246923.158682 249224.782661 50 695956.539704 709833.746374 100 1054727.043139 1903616.306028 200 964795.282957 1949200.338012 300 906029.377539 1927881.231478 400 845696.690912 1911065.369776 500 812295.222497 1926237.255856 600 888030.104213 1903047.236273 700 866896.532490 1886537.202142 800 863407.341506 1883768.592610 900 871386.608563 1874638.012128 1000 887668.277133 1876402.391502 1500 860051.361395 1815103.564241 2000 890900.098657 1775435.271018 3000 874184.980039 1653953.817997 4000 845023.080703 1582582.316043 5000 817100.195728 1512260.802371 I think these are pretty nice results. Note that the patchset currently does not implement snapshot_too_old, the rest of the regression tests do pass. One further cool recognition of the fact that GetSnapshotData()'s results can be made to only depend on the set of xids in progress, is that caching the results of GetSnapshotData() is almost trivial at that point: We only need to recompute snapshots when a toplevel transaction commits/aborts. So we can avoid rebuilding snapshots when no commt has happened since it was last built. Which amounts to assigning a current 'commit sequence number' to the snapshot, and checking that against the current number at the time of the next GetSnapshotData() call. Well, turns out there's this "LSN" thing we assign to commits (there are some small issues with that though). I've experimented with that, and it considerably further improves the numbers above. Both with a higher peak throughput, but more importantly it almost entirely removes the throughput regression from 2000 connections onwards. I'm still working on cleaning that part of the patch up, I'll post it in a bit. The series currently consists out of: 0001-0005: Fixes and assert improvements that are independent of the patch, but are hit by the new code (see also separate thread). 0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified 0007: WIP: Introduce abstraction layer for "is tuple invisible" tests. This is the most crucial piece. Instead of code directly using RecentOldestXmin, there's a new set of functions for testing whether an xid is visible (InvisibleToEveryoneTestXid() et al). Those function use new horizon boundaries computed as part of GetSnapshotData(), and recompute an accurate boundary when the tested xid falls inbetween. There's a bit more infrastructure needed - we need to limit how often an accurate snapshot is computed. Probably to once per snapshot? Or once per transaction? To avoid issues with the lower boundary getting too old and presenting a wraparound danger, I made all the xids be FullTransactionIds. That imo is a good thing? This patch currently breaks old_snapshot_threshold, as I've not yet integrated it with the new functions. I think we can make the old snapshot stuff a lot more precise with this - instead of always triggering conflicts when a RecentGlobalXmin is too old, we can do so only in the cases we actually remove a row. I ran out of energy threading that through the heap_page_prune and HeapTupleSatisfiesVacuum. 0008: Move PGXACT->xmin back to PGPROC. Now that GetSnapshotData() doesn't access xmin anymore, we can make it a normal field in PGPROC again. 0009: Improve GetSnapshotData() performance by avoiding indirection for xid access. 0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags 0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access These successively move the remaining PGXACT fields into separate arrays in ProcGlobal, and adjust GetSnapshotData() to take advantage. Those arrays are dense in the sense that they only contain data for PGPROCs that are in use (i.e. when disconnecting, the array is moved around).. I think xid, and vacuumFlags are pretty reasonable. But need cleanup, obviously: - The biggest cleanup would be to add a few helper functions for accessing the values, rather than open coding that. - Perhaps we should call the ProcGlobal ones 'cached', and name the PGPROC ones as the one true source of truth? For subxid I thought it'd be nice to have nxids and overflow be only one number. But that probably was the wrong call? Now TransactionIdInProgress() cannot look at at the subxids that did fit in PGPROC.subxid. I'm not sure that's important, given the likelihood of misses? But I'd probably still have the subxid array be one of {uint8 nsubxids; bool overflowed} instead. To keep the arrays dense they copy the logic for pgprocnos. Which means that ProcArrayAdd/Remove move things around. Unfortunately that requires holding both ProcArrayLock and XidGenLock currently (to avoid GetNewTransactionId() having to hold ProcArrayLock). But that doesn't seem too bad? 0012: Remove now unused PGXACT. There's no reason to have it anymore. The patchseries is also available at https://github.com/anarazel/postgres/tree/pgxact-split Greetings, Andres Freund
Attachment
- v1-0001-WIP-Ensure-snapshot-is-registered-within-ScanPgRe.patch.gz
- v1-0002-TMP-work-around-missing-snapshot-registrations.patch.gz
- v1-0003-TMP-don-t-build-snapshot_too_old-module-it-s-curr.patch.gz
- v1-0004-Improve-and-extend-asserts-for-a-snapshot-being-s.patch.gz
- v1-0005-Fix-unlikely-xid-wraparound-issue-in-heap_abort_s.patch.gz
- v1-0006-Move-delayChkpt-from-PGXACT-to-PGPROC-it-s-rarely.patch.gz
- v1-0007-WIP-Introduce-abstraction-layer-for-is-tuple-invi.patch.gz
- v1-0008-Move-PGXACT-xmin-back-to-PGPROC.patch.gz
- v1-0009-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gz
- v1-0010-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gz
- v1-0011-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gz
- v1-0012-Remove-now-unused-PGXACT.patch.gz
pgsql-hackers by date: