Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers

Hi,

I think postgres' issues with scaling to larger numbers of connections
is a serious problem in the field. While poolers can address some of
that, given the issues around prepared statements, transaction state,
etc, I don't think that's sufficient in many cases. It also adds
latency.

Nor do I think the argument that one shouldn't have more than a few
dozen connection holds particularly much water. As clients have think
time, and database results have to be sent/received (most clients don't
use pipelining), and as many applications have many application servers
with individual connection pools, it's very common to need more
connections than postgres can easily deal with.


The largest reason for that is GetSnapshotData(). It scales poorly to
larger connection counts. Part of that is obviously it's O(connections)
nature, but I always thought it had to be more.  I've seen production
workloads spending > 98% of the cpu time n GetSnapshotData().


After a lot of analysis and experimentation I figured out that the
primary reason for this is PGXACT->xmin. Even the simplest transaction
modifies MyPgXact->xmin several times during its lifetime (IIRC twice
(snapshot & release) for exec_bind_message(), same for
exec_exec_message(), then again as part of EOXact processing). Which
means that a backend doing GetSnapshotData() on a system with a number
of other connections active, is very likely to hit PGXACT cachelines
that are owned by another cpu / set of cpus / socket. The larger the
system is, the worse the consequences of this are.

This problem is most prominent (and harder to fix) for xmin, but also
exists for the other fields in PGXACT. We rarely have xid, nxids,
overflow, or vacuumFlags set, yet constantly set them, leading to
cross-node traffic.

The second biggest problem is that the indirection through pgprocnos
that GetSnapshotData() has to do to go through to get each backend's
xmin is very unfriendly for a pipelined CPU (i.e. all that postgres runs
on). There's basically a stall at the end of every loop iteration -
which is exascerbated by there being so many cache misses.


It's fairly easy to avoid unnecessarily dirtying cachelines for all the
PGXACT fields except xmin. Because that actually needs to be visible to
other backends.


While it sounds almost trivial in hindsight, it took me a long while to
grasp a solution to a big part of this problem: We don't actually need
to look at PGXACT->xmin to compute a snapshot. The only reason that
GetSnapshotData() does so, is because it also computes
RecentGlobal[Data]Xmin.

But we don't actually need them all that frequently. They're primarily
used as a horizons for heap_page_prune_opt() etc. But for one, while
pruning is really important, it doesn't happen *all* the time. But more
importantly a RecentGlobalXmin from an earlier transaction is actually
sufficient for most pruning requests, especially when there is a larger
percentage of reading than updating transaction (very common).

By having GetSnapshotData() compute an accurate upper bound after which
we are certain not to be able to prune (basically the transaction's
xmin, slots horizons, etc), and a conservative lower bound below which
we are definitely able to prune, we can allow some pruning actions to
happen. If a pruning request (or something similar) encounters an xid
between those, an accurate lower bound can be computed.

That allows to avoid looking at PGXACT->xmin.


To address the second big problem (the indirection), we can instead pack
the contents of PGXACT tightly, just like we do for pgprocnos. In the
attached series, I introduced separate arrays for xids, vacuumFlags,
nsubxids.

The reason for splitting them is that they change at different rates,
and different sizes.  In a read-mostly workload, most backends are not
going to have an xid, therefore making the xids array almost
constant. As long as all xids are unassigned, GetSnapshotData() doesn't
need to look at anything else, therefore making it sensible to check the
xid first.


Here are some numbers for the submitted patch series. I'd to cull some
further improvements to make it more manageable, but I think the numbers
still are quite convincing.

The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count.  This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.

conns   tps master        tps pgxact-split

1       26842.492845            26524.194821
10      246923.158682           249224.782661
50      695956.539704           709833.746374
100     1054727.043139          1903616.306028
200     964795.282957           1949200.338012
300     906029.377539           1927881.231478
400     845696.690912           1911065.369776
500     812295.222497           1926237.255856
600     888030.104213           1903047.236273
700     866896.532490           1886537.202142
800     863407.341506           1883768.592610
900     871386.608563           1874638.012128
1000    887668.277133           1876402.391502
1500    860051.361395           1815103.564241
2000    890900.098657           1775435.271018
3000    874184.980039           1653953.817997
4000    845023.080703           1582582.316043
5000    817100.195728           1512260.802371

I think these are pretty nice results.


Note that the patchset currently does not implement snapshot_too_old,
the rest of the regression tests do pass.


One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.

So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though).  I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.

I'm still working on cleaning that part of the patch up, I'll post it in
a bit.


The series currently consists out of:

0001-0005: Fixes and assert improvements that are independent of the patch, but
           are hit by the new code (see also separate thread).

0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified

0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.

      This is the most crucial piece. Instead of code directly using
      RecentOldestXmin, there's a new set of functions for testing
      whether an xid is visible (InvisibleToEveryoneTestXid() et al).

      Those function use new horizon boundaries computed as part of
      GetSnapshotData(), and recompute an accurate boundary when the
      tested xid falls inbetween.

      There's a bit more infrastructure needed - we need to limit how
      often an accurate snapshot is computed. Probably to once per
      snapshot? Or once per transaction?


      To avoid issues with the lower boundary getting too old and
      presenting a wraparound danger, I made all the xids be
      FullTransactionIds. That imo is a good thing?


      This patch currently breaks old_snapshot_threshold, as I've not
      yet integrated it with the new functions. I think we can make the
      old snapshot stuff a lot more precise with this - instead of
      always triggering conflicts when a RecentGlobalXmin is too old, we
      can do so only in the cases we actually remove a row. I ran out of
      energy threading that through the heap_page_prune and
      HeapTupleSatisfiesVacuum.


0008: Move PGXACT->xmin back to PGPROC.

      Now that GetSnapshotData() doesn't access xmin anymore, we can
      make it a normal field in PGPROC again.


0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access

      These successively move the remaining PGXACT fields into separate
      arrays in ProcGlobal, and adjust GetSnapshotData() to take
      advantage.  Those arrays are dense in the sense that they only
      contain data for PGPROCs that are in use (i.e. when disconnecting,
      the array is moved around)..

      I think xid, and vacuumFlags are pretty reasonable. But need
      cleanup, obviously:
      - The biggest cleanup would be to add a few helper functions for
        accessing the values, rather than open coding that.
      - Perhaps we should call the ProcGlobal ones 'cached', and name
        the PGPROC ones as the one true source of truth?

      For subxid I thought it'd be nice to have nxids and overflow be
      only one number. But that probably was the wrong call? Now
      TransactionIdInProgress() cannot look at at the subxids that did
      fit in PGPROC.subxid. I'm not sure that's important, given the
      likelihood of misses?  But I'd probably still have the subxid
      array be one of {uint8 nsubxids; bool overflowed} instead.


      To keep the arrays dense they copy the logic for pgprocnos. Which
      means that ProcArrayAdd/Remove move things around. Unfortunately
      that requires holding both ProcArrayLock and XidGenLock currently
      (to avoid GetNewTransactionId() having to hold ProcArrayLock). But
      that doesn't seem too bad?


0012: Remove now unused PGXACT.

      There's no reason to have it anymore.

The patchseries is also available at
https://github.com/anarazel/postgres/tree/pgxact-split

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Wao
Date:
Subject: Re[2]: bool_plperl transform
Next
From: Andres Freund
Date:
Subject: Re: Improving connection scalability: GetSnapshotData()