Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Deleting older versions in unique indexes to avoid page splits
Date
Msg-id CAH2-WzkGBct_PYzGkv4+qkOFu10WmnX3Q5Ba9FRQqt29onpSQw@mail.gmail.com
Whole thread Raw
In response to Re: Deleting older versions in unique indexes to avoid page splits  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
On Tue, Oct 27, 2020 at 2:44 AM Simon Riggs <simon@2ndquadrant.com> wrote:
> While it is important we investigate the worst cases, I don't see this
> is necessarily bad.

I looked at "perf top" a few times when the test from yesterday ran. I
saw that the proposed delete mechanism was the top consumer of CPU
cycles. It seemed as if the mechanism was very expensive. However,
that's definitely the wrong conclusion about what happens in the
general case, or even in slightly less extreme cases. It at least
needs to be put in context.

I reran exactly the same benchmark overnight, but added a 10k TPS rate
limit this time (so about a third of the TPS that's possible without a
limit). I also ran it for longer, and saw much improved latency. (More
on the latency aspect below, for now I want to talk about "perf top").

The picture with "perf top" changed significantly with a 10k TPS rate
limit, even though the workload itself is very similar. Certainly the
new mechanism/function is still quite close to the top consumer of CPU
cycles. But it no longer uses more cycles than the familiar super hot
functions that you expect to see right at the top with pgbench (e.g.
_bt_compare(), hash_search_with_hash_value()). It's now something like
the 4th or 5th hottest function (I think that that means that the cost
in cycles is more than an order of magnitude lower, but I didn't
check). Just adding this 10k TPS rate limit makes the number of CPU
cycles consumed by the new mechanism seem quite reasonable. The
benefits that the patch brings are not diminished at all compared to
the original no-rate-limit variant -- the master branch now only takes
slightly longer to completely bloat all its indexes with this 10k TPS
limit (while the patch avoids even a single page split -- no change
there).

Again, this is because the mechanism is a backstop. It only works as
hard as needed to avoid unnecessary page splits. When the system is
working as hard as possible to add version churn to indexes (which is
what the original/unthrottled test involved), then the mechanism also
works quite hard. In this artificial and contrived scenario, any
cycles we can save from cleaning up bloat (by micro optimizing the
code in the patch) go towards adding even more bloat instead...which
necessitates doing more cleanup. This is why optimizing the code in
the patch with this unrealistic scenario in mind is subject to sharp
diminishing returns. It's also why you can get a big benefit from the
patch when the new mechanism is barely ever used. I imagine that if I
ran the same test again but with a 1k TPS limit I would hardly see the
new mechanism in "perf top" at all....but in the end the bloat
situation would be very similar.

I think that you could model this probabilistically if you had the
inclination. Yes, the more you throttle throughput (by lowering the
pgbench rate limit further), the less chance you have of any given
leaf page splitting moment to moment (for the master branch). But in
the long run every original leaf page splits at least once anyway,
because each leaf page still only has to be unlucky once. It is still
inevitable that they'll all get split eventually (and probably not
before too long), unless and until you *drastically* throttle pgbench.

I believe that things like opportunistic HOT chain truncation (heap
pruning) and index tuple LP_DEAD bit setting are very effective much
of the time. The problem is that it's easy to utterly rely on them
without even realizing it, which creates hidden risk that may or may
not result in big blow ups down the line. There is nothing inherently
wrong with being lazy or opportunistic about cleaning up garbage
tuples -- I think that there are significant advantages, in fact. But
only if it isn't allowed to create systemic risk. More concretely,
bloat cannot be allowed to become concentrated in any one place -- no
individual query should have to deal with more than 2 or 3 versions
for any given logical row. If we're focussed on the *average* number
of physical versions per logical row then we may reach dramatically
wrong conclusions about what to do (which is a problem in a world
where autovacuum is supposed to do most garbage collection...unless
your app happens to look like standard pgbench!).

And now back to latency with this 10k TPS limited variant I ran last
night. After 16 hours we have performed 8 runs, each lasting 2 hours.
In the original chronological order, these runs are:

patch_1_run_16.out: "tps = 10000.095914 (including connections establishing)"
master_1_run_16.out: "tps = 10000.171945 (including connections establishing)"
patch_1_run_32.out: "tps = 10000.082975 (including connections establishing)"
master_1_run_32.out: "tps = 10000.533370 (including connections establishing)"
patch_2_run_16.out: "tps = 10000.076865 (including connections establishing)"
master_2_run_16.out: "tps = 9997.933215 (including connections establishing)"
patch_2_run_32.out: "tps = 9999.911988 (including connections establishing)"
master_2_run_32.out: "tps = 10000.864031 (including connections establishing)"

Here is what I see at the end of "patch_2_run_32.out" (i.e. at the end
of the final 2 hour run for the patch):

number of transactions actually processed: 71999872
latency average = 0.265 ms
latency stddev = 0.110 ms
rate limit schedule lag: avg 0.046 (max 30.274) ms
tps = 9999.911988 (including connections establishing)
tps = 9999.915766 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.023  BEGIN;
         0.099  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.036  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.034  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.025  END;

Here is what I see at the end of "master_2_run_32.out" (i.e. at the
end of the final run for master):

number of transactions actually processed: 72006803
latency average = 0.266 ms
latency stddev = 2.722 ms
rate limit schedule lag: avg 0.074 (max 396.853) ms
tps = 10000.864031 (including connections establishing)
tps = 10000.868545 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.022  BEGIN;
         0.073  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.036  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.034  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.025  END;

Notice the following:

1. The overall "latency average" for the patch is very slightly lower.
2. The overall "latency stddev" for the patch is far far lower -- over
20x lower, in fact.
3. The patch's latency for the UPDATE statement is still somewhat
higher, but it's not so bad. We're still visibly paying a price in
some sense, but at least we're imposing the new costs squarely on the
query that is responsible for all of our problems.
4. The patch's latency for the SELECT statements is exactly the same
for the patch. We're not imposing any new cost on "innocent" SELECT
statements that didn't create the problem, even if they didn't quite
manage to benefit here. (Without LP_DEAD setting in _bt_check_unique()
I'm sure that the SELECT latency would be significantly lower for the
patch.)

The full results (with lots of details pulled from standard system
views after each run) can be downloaded as a .tar.gz archive from:

https://drive.google.com/file/d/1Dn8rSifZqT7pOOIgyKstl-tdACWH-hqO/view?usp=sharing

(It's probably not that interesting to drill down any further, but I
make the full set of results available just in case. There are loads
of things included just because I automatically capture them when
benchmarking anything at all.)

> HOT was difficult to measure, but on a 2+ hour run on a larger table,
> the latency graph was what showed it was a winner. Short runs and
> in-memory data masked the benefits in our early analyses.

Yeah, that's what was nice about working on sorting -- almost instant
feedback. Who wants to spend at least 2 hours to test out every little
theory?  :-)

> So I suggest not looking at the totals and averages but on the peaks
> and the long term trend. Showing that in graphical form is best.

I think that you're right that a graphical representation with an
X-axis that shows how much time has passed would be very useful. I'll
try to find a way of incorporating that into my benchmarking workflow.

This is especially likely to help when modelling how cases with a long
running xact/snapshot behave. That isn't a specific goal of mine here,
but I expect that it'll help with that a lot too. For now I'm just
focussing on downsides and not upsides, for the usual reasons.

> Agreed, we can trade initial speed for long term consistency. I guess
> there are some heuristics there on that tradeoff.

Right. Another way of looking at it is this: it should be possible for
reasonably good DBAs to develop good intuitions about how the system
will hold up over time, based on past experience and common sense --
no chaos theory required. Whatever the cost of the mechanism is, at
least it's only something that gets shaved off the top minute to
minute. It seems almost impossible for the cost to cause sudden
surprises (except maybe once, after an initial upgrade to Postgres 14,
though I doubt it). Whereas it seems very likely to prevent many large
and unpleasant surprises caused by hidden, latent risk.

I believe that approximately 100% of DBAs would gladly take that
trade-off, even if the total cost in cycles was higher. It happens to
also be true that they're very likely to use far fewer cycles over
time, but that's really just a bonus.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Patch to fix FK-related selectivity estimates with constants
Next
From: Anastasia Lubennikova
Date:
Subject: Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits