Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Deleting older versions in unique indexes to avoid page splits
Date
Msg-id CAH2-WznT15uA_86JxvCf5fPPTAn-H7uS8TS3CzvToqERbuU52A@mail.gmail.com
Whole thread Raw
In response to Re: Deleting older versions in unique indexes to avoid page splits  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Deleting older versions in unique indexes to avoid page splits
List pgsql-hackers
On Mon, Oct 26, 2020 at 2:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Now for the not-so-good news.

> The latency numbers aren't great for the patch, either. Take the 16 client case:

Attached is v6, which more or less totally fixes the problem we saw
with this silly "lots of low cardinality indexes" benchmark.

I wasn't really expecting this to happen -- the benchmark in question
is extreme and rather unrealistic -- but I had some new insights that
made it possible (more details on the code changes towards the end of
this email). Now I don't have to convince anyone that the performance
hit for extreme cases is worth it in order to realize big benefits for
other workloads. There pretty much isn't a performance hit to speak of
now (or so it would seem). I've managed to take this small loss of
performance and turn it into a small gain. And without having to make
any compromises on the core goal of the patch ("no unnecessary page
splits caused by version churn").

With the same indexes ("score", "tenner", "fiver", etc) as before on
pgbench_accounts, and same pgbench-variant queries, we once again see
lots of index bloat with master, but no index bloat at all with v6 of
the patch (no change there). The raw latency numbers are where we see
new improvements for v6. Summary:

2020-11-02 17:35:29 -0800 - Start of initial data load for run
"patch.r1c16" (DB is also used by later runs)
2020-11-02 17:40:21 -0800 - End of initial data load for run "patch.r1c16"
2020-11-02 17:40:21 -0800 - Start of pgbench run "patch.r1c16"
2020-11-02 19:40:31 -0800 - End of pgbench run "patch.r1c16":
patch.r1c16.bench.out: "tps = 9998.129224 (including connections
establishing)" "latency average = 0.243 ms" "latency stddev = 0.088
ms"
2020-11-02 19:40:46 -0800 - Start of initial data load for run
"master.r1c16" (DB is also used by later runs)
2020-11-02 19:45:42 -0800 - End of initial data load for run "master.r1c16"
2020-11-02 19:45:42 -0800 - Start of pgbench run "master.r1c16"
2020-11-02 21:45:52 -0800 - End of pgbench run "master.r1c16":
master.r1c16.bench.out: "tps = 9998.674505 (including connections
establishing)" "latency average = 0.231 ms" "latency stddev = 0.717
ms"
2020-11-02 21:46:10 -0800 - Start of pgbench run "patch.r1c32"
2020-11-02 23:46:23 -0800 - End of pgbench run "patch.r1c32":
patch.r1c32.bench.out: "tps = 9999.968794 (including connections
establishing)" "latency average = 0.256 ms" "latency stddev = 0.104
ms"
2020-11-02 23:46:39 -0800 - Start of pgbench run "master.r1c32"
2020-11-03 01:46:54 -0800 - End of pgbench run "master.r1c32":
master.r1c32.bench.out: "tps = 10001.097045 (including connections
establishing)" "latency average = 0.250 ms" "latency stddev = 1.858
ms"
2020-11-03 01:47:32 -0800 - Start of pgbench run "patch.r2c16"
2020-11-03 03:47:45 -0800 - End of pgbench run "patch.r2c16":
patch.r2c16.bench.out: "tps = 9999.290688 (including connections
establishing)" "latency average = 0.247 ms" "latency stddev = 0.103
ms"
2020-11-03 03:48:04 -0800 - Start of pgbench run "master.r2c16"
2020-11-03 05:48:18 -0800 - End of pgbench run "master.r2c16":
master.r2c16.bench.out: "tps = 10000.424117 (including connections
establishing)" "latency average = 0.241 ms" "latency stddev = 1.587
ms"
2020-11-03 05:48:39 -0800 - Start of pgbench run "patch.r2c32"
2020-11-03 07:48:52 -0800 - End of pgbench run "patch.r2c32":
patch.r2c32.bench.out: "tps = 9999.539730 (including connections
establishing)" "latency average = 0.258 ms" "latency stddev = 0.125
ms"
2020-11-03 07:49:11 -0800 - Start of pgbench run "master.r2c32"
2020-11-03 09:49:26 -0800 - End of pgbench run "master.r2c32":
master.r2c32.bench.out: "tps = 10000.833754 (including connections
establishing)" "latency average = 0.250 ms" "latency stddev = 0.997
ms"

These are 2 hour runs, 16 and 32 clients -- same as last time (though
note the 10k TPS limit). So 4 pairs of runs (each pair of runs is a
pair of patch/master runs) making 8 runs total, lasting 16 hours total
(not including initial data loading).

Notice that the final pair of runs shows that the master branch
eventually gets to the point of having far higher latency stddev. The
stddev starts high and gets higher as time goes on. Here is the
latency breakdown for the final pair of runs:

patch.r2c32.bench.out:

scaling factor: 1000
query mode: prepared
number of clients: 32
number of threads: 8
duration: 7200 s
number of transactions actually processed: 71997119
latency average = 0.258 ms
latency stddev = 0.125 ms
rate limit schedule lag: avg 0.046 (max 39.151) ms
tps = 9999.539730 (including connections establishing)
tps = 9999.544743 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.022  BEGIN;
         0.091  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.036  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.034  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.025  END;

master.r2c32.bench.out:

query mode: prepared
number of clients: 32
number of threads: 8
duration: 7200 s
number of transactions actually processed: 72006667
latency average = 0.250 ms
latency stddev = 0.997 ms
rate limit schedule lag: avg 0.053 (max 233.045) ms
tps = 10000.833754 (including connections establishing)
tps = 10000.839935 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.023  BEGIN;
         0.075  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.037  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.035  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.026  END;

There is only one aspect of this that the master branch still wins on
by the end -- the latency on the UPDATE is still a little lower on
master. This happens for the obvious reason: the UPDATE doesn't clean
up after itself on the master branch (to the extent that the average
xact latency is still a tiny bit higher with the patch). But the
SELECT queries are never actually slower with the patch, even during
earlier runs -- they're just as fast as the master branch (and faster
than the master branch by the end). Only the UPDATEs can ever be made
slower, so AFAICT any new cost is only experienced by the queries that
create the problem for the system as a whole. We're imposing new costs
fairly.

Performance with the patch is consistently much more stable. We don't
get overwhelmed by FPIs in the way we see with the master branch,
which causes sudden gluts that are occasionally very severe (notice
that master.r2c16.bench.out was a big stddev outlier). Maybe the most
notable thing is the max rate limit schedule lag. In the last run we
see max 39.151 ms for the patch vs max 233.045 ms for master.

Now for a breakdown of the code enhancements behind the improved
benchmark numbers. They are:

* The most notable improvement is to the sort order of the heap block
groups within heapam.c. We now round up to the next power of two when
sorting candidate heap blocks (this is the sort used to decide which
heap blocks to visit, and in what order). The "number of promising
TIDs" field (as well as the "number of TIDs total" tiebreaker) is
rounded up so that we ignore relatively small differences. We now tend
to process related groups of contiguous pages in relatively big
batches -- though only where appropriate.

* A closely related optimization was also added to heapam.c:
"favorable blocks". That is, we recognize groups of related heap
blocks that are contiguous. When we encounter these blocks we make
heapam.c effectively increase its per-call batch size, so that it
processes more blocks in the short term but does less absolute work in
the long run.

The key insight behind both of these enhancements is that physically
close heap blocks are generally also close together in time, and
generally share similar characteristics (mostly LP_DEAD items vs
mostly live items, already in shared_buffers, etc). So we're focussing
more on heap locality when the hint we get from nbtree isn't giving us
a very strong signal about what to do (we need to be very judicious
because we are still only willing to access a small number of heap
blocks at a time). While in general the best information heapam has to
go on comes from nbtree, heapam should not care about noise-level
variations in that information -- better to focus on heap locality
(IOW heapam.c needs to have a sophisticated understanding of the
limitations of the hints it receives from nbtree). As a result of
these two changes, heapam.c tends to process earlier blocks/records
first, in order, in a way that is correlated across time and across
indexes -- with more sequential I/O and more confidence in a
successful outcome when we undertake work. (Victor should note that
heapam.c no longer has any patience when it encounters even a single
heap block with no dead TIDs -- he was right about that. The new
heapam.c stuff works well enough that there is no possible upside to
"being patient", even with indexes on low cardinality data that
experience lots of version churn, a workload that my recent
benchmarking exemplifies.)

* nbtdedup.c will now give heapam.c an accurate idea about its
requirements -- it now expresses those in terms of space freed, which
heapam.c now cares about directly. This matters a lot with low
cardinality data, where freeing an entire index tuple is a lot more
valuable to nbtree than freeing just one TID in a posting list
(because the former case frees up 20 bytes at a minimum, while the
latter case usually only frees up 6 bytes).

I said something to Victor about nbtree's wishes not being important
here -- heapam.c is in charge. But that now seems like the wrong
mental model. After all, how can nbtree not be important if it is
entitled to call heapam.c as many times as it feels like, without
caring about the cost of thrashing (as we saw with low cardinality
data prior to v6)? With v6 of the patch I took my own advice about not
thinking of each operation as an isolated event. So now heapam.c has a
more nuanced understanding of the requirements of nbtree, and can be
either more eager or more lazy according to 1) nbtree requirements,
and 2.) conditions local to heapam.c.

* Improved handling of low cardinality data in nbtdedup.c -- we now
always merge together items with low cardinality data, regardless of
how well we do with deleting TIDs. This buys more time before the next
delete attempt for the same page.

* Specialized shellsort implementations for heapam.c.

Shellsort is sometimes used as a lightweight system qsort in embedded
systems. It has many of the same useful properties as a well optimized
quicksort for smaller datasets, and also has the merit of compiling to
far fewer instructions when specialized like this. Specializing on
qsort (as in v5) results in machine code that seems rather heavyweight
given the heapam.c requirements. Instruction cache matters here
(although that's easy to miss when profiling).

v6 still needs more polishing -- my focus has still been on the
algorithm itself. But I think I'm almost done with that part -- it
seems unlikely that I'll be able to make any additional significant
improvements in that area after v6. The new bucketized heap block
sorting behavior seems to be really effective, especially with low
cardinality data, and especially over time, as the heap naturally
becomes more fragmented. We're now blending together locality from
nbtree and heapam in an adaptive way.

I'm pretty sure that the performance on sympathetic cases (such as the
case Victor tested) will also be a lot better, though I didn't look
into that on v6. If Victor is in a position to run further benchmarks
on v6, that would certainly be welcome (independent validation always
helps).

I'm not aware of any remaining cases that it would be fair to describe
as being regressed by the patch -- can anybody else think of any
possible candidates?

BTW, I will probably rename the mechanism added by the patch to
"bottom-up index vacuuming", or perhaps "bottom-up index deletion" --
that seems to capture the general nature of what the patch does quite
well. Now regular index vacuuming can be thought of as a top-down
complementary mechanism that takes care of remaining diffuse spots of
garbage tuples that queries happen to miss (more or less). Also, while
it is true that there are some ways in which the patch is related to
deduplication, it doesn't seem useful to emphasize that part anymore.
Plus clarifying which kind of deduplication I'm talking about in code
comments is starting to become really annoying.

-- 
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Dumping/restoring fails on inherited generated column
Next
From: David Rowley
Date:
Subject: Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path