Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Deleting older versions in unique indexes to avoid page splits |
Date | |
Msg-id | CAH2-WznT15uA_86JxvCf5fPPTAn-H7uS8TS3CzvToqERbuU52A@mail.gmail.com Whole thread Raw |
In response to | Re: Deleting older versions in unique indexes to avoid page splits (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Deleting older versions in unique indexes to avoid page splits
|
List | pgsql-hackers |
On Mon, Oct 26, 2020 at 2:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > Now for the not-so-good news. > The latency numbers aren't great for the patch, either. Take the 16 client case: Attached is v6, which more or less totally fixes the problem we saw with this silly "lots of low cardinality indexes" benchmark. I wasn't really expecting this to happen -- the benchmark in question is extreme and rather unrealistic -- but I had some new insights that made it possible (more details on the code changes towards the end of this email). Now I don't have to convince anyone that the performance hit for extreme cases is worth it in order to realize big benefits for other workloads. There pretty much isn't a performance hit to speak of now (or so it would seem). I've managed to take this small loss of performance and turn it into a small gain. And without having to make any compromises on the core goal of the patch ("no unnecessary page splits caused by version churn"). With the same indexes ("score", "tenner", "fiver", etc) as before on pgbench_accounts, and same pgbench-variant queries, we once again see lots of index bloat with master, but no index bloat at all with v6 of the patch (no change there). The raw latency numbers are where we see new improvements for v6. Summary: 2020-11-02 17:35:29 -0800 - Start of initial data load for run "patch.r1c16" (DB is also used by later runs) 2020-11-02 17:40:21 -0800 - End of initial data load for run "patch.r1c16" 2020-11-02 17:40:21 -0800 - Start of pgbench run "patch.r1c16" 2020-11-02 19:40:31 -0800 - End of pgbench run "patch.r1c16": patch.r1c16.bench.out: "tps = 9998.129224 (including connections establishing)" "latency average = 0.243 ms" "latency stddev = 0.088 ms" 2020-11-02 19:40:46 -0800 - Start of initial data load for run "master.r1c16" (DB is also used by later runs) 2020-11-02 19:45:42 -0800 - End of initial data load for run "master.r1c16" 2020-11-02 19:45:42 -0800 - Start of pgbench run "master.r1c16" 2020-11-02 21:45:52 -0800 - End of pgbench run "master.r1c16": master.r1c16.bench.out: "tps = 9998.674505 (including connections establishing)" "latency average = 0.231 ms" "latency stddev = 0.717 ms" 2020-11-02 21:46:10 -0800 - Start of pgbench run "patch.r1c32" 2020-11-02 23:46:23 -0800 - End of pgbench run "patch.r1c32": patch.r1c32.bench.out: "tps = 9999.968794 (including connections establishing)" "latency average = 0.256 ms" "latency stddev = 0.104 ms" 2020-11-02 23:46:39 -0800 - Start of pgbench run "master.r1c32" 2020-11-03 01:46:54 -0800 - End of pgbench run "master.r1c32": master.r1c32.bench.out: "tps = 10001.097045 (including connections establishing)" "latency average = 0.250 ms" "latency stddev = 1.858 ms" 2020-11-03 01:47:32 -0800 - Start of pgbench run "patch.r2c16" 2020-11-03 03:47:45 -0800 - End of pgbench run "patch.r2c16": patch.r2c16.bench.out: "tps = 9999.290688 (including connections establishing)" "latency average = 0.247 ms" "latency stddev = 0.103 ms" 2020-11-03 03:48:04 -0800 - Start of pgbench run "master.r2c16" 2020-11-03 05:48:18 -0800 - End of pgbench run "master.r2c16": master.r2c16.bench.out: "tps = 10000.424117 (including connections establishing)" "latency average = 0.241 ms" "latency stddev = 1.587 ms" 2020-11-03 05:48:39 -0800 - Start of pgbench run "patch.r2c32" 2020-11-03 07:48:52 -0800 - End of pgbench run "patch.r2c32": patch.r2c32.bench.out: "tps = 9999.539730 (including connections establishing)" "latency average = 0.258 ms" "latency stddev = 0.125 ms" 2020-11-03 07:49:11 -0800 - Start of pgbench run "master.r2c32" 2020-11-03 09:49:26 -0800 - End of pgbench run "master.r2c32": master.r2c32.bench.out: "tps = 10000.833754 (including connections establishing)" "latency average = 0.250 ms" "latency stddev = 0.997 ms" These are 2 hour runs, 16 and 32 clients -- same as last time (though note the 10k TPS limit). So 4 pairs of runs (each pair of runs is a pair of patch/master runs) making 8 runs total, lasting 16 hours total (not including initial data loading). Notice that the final pair of runs shows that the master branch eventually gets to the point of having far higher latency stddev. The stddev starts high and gets higher as time goes on. Here is the latency breakdown for the final pair of runs: patch.r2c32.bench.out: scaling factor: 1000 query mode: prepared number of clients: 32 number of threads: 8 duration: 7200 s number of transactions actually processed: 71997119 latency average = 0.258 ms latency stddev = 0.125 ms rate limit schedule lag: avg 0.046 (max 39.151) ms tps = 9999.539730 (including connections establishing) tps = 9999.544743 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid1 random_gaussian(1, 100000 * :scale, 4.0) 0.000 \set aid2 random_gaussian(1, 100000 * :scale, 4.5) 0.000 \set aid3 random_gaussian(1, 100000 * :scale, 4.2) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.022 BEGIN; 0.091 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; 0.036 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; 0.034 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; 0.025 END; master.r2c32.bench.out: query mode: prepared number of clients: 32 number of threads: 8 duration: 7200 s number of transactions actually processed: 72006667 latency average = 0.250 ms latency stddev = 0.997 ms rate limit schedule lag: avg 0.053 (max 233.045) ms tps = 10000.833754 (including connections establishing) tps = 10000.839935 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid1 random_gaussian(1, 100000 * :scale, 4.0) 0.000 \set aid2 random_gaussian(1, 100000 * :scale, 4.5) 0.000 \set aid3 random_gaussian(1, 100000 * :scale, 4.2) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.023 BEGIN; 0.075 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; 0.037 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; 0.035 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; 0.026 END; There is only one aspect of this that the master branch still wins on by the end -- the latency on the UPDATE is still a little lower on master. This happens for the obvious reason: the UPDATE doesn't clean up after itself on the master branch (to the extent that the average xact latency is still a tiny bit higher with the patch). But the SELECT queries are never actually slower with the patch, even during earlier runs -- they're just as fast as the master branch (and faster than the master branch by the end). Only the UPDATEs can ever be made slower, so AFAICT any new cost is only experienced by the queries that create the problem for the system as a whole. We're imposing new costs fairly. Performance with the patch is consistently much more stable. We don't get overwhelmed by FPIs in the way we see with the master branch, which causes sudden gluts that are occasionally very severe (notice that master.r2c16.bench.out was a big stddev outlier). Maybe the most notable thing is the max rate limit schedule lag. In the last run we see max 39.151 ms for the patch vs max 233.045 ms for master. Now for a breakdown of the code enhancements behind the improved benchmark numbers. They are: * The most notable improvement is to the sort order of the heap block groups within heapam.c. We now round up to the next power of two when sorting candidate heap blocks (this is the sort used to decide which heap blocks to visit, and in what order). The "number of promising TIDs" field (as well as the "number of TIDs total" tiebreaker) is rounded up so that we ignore relatively small differences. We now tend to process related groups of contiguous pages in relatively big batches -- though only where appropriate. * A closely related optimization was also added to heapam.c: "favorable blocks". That is, we recognize groups of related heap blocks that are contiguous. When we encounter these blocks we make heapam.c effectively increase its per-call batch size, so that it processes more blocks in the short term but does less absolute work in the long run. The key insight behind both of these enhancements is that physically close heap blocks are generally also close together in time, and generally share similar characteristics (mostly LP_DEAD items vs mostly live items, already in shared_buffers, etc). So we're focussing more on heap locality when the hint we get from nbtree isn't giving us a very strong signal about what to do (we need to be very judicious because we are still only willing to access a small number of heap blocks at a time). While in general the best information heapam has to go on comes from nbtree, heapam should not care about noise-level variations in that information -- better to focus on heap locality (IOW heapam.c needs to have a sophisticated understanding of the limitations of the hints it receives from nbtree). As a result of these two changes, heapam.c tends to process earlier blocks/records first, in order, in a way that is correlated across time and across indexes -- with more sequential I/O and more confidence in a successful outcome when we undertake work. (Victor should note that heapam.c no longer has any patience when it encounters even a single heap block with no dead TIDs -- he was right about that. The new heapam.c stuff works well enough that there is no possible upside to "being patient", even with indexes on low cardinality data that experience lots of version churn, a workload that my recent benchmarking exemplifies.) * nbtdedup.c will now give heapam.c an accurate idea about its requirements -- it now expresses those in terms of space freed, which heapam.c now cares about directly. This matters a lot with low cardinality data, where freeing an entire index tuple is a lot more valuable to nbtree than freeing just one TID in a posting list (because the former case frees up 20 bytes at a minimum, while the latter case usually only frees up 6 bytes). I said something to Victor about nbtree's wishes not being important here -- heapam.c is in charge. But that now seems like the wrong mental model. After all, how can nbtree not be important if it is entitled to call heapam.c as many times as it feels like, without caring about the cost of thrashing (as we saw with low cardinality data prior to v6)? With v6 of the patch I took my own advice about not thinking of each operation as an isolated event. So now heapam.c has a more nuanced understanding of the requirements of nbtree, and can be either more eager or more lazy according to 1) nbtree requirements, and 2.) conditions local to heapam.c. * Improved handling of low cardinality data in nbtdedup.c -- we now always merge together items with low cardinality data, regardless of how well we do with deleting TIDs. This buys more time before the next delete attempt for the same page. * Specialized shellsort implementations for heapam.c. Shellsort is sometimes used as a lightweight system qsort in embedded systems. It has many of the same useful properties as a well optimized quicksort for smaller datasets, and also has the merit of compiling to far fewer instructions when specialized like this. Specializing on qsort (as in v5) results in machine code that seems rather heavyweight given the heapam.c requirements. Instruction cache matters here (although that's easy to miss when profiling). v6 still needs more polishing -- my focus has still been on the algorithm itself. But I think I'm almost done with that part -- it seems unlikely that I'll be able to make any additional significant improvements in that area after v6. The new bucketized heap block sorting behavior seems to be really effective, especially with low cardinality data, and especially over time, as the heap naturally becomes more fragmented. We're now blending together locality from nbtree and heapam in an adaptive way. I'm pretty sure that the performance on sympathetic cases (such as the case Victor tested) will also be a lot better, though I didn't look into that on v6. If Victor is in a position to run further benchmarks on v6, that would certainly be welcome (independent validation always helps). I'm not aware of any remaining cases that it would be fair to describe as being regressed by the patch -- can anybody else think of any possible candidates? BTW, I will probably rename the mechanism added by the patch to "bottom-up index vacuuming", or perhaps "bottom-up index deletion" -- that seems to capture the general nature of what the patch does quite well. Now regular index vacuuming can be thought of as a top-down complementary mechanism that takes care of remaining diffuse spots of garbage tuples that queries happen to miss (more or less). Also, while it is true that there are some ways in which the patch is related to deduplication, it doesn't seem useful to emphasize that part anymore. Plus clarifying which kind of deduplication I'm talking about in code comments is starting to become really annoying. -- Peter Geoghegan
Attachment
pgsql-hackers by date: