Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Deleting older versions in unique indexes to avoid page splits |
Date | |
Msg-id | CAH2-Wzkaj6OhVSvB46t_ybosd0cH1Qaiao+0dSopGUDmEM37Hw@mail.gmail.com Whole thread Raw |
In response to | Re: Deleting older versions in unique indexes to avoid page splits (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Deleting older versions in unique indexes to avoid page splits
Re: Deleting older versions in unique indexes to avoid page splits Re: Deleting older versions in unique indexes to avoid page splits Re: Deleting older versions in unique indexes to avoid page splits |
List | pgsql-hackers |
On Sat, Oct 24, 2020 at 8:01 AM Peter Geoghegan <pg@bowt.ie> wrote: > There probably are some workloads with worse latency and throughput, > but generally only with high contention/small indexes. I'll try to > fine tune those, but some amount of it is probably inevitable. On > average query latency is quite a lot lower with the patch (where it is > affected at all - the mechanism is only used with non-hot updates). Attached is v5, which has changes that are focused on two important high level goals: 1. Keeping costs down generally, especially in high contention cases where costs are most noticeable. 2. Making indexes on low cardinality columns that naturally really benefit from merge deduplication in Postgres 13 receive largely the same benefits that you've already seen with high cardinality indexes (at least outside of the extreme cases where it isn't sensible to try). CPU costs (especially from sorting temp work arrays) seem to be the big cost overall. It turns out that the costs of accessing heap pages within the new mechanism is not really noticeable. This isn't all that surprising, though. The heap pages are accessed in a way that naturally exploits locality across would-be page splits in different indexes. To some degree the two goals that I describe conflict with each other. If merge deduplication increases the number of logical rows that "fit on each leaf page" (by increasing the initial number of TIDs on each leaf page by over 3x when the index is in the pristine CREATE INDEX state), then naturally the average amount of work required to maintain indexes in their pristine state is increased. We cannot expect to pay nothing to avoid "unnecessary page splits" -- we can only expect to come out ahead over time. The main new thing that allowed me to more or less accomplish the second goal is granular deletion in posting list tuples. That is, v5 teaches the delete mechanism to do VACUUM-style granular TID deletion within posting lists. This factor turned out to matter a lot. Here is an extreme benchmark that I ran this morning for this patch, which shows both strengths and weaknesses: pgbench scale 1000 (with customizations to indexing that I go into below), 16 + 32 clients, 30 minutes per run (so 2 hours total excluding initial data loading). Same queries as last time: \set aid1 random_gaussian(1, 100000 * :scale, 4.0) \set aid2 random_gaussian(1, 100000 * :scale, 4.5) \set aid3 random_gaussian(1, 100000 * :scale, 4.2) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; END; And just like last time, we replace the usual pgbench_accounts index with an INCLUDE index to artificially make every UPDATE a non-HOT UPDATE: create unique index aid_pkey_include_abalance on pgbench_accounts (aid) include(abalance); Unlike last time we have a variety of useless-to-us *low cardinality* indexes. These are much smaller than aid_pkey_include_abalance due to merge deduplication, but are still quite similar to it: create index fiver on pgbench_accounts ((aid - (aid%5))); create index tenner on pgbench_accounts ((aid - (aid%10))); create index score on pgbench_accounts ((aid - (aid%20))); The silly indexes this time around are designed to have the same skew as the PK, but with low cardinality data. So, for example "score", has twenty distinct logical rows for each distinct aid value. Which is pretty extreme as far as the delete mechanism is concerned. That's why this is more of a mixed picture compared to the earlier benchmark. I'm trying to really put the patch through its paces, not make it look good. First the good news. The patch held up perfectly in one important way -- the size of the indexes didn't change at all compared to the original pristine size. That looked like this at the start for both patch + master: aid_pkey_include_abalance: 274,194 pages/2142 MB fiver: 142,348 pages/1112 MB tenner: 115,352 pages/901 MB score: 94,677 pages/740 MB By the end, master looked like this: aid_pkey_include_abalance: 428,759 pages (~1.56x original size) fiver: 220,114 pages (~1.54x original size) tenner: 176,983 pages (~1.53x original size) score: 178,165 pages (~1.88x original size) (As I said, no change in the size of indexes with the patch -- not even one single page split.) Now for the not-so-good news. The TPS numbers looked like this (results in original chronological order of the runs, which I've interleaved): patch_1_run_16.out: "tps = 30452.530518 (including connections establishing)" master_1_run_16.out: "tps = 35101.867559 (including connections establishing)" patch_1_run_32.out: "tps = 26000.991486 (including connections establishing)" master_1_run_32.out: "tps = 32582.129545 (including connections establishing)" The latency numbers aren't great for the patch, either. Take the 16 client case: number of transactions actually processed: 54814992 latency average = 0.525 ms latency stddev = 0.326 ms tps = 30452.530518 (including connections establishing) tps = 30452.612796 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid1 random_gaussian(1, 100000 * :scale, 4.0) 0.000 \set aid2 random_gaussian(1, 100000 * :scale, 4.5) 0.000 \set aid3 random_gaussian(1, 100000 * :scale, 4.2) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.046 BEGIN; 0.159 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; 0.153 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; 0.091 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; 0.075 END; vs master's 16 client case: number of transactions actually processed: 63183870 latency average = 0.455 ms latency stddev = 0.307 ms tps = 35101.867559 (including connections establishing) tps = 35101.914573 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid1 random_gaussian(1, 100000 * :scale, 4.0) 0.000 \set aid2 random_gaussian(1, 100000 * :scale, 4.5) 0.000 \set aid3 random_gaussian(1, 100000 * :scale, 4.2) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.049 BEGIN; 0.117 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; 0.120 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; 0.091 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; 0.077 END; Earlier variations of this same workload that were run with a 10k TPS rate limit had SELECT statements that had somewhat lower latency with the patch, while the UPDATE statements were slower by roughly the same amount as you see here. Here we see that at least the aid3 SELECT is just as fast with the patch. (Don't have a proper example of this rate-limited phenomenon close at hand now, since I saw it happen back when the patch was less well optimized.) This benchmark is designed to be extreme, and to really stress the patch. For one thing we have absolutely no index scans for all but one of the four indexes, so controlling the bloat there isn't going to give you the main expected benefit. Which is that index scans don't even have to visit heap pages from old versions, since they're not in the index for very long (unless there aren't very many versions for the logical rows in question, which isn't really a problem for us). For another the new mechanism is constantly needed, which just isn't very realistic. It seems as if the cost is mostly paid by non-HOT updaters, which seems exactly right to me. I probably could have cheated by making the aid_pkey_include_abalance index a non-unique index, denying the master branch the benefit of LP_DEAD setting within _bt_check_unique(). Or I could have cheated (perhaps I should just say "gone for something a bit more sympathetic") by not having skew on all the indexes (say by hashing on aid in the indexes that use merge deduplication). I also think that the TPS gap would have been smaller if I'd spent more time on each run, but I didn't have time for that today. In the past I've seen it take a couple of hours or more for the advantages of the patch to come through (it takes that long for reasons that should be obvious). Even the overhead we see here is pretty tolerable IMV. I believe that it will be far more common for the new mechanism to hardly get used at all, and yet have a pretty outsized effect on index bloat. To give you a simple example of how that can happen, consider that if this did happen in a real workload it would probably be caused by a surge in demand -- now we don't have to live with the bloat consequences of an isolated event forever (or until the next REINDEX). I can make more sophisticated arguments than that one, but it doesn't seem useful right now so I'll refrain. The patch adds a backstop. It seems to me that that's really what we need here. Predictability over time and under a large variety of different conditions. Real workloads constantly fluctuate. Even if people end up not buying my argument that it's worth it for workloads like this, there are various options. And, I bet I can further improve the high contention cases without losing the valuable part -- there are a number of ways in which I can get the CPU costs down further that haven't been fully explored (yes, it really does seem to be CPU costs, especially due to TID sorting). Again, this patch is all about extreme pathological workloads, system stability, and system efficiency over time -- it is not simply about increasing system throughput. There are some aspects of this design (that come up with extreme workloads) that may in the end come down to value judgments. I'm not going to tell somebody that they're wrong for prioritizing different things (within reason, at least). In my opinion almost all of the problems we have with VACUUM are ultimately stability problems, not performance problems per se. And, I suspect that we do very well with stupid benchmarks like this compared to other DB systems precisely because we currently allow non-HOT updaters to "live beyond their means" (which could in theory be great if you frame it a certain way that seems pretty absurd to me). This suggests we can "afford" to go a bit slower here as far as the competitive pressures determine what we should do (notice that this is a distinct argument to my favorite argument, which is that we cannot afford to *not* go a bit slower in certain extreme cases). I welcome debate about this. -- Peter Geoghegan
Attachment
pgsql-hackers by date: