Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Deleting older versions in unique indexes to avoid page splits |
Date | |
Msg-id | CAH2-WzkGBct_PYzGkv4+qkOFu10WmnX3Q5Ba9FRQqt29onpSQw@mail.gmail.com Whole thread Raw |
In response to | Re: Deleting older versions in unique indexes to avoid page splits (Simon Riggs <simon@2ndquadrant.com>) |
List | pgsql-hackers |
On Tue, Oct 27, 2020 at 2:44 AM Simon Riggs <simon@2ndquadrant.com> wrote: > While it is important we investigate the worst cases, I don't see this > is necessarily bad. I looked at "perf top" a few times when the test from yesterday ran. I saw that the proposed delete mechanism was the top consumer of CPU cycles. It seemed as if the mechanism was very expensive. However, that's definitely the wrong conclusion about what happens in the general case, or even in slightly less extreme cases. It at least needs to be put in context. I reran exactly the same benchmark overnight, but added a 10k TPS rate limit this time (so about a third of the TPS that's possible without a limit). I also ran it for longer, and saw much improved latency. (More on the latency aspect below, for now I want to talk about "perf top"). The picture with "perf top" changed significantly with a 10k TPS rate limit, even though the workload itself is very similar. Certainly the new mechanism/function is still quite close to the top consumer of CPU cycles. But it no longer uses more cycles than the familiar super hot functions that you expect to see right at the top with pgbench (e.g. _bt_compare(), hash_search_with_hash_value()). It's now something like the 4th or 5th hottest function (I think that that means that the cost in cycles is more than an order of magnitude lower, but I didn't check). Just adding this 10k TPS rate limit makes the number of CPU cycles consumed by the new mechanism seem quite reasonable. The benefits that the patch brings are not diminished at all compared to the original no-rate-limit variant -- the master branch now only takes slightly longer to completely bloat all its indexes with this 10k TPS limit (while the patch avoids even a single page split -- no change there). Again, this is because the mechanism is a backstop. It only works as hard as needed to avoid unnecessary page splits. When the system is working as hard as possible to add version churn to indexes (which is what the original/unthrottled test involved), then the mechanism also works quite hard. In this artificial and contrived scenario, any cycles we can save from cleaning up bloat (by micro optimizing the code in the patch) go towards adding even more bloat instead...which necessitates doing more cleanup. This is why optimizing the code in the patch with this unrealistic scenario in mind is subject to sharp diminishing returns. It's also why you can get a big benefit from the patch when the new mechanism is barely ever used. I imagine that if I ran the same test again but with a 1k TPS limit I would hardly see the new mechanism in "perf top" at all....but in the end the bloat situation would be very similar. I think that you could model this probabilistically if you had the inclination. Yes, the more you throttle throughput (by lowering the pgbench rate limit further), the less chance you have of any given leaf page splitting moment to moment (for the master branch). But in the long run every original leaf page splits at least once anyway, because each leaf page still only has to be unlucky once. It is still inevitable that they'll all get split eventually (and probably not before too long), unless and until you *drastically* throttle pgbench. I believe that things like opportunistic HOT chain truncation (heap pruning) and index tuple LP_DEAD bit setting are very effective much of the time. The problem is that it's easy to utterly rely on them without even realizing it, which creates hidden risk that may or may not result in big blow ups down the line. There is nothing inherently wrong with being lazy or opportunistic about cleaning up garbage tuples -- I think that there are significant advantages, in fact. But only if it isn't allowed to create systemic risk. More concretely, bloat cannot be allowed to become concentrated in any one place -- no individual query should have to deal with more than 2 or 3 versions for any given logical row. If we're focussed on the *average* number of physical versions per logical row then we may reach dramatically wrong conclusions about what to do (which is a problem in a world where autovacuum is supposed to do most garbage collection...unless your app happens to look like standard pgbench!). And now back to latency with this 10k TPS limited variant I ran last night. After 16 hours we have performed 8 runs, each lasting 2 hours. In the original chronological order, these runs are: patch_1_run_16.out: "tps = 10000.095914 (including connections establishing)" master_1_run_16.out: "tps = 10000.171945 (including connections establishing)" patch_1_run_32.out: "tps = 10000.082975 (including connections establishing)" master_1_run_32.out: "tps = 10000.533370 (including connections establishing)" patch_2_run_16.out: "tps = 10000.076865 (including connections establishing)" master_2_run_16.out: "tps = 9997.933215 (including connections establishing)" patch_2_run_32.out: "tps = 9999.911988 (including connections establishing)" master_2_run_32.out: "tps = 10000.864031 (including connections establishing)" Here is what I see at the end of "patch_2_run_32.out" (i.e. at the end of the final 2 hour run for the patch): number of transactions actually processed: 71999872 latency average = 0.265 ms latency stddev = 0.110 ms rate limit schedule lag: avg 0.046 (max 30.274) ms tps = 9999.911988 (including connections establishing) tps = 9999.915766 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid1 random_gaussian(1, 100000 * :scale, 4.0) 0.000 \set aid2 random_gaussian(1, 100000 * :scale, 4.5) 0.000 \set aid3 random_gaussian(1, 100000 * :scale, 4.2) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.023 BEGIN; 0.099 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; 0.036 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; 0.034 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; 0.025 END; Here is what I see at the end of "master_2_run_32.out" (i.e. at the end of the final run for master): number of transactions actually processed: 72006803 latency average = 0.266 ms latency stddev = 2.722 ms rate limit schedule lag: avg 0.074 (max 396.853) ms tps = 10000.864031 (including connections establishing) tps = 10000.868545 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid1 random_gaussian(1, 100000 * :scale, 4.0) 0.000 \set aid2 random_gaussian(1, 100000 * :scale, 4.5) 0.000 \set aid3 random_gaussian(1, 100000 * :scale, 4.2) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.022 BEGIN; 0.073 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; 0.036 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; 0.034 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; 0.025 END; Notice the following: 1. The overall "latency average" for the patch is very slightly lower. 2. The overall "latency stddev" for the patch is far far lower -- over 20x lower, in fact. 3. The patch's latency for the UPDATE statement is still somewhat higher, but it's not so bad. We're still visibly paying a price in some sense, but at least we're imposing the new costs squarely on the query that is responsible for all of our problems. 4. The patch's latency for the SELECT statements is exactly the same for the patch. We're not imposing any new cost on "innocent" SELECT statements that didn't create the problem, even if they didn't quite manage to benefit here. (Without LP_DEAD setting in _bt_check_unique() I'm sure that the SELECT latency would be significantly lower for the patch.) The full results (with lots of details pulled from standard system views after each run) can be downloaded as a .tar.gz archive from: https://drive.google.com/file/d/1Dn8rSifZqT7pOOIgyKstl-tdACWH-hqO/view?usp=sharing (It's probably not that interesting to drill down any further, but I make the full set of results available just in case. There are loads of things included just because I automatically capture them when benchmarking anything at all.) > HOT was difficult to measure, but on a 2+ hour run on a larger table, > the latency graph was what showed it was a winner. Short runs and > in-memory data masked the benefits in our early analyses. Yeah, that's what was nice about working on sorting -- almost instant feedback. Who wants to spend at least 2 hours to test out every little theory? :-) > So I suggest not looking at the totals and averages but on the peaks > and the long term trend. Showing that in graphical form is best. I think that you're right that a graphical representation with an X-axis that shows how much time has passed would be very useful. I'll try to find a way of incorporating that into my benchmarking workflow. This is especially likely to help when modelling how cases with a long running xact/snapshot behave. That isn't a specific goal of mine here, but I expect that it'll help with that a lot too. For now I'm just focussing on downsides and not upsides, for the usual reasons. > Agreed, we can trade initial speed for long term consistency. I guess > there are some heuristics there on that tradeoff. Right. Another way of looking at it is this: it should be possible for reasonably good DBAs to develop good intuitions about how the system will hold up over time, based on past experience and common sense -- no chaos theory required. Whatever the cost of the mechanism is, at least it's only something that gets shaved off the top minute to minute. It seems almost impossible for the cost to cause sudden surprises (except maybe once, after an initial upgrade to Postgres 14, though I doubt it). Whereas it seems very likely to prevent many large and unpleasant surprises caused by hidden, latent risk. I believe that approximately 100% of DBAs would gladly take that trade-off, even if the total cost in cycles was higher. It happens to also be true that they're very likely to use far fewer cycles over time, but that's really just a bonus. -- Peter Geoghegan
pgsql-hackers by date: