Re: Problems with the FSM, heap fillfactor, and temporal locality - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Problems with the FSM, heap fillfactor, and temporal locality |
Date | |
Msg-id | CAH2-WzmqdrjedPLXqSZXq4oSrcbCyZFHsHnQY72o5YPsHjiMvQ@mail.gmail.com Whole thread Raw |
In response to | Re: Problems with the FSM, heap fillfactor, and temporal locality (Stephen Frost <sfrost@snowman.net>) |
List | pgsql-hackers |
On Tue, Aug 25, 2020 at 6:21 AM Stephen Frost <sfrost@snowman.net> wrote: > This all definitely sounds quite interesting and the idea to look at the > XID to see if we're in the same transaction and therefore likely > inserting a related tuple certainly makes some sense. While I get that > it might not specifically work with TPC-C, I'm wondering about if we > could figure out how to make a multi-tuple INSERT use > heap/table_multi_insert (which seems to only be used by COPY currently, > and internally thanks to the recent work to use it for some catalog > tables) and then consider the size of the entire set of tuples being > INSERT'd when working to find a page, or deciding if we should extend > the relation. There are probably quite a variety of ways in which we can capture locality, and I'm sure that I'm only scratching the surface right now. I agree that table_multi_insert could definitely be one of them. John said something about concentrating garbage in certain pages up-thread. I also wonder if there is some visibility + freeze map angle on this. What I see with the benchmark is that the order_line table (the largest table by quite some margin, and one that grows indefinitely) does not make much use of the visibility map during VACUUM -- even though it's the kind of table + workload that you'd hope and expect would make good use of it if you were looking at it in a real world situation. Each tuple is only inserted once and later updated once, so what we really ought to do better. The logs show that VACUUM/autovacuum dirties lots of pages, probably due to fragmentation from free space management (though there are undoubtedly other factors). The best "locality orientated" reference guide to TPC-C that I've been able to find is "A Modeling Study of the TPC-C Benchmark", which was published in 1993 by NASA (shortly after the introduction of TPC-C). You can get it from: https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf (Unfortunately this reproduction is a low quality photocopy -- ACM members can get a clear copy.) If you think about the TPC-C workload at a high level, and Postgres internals stuff at a low level, and then run the benchmark, you'll find various ways in which we don't live up to our potential. The big picture is that the "physical database" is not close enough to the "logical database", especially over time and after a lot of churn. This causes problems all over the place, that look like nothing in particular in profiles. It's not that TPC-C is unsympathetic to Postgres in any of the usual ways -- there are very few UPDATEs that affect indexed columns, which are not really a factor at all. There are also no transactions that run longer than 2 seconds (any more than ~50ms per execution is exceptional, in fact). We already do a significant amount of the necessary garbage collection opportunistically (by pruning) -- probably the vast majority, in fact. In particular, HOT pruning works well, since fill factor has been tuned. It just doesn't work as well as you'd hope, in that it cannot stem the tide of fragmentation. And not just because of heapam's use of the FSM. If we implemented a simple differential heap tuple compression scheme within HOT chains (though not across HOT chains/unrelated tuples), then we'd probably do much better -- we could keep the same logical tuple on the same heap page much more often, maybe always. For example, "stock" table is a major source of FPIs, and I bet that this is greatly amplified by our failure to keep versions of the same frequently updated tuple together. We can only fit ~25 stock tuples on each heap page (with heap fill factor at 95, the BenchmarkSQL default), so individual tuples are ~320 bytes (including tuple header). If we found a way to store the changed columns for successor tuples within a HOT chain, then we would do much better -- without changing the HOT invariant (or anything else that's truly scary). If our scheme worked by abusing the representation that we use for NULL values in the successor/HOT tuples (which is not terribly space efficient), then we could still store about 6 more versions of each stock tuple on the same page -- the actual changed columns are typically much much smaller than the unchanged columns. Our 23/24 byte tuple header is usually small potatoes compared to storing unchanged values several times. As I said, the HOT optimization (and opportunistic pruning) already work well with this benchmark. But I think that they'd work a lot better if we could just temporarily absorb a few extra versions on the heap page, so we have enough breathing room to prune before the page "truly fills to capacity". It could help in roughly the same way that deduplication now helps in nbtree indexes with "version churn". I'm also reminded of the nbtree optimization I prototyped recently, which more or less prevented all unique index bloat provided there is no long running transaction: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com (Yes, "preventing all unique index bloat provided there is no long running transaction" is no exaggeration -- it really prevents all bloat related nbtree page splits, even with hundreds of clients, skew, etc.) It seems pretty obvious to me that buying another (say) 2 seconds to let opportunistic pruning run "before the real damage is done" can be extremely valuable -- we only need to be able to delay a page split (which is similar to the case where we cannot continue to store heap tuples on the same heap page indefinitely) for a couple of seconds at a time. We only need to "stay one step ahead" of the need to split the page (or to migrate a logical heap tuple to a new heap page when it comes to the heap) at any given time -- that alone will totally arrest the problem. This is a very important point -- the same set of principles that helped in nbtree can also be effectively applied to heap pages that are subject to version churn. (Assuming no long running xacts.) > Getting a 5% improvement is pretty exciting, very cool and seems worth > spending effort on. I'm already at 5% - 7% now. I bet the differential compression of tuples on a HOT chain could buy a lot more than that. The biggest emphasis should be placed on stable performance over time, and total I/O over time -- that's where we're weakest right now. -- Peter Geoghegan
pgsql-hackers by date: