Thread: Problems with the FSM, heap fillfactor, and temporal locality
I'm concerned about how the FSM gives out pages to heapam. Disabling the FSM entirely helps TPC-C/BenchmarkSQL, which uses non-default heap fillfactors for most tables [1]. Testing has shown that this actually increases throughput for the benchmark (as measured in TPM) by 5% - 9%, even though my approach is totally naive and stupid. My approach makes one or two small tables much bigger, but that doesn't have that much downside for the workload in practice. My approach helps by accidentally improving temporal locality -- related records are more consistently located on the same block, which in practice reduces the number of pages dirtied and the number of FPIs generated. TPC-C has a tendency to insert a set of related tuples together (e.g., order lines from an order), while later updating all of those records together. Interestingly, the problems start before we even begin the benchmark proper, and can be observed directly using simple ad-hoc queries (I developed some simple SQL queries involving ctid for this). BenchmarkSQL's initial bulk loading is performed by concurrent workers that insert related groups of tuples into tables, so that we start out with a realistic amount of old orders to refer back to, etc. I can clearly observe that various natural groupings (e.g., grouping order lines by order number, grouping customers by district + warehouse) actually get initially inserted in a way that leaves tuples in a grouping spread around an excessive number of heap blocks. For example, while most order lines do fit on one block, there is a significant minority of orderlines that span two or more blocks for the master branch case. Whereas with the FSM artificially disabled, the heap relation looks more "pristine" in that related tuples are located on the same blocks (or at least on neighboring blocks). It's possible that one orderline will span two neighboring blocks here, but it will never span three or more blocks. Each order has 5 - 15 order lines, and so I was surprised to see that a small minority or order line tuples end up occupying as many as 5 or 7 heap pages on the master branch (i.e. with the current FSM intact during bulk loading). The underlying cause of this "bulk inserts are surprisingly indifferent to locality" issue seems to be that heap am likes to remember small amounts of space from earlier pages when the backend couldn't fit one last tuple on an earlier target page (before allocating a new page that became the new relcache target page in turn). This is penny wise and pound foolish, because we're eagerly using a little bit more space in a case where we are applying a heap fill factor anyway. I think that we also have a number of related problems. It seems to me that we don't make enough effort to store related heap tuples together -- both during bulk inserts like this, but also during subsequent updates that cannot fit successor tuples on the same heap page. The current design of the FSM seems to assume that it doesn't matter where the free space comes from, as long as you get it from somewhere and as long as fill factor isn't violated -- it cares about the letter of the fill factor law without caring about its spirit or intent. If the FSM tried to get free space from a close-by block, then we might at least see related updates that cannot fit a successor tuple on the same block behave in a coordinated fashion. We might at least have both updates relocate the successor tuple to the same mostly-empty block -- they both notice the same nearby free block, so both sets of successor tuples end up going on the same most-empty block. The two updating backends don't actually coordinate, of course -- this just happens as a consequence of looking for nearby free space. The FSM should probably be taught to treat pages as free space candidates (candidates to give out free space from) based on more sophisticated, locality-aware criteria. The FSM should care about the *rate of change* for a block over time. Suppose you have heap fill factor set to 90. Once a heap block reaches fill factor% full, it ought to not be used to insert new tuples unless and until the used space on the block shrinks *significantly* -- the free space is now supposed to be reserved. It should not be enough for the space used on the page to shrink by just 1% (to 89%). Maybe it should have to reach as low as 50% or less before we flip it back to "free to take space from for new unrelated tuples". The idea is that fill factor space is truly reserved for updates -- that should be "sticky" for all of this to work well. What's the point in having the concept of a heap fill factor at all if we don't make any real effort to enforce that the extra free space left behind is used as intended, for updates of tuples located on the same heap page? Thoughts? [1] https://github.com/petergeoghegan/benchmarksql/commit/3ef4fe71077b40f56b91286d4b874a15835c241e -- Peter Geoghegan
On Fri, Aug 21, 2020 at 2:48 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I'm concerned about how the FSM gives out pages to heapam. Disabling > the FSM entirely helps TPC-C/BenchmarkSQL, which uses non-default heap > fillfactors for most tables [1]. Hi Peter, Interesting stuff. Is lower-than-default fillfactor important for the behavior you see? > located on the same blocks (or at least on neighboring blocks). It's > possible that one orderline will span two neighboring blocks here, but > it will never span three or more blocks. Each order has 5 - 15 order > lines, and so I was surprised to see that a small minority or order > line tuples end up occupying as many as 5 or 7 heap pages on the > master branch (i.e. with the current FSM intact during bulk loading). Hmm. You focus on FSM, but I'm thinking also multiple VM bits potentially got cleared (maybe not this benchmark but other scenarios). > If the FSM tried to get free space from a close-by block, then we > might at least see related updates that cannot fit a successor tuple > on the same block behave in a coordinated fashion. We might at least > have both updates relocate the successor tuple to the same > mostly-empty block -- they both notice the same nearby free block, so > both sets of successor tuples end up going on the same most-empty > block. The two updating backends don't actually coordinate, of course > -- this just happens as a consequence of looking for nearby free > space. I'm not sure I follow the "close-by" criterion. It doesn't seem immediately relevant to what I understand as the main problem you found, of free space. In other words, if we currently write to 5 blocks, but with smarter FSM logic we can find a single block, it seems that should be preferred over close-by blocks each with less space? Or are you envisioning scoring by both free space and distance from the original block? > supposed to be reserved. It should not be enough for the space used on > the page to shrink by just 1% (to 89%). Maybe it should have to reach > as low as 50% or less before we flip it back to "free to take space > from for new unrelated tuples". The idea is that fill factor space is > truly reserved for updates -- that should be "sticky" for all of this > to work well. Makes sense. If we go this route, I wonder if we should keep the current behavior and use any free space if the fillfactor is 100%, since that's in line with the intention. Also, the 50% (or whatever) figure could be scaled depending on fillfactor. As in, check if freespace > (100% - fillfactor * 0.6), or something. I'm not sure how to distinguish blocks that have never reached fillfactor vs. ones that did but haven't gained back enough free space to be considered again. Naively, we could start by assuming the last block can always be filled up to fillfactor, but earlier blocks must use the stricter rule. That's easy since we already try the last block anyway before extending the relation. The flip side of this is: Why should vacuum be in a hurry to dirty a page, emit WAL, and update the FSM if it only removes one dead tuple? This presentation [1] (pages 35-43) from Masahiko Sawada had the idea of a "garbage map", which keeps track of which parts of the heap have the most dead space, and focus I/O efforts on those blocks first. It may or may not be worth the extra complexity by itself, but it seems it would work synergistically with your proposal: Updating related tuples would concentrate dead tuples on fewer pages, and vacuums would more quickly free up space that can actually be used to store multiple new tuples. [1] https://www.slideshare.net/masahikosawada98/vacuum-more-efficient-than-ever-99916671 -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi John, On Fri, Aug 21, 2020 at 3:10 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > Interesting stuff. Is lower-than-default fillfactor important for the > behavior you see? It's hard to say. It's definitely not important as far as the initial bulk loading behavior is concerned (the behavior where related tuples get inserted onto multiple disparate tuples all too often). That will happen with any fill factor, including the default/100. I'm concerned about heap fill factor in particular because I suspect that that doesn't really work sensibly. To give you some concrete idea of the benefits, I present a pg_stat_database from the master branch after 6 hours of BenchmarkSQL with a rate limit: -[ RECORD 1 ]---------+------------------------------ datid | 13,619 datname | postgres numbackends | 3 xact_commit | 45,902,031 xact_rollback | 200,131 blks_read | 662,677,079 blks_hit | 24,790,989,538 tup_returned | 30,054,930,608 tup_fetched | 13,722,542,025 tup_inserted | 859,165,629 tup_updated | 520,611,959 tup_deleted | 20,074,897 conflicts | 0 temp_files | 88 temp_bytes | 18,849,890,304 deadlocks | 0 checksum_failures | checksum_last_failure | blk_read_time | 124,233,831.492 blk_write_time | 8,588,876.871 stats_reset | 2020-08-20 13:51:08.351036-07 Here is equivalent output for my simple patch that naively disables the FSM: -[ RECORD 1 ]---------+------------------------------ datid | 13,619 datname | postgres numbackends | 3 xact_commit | 46,369,235 xact_rollback | 201,919 blks_read | 658,267,665 blks_hit | 19,980,524,578 tup_returned | 30,804,856,896 tup_fetched | 11,839,865,936 tup_inserted | 861,798,911 tup_updated | 525,895,435 tup_deleted | 20,277,618 conflicts | 0 temp_files | 88 temp_bytes | 18,849,439,744 deadlocks | 0 checksum_failures | checksum_last_failure | blk_read_time | 117,167,612.616 blk_write_time | 7,873,922.175 stats_reset | 2020-08-20 13:50:51.72056-07 Note that there is a ~20% reduction in blks_hit here, even though the patch does ~1% more transactions (the rate limiting doesn't work perfectly). There is also a ~5.5% reduction in aggregate blk_read_time, and a ~9% reduction in blk_write_time. I'd say that that's pretty significant. I also see small but significant improvements in transaction latency, particularly with 90th percentile latency. > Hmm. You focus on FSM, but I'm thinking also multiple VM bits > potentially got cleared (maybe not this benchmark but other > scenarios). I'm focussed on how heapam interacts with the FSM, and its effects on locality. It's definitely true that this could go on to affect how the visibility map gets set -- we could set fewer bits unnecessarily. And it probably has a number of other undesirable consequences that are hard to quantify. Clearly there are many reasons why making the physical database layout closer to that of the logical database is a good thing. I probably have missed a few. > I'm not sure I follow the "close-by" criterion. It doesn't seem > immediately relevant to what I understand as the main problem you > found, of free space. In other words, if we currently write to 5 > blocks, but with smarter FSM logic we can find a single block, it > seems that should be preferred over close-by blocks each with less > space? Or are you envisioning scoring by both free space and distance > from the original block? I am thinking of doing both at the same time. Think of a table with highly skewed access. There is a relatively small number of hot spots -- parts of the primary key's key space that are consistently affected by many skewed updates (with strong heap correlation to primary key order). We simply cannot ever hope to avoid migrating heavily updated rows to new pages -- too much contention in the hot spots for that. But, by 1) Not considering pages free-space-reclaimable until the free space reaches ~50%, and 2) preferring close-by free pages, we avoid (or at least delay) destroying locality of access. There is a much better chance that rows from logically related hot spots will all migrate to the same few blocks again and again, with whole blocks becoming free in a relatively predictable, phased fashion. (I'm speculating here, but my guess is that this combination will help real world workloads by quite a bit.) Preferring close-by blocks in the FSM is something that there was discussion of when the current FSM implementation went in back in 8.4. I am almost certain that just doing that will not help. If it helps at all then it will help by preserving locality as tuple turnover takes place over time, and I think that you need to be clever about "reuse granularity" in order for that to happen. We're optimizing the entire "lifecycle" of logically related tuples whose relatedness is embodied by their initial physical position following an initial insert (before many subsequent updates take place that risk destroying locality). > Makes sense. If we go this route, I wonder if we should keep the > current behavior and use any free space if the fillfactor is 100%, > since that's in line with the intention. Also, the 50% (or whatever) > figure could be scaled depending on fillfactor. As in, check if > freespace > (100% - fillfactor * 0.6), or something. Right. Or it could be another reloption. > I'm not sure how to distinguish blocks that have never reached > fillfactor vs. ones that did but haven't gained back enough free space > to be considered again. Naively, we could start by assuming the last > block can always be filled up to fillfactor, but earlier blocks must > use the stricter rule. That's easy since we already try the last block > anyway before extending the relation. I was thinking of explicitly marking blocks as "freeable", meaning that the FSM will advertise their free space. This isn't self-evident from the amount of free space on the page alone, since we need to distinguish between at least two cases: the case where a page has yet to apply fill factor for the first time (which may still be close to fillfactor% full) versus the case where the page did reach fillfactor, but then had a small amount of space freed. I think that the FSM ought to give out space in the former case, but not in the latter case. Even though an identical amount of free space might be present in either case. > The flip side of this is: Why should vacuum be in a hurry to dirty a > page, emit WAL, and update the FSM if it only removes one dead tuple? > This presentation [1] (pages 35-43) from Masahiko Sawada had the idea > of a "garbage map", which keeps track of which parts of the heap have > the most dead space, and focus I/O efforts on those blocks first. It > may or may not be worth the extra complexity by itself, but it seems > it would work synergistically with your proposal: Updating related > tuples would concentrate dead tuples on fewer pages, and vacuums would > more quickly free up space that can actually be used to store multiple > new tuples. I agree that that seems kind of related. I'm trying to concentrate garbage from skewed updates in fewer blocks. (Same with the skewed successor tuples.) -- Peter Geoghegan
On Fri, Aug 21, 2020 at 8:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > Note that there is a ~20% reduction in blks_hit here, even though the > patch does ~1% more transactions (the rate limiting doesn't work > perfectly). There is also a ~5.5% reduction in aggregate > blk_read_time, and a ~9% reduction in blk_write_time. I'd say that > that's pretty significant. Indeed. > Preferring close-by blocks in the FSM is something that there was > discussion of when the current FSM implementation went in back in 8.4. Right, I found a pretty long one here: https://www.postgresql.org/message-id/flat/1253201179.9666.174.camel%40ebony.2ndQuadrant > > I'm not sure how to distinguish blocks that have never reached > > fillfactor vs. ones that did but haven't gained back enough free space > > to be considered again. Naively, we could start by assuming the last > > block can always be filled up to fillfactor, but earlier blocks must > > use the stricter rule. That's easy since we already try the last block > > anyway before extending the relation. > > I was thinking of explicitly marking blocks as "freeable", meaning > that the FSM will advertise their free space. This isn't self-evident > from the amount of free space on the page alone, since we need to > distinguish between at least two cases: the case where a page has yet > to apply fill factor for the first time (which may still be close to > fillfactor% full) versus the case where the page did reach fillfactor, > but then had a small amount of space freed. I think that the FSM ought > to give out space in the former case, but not in the latter case. Even > though an identical amount of free space might be present in either > case. I imagine you're suggesting to make this change in the FSM data? I'm thinking we could change the category byte to a signed integer, and reduce FSM_CATEGORIES to 128. (That gives us 64-byte granularity which doesn't sound bad, especially if we're considering ignoring free space for inserts until we get a couple kilobytes back.) The absolute value represents the actual space. A negative number would always compare as less, so use the negative range to mark the page as not usable. A fresh page will have positive-numbered categories until fill_factor is reached, at which point we flip the sign to negative. When the used space gets below "min_fill_factor", flip the sign to mark it usable. Updates would have to be taught to read the absolute value. Managing the math might be a little tricky, but maybe that could be contained. One possible disadvantage is that negative values would bubble up to higher levels in the opposite way that positive ones do, but maybe we don't care since the pages are marked unusable anyway. All we care is that all negative numbers give the correct binary comparison when we search for available space. We could also reserve the high bit as a flag (1 = usable), and use the lower bits for the value, but I'm not sure that's better. We could also preserve the 255 categories as is, but add a second byte for the flag. If we could imagine other uses for a new byte, this might be good, but would make the FSM much bigger, which doesn't sound attractive at all. Any change of the FSM file would require pg_upgrade to rewrite the FSM, but it still doesn't seem like a lot of code. Other ideas? -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 24, 2020 at 6:38 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > On Fri, Aug 21, 2020 at 8:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Note that there is a ~20% reduction in blks_hit here, even though the > > patch does ~1% more transactions (the rate limiting doesn't work > > perfectly). There is also a ~5.5% reduction in aggregate > > blk_read_time, and a ~9% reduction in blk_write_time. I'd say that > > that's pretty significant. > > Indeed. Most of this seems to come from the new_orders table, which has heap pages that are continually inserted into and then deleted a little later on. new_orders is a bit like a queue that never gets too big. It is probably the component of TPC-C where we have the most room for improvement, fragmentation-wise. OTOH, despite all the churn the high watermark size of the new_orders table isn't all that high -- maybe ~50MB with a 1TB database. So it's not like we'll save very many shared_buffers misses there. > > Preferring close-by blocks in the FSM is something that there was > > discussion of when the current FSM implementation went in back in 8.4. > > Right, I found a pretty long one here: > > https://www.postgresql.org/message-id/flat/1253201179.9666.174.camel%40ebony.2ndQuadrant Thanks for finding that. > I imagine you're suggesting to make this change in the FSM data? I'm > thinking we could change the category byte to a signed integer, and > reduce FSM_CATEGORIES to 128. Yeah, something like that. I don't think that we need very many distinct FSM_CATEGORIES. Even 128 seems like way more than could ever be needed. > Any change of the FSM file would require pg_upgrade to rewrite the > FSM, but it still doesn't seem like a lot of code. I think that the sloppy approach to locking for the fsmpage->fp_next_slot field in functions like fsm_search_avail() (i.e. not using real atomic ops, even though we could) is one source of problems here. That might end up necessitating fixing the on-disk format, just to get the FSM to behave sensibly -- assuming that the value won't be too stale in practice is extremely dubious. This fp_next_slot business interacts poorly with the "extend a relation by multiple blocks" logic added by commit 719c84c1be5 -- concurrently inserting backends are liable to get the same heap block from the FSM, causing "collisions". That almost seems like a bug IMV. We really shouldn't be given out the same block twice, but that's what my own custom instrumentation shows happens here. With atomic ops, it isn't a big deal to restart using a compare-and-swap at the end (when we set/reset fp_next_slot for other backends). > Other ideas? I've been experimenting with changing the way that we enforce heap fill factor with calls to heap_insert() (and even heap_update()) that happen to occur at a "natural temporal boundary". This works by remembering an XID alongside the target block in the relcache when the target block is set. When we have an existing target block whose XID does not match our backend's current XID (i.e. it's an old XID for the backend), then that means we're at one of these boundaries. We require that the page has a little more free space before we'll insert on it when at a boundary. If we barely have enough space to insert the incoming heap tuple, and it's the first of a few tuples the transaction will ultimately insert, then we should start early on a new page instead of using the last little bit of space (note that the "last little bit" of space does not include the space left behind by fill factor). The overall effect is that groups of related tuples are much less likely to span a heap page boundary unless and until we have lots of updates -- though maybe not even then. I think that it's very common for transactions to insert a group of 2 - 15 logically related tuples into a table at a time. Roughly speaking, you can think of this as the heapam equivalent of the nbtree page split choice logic added by commit fab25024. We ought to go to at least a little bit of effort to minimize the number of distinct XIDs that are present on each heap page (in the tuple headers). We can probably figure out heuristics that result in respecting heap fill factor on average, while giving inserts (and even non-HOT updates) a little wiggle room when it comes to heap page boundaries. By applying both of these techniques together (locality/page split thing and real atomic ops for fp_next_slot) the prototype patch I'm working on mostly restores the system's current ability to reuse space (as measured by the final size of relations when everything is done), while maintaining most of the performance benefits of not using the FSM at all. The code is still pretty rough, though. I haven't decided how far to pursue this. It's not as if there are that many ways to make TPC-C go 5%+ faster left; it's very write-heavy, and stresses many different parts of the system all at once. I'm sure that anything like my current prototype patch will be controversial, though. Maybe it will be acceptable if we only change the behavior for people that explicitly set heap fillfactor. -- Peter Geoghegan
Greetings, * Peter Geoghegan (pg@bowt.ie) wrote: > On Mon, Aug 24, 2020 at 6:38 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > Other ideas? > > I've been experimenting with changing the way that we enforce heap > fill factor with calls to heap_insert() (and even heap_update()) that > happen to occur at a "natural temporal boundary". This works by > remembering an XID alongside the target block in the relcache when the > target block is set. When we have an existing target block whose XID > does not match our backend's current XID (i.e. it's an old XID for the > backend), then that means we're at one of these boundaries. We require > that the page has a little more free space before we'll insert on it > when at a boundary. If we barely have enough space to insert the > incoming heap tuple, and it's the first of a few tuples the > transaction will ultimately insert, then we should start early on a > new page instead of using the last little bit of space (note that the > "last little bit" of space does not include the space left behind by > fill factor). The overall effect is that groups of related tuples are > much less likely to span a heap page boundary unless and until we have > lots of updates -- though maybe not even then. I think that it's very > common for transactions to insert a group of 2 - 15 logically related > tuples into a table at a time. This all definitely sounds quite interesting and the idea to look at the XID to see if we're in the same transaction and therefore likely inserting a related tuple certainly makes some sense. While I get that it might not specifically work with TPC-C, I'm wondering about if we could figure out how to make a multi-tuple INSERT use heap/table_multi_insert (which seems to only be used by COPY currently, and internally thanks to the recent work to use it for some catalog tables) and then consider the size of the entire set of tuples being INSERT'd when working to find a page, or deciding if we should extend the relation. > Roughly speaking, you can think of this as the heapam equivalent of > the nbtree page split choice logic added by commit fab25024. We ought > to go to at least a little bit of effort to minimize the number of > distinct XIDs that are present on each heap page (in the tuple > headers). We can probably figure out heuristics that result in > respecting heap fill factor on average, while giving inserts (and even > non-HOT updates) a little wiggle room when it comes to heap page > boundaries. Agreed. > By applying both of these techniques together (locality/page split > thing and real atomic ops for fp_next_slot) the prototype patch I'm > working on mostly restores the system's current ability to reuse space > (as measured by the final size of relations when everything is done), > while maintaining most of the performance benefits of not using the > FSM at all. The code is still pretty rough, though. > > I haven't decided how far to pursue this. It's not as if there are > that many ways to make TPC-C go 5%+ faster left; it's very > write-heavy, and stresses many different parts of the system all at > once. I'm sure that anything like my current prototype patch will be > controversial, though. Maybe it will be acceptable if we only change > the behavior for people that explicitly set heap fillfactor. Getting a 5% improvement is pretty exciting, very cool and seems worth spending effort on. Thanks, Stephen
Attachment
On Tue, Aug 25, 2020 at 6:21 AM Stephen Frost <sfrost@snowman.net> wrote: > This all definitely sounds quite interesting and the idea to look at the > XID to see if we're in the same transaction and therefore likely > inserting a related tuple certainly makes some sense. While I get that > it might not specifically work with TPC-C, I'm wondering about if we > could figure out how to make a multi-tuple INSERT use > heap/table_multi_insert (which seems to only be used by COPY currently, > and internally thanks to the recent work to use it for some catalog > tables) and then consider the size of the entire set of tuples being > INSERT'd when working to find a page, or deciding if we should extend > the relation. There are probably quite a variety of ways in which we can capture locality, and I'm sure that I'm only scratching the surface right now. I agree that table_multi_insert could definitely be one of them. John said something about concentrating garbage in certain pages up-thread. I also wonder if there is some visibility + freeze map angle on this. What I see with the benchmark is that the order_line table (the largest table by quite some margin, and one that grows indefinitely) does not make much use of the visibility map during VACUUM -- even though it's the kind of table + workload that you'd hope and expect would make good use of it if you were looking at it in a real world situation. Each tuple is only inserted once and later updated once, so what we really ought to do better. The logs show that VACUUM/autovacuum dirties lots of pages, probably due to fragmentation from free space management (though there are undoubtedly other factors). The best "locality orientated" reference guide to TPC-C that I've been able to find is "A Modeling Study of the TPC-C Benchmark", which was published in 1993 by NASA (shortly after the introduction of TPC-C). You can get it from: https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf (Unfortunately this reproduction is a low quality photocopy -- ACM members can get a clear copy.) If you think about the TPC-C workload at a high level, and Postgres internals stuff at a low level, and then run the benchmark, you'll find various ways in which we don't live up to our potential. The big picture is that the "physical database" is not close enough to the "logical database", especially over time and after a lot of churn. This causes problems all over the place, that look like nothing in particular in profiles. It's not that TPC-C is unsympathetic to Postgres in any of the usual ways -- there are very few UPDATEs that affect indexed columns, which are not really a factor at all. There are also no transactions that run longer than 2 seconds (any more than ~50ms per execution is exceptional, in fact). We already do a significant amount of the necessary garbage collection opportunistically (by pruning) -- probably the vast majority, in fact. In particular, HOT pruning works well, since fill factor has been tuned. It just doesn't work as well as you'd hope, in that it cannot stem the tide of fragmentation. And not just because of heapam's use of the FSM. If we implemented a simple differential heap tuple compression scheme within HOT chains (though not across HOT chains/unrelated tuples), then we'd probably do much better -- we could keep the same logical tuple on the same heap page much more often, maybe always. For example, "stock" table is a major source of FPIs, and I bet that this is greatly amplified by our failure to keep versions of the same frequently updated tuple together. We can only fit ~25 stock tuples on each heap page (with heap fill factor at 95, the BenchmarkSQL default), so individual tuples are ~320 bytes (including tuple header). If we found a way to store the changed columns for successor tuples within a HOT chain, then we would do much better -- without changing the HOT invariant (or anything else that's truly scary). If our scheme worked by abusing the representation that we use for NULL values in the successor/HOT tuples (which is not terribly space efficient), then we could still store about 6 more versions of each stock tuple on the same page -- the actual changed columns are typically much much smaller than the unchanged columns. Our 23/24 byte tuple header is usually small potatoes compared to storing unchanged values several times. As I said, the HOT optimization (and opportunistic pruning) already work well with this benchmark. But I think that they'd work a lot better if we could just temporarily absorb a few extra versions on the heap page, so we have enough breathing room to prune before the page "truly fills to capacity". It could help in roughly the same way that deduplication now helps in nbtree indexes with "version churn". I'm also reminded of the nbtree optimization I prototyped recently, which more or less prevented all unique index bloat provided there is no long running transaction: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com (Yes, "preventing all unique index bloat provided there is no long running transaction" is no exaggeration -- it really prevents all bloat related nbtree page splits, even with hundreds of clients, skew, etc.) It seems pretty obvious to me that buying another (say) 2 seconds to let opportunistic pruning run "before the real damage is done" can be extremely valuable -- we only need to be able to delay a page split (which is similar to the case where we cannot continue to store heap tuples on the same heap page indefinitely) for a couple of seconds at a time. We only need to "stay one step ahead" of the need to split the page (or to migrate a logical heap tuple to a new heap page when it comes to the heap) at any given time -- that alone will totally arrest the problem. This is a very important point -- the same set of principles that helped in nbtree can also be effectively applied to heap pages that are subject to version churn. (Assuming no long running xacts.) > Getting a 5% improvement is pretty exciting, very cool and seems worth > spending effort on. I'm already at 5% - 7% now. I bet the differential compression of tuples on a HOT chain could buy a lot more than that. The biggest emphasis should be placed on stable performance over time, and total I/O over time -- that's where we're weakest right now. -- Peter Geoghegan
On Tue, Aug 25, 2020 at 5:17 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I think that the sloppy approach to locking for the > fsmpage->fp_next_slot field in functions like fsm_search_avail() (i.e. > not using real atomic ops, even though we could) is one source of > problems here. That might end up necessitating fixing the on-disk > format, just to get the FSM to behave sensibly -- assuming that the > value won't be too stale in practice is extremely dubious. > > This fp_next_slot business interacts poorly with the "extend a > relation by multiple blocks" logic added by commit 719c84c1be5 -- > concurrently inserting backends are liable to get the same heap block > from the FSM, causing "collisions". That almost seems like a bug IMV. > We really shouldn't be given out the same block twice, but that's what > my own custom instrumentation shows happens here. With atomic ops, it > isn't a big deal to restart using a compare-and-swap at the end (when > we set/reset fp_next_slot for other backends). The fact that that logic extends by 20 * numwaiters to get optimal performance is a red flag that resources aren't being allocated efficiently. I have an idea to ignore fp_next_slot entirely if we have extended by multiple blocks: The backend that does the extension stores in the FSM root page 1) the number of blocks added and 2) the end-most block number. Any request for space will look for a valid value here first before doing the usual search. If there is then the block to try is based on a hash of the xid. Something like: candidate-block = prev-end-of-relation + 1 + (xid % (num-new-blocks)) To guard against collisions, then peak in the FSM at that slot and if it's not completely empty, then search FSM using a "look-nearby" API and increment a counter every time we collide. When the counter gets to some-value, clear the special area in the root page so that future backends use the usual search. I think this would work well with your idea to be more picky if the xid stored with the relcache target block doesn't match the current one. Also num-new-blocks above could be scaled down from the actual number of blocks added, just to make sure writes aren't happening all over the place. There might be holes in this idea, but it may be worth trying to be better in this area without adding stricter locking. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 26, 2020 at 1:46 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > The fact that that logic extends by 20 * numwaiters to get optimal > performance is a red flag that resources aren't being allocated > efficiently. I agree that that's pretty suspicious. > I have an idea to ignore fp_next_slot entirely if we have > extended by multiple blocks: The backend that does the extension > stores in the FSM root page 1) the number of blocks added and 2) the > end-most block number. Any request for space will look for a valid > value here first before doing the usual search. If there is then the > block to try is based on a hash of the xid. Something like: > > candidate-block = prev-end-of-relation + 1 + (xid % (num-new-blocks)) I was thinking of doing something in shared memory, and not using the FSM here at all. If we're really giving 20 pages out to each backend, we will probably benefit from explicitly assigning contiguous ranges of pages to each backend, and making some effort to respect that they own the blocks in some general sense. Hopefully without also losing access to the free space in corner cases (e.g. one of the backend's has an error shortly after receiving its contiguous range of blocks). > To guard against collisions, then peak in the FSM at that slot and if > it's not completely empty, then search FSM using a "look-nearby" API > and increment a counter every time we collide. When the counter gets > to some-value, clear the special area in the root page so that future > backends use the usual search. The backends already use a look nearby API, sort of -- RecordAndGetPageWithFreeSpace() already behaves that way. I'm not sure exactly how well it works in practice, but it definitely works to some degree. > Also num-new-blocks above could be scaled down from the actual number > of blocks added, just to make sure writes aren't happening all over > the place. Or scaled up, perhaps. -- Peter Geoghegan
On Wed, Sep 2, 2020 at 1:57 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Aug 26, 2020 at 1:46 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > The fact that that logic extends by 20 * numwaiters to get optimal > > performance is a red flag that resources aren't being allocated > > efficiently. > > I agree that that's pretty suspicious. Per Simon off-list some time ago (if I understood him correctly), counting the lock waiters make the lock contention worse. I haven't tried to measure this, but I just had an idea instead to keep track of how many times the table has previously been extended by multiple blocks, and extend by a number calculated from that. Something like (pow2(2 + num-times-ext-mult-blocks)), with some ceiling perhaps much smaller than 512. Maybe a bit off topic, but the general problem you brought up has many moving parts, as you've mentioned. > > I have an idea to ignore fp_next_slot entirely if we have > > extended by multiple blocks: The backend that does the extension > > stores in the FSM root page 1) the number of blocks added and 2) the > > end-most block number. Any request for space will look for a valid > > value here first before doing the usual search. If there is then the > > block to try is based on a hash of the xid. Something like: > > > > candidate-block = prev-end-of-relation + 1 + (xid % (num-new-blocks)) > > I was thinking of doing something in shared memory, and not using the > FSM here at all. If we're really giving 20 pages out to each backend, > we will probably benefit from explicitly assigning contiguous ranges > of pages to each backend, and making some effort to respect that they > own the blocks in some general sense. That would give us flexibility and precise control. I suspect it would also have more cognitive and maintenance cost, by having more than one source of info. > Hopefully without also losing > access to the free space in corner cases (e.g. one of the backend's > has an error shortly after receiving its contiguous range of blocks). Right, you'd need some way of resetting or retiring the shared memory info when it is no longer useful. That was my thinking with the collision counter -- go back to using the FSM when we no longer have a reasonable chance of getting a fresh block. > > Also num-new-blocks above could be scaled down from the actual number > > of blocks added, just to make sure writes aren't happening all over > > the place. > > Or scaled up, perhaps. I don't think I explained this part well, so here's a concrete example. Let's say 128 blocks were added at once. Then xid % 128 would give a number to be added to the previous last block in the relation, so new target block allocations could be anywhere in this 128. By "scale down", I mean compute (say) xid % 32. That would limit deviance of spatial locality for those backends that were waiting on extension. Doing the opposite, like xid % 256, would give you blocks past the end of the relation. Further thinking out loud, after detecting enough collisions in the first 32, we could iterate through the other ranges and finally revert to conventional FSM search. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services