Re: Problems with the FSM, heap fillfactor, and temporal locality - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Problems with the FSM, heap fillfactor, and temporal locality |
Date | |
Msg-id | CAH2-Wz=Kk6RGn_XaV4JYgCyT=hcascsA835TYksF=6LRVRve0A@mail.gmail.com Whole thread Raw |
In response to | Re: Problems with the FSM, heap fillfactor, and temporal locality (John Naylor <john.naylor@2ndquadrant.com>) |
Responses |
Re: Problems with the FSM, heap fillfactor, and temporal locality
|
List | pgsql-hackers |
Hi John, On Fri, Aug 21, 2020 at 3:10 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > Interesting stuff. Is lower-than-default fillfactor important for the > behavior you see? It's hard to say. It's definitely not important as far as the initial bulk loading behavior is concerned (the behavior where related tuples get inserted onto multiple disparate tuples all too often). That will happen with any fill factor, including the default/100. I'm concerned about heap fill factor in particular because I suspect that that doesn't really work sensibly. To give you some concrete idea of the benefits, I present a pg_stat_database from the master branch after 6 hours of BenchmarkSQL with a rate limit: -[ RECORD 1 ]---------+------------------------------ datid | 13,619 datname | postgres numbackends | 3 xact_commit | 45,902,031 xact_rollback | 200,131 blks_read | 662,677,079 blks_hit | 24,790,989,538 tup_returned | 30,054,930,608 tup_fetched | 13,722,542,025 tup_inserted | 859,165,629 tup_updated | 520,611,959 tup_deleted | 20,074,897 conflicts | 0 temp_files | 88 temp_bytes | 18,849,890,304 deadlocks | 0 checksum_failures | checksum_last_failure | blk_read_time | 124,233,831.492 blk_write_time | 8,588,876.871 stats_reset | 2020-08-20 13:51:08.351036-07 Here is equivalent output for my simple patch that naively disables the FSM: -[ RECORD 1 ]---------+------------------------------ datid | 13,619 datname | postgres numbackends | 3 xact_commit | 46,369,235 xact_rollback | 201,919 blks_read | 658,267,665 blks_hit | 19,980,524,578 tup_returned | 30,804,856,896 tup_fetched | 11,839,865,936 tup_inserted | 861,798,911 tup_updated | 525,895,435 tup_deleted | 20,277,618 conflicts | 0 temp_files | 88 temp_bytes | 18,849,439,744 deadlocks | 0 checksum_failures | checksum_last_failure | blk_read_time | 117,167,612.616 blk_write_time | 7,873,922.175 stats_reset | 2020-08-20 13:50:51.72056-07 Note that there is a ~20% reduction in blks_hit here, even though the patch does ~1% more transactions (the rate limiting doesn't work perfectly). There is also a ~5.5% reduction in aggregate blk_read_time, and a ~9% reduction in blk_write_time. I'd say that that's pretty significant. I also see small but significant improvements in transaction latency, particularly with 90th percentile latency. > Hmm. You focus on FSM, but I'm thinking also multiple VM bits > potentially got cleared (maybe not this benchmark but other > scenarios). I'm focussed on how heapam interacts with the FSM, and its effects on locality. It's definitely true that this could go on to affect how the visibility map gets set -- we could set fewer bits unnecessarily. And it probably has a number of other undesirable consequences that are hard to quantify. Clearly there are many reasons why making the physical database layout closer to that of the logical database is a good thing. I probably have missed a few. > I'm not sure I follow the "close-by" criterion. It doesn't seem > immediately relevant to what I understand as the main problem you > found, of free space. In other words, if we currently write to 5 > blocks, but with smarter FSM logic we can find a single block, it > seems that should be preferred over close-by blocks each with less > space? Or are you envisioning scoring by both free space and distance > from the original block? I am thinking of doing both at the same time. Think of a table with highly skewed access. There is a relatively small number of hot spots -- parts of the primary key's key space that are consistently affected by many skewed updates (with strong heap correlation to primary key order). We simply cannot ever hope to avoid migrating heavily updated rows to new pages -- too much contention in the hot spots for that. But, by 1) Not considering pages free-space-reclaimable until the free space reaches ~50%, and 2) preferring close-by free pages, we avoid (or at least delay) destroying locality of access. There is a much better chance that rows from logically related hot spots will all migrate to the same few blocks again and again, with whole blocks becoming free in a relatively predictable, phased fashion. (I'm speculating here, but my guess is that this combination will help real world workloads by quite a bit.) Preferring close-by blocks in the FSM is something that there was discussion of when the current FSM implementation went in back in 8.4. I am almost certain that just doing that will not help. If it helps at all then it will help by preserving locality as tuple turnover takes place over time, and I think that you need to be clever about "reuse granularity" in order for that to happen. We're optimizing the entire "lifecycle" of logically related tuples whose relatedness is embodied by their initial physical position following an initial insert (before many subsequent updates take place that risk destroying locality). > Makes sense. If we go this route, I wonder if we should keep the > current behavior and use any free space if the fillfactor is 100%, > since that's in line with the intention. Also, the 50% (or whatever) > figure could be scaled depending on fillfactor. As in, check if > freespace > (100% - fillfactor * 0.6), or something. Right. Or it could be another reloption. > I'm not sure how to distinguish blocks that have never reached > fillfactor vs. ones that did but haven't gained back enough free space > to be considered again. Naively, we could start by assuming the last > block can always be filled up to fillfactor, but earlier blocks must > use the stricter rule. That's easy since we already try the last block > anyway before extending the relation. I was thinking of explicitly marking blocks as "freeable", meaning that the FSM will advertise their free space. This isn't self-evident from the amount of free space on the page alone, since we need to distinguish between at least two cases: the case where a page has yet to apply fill factor for the first time (which may still be close to fillfactor% full) versus the case where the page did reach fillfactor, but then had a small amount of space freed. I think that the FSM ought to give out space in the former case, but not in the latter case. Even though an identical amount of free space might be present in either case. > The flip side of this is: Why should vacuum be in a hurry to dirty a > page, emit WAL, and update the FSM if it only removes one dead tuple? > This presentation [1] (pages 35-43) from Masahiko Sawada had the idea > of a "garbage map", which keeps track of which parts of the heap have > the most dead space, and focus I/O efforts on those blocks first. It > may or may not be worth the extra complexity by itself, but it seems > it would work synergistically with your proposal: Updating related > tuples would concentrate dead tuples on fewer pages, and vacuums would > more quickly free up space that can actually be used to store multiple > new tuples. I agree that that seems kind of related. I'm trying to concentrate garbage from skewed updates in fewer blocks. (Same with the skewed successor tuples.) -- Peter Geoghegan
pgsql-hackers by date: