Thread: Combine Prune and Freeze records emitted by vacuum
Hi, Attached is a patch set which combines the freeze and prune records for vacuum -- eliminating the overhead of a separate freeze WAL record for every block frozen by vacuum. The contents of the freeze record are added to the PRUNE record. In cases where vacuum does freeze and prune, combining the WAL records can reduce WAL bytes substantially and, as a consequence, reduce WAL sync and write time. For example: psql -c "CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f INT, g INT, h TEXT) WITH (autovacuum_enabled=false);" for x in $(seq 1 16); do psql -c "INSERT INTO foo SELECT i, i, i, i, i, i, i, i, repeat('b', 30) FROM generate_series(1,2000000)i;" done psql -c "UPDATE foo SET a = 2 WHERE id % 7 = 0;" psql -c "VACUUM (FREEZE) foo;" Generates 30% fewer WAL records and 12% fewer WAL bytes -- which, depending on what else is happening on the system, can lead to vacuum spending substantially less time on WAL writing and syncing (often 15% less time on WAL writes and 10% less time on syncing WAL in my testing). Though heap_page_prune() is also used by on-access pruning, on-access pruning does not pass in the parameter used for freezing, so it should incur limited additional overhead. The primary additional overhead would be checking tuples' xmins against the GlobalVisState to determine if the page would be all_visible and identify the visibility cutoff xid. This is required to determine whether or not to opportunistically freeze. We could condition this on the caller being vacuum if needed. Though, in the future, we may want to consider opportunistic/eager freezing on access. This could allow us to, for example, freeze bulk-loaded read-only data before it goes cold and avoid expensive wraparound vacuuming. There are other steps that we can take to decrease vacuum WAL volume even further. Many of those are natural follow-ons to combining the prune and freeze record. For example, I intend to propose combining the visibility map update record into the Prune/Freeze and Vacuum records -- eliminating an extra visibility map update record. This would mean a single WAL record emitted per block for vacuum's first pass. On master, for my example above, of the roughly 1 million WAL records emitted by vacuum, about 1/3 of them are prune records, 1/3 are freeze records, and 1/3 are visibility map update records. So we will achieve another substantial reduction in the number of WAL records and bytes of WAL record overhead by eliminating a separate record for updating the VM. The attached patch set is broken up into many separate commits for ease of review. Each patch does a single thing which can be explained plainly in the commit message. Every commit passes tests and works on its own. 0001 - 0003 cover checking tuples' xmins against the GlobalVisState in heap_page_prune(). 0004 - 0007 executes freezing in heap_page_prune() (for vacuum only). 0008 translates the eager/opportunistic freeze heuristic into one that will work without relying on having a separate prune record. Elsewhere in [1] we are discussing how to improve this heuristic. 0009 - 0012 merges the freeze record into the prune record. 0013 - 0015 removes the loop through the page in lazy_scan_prune() by doing the accounting it did in heap_page_prune(). A nice bonus of this patch set is that we can eliminate one of vacuum's loops through the page. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_ZTDm1d9M%2BENf6oXhW9nRT3J76vOL1ianiCW4%2B4M6hMoA%40mail.gmail.com
Attachment
- v1-0004-Add-reference-to-VacuumCutoffs-in-HeapPageFreeze.patch
- v1-0001-lazy_scan_prune-tests-tuple-vis-with-GlobalVisTes.patch
- v1-0002-Pass-heap_prune_chain-PruneResult-output-paramete.patch
- v1-0003-heap_page_prune-sets-all_visible-and-frz_conflict.patch
- v1-0005-Prepare-freeze-tuples-in-heap_page_prune.patch
- v1-0006-lazy_scan_prune-reorder-freeze-execution-logic.patch
- v1-0009-Separate-tuple-pre-freeze-checks-and-invoke-earli.patch
- v1-0010-Inline-heap_freeze_execute_prepared.patch
- v1-0007-Execute-freezing-in-heap_page_prune.patch
- v1-0008-Make-opp-freeze-heuristic-compatible-with-prune-f.patch
- v1-0013-Set-hastup-in-heap_page_prune.patch
- v1-0014-Count-tuples-for-vacuum-logging-in-heap_page_prun.patch
- v1-0011-Exit-heap_page_prune-early-if-no-prune.patch
- v1-0012-Merge-prune-and-freeze-records.patch
- v1-0015-Save-dead-tuple-offsets-during-heap_page_prune.patch
On 25/01/2024 00:49, Melanie Plageman wrote: > Generates 30% fewer WAL records and 12% fewer WAL bytes -- which, > depending on what else is happening on the system, can lead to vacuum > spending substantially less time on WAL writing and syncing (often 15% > less time on WAL writes and 10% less time on syncing WAL in my > testing). Nice! > The attached patch set is broken up into many separate commits for > ease of review. Each patch does a single thing which can be explained > plainly in the commit message. Every commit passes tests and works on > its own. About this very first change: > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > @@ -1526,8 +1526,7 @@ lazy_scan_prune(LVRelState *vacrel, > * that everyone sees it as committed? > */ > xmin = HeapTupleHeaderGetXmin(htup); > - if (!TransactionIdPrecedes(xmin, > - vacrel->cutoffs.OldestXmin)) > + if (!GlobalVisTestIsRemovableXid(vacrel->vistest, xmin)) > { > all_visible = false; > break; Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly? I read through all the patches in order, and aside from the above they all look correct to me. Some comments on the set as whole: I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type anymore. By removing that, you also get rid of the freeze-only codepath near the end of heap_page_prune(), and the heap_freeze_execute_prepared() function. The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for the case that there's no pruning, just freezing. The record format (xl_heap_prune) looks pretty complex as it is, I think it could be made both more compact and more clear with some refactoring. FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use pagefrz->cutoffs now. HeapPageFreeze has two "trackers", for the "freeze" and "no freeze" cases. heap_page_prune() needs to track both, until it decides whether to freeze or not. But it doesn't make much sense that the caller (lazy_scan_prune()) has to initialize both, and has to choose which of the values to use after the call depending on whether heap_page_prune() froze or not. The two trackers should be just heap_page_prune()'s internal business. HeapPageFreeze is a bit confusing in general, as it's both an input and an output to heap_page_prune(). Not sure what exactly to do there, but I feel that we should make heap_page_prune()'s interface more clear. Perhaps move the output fields to PruneResult. Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as freezing is now an integral part of the function. And mention it in the function comment, too. In heap_prune_chain: > * Tuple visibility information is provided in presult->htsv. Not this patch's fault directly, but it's not immediate clear what "is provided" means here. Does the caller provide it, or does the function set it, ie. is it an input or output argument? Looking at the code, it's an input, but now it looks a bit weird that an input argument is called 'presult'. -- Heikki Linnakangas Neon (https://neon.tech)
Thanks so much for the review! On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 25/01/2024 00:49, Melanie Plageman wrote: > > > The attached patch set is broken up into many separate commits for > > ease of review. Each patch does a single thing which can be explained > > plainly in the commit message. Every commit passes tests and works on > > its own. > > About this very first change: > > > --- a/src/backend/access/heap/vacuumlazy.c > > +++ b/src/backend/access/heap/vacuumlazy.c > > @@ -1526,8 +1526,7 @@ lazy_scan_prune(LVRelState *vacrel, > > * that everyone sees it as committed? > > */ > > xmin = HeapTupleHeaderGetXmin(htup); > > - if (!TransactionIdPrecedes(xmin, > > - vacrel->cutoffs.OldestXmin)) > > + if (!GlobalVisTestIsRemovableXid(vacrel->vistest, xmin)) > > { > > all_visible = false; > > break; > > Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly? Okay, so I thought a lot about this, and I don't understand how GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId correctly. vacrel->cutoffs.OldestXmin is computed initially from GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons(). GlobalVisState is updated by ComputeXidHorizons() (when it is updated). I do see that the comment above GlobalVisTestIsRemovableXid() says * It is crucial that this only gets called for xids from a source that * protects against xid wraparounds (e.g. from a table and thus protected by * relfrozenxid). and then in * Convert 32 bit argument to FullTransactionId. We can do so safely * because we know the xid has to, at the very least, be between * [oldestXid, nextXid), i.e. within 2 billion of xid. I'm not sure what oldestXid is here. It's true that I don't see GlobalVisTestIsRemovableXid() being called anywhere else with an xmin as an input. I think that hints that it is not safe with FrozenTransactionId. But I don't see what could go wrong. Maybe it has something to do with converting it to a FullTransactionId? FullTransactionIdFromU64(U64FromFullTransactionId(rel) + (int32) (xid - rel_xid)); Sorry, I couldn't quite figure it out :( > I read through all the patches in order, and aside from the above they > all look correct to me. Some comments on the set as whole: > > I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type > anymore. By removing that, you also get rid of the freeze-only codepath > near the end of heap_page_prune(), and the > heap_freeze_execute_prepared() function. That makes sense to me. > The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than > XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for > the case that there's no pruning, just freezing. The record format > (xl_heap_prune) looks pretty complex as it is, I think it could be made > both more compact and more clear with some refactoring. I'm happy to change up xl_heap_prune format. In its current form, according to pahole, it has no holes and just 3 bytes of padding at the end. One way we could make it smaller is by moving the isCatalogRel member to directly after snapshotConflictHorizon and then adding a flags field and defining flags to indicate whether or not other members exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED, HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16 nredirected, nunused, nplans, ndead and just put the number of redirected/unused/etc at the beginning of the arrays containing the offsets. Then I could write various macros for accessing them. That would make it smaller, but it definitely wouldn't make it less complex (IMO). > FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use > pagefrz->cutoffs now. Yep, I forgot to add a commit to do this. Thanks! > HeapPageFreeze has two "trackers", for the "freeze" and "no freeze" > cases. heap_page_prune() needs to track both, until it decides whether > to freeze or not. But it doesn't make much sense that the caller > (lazy_scan_prune()) has to initialize both, and has to choose which of > the values to use after the call depending on whether heap_page_prune() > froze or not. The two trackers should be just heap_page_prune()'s > internal business. > > HeapPageFreeze is a bit confusing in general, as it's both an input and > an output to heap_page_prune(). Not sure what exactly to do there, but I > feel that we should make heap_page_prune()'s interface more clear. > Perhaps move the output fields to PruneResult. Great point. Perhaps I just add NewRelfrozenXid and NewRelminMxid to PruneResult (and call it PruneFreezeResult) and then make VacuumCutoffs an optional argument to heap_page_prune() (used by vacuum and not on-access pruning). Then I eliminate HeapPageFreeze as a parameter altogether. > Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as > freezing is now an integral part of the function. And mention it in the > function comment, too. Agreed. Will do in the next version. I want to get some consensus on what to do with xl_heap_prune before going on my rebase journey with these 15 patches. > In heap_prune_chain: > > > * Tuple visibility information is provided in presult->htsv. > > Not this patch's fault directly, but it's not immediate clear what "is > provided" means here. Does the caller provide it, or does the function > set it, ie. is it an input or output argument? Looking at the code, it's > an input, but now it looks a bit weird that an input argument is called > 'presult'. So, htsv is a member of PruneResult on master because heap_page_prune() populates PruneResult->htsv for use in lazy_scan_prune(). heap_prune_chain() doesn't have access to PruneResult on master. Once I move PruneResult to being populated both by heap_page_prune() and heap_prune_chain(), it gets more confusing. htsv is always populated in heap_page_prune(), but it is not until later patches in the set that I stop accessing it in lazy_scan_prune(). Once I do so, I move htsv from PruneResult into PruneState -- which fixes the heap_prune_chain() confusion. So, only intermediate commits in the set have PruneResult->htsv used in heap_prune_chain(). The end state is that heap_prune_chain() accesses PruneState->htsv. However, I can document how it is used more clearly in the function comment in the intermediate commits. Or, I can simply leave htsv as a separate input argument to heap_prune_chain() in the intermediate commits. - Melanie
On 09/03/2024 22:41, Melanie Plageman wrote: > On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly? > > Okay, so I thought a lot about this, and I don't understand how > GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId > correctly. > > vacrel->cutoffs.OldestXmin is computed initially from > GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons(). > GlobalVisState is updated by ComputeXidHorizons() (when it is > updated). > > I do see that the comment above GlobalVisTestIsRemovableXid() says > > * It is crucial that this only gets called for xids from a source that > * protects against xid wraparounds (e.g. from a table and thus protected by > * relfrozenxid). > > and then in > > * Convert 32 bit argument to FullTransactionId. We can do so safely > * because we know the xid has to, at the very least, be between > * [oldestXid, nextXid), i.e. within 2 billion of xid. > > I'm not sure what oldestXid is here. > It's true that I don't see GlobalVisTestIsRemovableXid() being called > anywhere else with an xmin as an input. I think that hints that it is > not safe with FrozenTransactionId. But I don't see what could go > wrong. > > Maybe it has something to do with converting it to a FullTransactionId? > > FullTransactionIdFromU64(U64FromFullTransactionId(rel) + (int32) > (xid - rel_xid)); > > Sorry, I couldn't quite figure it out :( I just tested it. No, GlobalVisTestIsRemovableXid does not work for FrozenTransactionId. I just tested it with state->definitely_needed == {0, 4000000000} and xid == FrozenTransactionid, and it incorrectly returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'. >> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than >> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for >> the case that there's no pruning, just freezing. The record format >> (xl_heap_prune) looks pretty complex as it is, I think it could be made >> both more compact and more clear with some refactoring. > > I'm happy to change up xl_heap_prune format. In its current form, > according to pahole, it has no holes and just 3 bytes of padding at > the end. > > One way we could make it smaller is by moving the isCatalogRel member > to directly after snapshotConflictHorizon and then adding a flags > field and defining flags to indicate whether or not other members > exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED, > HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16 > nredirected, nunused, nplans, ndead and just put the number of > redirected/unused/etc at the beginning of the arrays containing the > offsets. Sounds good. > Then I could write various macros for accessing them. That > would make it smaller, but it definitely wouldn't make it less complex > (IMO). I don't know, it might turn out not that complex. If you define the formats of each of those "sub-record types" as explicit structs, per attached sketch, you won't need so many macros. Some care is still needed with alignment though. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
On Mon, Mar 11, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 09/03/2024 22:41, Melanie Plageman wrote: > > On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly? > > > > Okay, so I thought a lot about this, and I don't understand how > > GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId > > correctly. > > > > vacrel->cutoffs.OldestXmin is computed initially from > > GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons(). > > GlobalVisState is updated by ComputeXidHorizons() (when it is > > updated). > > > > I do see that the comment above GlobalVisTestIsRemovableXid() says > > > > * It is crucial that this only gets called for xids from a source that > > * protects against xid wraparounds (e.g. from a table and thus protected by > > * relfrozenxid). > > > > and then in > > > > * Convert 32 bit argument to FullTransactionId. We can do so safely > > * because we know the xid has to, at the very least, be between > > * [oldestXid, nextXid), i.e. within 2 billion of xid. > > > > I'm not sure what oldestXid is here. > > It's true that I don't see GlobalVisTestIsRemovableXid() being called > > anywhere else with an xmin as an input. I think that hints that it is > > not safe with FrozenTransactionId. But I don't see what could go > > wrong. > > > > Maybe it has something to do with converting it to a FullTransactionId? > > > > FullTransactionIdFromU64(U64FromFullTransactionId(rel) + (int32) > > (xid - rel_xid)); > > > > Sorry, I couldn't quite figure it out :( > > I just tested it. No, GlobalVisTestIsRemovableXid does not work for > FrozenTransactionId. I just tested it with state->definitely_needed == > {0, 4000000000} and xid == FrozenTransactionid, and it incorrectly > returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'. I see. Looking at the original code: if (!TransactionIdPrecedes(xmin, vacrel->cutoffs.OldestXmin)) And the source code for TransactionIdPrecedes: if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2)) return (id1 < id2); diff = (int32) (id1 - id2); return (diff < 0); In your example, It seems like you mean GlobalVisState->maybe_needed is 0 and GlobalVisState->definitely_needed = 4000000000. In this example, if vacrel->cutoffs.OldestXmin was 0, we would get a wrong answer also. But, I do see that the comparison done by TransactionIdPrecedes() is is quite different than that done by FullTransactionIdPrecedes() because of the modulo 2**32 arithmetic. Solving the handling of FrozenTransactionId specifically by GlobalVisTestIsRemovableXid() seems like it would be done easily in our case by wrapping it in a function which checks if TransactionIdIsNormal() and returns true if it is not normal. But, I'm not sure if I am missing the larger problem. > >> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than > >> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for > >> the case that there's no pruning, just freezing. The record format > >> (xl_heap_prune) looks pretty complex as it is, I think it could be made > >> both more compact and more clear with some refactoring. > > > > I'm happy to change up xl_heap_prune format. In its current form, > > according to pahole, it has no holes and just 3 bytes of padding at > > the end. > > > > One way we could make it smaller is by moving the isCatalogRel member > > to directly after snapshotConflictHorizon and then adding a flags > > field and defining flags to indicate whether or not other members > > exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED, > > HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16 > > nredirected, nunused, nplans, ndead and just put the number of > > redirected/unused/etc at the beginning of the arrays containing the > > offsets. > > Sounds good. > > > Then I could write various macros for accessing them. That > > would make it smaller, but it definitely wouldn't make it less complex > > (IMO). > > I don't know, it might turn out not that complex. If you define the > formats of each of those "sub-record types" as explicit structs, per > attached sketch, you won't need so many macros. Some care is still > needed with alignment though. In the attached v2, I've done as you suggested and made all members except flags and snapshotConflictHorizon optional in the xl_heap_prune struct and obsoleted the xl_heap_freeze struct. I've kept the actual xl_heap_freeze_page struct and heap_xlog_freeze_page() function so that we can replay previously made XLOG_HEAP2_FREEZE_PAGE records. Because we may set line pointers unused during vacuum's first pass, I couldn't use the presence of nowunused without redirected or dead items to indicate that this was a record emitted by vacuum's second pass. As such, I haven't obsoleted the xl_heap_vacuum struct. I was thinking I could add a flag that indicates the record was emitted by vacuum's second pass? We would want to distinguish this so that we could set the items unused without calling heap_page_prune_execute() -- because that calls PageRepairFragmentation() which requires a full cleanup lock. I introduced a few sub-record types similar to what you suggested -- they help a bit with alignment, so I think they are worth keeping. There are comments around them, but perhaps a larger diagram of the layout of a the new XLOG_HEAP2_PRUNE record would be helpful. There is a bit of duplicated code between heap_xlog_prune() and heap2_desc() since they both need to deserialize the record. Before the code to do this was small and it didn't matter, but it might be worth refactoring it that way now. Note that I've made all of the changes related to obsoleting the XLOG_HEAP2_FREEZE_PAGE record in separate commits on top of the rest of the set for ease of review. However, I've rebased the other review feedback into the relevant commits. On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type > anymore. By removing that, you also get rid of the freeze-only codepath > near the end of heap_page_prune(), and the > heap_freeze_execute_prepared() function. > > The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than > XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for > the case that there's no pruning, just freezing. The record format > (xl_heap_prune) looks pretty complex as it is, I think it could be made > both more compact and more clear with some refactoring. On the point of removing the freeze-only code path from heap_page_prune() (now heap_page_prune_and_freeze()): while doing this, I realized that heap_pre_freeze_checks() was not being called in the case that we decide to freeze because we emitted an FPI while setting the hint bit. I've fixed that, however, I've done so by moving heap_pre_freeze_checks() into the critical section. I think that is not okay? I could move it earlier and not do call it when the hint bit FPI leads us to freeze tuples. But, I think that would lead to us doing a lot less validation of tuples being frozen when checksums are enabled. Or, I could make two critical sections? > FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use > pagefrz->cutoffs now. Fixed this. > HeapPageFreeze has two "trackers", for the "freeze" and "no freeze" > cases. heap_page_prune() needs to track both, until it decides whether > to freeze or not. But it doesn't make much sense that the caller > (lazy_scan_prune()) has to initialize both, and has to choose which of > the values to use after the call depending on whether heap_page_prune() > froze or not. The two trackers should be just heap_page_prune()'s > internal business. I've added new_relminmxid and new_relfrozenxid to PruneFreezeResult and set them appropriately in heap_page_prune_and_freeze(). It's a bit sad because if it wasn't for vacrel->skippedallvis, vacrel->NewRelfrozenXid and vacrel->NewRelminMxid would be vacrel->cutoffs.OldestXmin and vacrel->cutoffs.OldestMxact respectively and we could avoid having lazy_scan_prune() initializing the HeapPageFreeze altogether and just pass VacuumCutoffs (and heap_page_prune_opt() could pass NULL) to heap_page_prune_and_freeze(). I think it is probably worse to add both of them as additional optional arguments, so I've just left lazy_scan_prune() with the job of initializing them. In heap_page_prune_and_freeze(), I initialize new_relminmxid and new_relfrozenxid to InvalidMultiXactId and InvalidTransactionId respectively because on-access pruning doesn't have a value to set them to. But I wasn't sure if this was okay -- since I don't see validation that TransactionIdIsValid() in vac_update_relstats(). It will work now -- just worried about future issues. I could add an assert there? > HeapPageFreeze is a bit confusing in general, as it's both an input and > an output to heap_page_prune(). Not sure what exactly to do there, but I > feel that we should make heap_page_prune()'s interface more clear. > Perhaps move the output fields to PruneResult. HeapPageFrz is now only an input argument to heap_page_prune_and_freeze() as of the commit in which heap_page_prune_and_freeze() becomes responsible for executing freezing. It is still an in/out param to earlier commits because we still need info from it to execute freezing in lazy_scan_prune(). > Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as > freezing is now an integral part of the function. And mention it in the > function comment, too. I've done this. > In heap_prune_chain: > > > * Tuple visibility information is provided in presult->htsv. > > Not this patch's fault directly, but it's not immediate clear what "is > provided" means here. Does the caller provide it, or does the function > set it, ie. is it an input or output argument? Looking at the code, it's > an input, but now it looks a bit weird that an input argument is called > 'presult'. I haven't updated the comments about this in the intermediate commits since it ends up in the PruneState as an input. - Melanie
Attachment
- v2-0017-Streamline-XLOG_HEAP2_PRUNE-record.patch
- v2-0001-lazy_scan_prune-tests-tuple-vis-with-GlobalVisTes.patch
- v2-0002-Pass-heap_prune_chain-PruneResult-output-paramete.patch
- v2-0003-heap_page_prune-sets-all_visible-and-frz_conflict.patch
- v2-0004-Add-reference-to-VacuumCutoffs-in-HeapPageFreeze.patch
- v2-0005-Prepare-freeze-tuples-in-heap_page_prune.patch
- v2-0006-lazy_scan_prune-reorder-freeze-execution-logic.patch
- v2-0007-Execute-freezing-in-heap_page_prune.patch
- v2-0008-Make-opp-freeze-heuristic-compatible-with-prune-f.patch
- v2-0009-Separate-tuple-pre-freeze-checks-and-invoke-earli.patch
- v2-0010-Inline-heap_freeze_execute_prepared.patch
- v2-0011-Exit-heap_page_prune-early-if-no-prune.patch
- v2-0012-Merge-prune-and-freeze-records.patch
- v2-0013-Set-hastup-in-heap_page_prune.patch
- v2-0014-Count-tuples-for-vacuum-logging-in-heap_page_prun.patch
- v2-0015-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v2-0016-Obsolete-XLOG_HEAP2_FREEZE_PAGE.patch
On Wed, Mar 13, 2024 at 07:25:56PM -0400, Melanie Plageman wrote: > On Mon, Mar 11, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 09/03/2024 22:41, Melanie Plageman wrote: > > > On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > >> Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly? > > > > > > Okay, so I thought a lot about this, and I don't understand how > > > GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId > > > correctly. > > > > > > vacrel->cutoffs.OldestXmin is computed initially from > > > GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons(). > > > GlobalVisState is updated by ComputeXidHorizons() (when it is > > > updated). > > > > > > I do see that the comment above GlobalVisTestIsRemovableXid() says > > > > > > * It is crucial that this only gets called for xids from a source that > > > * protects against xid wraparounds (e.g. from a table and thus protected by > > > * relfrozenxid). > > > > > > and then in > > > > > > * Convert 32 bit argument to FullTransactionId. We can do so safely > > > * because we know the xid has to, at the very least, be between > > > * [oldestXid, nextXid), i.e. within 2 billion of xid. > > > > > > I'm not sure what oldestXid is here. > > > It's true that I don't see GlobalVisTestIsRemovableXid() being called > > > anywhere else with an xmin as an input. I think that hints that it is > > > not safe with FrozenTransactionId. But I don't see what could go > > > wrong. > > > > > > Maybe it has something to do with converting it to a FullTransactionId? > > > > > > FullTransactionIdFromU64(U64FromFullTransactionId(rel) + (int32) > > > (xid - rel_xid)); > > > > > > Sorry, I couldn't quite figure it out :( > > > > I just tested it. No, GlobalVisTestIsRemovableXid does not work for > > FrozenTransactionId. I just tested it with state->definitely_needed == > > {0, 4000000000} and xid == FrozenTransactionid, and it incorrectly > > returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'. > > I see. Looking at the original code: > if (!TransactionIdPrecedes(xmin, > vacrel->cutoffs.OldestXmin)) > > And the source code for TransactionIdPrecedes: > > if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2)) > return (id1 < id2); > > diff = (int32) (id1 - id2); > return (diff < 0); > > In your example, It seems like you mean GlobalVisState->maybe_needed is > 0 and GlobalVisState->definitely_needed = 4000000000. In this example, > if vacrel->cutoffs.OldestXmin was 0, we would get a wrong answer also. > > But, I do see that the comparison done by TransactionIdPrecedes() is is > quite different than that done by FullTransactionIdPrecedes() because of > the modulo 2**32 arithmetic. > > Solving the handling of FrozenTransactionId specifically by > GlobalVisTestIsRemovableXid() seems like it would be done easily in our > case by wrapping it in a function which checks if > TransactionIdIsNormal() and returns true if it is not normal. But, I'm > not sure if I am missing the larger problem. I've made such a wrapper in attached v3. > > >> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than > > >> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for > > >> the case that there's no pruning, just freezing. The record format > > >> (xl_heap_prune) looks pretty complex as it is, I think it could be made > > >> both more compact and more clear with some refactoring. > > > > > > I'm happy to change up xl_heap_prune format. In its current form, > > > according to pahole, it has no holes and just 3 bytes of padding at > > > the end. > > > > > > One way we could make it smaller is by moving the isCatalogRel member > > > to directly after snapshotConflictHorizon and then adding a flags > > > field and defining flags to indicate whether or not other members > > > exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED, > > > HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16 > > > nredirected, nunused, nplans, ndead and just put the number of > > > redirected/unused/etc at the beginning of the arrays containing the > > > offsets. > > > > Sounds good. > > > > > Then I could write various macros for accessing them. That > > > would make it smaller, but it definitely wouldn't make it less complex > > > (IMO). > > > > I don't know, it might turn out not that complex. If you define the > > formats of each of those "sub-record types" as explicit structs, per > > attached sketch, you won't need so many macros. Some care is still > > needed with alignment though. > > In the attached v2, I've done as you suggested and made all members > except flags and snapshotConflictHorizon optional in the xl_heap_prune > struct and obsoleted the xl_heap_freeze struct. I've kept the actual > xl_heap_freeze_page struct and heap_xlog_freeze_page() function so that > we can replay previously made XLOG_HEAP2_FREEZE_PAGE records. > > Because we may set line pointers unused during vacuum's first pass, I > couldn't use the presence of nowunused without redirected or dead items > to indicate that this was a record emitted by vacuum's second pass. As > such, I haven't obsoleted the xl_heap_vacuum struct. I was thinking I > could add a flag that indicates the record was emitted by vacuum's > second pass? We would want to distinguish this so that we could set the > items unused without calling heap_page_prune_execute() -- because that > calls PageRepairFragmentation() which requires a full cleanup lock. Okay, so I was going to start using xl_heap_prune for vacuum here too, but I realized it would be bigger because of the snapshotConflictHorizon. Do you think there is a non-terrible way to make the snapshotConflictHorizon optional? Like with a flag? > I introduced a few sub-record types similar to what you suggested -- > they help a bit with alignment, so I think they are worth keeping. There > are comments around them, but perhaps a larger diagram of the layout of > a the new XLOG_HEAP2_PRUNE record would be helpful. I started doing this, but I can't find a way of laying out the diagram that pgindent doesn't destroy. I thought I remember other diagrams in the source code showing the layout of something (something with pages somewhere?) that don't get messed up by pgindent, but I couldn't find them. > There is a bit of duplicated code between heap_xlog_prune() and > heap2_desc() since they both need to deserialize the record. Before the > code to do this was small and it didn't matter, but it might be worth > refactoring it that way now. I have added a helper function to do the deserialization, heap_xlog_deserialize_prune_and_freeze(). But I didn't start using it in heap2_desc() because of the way the pg_waldump build file works. Do you think the helper belongs in any of waldump's existing sources? pg_waldump_sources = files( 'compat.c', 'pg_waldump.c', 'rmgrdesc.c', ) pg_waldump_sources += rmgr_desc_sources pg_waldump_sources += xlogreader_sources pg_waldump_sources += files('../../backend/access/transam/xlogstats.c') Otherwise, I assume I am suppose to avoid adding some big new include to waldump. > On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type > > anymore. By removing that, you also get rid of the freeze-only codepath > > near the end of heap_page_prune(), and the > > heap_freeze_execute_prepared() function. > > > > The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than > > XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for > > the case that there's no pruning, just freezing. The record format > > (xl_heap_prune) looks pretty complex as it is, I think it could be made > > both more compact and more clear with some refactoring. > > On the point of removing the freeze-only code path from > heap_page_prune() (now heap_page_prune_and_freeze()): while doing this, > I realized that heap_pre_freeze_checks() was not being called in the > case that we decide to freeze because we emitted an FPI while setting > the hint bit. I've fixed that, however, I've done so by moving > heap_pre_freeze_checks() into the critical section. I think that is not > okay? I could move it earlier and not do call it when the hint bit FPI > leads us to freeze tuples. But, I think that would lead to us doing a > lot less validation of tuples being frozen when checksums are enabled. > Or, I could make two critical sections? I found another approach and just do the pre-freeze checks if we are considering freezing except for the FPI criteria. > > HeapPageFreeze has two "trackers", for the "freeze" and "no freeze" > > cases. heap_page_prune() needs to track both, until it decides whether > > to freeze or not. But it doesn't make much sense that the caller > > (lazy_scan_prune()) has to initialize both, and has to choose which of > > the values to use after the call depending on whether heap_page_prune() > > froze or not. The two trackers should be just heap_page_prune()'s > > internal business. > > I've added new_relminmxid and new_relfrozenxid to PruneFreezeResult and > set them appropriately in heap_page_prune_and_freeze(). > > It's a bit sad because if it wasn't for vacrel->skippedallvis, > vacrel->NewRelfrozenXid and vacrel->NewRelminMxid would be > vacrel->cutoffs.OldestXmin and vacrel->cutoffs.OldestMxact respectively > and we could avoid having lazy_scan_prune() initializing the > HeapPageFreeze altogether and just pass VacuumCutoffs (and > heap_page_prune_opt() could pass NULL) to heap_page_prune_and_freeze(). > > I think it is probably worse to add both of them as additional optional > arguments, so I've just left lazy_scan_prune() with the job of > initializing them. > > In heap_page_prune_and_freeze(), I initialize new_relminmxid and > new_relfrozenxid to InvalidMultiXactId and InvalidTransactionId > respectively because on-access pruning doesn't have a value to set them > to. But I wasn't sure if this was okay -- since I don't see validation > that TransactionIdIsValid() in vac_update_relstats(). It will work now > -- just worried about future issues. I could add an assert there? I looked more closely at vac_update_relstats() and it won't update relfrozenxid unless the proposed value is smaller than the existing one. And for MultiXactIds, it checks if it is valid. So, this is not an issue. I've also optimized the member ordering of PruneFreezeResult. pahole identified some avoidable holes. I went through each commit and optimized the PruneResult data structure as members are being added and removed. - Melanie
Attachment
- v3-0001-lazy_scan_prune-tests-tuple-vis-with-GlobalVisTes.patch
- v3-0002-Pass-heap_prune_chain-PruneResult-output-paramete.patch
- v3-0003-heap_page_prune-sets-all_visible-and-frz_conflict.patch
- v3-0004-Add-reference-to-VacuumCutoffs-in-HeapPageFreeze.patch
- v3-0005-Prepare-freeze-tuples-in-heap_page_prune.patch
- v3-0006-lazy_scan_prune-reorder-freeze-execution-logic.patch
- v3-0007-Execute-freezing-in-heap_page_prune.patch
- v3-0008-Make-opp-freeze-heuristic-compatible-with-prune-f.patch
- v3-0009-Separate-tuple-pre-freeze-checks-and-invoke-earli.patch
- v3-0010-Inline-heap_freeze_execute_prepared.patch
- v3-0011-Exit-heap_page_prune-early-if-no-prune.patch
- v3-0012-Merge-prune-and-freeze-records.patch
- v3-0013-Set-hastup-in-heap_page_prune.patch
- v3-0014-Count-tuples-for-vacuum-logging-in-heap_page_prun.patch
- v3-0015-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v3-0016-Obsolete-XLOG_HEAP2_FREEZE_PAGE.patch
- v3-0017-Streamline-XLOG_HEAP2_PRUNE-record.patch
On 15/03/2024 02:56, Melanie Plageman wrote: > Okay, so I was going to start using xl_heap_prune for vacuum here too, > but I realized it would be bigger because of the > snapshotConflictHorizon. Do you think there is a non-terrible way to > make the snapshotConflictHorizon optional? Like with a flag? Yeah, another flag would do the trick. >> I introduced a few sub-record types similar to what you suggested -- >> they help a bit with alignment, so I think they are worth keeping. There >> are comments around them, but perhaps a larger diagram of the layout of >> a the new XLOG_HEAP2_PRUNE record would be helpful. > > I started doing this, but I can't find a way of laying out the diagram > that pgindent doesn't destroy. I thought I remember other diagrams in > the source code showing the layout of something (something with pages > somewhere?) that don't get messed up by pgindent, but I couldn't find > them. See src/tools/pgindent/README, section "Cleaning up in case of failure or ugly output": /*---------- * Text here will not be touched by pgindent. */ >> There is a bit of duplicated code between heap_xlog_prune() and >> heap2_desc() since they both need to deserialize the record. Before the >> code to do this was small and it didn't matter, but it might be worth >> refactoring it that way now. > > I have added a helper function to do the deserialization, > heap_xlog_deserialize_prune_and_freeze(). But I didn't start using it in > heap2_desc() because of the way the pg_waldump build file works. Do you > think the helper belongs in any of waldump's existing sources? > > pg_waldump_sources = files( > 'compat.c', > 'pg_waldump.c', > 'rmgrdesc.c', > ) > > pg_waldump_sources += rmgr_desc_sources > pg_waldump_sources += xlogreader_sources > pg_waldump_sources += files('../../backend/access/transam/xlogstats.c') > > Otherwise, I assume I am suppose to avoid adding some big new include to > waldump. Didn't look closely at that, but there's some precedent with commit/prepare/abort records. See ParseCommitRecord, xl_xact_commit, xl_parsed_commit et al. Note that we don't provide WAL compatibility across major versions. You can fully remove the old xl_heap_freeze_page format. (We should bump XLOG_PAGE_MAGIC when this is committed though) >> On the point of removing the freeze-only code path from >> heap_page_prune() (now heap_page_prune_and_freeze()): while doing this, >> I realized that heap_pre_freeze_checks() was not being called in the >> case that we decide to freeze because we emitted an FPI while setting >> the hint bit. I've fixed that, however, I've done so by moving >> heap_pre_freeze_checks() into the critical section. I think that is not >> okay? I could move it earlier and not do call it when the hint bit FPI >> leads us to freeze tuples. But, I think that would lead to us doing a >> lot less validation of tuples being frozen when checksums are enabled. >> Or, I could make two critical sections? > > I found another approach and just do the pre-freeze checks if we are > considering freezing except for the FPI criteria. Hmm, I think you can make this simpler if you use XLogCheckBufferNeedsBackup() to check if the hint-bit update will generate a full-page-image before entering the critical section. Like you did to check if pruning will generate a full-page-image. I included that change in the attached patches. I don't fully understand this: > /* > * If we will freeze tuples on the page or they were all already frozen > * on the page, if we will set the page all-frozen in the visibility map, > * we can advance relfrozenxid and relminmxid to the values in > * pagefrz->FreezePageRelfrozenXid and pagefrz->FreezePageRelminMxid. > */ > if (presult->all_frozen || presult->nfrozen > 0) > { > presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid; > presult->new_relminmxid = pagefrz->FreezePageRelminMxid; > } > else > { > presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid; > presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid; > } Firstly, the comment is outdated, because we have already done the freezing at this point. But more importantly, I don't understand the logic here. Could we check just presult->nfrozen > 0 here and get the same result? >> I think it is probably worse to add both of them as additional optional >> arguments, so I've just left lazy_scan_prune() with the job of >> initializing them. Ok. Here are some patches on top of your patches for some further refactorings. Some notable changes in heap_page_prune_and_freeze(): - I moved the heap_prepare_freeze_tuple() work from the 2nd loop to the 1st one. It seems more clear and efficient that way. - I extracted the code to generate the WAL record to a separate function. - Refactored the "will setting hint bit generate FPI" check as discussed above These patches are in a very rough WIP state, but I wanted to share early. I haven't done much testing, and I'm not wedded to these changes, but I think they make it more readable. Please include / squash in the patch set if you agree with them. Please also take a look at the comments I marked with HEIKKI or FIXME, in the patches and commit messages. I'll wait for a new version from you before reviewing more. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote: > On 15/03/2024 02:56, Melanie Plageman wrote: > > Okay, so I was going to start using xl_heap_prune for vacuum here too, > > but I realized it would be bigger because of the > > snapshotConflictHorizon. Do you think there is a non-terrible way to > > make the snapshotConflictHorizon optional? Like with a flag? > > Yeah, another flag would do the trick. Okay, I've done this in attached v4 (including removing XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the "main chunk" of data available at replay regardless of whether or not the record ended up including an FPI. I made it its own sub-record (xlhp_conflict_horizon) less to help with alignment (though we can use all the help we can get there) and more to keep it from getting lost. When you look at heapam_xlog.h, you can see what a XLOG_HEAP2_PRUNE record will contain starting with the xl_heap_prune struct and then all the sub-record types. > > > I introduced a few sub-record types similar to what you suggested -- > > > they help a bit with alignment, so I think they are worth keeping. There > > > are comments around them, but perhaps a larger diagram of the layout of > > > a the new XLOG_HEAP2_PRUNE record would be helpful. > > > > I started doing this, but I can't find a way of laying out the diagram > > that pgindent doesn't destroy. I thought I remember other diagrams in > > the source code showing the layout of something (something with pages > > somewhere?) that don't get messed up by pgindent, but I couldn't find > > them. > > See src/tools/pgindent/README, section "Cleaning up in case of failure or > ugly output": > > /*---------- > * Text here will not be touched by pgindent. > */ > Cool. This version doesn't include the spiffy drawing I promised yet. > Note that we don't provide WAL compatibility across major versions. You can > fully remove the old xl_heap_freeze_page format. (We should bump > XLOG_PAGE_MAGIC when this is committed though) I've removed the xl_heap_freeze (and xl_heap_prune). I didn't bump XLOG_PAGE_MAGIC. > > > On the point of removing the freeze-only code path from > > > heap_page_prune() (now heap_page_prune_and_freeze()): while doing this, > > > I realized that heap_pre_freeze_checks() was not being called in the > > > case that we decide to freeze because we emitted an FPI while setting > > > the hint bit. I've fixed that, however, I've done so by moving > > > heap_pre_freeze_checks() into the critical section. I think that is not > > > okay? I could move it earlier and not do call it when the hint bit FPI > > > leads us to freeze tuples. But, I think that would lead to us doing a > > > lot less validation of tuples being frozen when checksums are enabled. > > > Or, I could make two critical sections? > > > > I found another approach and just do the pre-freeze checks if we are > > considering freezing except for the FPI criteria. > > Hmm, I think you can make this simpler if you use > XLogCheckBufferNeedsBackup() to check if the hint-bit update will generate a > full-page-image before entering the critical section. Like you did to check > if pruning will generate a full-page-image. I included that change in the > attached patches. I used your proposed structure. You had XLogCheckBufferNeedsBackup() twice in your proposed version a few lines apart. I don't think there is any point in checking it twice. If we are going to rely on XLogCheckBufferNeedsBackup() to tell us whether or not setting the hint bit is *likely* to emit an FPI, then we might as well just call XLogCheckBufferNeedsBackup() once. > I don't fully understand this: > > > /* > > * If we will freeze tuples on the page or they were all already frozen > > * on the page, if we will set the page all-frozen in the visibility map, > > * we can advance relfrozenxid and relminmxid to the values in > > * pagefrz->FreezePageRelfrozenXid and pagefrz->FreezePageRelminMxid. > > */ > > if (presult->all_frozen || presult->nfrozen > 0) > > { > > presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid; > > presult->new_relminmxid = pagefrz->FreezePageRelminMxid; > > } > > else > > { > > presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid; > > presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid; > > } > > Firstly, the comment is outdated, because we have already done the freezing > at this point. But more importantly, I don't understand the logic here. > Could we check just presult->nfrozen > 0 here and get the same result? I've updated the comment. The reason I had if (presult->all_frozen || presult->nfrozen > 0) was because of this comment in heapam.h in the HeapPageFreeze struct * "Freeze" NewRelfrozenXid/NewRelminMxid trackers. * * Trackers used when heap_freeze_execute_prepared freezes, or when there * are zero freeze plans for a page. It is always valid for vacuumlazy.c * to freeze any page, by definition. This even includes pages that have * no tuples with storage to consider in the first place. That way the * 'totally_frozen' results from heap_prepare_freeze_tuple can always be * used in the same way, even when no freeze plans need to be executed to * "freeze the page". Only the "freeze" path needs to consider the need * to set pages all-frozen in the visibility map under this scheme. * * When we freeze a page, we generally freeze all XIDs < OldestXmin, only * leaving behind XIDs that are ineligible for freezing, if any. And so * you might wonder why these trackers are necessary at all; why should * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that * ratchet back the top-level NewRelfrozenXid/NewRelminMxid trackers? * * It is useful to use a definition of "freeze the page" that does not * overspecify how MultiXacts are affected. heap_prepare_freeze_tuple * generally prefers to remove Multis eagerly, but lazy processing is used * in cases where laziness allows VACUUM to avoid allocating a new Multi. * The "freeze the page" trackers enable this flexibility. */ So, I don't really know if it is right to just check presult->nfrozen > 0 when updating relminmxid. I have changed it to the way you suggested. But we can change it back. > Here are some patches on top of your patches for some further refactorings. > Some notable changes in heap_page_prune_and_freeze(): > > - I moved the heap_prepare_freeze_tuple() work from the 2nd loop to the 1st > one. It seems more clear and efficient that way. cool. I kept this. > - I extracted the code to generate the WAL record to a separate function. cool. kept this. > These patches are in a very rough WIP state, but I wanted to share early. I > haven't done much testing, and I'm not wedded to these changes, but I think > they make it more readable. Please include / squash in the patch set if you > agree with them. I've squashed the changes into and across my nineteen patches :) I cleaned up your sugestions a bit and made a few stylistic choices. In this version, I also broke up the last couple commits which streamlined the WAL record and eliminated XLOG_HEAP2_FREEZE/VACUUM and redistributed those changes in a way that I thought made sense. Now, the progression is that in one commit we merge the prune and freeze record, eliminating the XLOG_HEAP2_FREEZE record. Then, in another commit, we eliminate the XLOG_HEAP2_VACUUM record. Then a later commit streamlines the new mega xl_heap_prune struct into the variable size structure based on which modifications it includes. > From 622620a7875ae8c1626e9cd118156e0c734d44ed Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Sun, 17 Mar 2024 22:52:28 +0200 > Subject: [PATCH v3heikki 1/4] Inline heap_frz_conflict_horizon() to the > caller. > > FIXME: This frz_conflict_horizon business looks fishy to me. We have: > - local frz_conflict_horizon variable, > - presult->frz_conflict_horizon, and > - prstate.snapshotConflictHorizon Yea, this is a mistake I made when I was rebasing some changes in. The local variable is gone now. > should we really have all three, and what are the differences? We do need both the prstate.snapshotConflictHorizon and the presult->frz_conflict_horizon because the youngest freezable xmin will often be different than the oldest removable xmax, so we have to track both and take the most conservative one if we are both freezing and pruning. The third (local variable) one was an oops. > From 0219842487931f899abcf183c863c43270c098f0 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Sun, 17 Mar 2024 23:05:31 +0200 > Subject: [PATCH v3heikki 2/4] Misc cleanup > --- > src/backend/access/heap/pruneheap.c | 16 +++++++--------- > src/backend/access/heap/vacuumlazy.c | 7 ++++++- > 2 files changed, 13 insertions(+), 10 deletions(-) > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c > index 8d3723faf3a..d3df7029667 100644 > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > @@ -433,7 +433,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, > * heap_page_prune_and_freeze(). We expect vistest will always make > * heap_page_prune_and_freeze() remove any deleted tuple whose xmax is < > * OldestXmin. lazy_scan_prune must never become confused about whether a > - * tuple should be frozen or removed. (In the future we might want to > + * tuple should be frozen or removed. > + * HEIKKI: Is such confusion possible anymore? > + * (In the future we might want to > * teach lazy_scan_prune to recompute vistest from time to time, to > * increase the number of dead tuples it can prune away.) TBH, I don't really know what this comment is even saying. But lazy_scan_prune() doesn't do any freezing anymore, so I removed this sentence. > */ > @@ -1387,6 +1389,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno, > * so we could deal with tuples that were DEAD to VACUUM, but nevertheless were > * left with storage after pruning. > * > + * HEIKKI: does the above paragraph still make sense? We don't call > + * HeapTupleSatisfiesVacuum() here anymore > + * Yea this whole comment definitely doesn't belong here anymore. Even though we are calling HeapTupleSatisfiesVacuum() (from inside heap_prune_satisifes_vacuum()) inside heap_page_prune_and_freeze(), the comment really doesn't fit anywhere in there either. The comment is describing a situation that is no longer possible. So describing a situation that is no longer possible in a part of the code that it never could have been possible does not make sense to me. I've removed the comment. > * As of Postgres 17, we circumvent this problem altogether by reusing the > * result of heap_page_prune_and_freeze()'s visibility check. Without the > * second call to HeapTupleSatisfiesVacuum(), there is no new HTSV_Result and > -- > 2.39.2 > > From d72cebf13f9866112309883f72a382fc2cb57e17 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Sun, 17 Mar 2024 23:08:42 +0200 > Subject: [PATCH v3heikki 3/4] Move work to the first loop > src/backend/access/heap/pruneheap.c | 141 ++++++++++------------------ > 1 file changed, 52 insertions(+), 89 deletions(-) > > diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c > index b3573bb628b..3541628799a 100644 > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > @@ -78,9 +78,6 @@ static int heap_prune_chain(Buffer buffer, > PruneState *prstate, PruneFreezeResult *presult); > > static inline HTSV_Result htsv_get_valid_status(int status); > -static void prune_prepare_freeze_tuple(Page page, OffsetNumber offnum, PruneState *prstate, > - HeapPageFreeze *pagefrz, HeapTupleFreeze *frozen, > - PruneFreezeResult *presult); > static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid); > static void heap_prune_record_redirect(PruneState *prstate, > OffsetNumber offnum, OffsetNumber rdoffnum, > @@ -322,6 +319,16 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer, > /* for recovery conflicts */ > presult->frz_conflict_horizon = InvalidTransactionId; > > + /* > + * We will update the VM after pruning, collecting LP_DEAD items, and > + * freezing tuples. Keep track of whether or not the page is all_visible > + * and all_frozen and use this information to update the VM. all_visible > + * implies lpdead_items == 0, but don't trust all_frozen result unless > + * all_visible is also set to true. > + */ > + /* HEIKKI: the caller updates the VM? not us */ I've updated this comment. Other questions and notes on v4: xl_heap_prune->flags is a uint8, but we are already using 7 of the bits. Should we make it a unit16? Eventually, I would like to avoid emitting a separate XLOG_HEAP2_VISIBLE record for vacuum's first and second passes and just include the VM update flags in the xl_heap_prune record. xl_heap_visible->flags is a uint8. If we made xl_heap_prune->flags uint16, we could probably combine them (though maybe we want other bits available). Also vacuum's second pass doesn't set a snapshotConflictHorizon, so if we combined xl_heap_visible and xl_heap_prune for vacuum we would end up saving even more space (since vacuum sets xl_heap_visible->snapshotConflictHorizon to InvalidXLogRecPtr). A note on sub-record naming: I kept xl_heap_freeze_plan's name but prefixed the other sub-records with xlhp. Do you think it is worth renaming it (to xlhp_freeze_plan)? Also, should I change xlhp_freeze to xlhp_freeze_page? - Melanie
Attachment
- v4-0001-Reorganize-heap_page_prune-function-comment.patch
- v4-0002-Remove-unused-PruneState-member-rel.patch
- v4-0003-lazy_scan_prune-tests-tuple-vis-with-GlobalVisTes.patch
- v4-0004-Pass-heap_prune_chain-PruneResult-output-paramete.patch
- v4-0005-heap_page_prune-sets-all_visible-and-frz_conflict.patch
- v4-0006-Add-reference-to-VacuumCutoffs-in-HeapPageFreeze.patch
- v4-0007-Prepare-freeze-tuples-in-heap_page_prune.patch
- v4-0008-lazy_scan_prune-reorder-freeze-execution-logic.patch
- v4-0009-Execute-freezing-in-heap_page_prune.patch
- v4-0010-Make-opp-freeze-heuristic-compatible-with-prune-f.patch
- v4-0011-Separate-tuple-pre-freeze-checks-and-invoke-earli.patch
- v4-0012-Remove-heap_freeze_execute_prepared.patch
- v4-0013-Merge-prune-and-freeze-records.patch
- v4-0014-Vacuum-second-pass-emits-XLOG_HEAP2_PRUNE-record.patch
- v4-0015-Set-hastup-in-heap_page_prune.patch
- v4-0016-Count-tuples-for-vacuum-logging-in-heap_page_prun.patch
- v4-0017-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v4-0018-Initialize-xl_heap_prune-deserialization-variable.patch
- v4-0019-Streamline-XLOG_HEAP2_PRUNE-record-and-use-for-fr.patch
On 20/03/2024 03:36, Melanie Plageman wrote: > On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote: >> On 15/03/2024 02:56, Melanie Plageman wrote: >>> Okay, so I was going to start using xl_heap_prune for vacuum here too, >>> but I realized it would be bigger because of the >>> snapshotConflictHorizon. Do you think there is a non-terrible way to >>> make the snapshotConflictHorizon optional? Like with a flag? >> >> Yeah, another flag would do the trick. > > Okay, I've done this in attached v4 (including removing > XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the > "main chunk" of data available at replay regardless of whether or not > the record ended up including an FPI. > > I made it its own sub-record (xlhp_conflict_horizon) less to help with > alignment (though we can use all the help we can get there) and more to > keep it from getting lost. When you look at heapam_xlog.h, you can see > what a XLOG_HEAP2_PRUNE record will contain starting with the > xl_heap_prune struct and then all the sub-record types. Ok, now that I look at this, I wonder if we're being overly cautious about the WAL size. We probably could just always include the snapshot field, and set it to InvalidTransactionId and waste 4 bytes when it's not needed. For the sake of simplicity. I don't feel strongly either way though, the flag is pretty simple too. > xl_heap_prune->flags is a uint8, but we are already using 7 of the bits. > Should we make it a unit16? It doesn't matter much either way. We can also make it larger when we need more bits, there's no need make room for them them beforehand. > Eventually, I would like to avoid emitting a separate XLOG_HEAP2_VISIBLE > record for vacuum's first and second passes and just include the VM > update flags in the xl_heap_prune record. xl_heap_visible->flags is a > uint8. If we made xl_heap_prune->flags uint16, we could probably combine > them (though maybe we want other bits available). Also vacuum's second > pass doesn't set a snapshotConflictHorizon, so if we combined > xl_heap_visible and xl_heap_prune for vacuum we would end up saving even > more space (since vacuum sets xl_heap_visible->snapshotConflictHorizon > to InvalidXLogRecPtr). Makes sense. > A note on sub-record naming: I kept xl_heap_freeze_plan's name but > prefixed the other sub-records with xlhp. Do you think it is worth > renaming it (to xlhp_freeze_plan)? Yeah, perhaps. > Also, should I change xlhp_freeze to xlhp_freeze_page? I renamed it to xlhp_freeze_plans, for some consistency with xlhp_prune_items. I realized that the WAL record format changes are pretty independent from the rest of the patches. They could be applied before the rest. Without the rest of the changes, we'll still write two WAL records per page in vacuum, one to prune and another one to freeze, but it's another meaningful incremental step. So I reshuffled the patches, so that the WAL format is changed first, before the rest of the changes. 0001-0008: These are the WAL format changes. There's some comment cleanup needed, but as far as the code goes, I think these are pretty much ready to be squashed & committed. 0009-: The rest of the v4 patches, rebased over the WAL format changes. I also added a few small commits for little cleanups that caught my eye, let me know if you disagree with those. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
- v5-0023-Set-hastup-in-heap_page_prune.patch
- v5-0001-Merge-prune-freeze-and-vacuum-WAL-record-formats.patch
- v5-0002-Keep-the-original-numbers-for-existing-WAL-record.patch
- v5-0003-Rename-record-to-XLOG_HEAP2_PRUNE_FREEZE.patch
- v5-0004-nplans-is-a-pointer.patch
- v5-0005-Remind-myself-to-bump-XLOG_PAGE_MAGIC-when-this-i.patch
- v5-0006-Fix-logging-snapshot-conflict-horizon.patch
- v5-0007-Add-comment-to-log_heap_prune_and_freeze.patch
- v5-0008-minor-refactoring-in-log_heap_prune_and_freeze.patch
- v5-0009-lazy_scan_prune-tests-tuple-vis-with-GlobalVisTes.patch
- v5-0010-Pass-heap_prune_chain-PruneResult-output-paramete.patch
- v5-0011-heap_page_prune-sets-all_visible-and-frz_conflict.patch
- v5-0012-Add-reference-to-VacuumCutoffs-in-HeapPageFreeze.patch
- v5-0013-still-use-a-local-cutoffs-variable.patch
- v5-0014-Prepare-freeze-tuples-in-heap_page_prune.patch
- v5-0015-lazy_scan_prune-reorder-freeze-execution-logic.patch
- v5-0016-Execute-freezing-in-heap_page_prune.patch
- v5-0017-Make-opp-freeze-heuristic-compatible-with-prune-f.patch
- v5-0018-Separate-tuple-pre-freeze-checks-and-invoke-earli.patch
- v5-0019-Remove-heap_freeze_execute_prepared.patch
- v5-0020-Merge-prune-and-freeze-records.patch
- v5-0021-move-whole_page_freezable-to-tighter-scope.patch
- v5-0022-make-all_visible_except_removable-local.patch
- v5-0024-Count-tuples-for-vacuum-logging-in-heap_page_prun.patch
- v5-0025-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v5-0026-reorder-PruneFreezeResult-fields.patch
On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote: > On 20/03/2024 03:36, Melanie Plageman wrote: > > On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote: > > > On 15/03/2024 02:56, Melanie Plageman wrote: > > > > Okay, so I was going to start using xl_heap_prune for vacuum here too, > > > > but I realized it would be bigger because of the > > > > snapshotConflictHorizon. Do you think there is a non-terrible way to > > > > make the snapshotConflictHorizon optional? Like with a flag? > > > > > > Yeah, another flag would do the trick. > > > > Okay, I've done this in attached v4 (including removing > > XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the > > "main chunk" of data available at replay regardless of whether or not > > the record ended up including an FPI. > > > > I made it its own sub-record (xlhp_conflict_horizon) less to help with > > alignment (though we can use all the help we can get there) and more to > > keep it from getting lost. When you look at heapam_xlog.h, you can see > > what a XLOG_HEAP2_PRUNE record will contain starting with the > > xl_heap_prune struct and then all the sub-record types. > > Ok, now that I look at this, I wonder if we're being overly cautious about > the WAL size. We probably could just always include the snapshot field, and > set it to InvalidTransactionId and waste 4 bytes when it's not needed. For > the sake of simplicity. I don't feel strongly either way though, the flag is > pretty simple too. That will mean that all vacuum records are at least 3 bytes bigger than before -- which makes it somewhat less defensible to get rid of xl_heap_vacuum. That being said, I ended up doing an unaligned access when I packed it and made it optional, so maybe it is less user-friendly. But I also think that making it optional is more clear for vacuum which will never use it. > I realized that the WAL record format changes are pretty independent from > the rest of the patches. They could be applied before the rest. Without the > rest of the changes, we'll still write two WAL records per page in vacuum, > one to prune and another one to freeze, but it's another meaningful > incremental step. So I reshuffled the patches, so that the WAL format is > changed first, before the rest of the changes. Ah, great idea! That eliminates the issue of preliminary commits having larger WAL records that then get streamlined. > 0001-0008: These are the WAL format changes. There's some comment cleanup > needed, but as far as the code goes, I think these are pretty much ready to > be squashed & committed. My review in this email is *only* for 0001-0008. I have not looked at the rest yet. > From 06d5ff5349a8aa95cbfd06a8043fe503b7b1bf7b Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 14:50:14 +0200 > Subject: [PATCH v5 01/26] Merge prune, freeze and vacuum WAL record formats > > The new combined WAL record is now used for pruning, freezing and 2nd > pass of vacuum. This is in preparation of changing vacuuming to write > a combined prune+freeze record per page, instead of separate two > records. The new WAL record format now supports that, but the code > still always writes separate records for pruning and freezing. Attached patch changes-for-0001.patch has a bunch of updated comments -- especially for heapam_xlog.h (including my promised diagram). And a few suggestions (mostly changes that I should have made before). > > XXX I tried to lift-and-shift the code from v4 patch set as unchanged > as possible, for easier review, but some noteworthy changes: In the final commit message, I think it is worth calling out that all of those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and shifted from one file to another. When I am reviewing a big diff, it is always helpful to know where I need to review line-by-line. > > - Instead of passing PruneState and PageFreezeResult to > log_heap_prune_and_freeze(), pass the arrays of frozen, redirected > et al offsets directly. That way, it can be called from other places. good idea. > - moved heap_xlog_deserialize_prune_and_freeze() from xactdesc.c to > heapdesc.c. (Because that's clearly where it belongs) :) > From cd6cdaebb362b014733e99ecd868896caf0fb3aa Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 13:45:01 +0200 > Subject: [PATCH v5 02/26] Keep the original numbers for existing WAL records > > Doesn't matter much because the WAL format is not compatible across > major versions anyway. But still seems nice to keep the identifiers > unchanged when we can. (There's some precedence for this if you search > the git history for "is free, was"). sounds good. > From d3207bb557aa1d2868a50d357a06318a6c0cb5cd Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 13:48:29 +0200 > Subject: [PATCH v5 03/26] Rename record to XLOG_HEAP2_PRUNE_FREEZE > > To clarify that it also freezes now, and to make it clear that it's > significantly different from the old XLOG_HEAP2_PRUNE format. +1 > From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 13:49:59 +0200 > Subject: [PATCH v5 04/26] 'nplans' is a pointer > > I'm surprised the compiler didn't warn about this oops. while looking at this, I noticed that the asserts I added that nredirected > 0, ndead > 0, and nunused > 0 have the same problem. > --- > src/backend/access/rmgrdesc/heapdesc.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c > index 8b94c869faf..9ef8a745982 100644 > --- a/src/backend/access/rmgrdesc/heapdesc.c > +++ b/src/backend/access/rmgrdesc/heapdesc.c > @@ -155,8 +155,7 @@ heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags, > cursor += sizeof(OffsetNumber) * *nunused; > } > > - if (nplans > 0) > From 59f3f80f82ed7a63d86c991d0cb025e4cde2caec Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 13:36:41 +0200 > Subject: [PATCH v5 06/26] Fix logging snapshot conflict horizon. > > - it was accessed without proper alignment, which won't work on > architectures that are strict about alignment. Use memcpy. wow, oops. thanks for fixing this! > - in heap_xlog_prune_freeze, the code tried to access the xid with > "(xlhp_conflict_horizon *) (xlrec + SizeOfHeapPrune);" But 'xlrec' > was "xl_heap_prune *" rather than "char *". That happened to work, > because sizeof(xl_heap_prune) == 1, but make it more robust by > adding a cast to char *. good catch. > - remove xlhp_conflict_horizon and store a TransactionId directly. A > separate struct would make sense if we needed to store anything else > there, but for now it just seems like more code. makes sense. I just want to make sure heapam_xlog.h makes it extra clear that this is happening. I see your comment addition. If you combine it with my comment additions in the attached patch for 0001, hopefully that makes it clear enough. > From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 14:03:06 +0200 > Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze(). > > XXX: This should be rewritten, but I tried to at least list some > important points. Are you thinking that it needs to mention more things or that the things it mentions need more detail? > From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 14:53:31 +0200 > Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze() > > Mostly to make local variables more tightly-scoped. So, I don't think you can move those sub-records into the tighter scope. If you run tests with this applied, you'll see it crashes and a lot of the data in the record is wrong. If you move the sub-record declarations out to a wider scope, the tests pass. The WAL record data isn't actually copied into the buffer until recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE); after registering everything. So all of those sub-records you made are out of scope the time it tries to copy from them. I spent the better part of a day last week trying to figure out what was happening after I did the exact same thing. I must say that I found the xloginsert API incredibly unhelpful on this point. I would like to review the rest of the suggested changes in this patch after we fix the issue I just mentioned. - Melanie
Attachment
On Wed, Mar 20, 2024 at 9:15 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I made it its own sub-record (xlhp_conflict_horizon) less to help with > > alignment (though we can use all the help we can get there) and more to > > keep it from getting lost. When you look at heapam_xlog.h, you can see > > what a XLOG_HEAP2_PRUNE record will contain starting with the > > xl_heap_prune struct and then all the sub-record types. > > Ok, now that I look at this, I wonder if we're being overly cautious > about the WAL size. We probably could just always include the snapshot > field, and set it to InvalidTransactionId and waste 4 bytes when it's > not needed. For the sake of simplicity. I don't feel strongly either way > though, the flag is pretty simple too. What about the issue of cleanup locks, which aren't needed and aren't taken with the current heapam VACUUM record type? Will you preserve that aspect of the existing design? -- Peter Geoghegan
On Wed, Mar 20, 2024 at 4:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Mar 20, 2024 at 9:15 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > I made it its own sub-record (xlhp_conflict_horizon) less to help with > > > alignment (though we can use all the help we can get there) and more to > > > keep it from getting lost. When you look at heapam_xlog.h, you can see > > > what a XLOG_HEAP2_PRUNE record will contain starting with the > > > xl_heap_prune struct and then all the sub-record types. > > > > Ok, now that I look at this, I wonder if we're being overly cautious > > about the WAL size. We probably could just always include the snapshot > > field, and set it to InvalidTransactionId and waste 4 bytes when it's > > not needed. For the sake of simplicity. I don't feel strongly either way > > though, the flag is pretty simple too. > > What about the issue of cleanup locks, which aren't needed and aren't > taken with the current heapam VACUUM record type? Will you preserve > that aspect of the existing design? Yep, we have a flag to indicate whether or not a cleanup lock is needed. - Melanie
On Wed, Mar 20, 2024 at 4:06 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > What about the issue of cleanup locks, which aren't needed and aren't > > taken with the current heapam VACUUM record type? Will you preserve > > that aspect of the existing design? > > Yep, we have a flag to indicate whether or not a cleanup lock is needed. Thanks for confirming. I realize that this is fairly obvious, but thought it better to ask. -- Peter Geoghegan
On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote: > > 0009-: The rest of the v4 patches, rebased over the WAL format changes. I > also added a few small commits for little cleanups that caught my eye, let > me know if you disagree with those. This review is just of the patches containing changes you made in 0009-0026. > From d36138b5bf0a93557273b5e47f8cd5ea089057c7 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 11:47:42 +0200 > Subject: [PATCH v5 13/26] still use a local 'cutoffs' variable > > Given how often 'cutoffs' is used in the function, I think it still > makes sense to have a local variable for it, just to keep the source > lines shorter. Works for me. > From 913617ed98cfddd678a6f620db7dee68d1d61c89 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 10:51:13 +0200 > Subject: [PATCH v5 21/26] move whole_page_freezable to tighter scope Works for me. > From e2b50f9b64f7e4255f4f764e2a348e1b446573dc Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 11:43:31 +0200 > Subject: [PATCH v5 22/26] make 'all_visible_except_removable' local > > The caller doesn't need it, so it doesn't belong in PruneFreezeResult Makes sense to me. > From e993e0d98cd0ef1ecbd229f6ddbe23d59a427e3a Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 20 Mar 2024 11:40:34 +0200 > Subject: [PATCH v5 26/26] reorder PruneFreezeResult fields > > --- > src/include/access/heapam.h | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index ee0eca8ae02..b2015f5a1ac 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > @@ -202,14 +202,17 @@ typedef struct PruneFreezeResult > int recently_dead_tuples; > int ndeleted; /* Number of tuples deleted from the page */ > int nnewlpdead; /* Number of newly LP_DEAD items */ > + int nfrozen; Let's add a comment after int nfrozen like /* Number of newly frozen tuples */ > + > bool all_visible; /* Whether or not the page is all visible */ > bool hastup; /* Does page make rel truncation unsafe */ > > + /* The following fields are only set if freezing */ So, all_frozen will be set correctly if we are even considering freezing (if pagefrz is passed). all_frozen will be true even if we didn't freeze anything if the page is all-frozen and can be set as such in the VM. And it will be false if the page is not all-frozen. So, maybe we say "considering freezing". But, I'm glad you thought to call out which of these fields will make sense to the caller. Also, maybe we should just name the members to which this applies. It is a bit confusing that I can't tell if the comment also refers to the other members following it (lpdead_items and deadoffsets) -- which it doesn't. > + > /* Whether or not the page can be set all frozen in the VM */ > bool all_frozen; > > /* Number of newly frozen tuples */ > - int nfrozen; > TransactionId frz_conflict_horizon; /* Newest xmin on the page */ > > /* New value of relfrozenxid found by heap_page_prune_and_freeze() */ > -- > 2.39.2 > - Melanie
On 20/03/2024 23:03, Melanie Plageman wrote: > On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote: >> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h >> index ee0eca8ae02..b2015f5a1ac 100644 >> --- a/src/include/access/heapam.h >> +++ b/src/include/access/heapam.h >> @@ -202,14 +202,17 @@ typedef struct PruneFreezeResult >> int recently_dead_tuples; >> int ndeleted; /* Number of tuples deleted from the page */ >> int nnewlpdead; /* Number of newly LP_DEAD items */ >> + int nfrozen; > > Let's add a comment after int nfrozen like > /* Number of newly frozen tuples */ > >> + >> bool all_visible; /* Whether or not the page is all visible */ >> bool hastup; /* Does page make rel truncation unsafe */ >> >> + /* The following fields are only set if freezing */ > > So, all_frozen will be set correctly if we are even considering freezing > (if pagefrz is passed). all_frozen will be true even if we didn't freeze > anything if the page is all-frozen and can be set as such in the VM. And > it will be false if the page is not all-frozen. So, maybe we say > "considering freezing". > > But, I'm glad you thought to call out which of these fields will make > sense to the caller. > > Also, maybe we should just name the members to which this applies. It is > a bit confusing that I can't tell if the comment also refers to the > other members following it (lpdead_items and deadoffsets) -- which it > doesn't. Right, sorry, I spotted the general issue that it's not clear which fields are valid when. I added that comment to remind about that, but I then forgot about it. In heap_page_prune_and_freeze(), we now do some extra work on each live tuple, to set the all_visible_except_removable correctly. And also to update live_tuples, recently_dead_tuples and hastup. When we're not freezing, that's a waste of cycles, the caller doesn't care. I hope it's enough that it doesn't matter, but is it? The first commit (after the WAL format changes) changes the all-visible check to use GlobalVisTestIsRemovableXid. The commit message says that it's because we don't have 'cutoffs' available, but we only care about that when we're freezing, and when we're freezing, we actually do have 'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems sensible anyway, because that's what we use in heap_prune_satisfies_vacuum() too, but just wanted to point that out. The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily these patches's fault). This at least is wrong, because Max(a, b) doesn't handle XID wraparound correctly: > if (do_freeze) > conflict_xid = Max(prstate.snapshotConflictHorizon, > presult->frz_conflict_horizon); > else > conflict_xid = prstate.snapshotConflictHorizon; Then there's this in lazy_scan_prune(): > /* Using same cutoff when setting VM is now unnecessary */ > if (presult.all_frozen) > presult.frz_conflict_horizon = InvalidTransactionId; This does the right thing in the end, but if all the tuples are frozen shouldn't frz_conflict_horizon already be InvalidTransactionId? The comment says it's "newest xmin on the page", and if everything was frozen, all xmins are FrozenTransactionId. In other words, that should be moved to heap_page_prune_and_freeze() so that it doesn't lie to its caller. Also, frz_conflict_horizon is only set correctly if 'all_frozen==true', would be good to mention that in the comments too. -- Heikki Linnakangas Neon (https://neon.tech)
On 20/03/2024 21:17, Melanie Plageman wrote: > Attached patch changes-for-0001.patch has a bunch of updated comments -- > especially for heapam_xlog.h (including my promised diagram). And a few > suggestions (mostly changes that I should have made before). New version of these WAL format changes attached. Squashed to one patch now. > + // TODO: should we avoid this if we only froze? heap_xlog_freeze() doesn't > + // do it Ah yes, that makes sense, did that. > In the final commit message, I think it is worth calling out that all of > those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and > shifted from one file to another. When I am reviewing a big diff, it is > always helpful to know where I need to review line-by-line. Done. >> From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Wed, 20 Mar 2024 13:49:59 +0200 >> Subject: [PATCH v5 04/26] 'nplans' is a pointer >> >> I'm surprised the compiler didn't warn about this > > oops. while looking at this, I noticed that the asserts I added that > nredirected > 0, ndead > 0, and nunused > 0 have the same problem. Good catch, fixed. >> - remove xlhp_conflict_horizon and store a TransactionId directly. A >> separate struct would make sense if we needed to store anything else >> there, but for now it just seems like more code. > > makes sense. I just want to make sure heapam_xlog.h makes it extra clear > that this is happening. I see your comment addition. If you combine it > with my comment additions in the attached patch for 0001, hopefully that > makes it clear enough. Thanks. I spent more time on the comments throughout the patch. And one notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a cleanup lock to replay the record. It must always be set when XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying those always needs a cleanup lock. That felt easier to document and understand than XLHP_LP_TRUNCATE_ONLY. >> From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Wed, 20 Mar 2024 14:03:06 +0200 >> Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze(). >> >> XXX: This should be rewritten, but I tried to at least list some >> important points. > > Are you thinking that it needs to mention more things or that the things > it mentions need more detail? I previously just quickly jotted down things that seemed worth mentioning in the comment. It was not so bad actually, but I reworded it a little. >> From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Wed, 20 Mar 2024 14:53:31 +0200 >> Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze() >> >> Mostly to make local variables more tightly-scoped. > > So, I don't think you can move those sub-records into the tighter scope. > If you run tests with this applied, you'll see it crashes and a lot of > the data in the record is wrong. If you move the sub-record declarations > out to a wider scope, the tests pass. > > The WAL record data isn't actually copied into the buffer until > > recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE); > > after registering everything. > So all of those sub-records you made are out of scope the time it tries > to copy from them. > > I spent the better part of a day last week trying to figure out what was > happening after I did the exact same thing. I must say that I found the > xloginsert API incredibly unhelpful on this point. Oops. I had that in mind and that was actually why I moved the XLogRegisterData() call to the end of the function, because I found it confusing to register the struct before filling it in completely, even though it works perfectly fine. But then I missed it anyway when I moved the local variables. I added a brief comment on that. > I would like to review the rest of the suggested changes in this patch > after we fix the issue I just mentioned. Thanks, review is appreciated. I feel this is ready now, so barring any big new issues, I plan to commit this early next week. There is another patch in the commitfest that touches this area: "Recording whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning" [1]. That actually goes in the opposite direction than this patch. That patch wants to add more information, to show whether a record was emitted by VACUUM or on-access pruning, while this patch makes the freezing and VACUUM's 2nd phase records also look the same. We could easily add more flags to xl_heap_prune to distinguish them. Or assign different xl_info code for them, like that other patch proposed. But I don't think that needs to block this patch, that can be added as a separate patch. [1] https://www.postgresql.org/message-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > In heap_page_prune_and_freeze(), we now do some extra work on each live > tuple, to set the all_visible_except_removable correctly. And also to > update live_tuples, recently_dead_tuples and hastup. When we're not > freezing, that's a waste of cycles, the caller doesn't care. I hope it's > enough that it doesn't matter, but is it? Last year on an early version of the patch set I did some pgbench tpcb-like benchmarks -- since there is a lot of on-access pruning in that workload -- and I don't remember it being a showstopper. The code has changed a fair bit since then. However, I think it might be safer to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip the rest of the loop after heap_prune_satisifies_vacuum() when on-access pruning invokes it. I had avoided that because it felt ugly and error-prone, however it addresses a few other of your points as well. For example, I think we should set a bit in the prune/freeze WAL record's flags to indicate if pruning was done by vacuum or on access (mentioned in another of your recent emails). > The first commit (after the WAL format changes) changes the all-visible > check to use GlobalVisTestIsRemovableXid. The commit message says that > it's because we don't have 'cutoffs' available, but we only care about > that when we're freezing, and when we're freezing, we actually do have > 'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems > sensible anyway, because that's what we use in > heap_prune_satisfies_vacuum() too, but just wanted to point that out. Yes, this is true. If we skip this part of the loop when on-access pruning invokes it, we can go back to using the OldestXmin. I have done that as well as some other changes in the attached patch, conflict_horizon_updates.diff. Note that this patch may not apply on your latest patch as I wrote it on top of an older version. Switching back to using OldestXmin for page visibility determination makes this patch set more similar to master as well. We could keep the alternative check (with GlobalVisState) to maintain the illusion that callers passing by_vacuum as True can pass NULL for pagefrz, but I was worried about the extra overhead. It would be nice to pick a single reasonable visibility horizon (the oldest running xid we compare things to) at the beginning of heap_page_prune_and_freeze() and use it for determining if we can remove tuples, if we can freeze tuples, and if the page is all visible. It makes it confusing that we use OldestXmin for freezing and setting the page visibility in the VM and GlobalVisState for determining prunability. I think using GlobalVisState for freezing has been discussed before and ruled out for a variety of reasons, and I definitely don't suggest making that change in this patch set. > The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily > these patches's fault). This at least is wrong, because Max(a, b) > doesn't handle XID wraparound correctly: > > > if (do_freeze) > > conflict_xid = Max(prstate.snapshotConflictHorizon, > > presult->frz_conflict_horizon); > > else > > conflict_xid = prstate.snapshotConflictHorizon; > > Then there's this in lazy_scan_prune(): > > > /* Using same cutoff when setting VM is now unnecessary */ > > if (presult.all_frozen) > > presult.frz_conflict_horizon = InvalidTransactionId; > This does the right thing in the end, but if all the tuples are frozen > shouldn't frz_conflict_horizon already be InvalidTransactionId? The > comment says it's "newest xmin on the page", and if everything was > frozen, all xmins are FrozenTransactionId. In other words, that should > be moved to heap_page_prune_and_freeze() so that it doesn't lie to its > caller. Also, frz_conflict_horizon is only set correctly if > 'all_frozen==true', would be good to mention that in the comments too. Yes, this is a good point. I've spent some time swapping all of this back into my head. I think we should change the names of all these conflict horizon variables and introduce some local variables again. In the attached patch, I've updated the name of the variable in PruneFreezeResult to vm_conflict_horizon, as it is only used for emitting a VM update record. Now, I don't set it until the end of heap_page_prune_and_freeze(). It is only updated from InvalidTransactionId if the page is not all frozen. As you say, if the page is all frozen, there can be no conflict. I've also changed PruneState->snapshotConflictHorizon to PruneState->latest_xid_removed. And I introduced the local variables visibility_cutoff_xid and frz_conflict_horizon. I think it is important we distinguish between the latest xid pruned, the latest xmin of tuples frozen, and the latest xid of all live tuples on the page. Though we end up using visibility_cutoff_xid as the freeze conflict horizon if the page is all frozen, I think it is more clear to have the three variables and name them what they are. Then, we calculate the correct one for the combined WAL record before emitting it. I've done that in the attached. I've tried to reduce the scope of the variables as much as possible to keep it as clear as I could. I think I've also fixed the issue with using Max() to compare TransactionIds by using TransactionIdFollows() instead. Note that I still don't think we have a resolution on what to correctly update new_relfrozenxid and new_relminmxid to at the end when presult->nfrozen == 0 and presult->all_frozen is true. if (presult->nfrozen > 0) { presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid; presult->new_relminmxid = pagefrz->FreezePageRelminMxid; } else { presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid; presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid; } - Melanie
Attachment
On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote: > On 20/03/2024 21:17, Melanie Plageman wrote: > > Attached patch changes-for-0001.patch has a bunch of updated comments -- > > especially for heapam_xlog.h (including my promised diagram). And a few > > suggestions (mostly changes that I should have made before). > > New version of these WAL format changes attached. Squashed to one patch now. > I spent more time on the comments throughout the patch. And one > notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with > XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a > cleanup lock to replay the record. It must always be set when > XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying those > always needs a cleanup lock. That felt easier to document and understand > than XLHP_LP_TRUNCATE_ONLY. Makes sense to me. > > > From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001 > > > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > > > Date: Wed, 20 Mar 2024 14:53:31 +0200 > > > Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze() > > > > > > Mostly to make local variables more tightly-scoped. > > > > So, I don't think you can move those sub-records into the tighter scope. > > If you run tests with this applied, you'll see it crashes and a lot of > > the data in the record is wrong. If you move the sub-record declarations > > out to a wider scope, the tests pass. > > > > The WAL record data isn't actually copied into the buffer until > > > > recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE); > > > > after registering everything. > > So all of those sub-records you made are out of scope the time it tries > > to copy from them. > > > > I spent the better part of a day last week trying to figure out what was > > happening after I did the exact same thing. I must say that I found the > > xloginsert API incredibly unhelpful on this point. > > Oops. I had that in mind and that was actually why I moved the > XLogRegisterData() call to the end of the function, because I found it > confusing to register the struct before filling it in completely, even > though it works perfectly fine. But then I missed it anyway when I moved the > local variables. I added a brief comment on that. Comment was a good idea. > There is another patch in the commitfest that touches this area: "Recording > whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning" > [1]. That actually goes in the opposite direction than this patch. That > patch wants to add more information, to show whether a record was emitted by > VACUUM or on-access pruning, while this patch makes the freezing and > VACUUM's 2nd phase records also look the same. We could easily add more > flags to xl_heap_prune to distinguish them. Or assign different xl_info code > for them, like that other patch proposed. But I don't think that needs to > block this patch, that can be added as a separate patch. > > [1] https://www.postgresql.org/message-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com I think it would be very helpful to distinguish amongst vacuum pass 1, 2, and on-access pruning. I often want to know what did most of the pruning -- and I could see also wanting to know if the first or second vacuum pass was responsible for removing the tuples. I agree it could be done separately, but it would be very helpful to have as soon as possible now that the record type will be the same for all three. > From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Fri, 22 Mar 2024 23:10:22 +0200 > Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats > > /* > - * Handles XLOG_HEAP2_PRUNE record type. > - * > - * Acquires a full cleanup lock. > + * Replay XLOG_HEAP2_PRUNE_FREEZE record. > */ > static void > -heap_xlog_prune(XLogReaderState *record) > +heap_xlog_prune_freeze(XLogReaderState *record) > { > XLogRecPtr lsn = record->EndRecPtr; > - xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record); > + char *ptr; > + xl_heap_prune *xlrec; > Buffer buffer; > RelFileLocator rlocator; > BlockNumber blkno; > XLogRedoAction action; > > XLogRecGetBlockTag(record, 0, &rlocator, NULL, &blkno); > + ptr = XLogRecGetData(record); I don't love having two different pointers that alias each other and we don't know which one is for what. Perhaps we could memcpy xlrec like in my attached diff (log_updates.diff). It also might perform slightly better than accessing flags through a xl_heap_prune * -- since it wouldn't be doing pointer dereferencing. > + xlrec = (xl_heap_prune *) ptr; > + ptr += SizeOfHeapPrune; > > /* > - * We're about to remove tuples. In Hot Standby mode, ensure that there's > - * no queries running for which the removed tuples are still visible. > + * We will take an ordinary exclusive lock or a cleanup lock depending on > + * whether the XLHP_CLEANUP_LOCK flag is set. With an ordinary exclusive > + * lock, we better not be doing anything that requires moving existing > + * tuple data. > */ > - if (InHotStandby) > - ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon, > - xlrec->isCatalogRel, > + Assert((xlrec->flags & XLHP_CLEANUP_LOCK) != 0 || > + (xlrec->flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0); > + > + /* > + * We are about to remove and/or freeze tuples. In Hot Standby mode, > + * ensure that there are no queries running for which the removed tuples > + * are still visible or which still consider the frozen xids as running. > + * The conflict horizon XID comes after xl_heap_prune. > + */ > + if (InHotStandby && (xlrec->flags & XLHP_HAS_CONFLICT_HORIZON) != 0) > + { My attached patch has a TODO here for the comment. It sticks out that the serialization and deserialization conditions are different for the snapshot conflict horizon. We don't deserialize if InHotStandby is false. That's still correct now because we don't write anything else to the main data chunk afterward. But, if someone were to add another member after snapshot_conflict_horizon, they would want to know to deserialize snapshot_conflict_horizon first even if InHotStandby is false. > + TransactionId snapshot_conflict_horizon; > + I would throw a comment in about the memcpy being required because the snapshot_conflict_horizon is in the buffer unaligned. > + memcpy(&snapshot_conflict_horizon, ptr, sizeof(TransactionId)); > + ResolveRecoveryConflictWithSnapshot(snapshot_conflict_horizon, > + (xlrec->flags & XLHP_IS_CATALOG_REL) != 0, > rlocator); > + } > > /* > + > +/* > + * Given a MAXALIGNed buffer returned by XLogRecGetBlockData() and pointed to > + * by cursor and any xl_heap_prune flags, deserialize the arrays of > + * OffsetNumbers contained in an XLOG_HEAP2_PRUNE_FREEZE record. > + * > + * This is in heapdesc.c so it can be shared between heap2_redo and heap2_desc > + * code, the latter of which is used in frontend (pg_waldump) code. > + */ > +void > +heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags, > + int *nredirected, OffsetNumber **redirected, > + int *ndead, OffsetNumber **nowdead, > + int *nunused, OffsetNumber **nowunused, > + int *nplans, xlhp_freeze_plan **plans, > + OffsetNumber **frz_offsets) > +{ > + if (flags & XLHP_HAS_FREEZE_PLANS) > + { > + xlhp_freeze_plans *freeze_plans = (xlhp_freeze_plans *) cursor; > + > + *nplans = freeze_plans->nplans; > + Assert(*nplans > 0); > + *plans = freeze_plans->plans; > + > + cursor += offsetof(xlhp_freeze_plans, plans); > + cursor += sizeof(xlhp_freeze_plan) * *nplans; > + } I noticed you decided to set these in the else statements. Is that to emphasize that it is important to correctness that they be properly initialized? > + else > + { > + *nplans = 0; > + *plans = NULL; > + } > + Thanks again for all your work on this set! - Melanie
Attachment
On 24/03/2024 21:55, Melanie Plageman wrote: > On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote: >> On 20/03/2024 21:17, Melanie Plageman wrote: >> There is another patch in the commitfest that touches this area: "Recording >> whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning" >> [1]. That actually goes in the opposite direction than this patch. That >> patch wants to add more information, to show whether a record was emitted by >> VACUUM or on-access pruning, while this patch makes the freezing and >> VACUUM's 2nd phase records also look the same. We could easily add more >> flags to xl_heap_prune to distinguish them. Or assign different xl_info code >> for them, like that other patch proposed. But I don't think that needs to >> block this patch, that can be added as a separate patch. >> >> [1] https://www.postgresql.org/message-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com > > I think it would be very helpful to distinguish amongst vacuum pass 1, > 2, and on-access pruning. I often want to know what did most of the > pruning -- and I could see also wanting to know if the first or second > vacuum pass was responsible for removing the tuples. I agree it could be > done separately, but it would be very helpful to have as soon as > possible now that the record type will be the same for all three. Ok, I used separate 'info' codes for records generated on on-access pruning and vacuum's 1st and 2nd pass. Similar to Peter's patch on that other thread, except that I didn't reserve the whole high bit for this, but used three different 'info' codes. Freezing uses the same XLOG_HEAP2_PRUNE_VACUUM_SCAN as the pruning in vacuum's 1st pass. You can distinguish them based on whether the record has nfrozen > 0, and with the rest of the patches, they will be merged anyway. >> From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Fri, 22 Mar 2024 23:10:22 +0200 >> Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats >> >> /* >> - * Handles XLOG_HEAP2_PRUNE record type. >> - * >> - * Acquires a full cleanup lock. >> + * Replay XLOG_HEAP2_PRUNE_FREEZE record. >> */ >> static void >> -heap_xlog_prune(XLogReaderState *record) >> +heap_xlog_prune_freeze(XLogReaderState *record) >> { >> XLogRecPtr lsn = record->EndRecPtr; >> - xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record); >> + char *ptr; >> + xl_heap_prune *xlrec; >> Buffer buffer; >> RelFileLocator rlocator; >> BlockNumber blkno; >> XLogRedoAction action; >> >> XLogRecGetBlockTag(record, 0, &rlocator, NULL, &blkno); >> + ptr = XLogRecGetData(record); > > I don't love having two different pointers that alias each other and we > don't know which one is for what. Perhaps we could memcpy xlrec like in > my attached diff (log_updates.diff). It also might perform slightly > better than accessing flags through a xl_heap_prune > * -- since it wouldn't be doing pointer dereferencing. Ok. >> /* >> - * We're about to remove tuples. In Hot Standby mode, ensure that there's >> - * no queries running for which the removed tuples are still visible. >> + * We will take an ordinary exclusive lock or a cleanup lock depending on >> + * whether the XLHP_CLEANUP_LOCK flag is set. With an ordinary exclusive >> + * lock, we better not be doing anything that requires moving existing >> + * tuple data. >> */ >> - if (InHotStandby) >> - ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon, >> - xlrec->isCatalogRel, >> + Assert((xlrec->flags & XLHP_CLEANUP_LOCK) != 0 || >> + (xlrec->flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0); >> + >> + /* >> + * We are about to remove and/or freeze tuples. In Hot Standby mode, >> + * ensure that there are no queries running for which the removed tuples >> + * are still visible or which still consider the frozen xids as running. >> + * The conflict horizon XID comes after xl_heap_prune. >> + */ >> + if (InHotStandby && (xlrec->flags & XLHP_HAS_CONFLICT_HORIZON) != 0) >> + { > > My attached patch has a TODO here for the comment. It sticks out that > the serialization and deserialization conditions are different for the > snapshot conflict horizon. We don't deserialize if InHotStandby is > false. That's still correct now because we don't write anything else to > the main data chunk afterward. But, if someone were to add another > member after snapshot_conflict_horizon, they would want to know to > deserialize snapshot_conflict_horizon first even if InHotStandby is > false. Good point. Fixed so that 'snapshot_conflict_horizon' is deserialized even if !InHotStandby. A memcpy is cheap, and is probably optimized away by the compiler anyway. >> + TransactionId snapshot_conflict_horizon; >> + > > I would throw a comment in about the memcpy being required because the > snapshot_conflict_horizon is in the buffer unaligned. Added. >> +/* >> + * Given a MAXALIGNed buffer returned by XLogRecGetBlockData() and pointed to >> + * by cursor and any xl_heap_prune flags, deserialize the arrays of >> + * OffsetNumbers contained in an XLOG_HEAP2_PRUNE_FREEZE record. >> + * >> + * This is in heapdesc.c so it can be shared between heap2_redo and heap2_desc >> + * code, the latter of which is used in frontend (pg_waldump) code. >> + */ >> +void >> +heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags, >> + int *nredirected, OffsetNumber **redirected, >> + int *ndead, OffsetNumber **nowdead, >> + int *nunused, OffsetNumber **nowunused, >> + int *nplans, xlhp_freeze_plan **plans, >> + OffsetNumber **frz_offsets) >> +{ >> + if (flags & XLHP_HAS_FREEZE_PLANS) >> + { >> + xlhp_freeze_plans *freeze_plans = (xlhp_freeze_plans *) cursor; >> + >> + *nplans = freeze_plans->nplans; >> + Assert(*nplans > 0); >> + *plans = freeze_plans->plans; >> + >> + cursor += offsetof(xlhp_freeze_plans, plans); >> + cursor += sizeof(xlhp_freeze_plan) * *nplans; >> + } > > I noticed you decided to set these in the else statements. Is that to > emphasize that it is important to correctness that they be properly > initialized? The point was to always initialize *nplans et al in the function. You required the caller to initialize them to zero, but that seems error-prone. I made one more last minute change: I changed the order of the array arguments in heap_xlog_deserialize_prune_and_freeze() to match the order in log_heap_prune_and_freeze(). Committed with the above little changes. Thank you! Now, back to the rest of the patches :-). -- Heikki Linnakangas Neon (https://neon.tech)
On 24/03/2024 18:32, Melanie Plageman wrote: > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> In heap_page_prune_and_freeze(), we now do some extra work on each live >> tuple, to set the all_visible_except_removable correctly. And also to >> update live_tuples, recently_dead_tuples and hastup. When we're not >> freezing, that's a waste of cycles, the caller doesn't care. I hope it's >> enough that it doesn't matter, but is it? > > Last year on an early version of the patch set I did some pgbench > tpcb-like benchmarks -- since there is a lot of on-access pruning in > that workload -- and I don't remember it being a showstopper. The code > has changed a fair bit since then. However, I think it might be safer > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip > the rest of the loop after heap_prune_satisifies_vacuum() when > on-access pruning invokes it. I had avoided that because it felt ugly > and error-prone, however it addresses a few other of your points as > well. Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the argument described what it does, rather than who it's for. For example, 'need_all_visible'. If set to true, the function determines 'all_visible', otherwise it does not. I started to look closer at the loops in heap_prune_chain() and how they update all the various flags and counters. There's a lot going on there. We have: - live_tuples counter - recently_dead_tuples counter - all_visible[_except_removable] - all_frozen - visibility_cutoff_xid - hastup - prstate.frozen array - nnewlpdead - deadoffsets array And that doesn't even include all the local variables and the final dead/redirected arrays. Some of those are set in the first loop that initializes 'htsv' for each tuple on the page. Others are updated in heap_prune_chain(). Some are updated in both. It's hard to follow which are set where. I think recently_dead_tuples is updated incorrectly, for tuples that are part of a completely dead HOT chain. For example, imagine a hot chain with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow the chain, see the DEAD tuple at the end of the chain, and mark both tuples for pruning. However, we already updated 'recently_dead_tuples' in the first loop, which is wrong if we remove the tuple. Maybe that's the only bug like this, but I'm a little scared. Is there something we could do to make this simpler? Maybe move all the new work that we added to the first loop, into heap_prune_chain() ? Maybe introduce a few more helper heap_prune_record_*() functions, to update the flags and counters also for live and insert/delete-in-progress tuples and for dead line pointers? Something like heap_prune_record_live() and heap_prune_record_lp_dead(). >> The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily >> these patches's fault). This at least is wrong, because Max(a, b) >> doesn't handle XID wraparound correctly: >> >>> if (do_freeze) >>> conflict_xid = Max(prstate.snapshotConflictHorizon, >>> presult->frz_conflict_horizon); >>> else >>> conflict_xid = prstate.snapshotConflictHorizon; >> >> Then there's this in lazy_scan_prune(): >> >>> /* Using same cutoff when setting VM is now unnecessary */ >>> if (presult.all_frozen) >>> presult.frz_conflict_horizon = InvalidTransactionId; >> This does the right thing in the end, but if all the tuples are frozen >> shouldn't frz_conflict_horizon already be InvalidTransactionId? The >> comment says it's "newest xmin on the page", and if everything was >> frozen, all xmins are FrozenTransactionId. In other words, that should >> be moved to heap_page_prune_and_freeze() so that it doesn't lie to its >> caller. Also, frz_conflict_horizon is only set correctly if >> 'all_frozen==true', would be good to mention that in the comments too. > > Yes, this is a good point. I've spent some time swapping all of this > back into my head. I think we should change the names of all these > conflict horizon variables and introduce some local variables again. > In the attached patch, I've updated the name of the variable in > PruneFreezeResult to vm_conflict_horizon, as it is only used for > emitting a VM update record. Now, I don't set it until the end of > heap_page_prune_and_freeze(). It is only updated from > InvalidTransactionId if the page is not all frozen. As you say, if the > page is all frozen, there can be no conflict. Makes sense. > I've also changed PruneState->snapshotConflictHorizon to > PruneState->latest_xid_removed. > > And I introduced the local variables visibility_cutoff_xid and > frz_conflict_horizon. I think it is important we distinguish between > the latest xid pruned, the latest xmin of tuples frozen, and the > latest xid of all live tuples on the page. > > Though we end up using visibility_cutoff_xid as the freeze conflict > horizon if the page is all frozen, I think it is more clear to have > the three variables and name them what they are. Agreed. > Note that I still don't think we have a resolution on what to > correctly update new_relfrozenxid and new_relminmxid to at the end > when presult->nfrozen == 0 and presult->all_frozen is true. > > if (presult->nfrozen > 0) > { > presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid; > presult->new_relminmxid = pagefrz->FreezePageRelminMxid; > } > else > { > presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid; > presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid; > } One approach is to take them out of the PageFreezeResult struct again, and pass them as pointers: void heap_page_prune_and_freeze(Relation relation, Buffer buffer, ... TransactionId *new_relfrozenxid, MultiXactId *new_relminmxid, ... ) That would be natural for the caller too, as it wouldn't need to set up the old values to HeapPageFreeze before each call, nor copy the new values to 'vacrel' after the call. I'm thinking that we'd move the responsibility of setting up HeapPageFreeze to heap_page_prune_and_freeze(), instead of having the caller set it up. So the caller would look something like this: heap_page_prune_and_freeze(rel, buf, vacrel->vistest, &vacrel->cutoffs, &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid, &presult, PRUNE_VACUUM_SCAN, flags, &vacrel->offnum); In this setup, heap_page_prune_and_freeze() would update *new_relfrozenxid and *new_relminmxid when it has a new value for them, and leave them unchanged otherwise. -- Heikki Linnakangas Neon (https://neon.tech)
Thanks for committing the new WAL format! On Mon, Mar 25, 2024 at 3:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 24/03/2024 18:32, Melanie Plageman wrote: > > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >> In heap_page_prune_and_freeze(), we now do some extra work on each live > >> tuple, to set the all_visible_except_removable correctly. And also to > >> update live_tuples, recently_dead_tuples and hastup. When we're not > >> freezing, that's a waste of cycles, the caller doesn't care. I hope it's > >> enough that it doesn't matter, but is it? > > > > Last year on an early version of the patch set I did some pgbench > > tpcb-like benchmarks -- since there is a lot of on-access pruning in > > that workload -- and I don't remember it being a showstopper. The code > > has changed a fair bit since then. However, I think it might be safer > > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip > > the rest of the loop after heap_prune_satisifies_vacuum() when > > on-access pruning invokes it. I had avoided that because it felt ugly > > and error-prone, however it addresses a few other of your points as > > well. > > Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the > argument described what it does, rather than who it's for. For example, > 'need_all_visible'. If set to true, the function determines > 'all_visible', otherwise it does not. I like that way of putting it -- describing what it does instead of who it is for. However, we now have PruneReason as an argument to heap_page_prune(), which would be usable for this purpose (for skipping the rest of the first loop). It is not descriptive of how we would use it in this scenario, though. > I started to look closer at the loops in heap_prune_chain() and how they > update all the various flags and counters. There's a lot going on there. > We have: > > - live_tuples counter > - recently_dead_tuples counter > - all_visible[_except_removable] > - all_frozen > - visibility_cutoff_xid > - hastup > - prstate.frozen array > - nnewlpdead > - deadoffsets array > > And that doesn't even include all the local variables and the final > dead/redirected arrays. Yes, there are a lot of things happening. In an early version, I had hoped for the first loop to be just getting the visibility information and then to do most of the other stuff as we went in heap_prune_chain() as you mention below. I couldn't quite get a version of that working that looked nice. I agree that the whole thing feels a bit brittle and error-prone. It's hard to be objective after fiddling with something over the course of a year. I'm trying to take a step back now and rethink it. > Some of those are set in the first loop that initializes 'htsv' for each > tuple on the page. Others are updated in heap_prune_chain(). Some are > updated in both. It's hard to follow which are set where. Yep. > I think recently_dead_tuples is updated incorrectly, for tuples that are > part of a completely dead HOT chain. For example, imagine a hot chain > with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow > the chain, see the DEAD tuple at the end of the chain, and mark both > tuples for pruning. However, we already updated 'recently_dead_tuples' > in the first loop, which is wrong if we remove the tuple. > > Maybe that's the only bug like this, but I'm a little scared. Is there > something we could do to make this simpler? Maybe move all the new work > that we added to the first loop, into heap_prune_chain() ? Maybe > introduce a few more helper heap_prune_record_*() functions, to update > the flags and counters also for live and insert/delete-in-progress > tuples and for dead line pointers? Something like > heap_prune_record_live() and heap_prune_record_lp_dead(). I had discarded previous attempts to get everything done in heap_prune_chain() because it was hard to make sure I was doing the right thing given that it visits the line pointers out of order so making sure you've considered all of them once and only once was hard. I hadn't thought of the approach you suggested with record_live() -- that might help. I will work on this tomorrow. I had hoped to get something out today, but I am still in the middle of rebasing the back 20 patches from your v5 over current master and then adding in the suggestions that I made in the various diffs on the thread. > > Note that I still don't think we have a resolution on what to > > correctly update new_relfrozenxid and new_relminmxid to at the end > > when presult->nfrozen == 0 and presult->all_frozen is true. > > > > if (presult->nfrozen > 0) > > { > > presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid; > > presult->new_relminmxid = pagefrz->FreezePageRelminMxid; > > } > > else > > { > > presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid; > > presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid; > > } > > One approach is to take them out of the PageFreezeResult struct again, > and pass them as pointers: > > void > heap_page_prune_and_freeze(Relation relation, Buffer buffer, > ... > TransactionId *new_relfrozenxid, > MultiXactId *new_relminmxid, > ... > ) What about the question about whether or not we should be using FreezePageRelfrozenXid when all_frozen is true and nrfrozen == 0. I was confused about whether or not we had to do this by the comment in HeapPageFreeze. > That would be natural for the caller too, as it wouldn't need to set up > the old values to HeapPageFreeze before each call, nor copy the new > values to 'vacrel' after the call. I'm thinking that we'd move the > responsibility of setting up HeapPageFreeze to > heap_page_prune_and_freeze(), instead of having the caller set it up. So > the caller would look something like this: > > heap_page_prune_and_freeze(rel, buf, vacrel->vistest, > &vacrel->cutoffs, &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid, > &presult, > PRUNE_VACUUM_SCAN, flags, > &vacrel->offnum); > > In this setup, heap_page_prune_and_freeze() would update > *new_relfrozenxid and *new_relminmxid when it has a new value for them, > and leave them unchanged otherwise. I do prefer having heap_page_prune_and_freeze() own HeapPageFreeze. - Melanie
On Mon, Mar 25, 2024 at 09:33:38PM +0200, Heikki Linnakangas wrote: > On 24/03/2024 18:32, Melanie Plageman wrote: > > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > > > In heap_page_prune_and_freeze(), we now do some extra work on each live > > > tuple, to set the all_visible_except_removable correctly. And also to > > > update live_tuples, recently_dead_tuples and hastup. When we're not > > > freezing, that's a waste of cycles, the caller doesn't care. I hope it's > > > enough that it doesn't matter, but is it? > > > > Last year on an early version of the patch set I did some pgbench > > tpcb-like benchmarks -- since there is a lot of on-access pruning in > > that workload -- and I don't remember it being a showstopper. The code > > has changed a fair bit since then. However, I think it might be safer > > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip > > the rest of the loop after heap_prune_satisifies_vacuum() when > > on-access pruning invokes it. I had avoided that because it felt ugly > > and error-prone, however it addresses a few other of your points as > > well. > > Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the > argument described what it does, rather than who it's for. For example, > 'need_all_visible'. If set to true, the function determines 'all_visible', > otherwise it does not. A very rough v7 is attached. The whole thing is rebased over master and then 0016 contains an attempt at the refactor we discussed in this email. Instead of just using the PruneReason to avoid doing the extra steps when on-access pruning calls heap_page_prune_and_freeze(), I've made an "actions" variable and defined different flags for it. One of them is a replacement for the existing mark_unused_now flag. I defined another one, PRUNE_DO_TRY_FREEZE, which could be used in place of checking if pagefrz is NULL. There is a whole group of activities that only the vacuum caller does outside of freezing -- setting hastup, counting live and recently dead tuples, determining whole page visibility and a snapshot conflict horizon for updating the VM. But I didn't want to introduce separate flags for each of them, because then I would have to check each of them before taking the action. That would be lots of extra branching and on-access pruning does none of those actions while vacuum does all of them. > I started to look closer at the loops in heap_prune_chain() and how they > update all the various flags and counters. There's a lot going on there. We > have: > > - live_tuples counter > - recently_dead_tuples counter > - all_visible[_except_removable] > - all_frozen > - visibility_cutoff_xid > - hastup > - prstate.frozen array > - nnewlpdead > - deadoffsets array > > And that doesn't even include all the local variables and the final > dead/redirected arrays. > > Some of those are set in the first loop that initializes 'htsv' for each > tuple on the page. Others are updated in heap_prune_chain(). Some are > updated in both. It's hard to follow which are set where. > > I think recently_dead_tuples is updated incorrectly, for tuples that are > part of a completely dead HOT chain. For example, imagine a hot chain with > two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow the > chain, see the DEAD tuple at the end of the chain, and mark both tuples for > pruning. However, we already updated 'recently_dead_tuples' in the first > loop, which is wrong if we remove the tuple. Ah, yes, you are so right about this bug. > Maybe that's the only bug like this, but I'm a little scared. Is there > something we could do to make this simpler? Maybe move all the new work that > we added to the first loop, into heap_prune_chain() ? Maybe introduce a few > more helper heap_prune_record_*() functions, to update the flags and > counters also for live and insert/delete-in-progress tuples and for dead > line pointers? Something like heap_prune_record_live() and > heap_prune_record_lp_dead(). I like the idea of a heap_prune_record_live_or_recently_dead() function. That's what I've attempted to implement in the attached 0016. I haven't updated and cleaned up everything (especially comments) in the refactor, but there are two major issues: 1) In heap_prune_chain(), a heap-only tuple which is not HOT updated may end up being a live tuple not part of any chain or it may end up the redirect target in a HOT chain. At the top of heap_prune_chain(), we return if (HeapTupleHeaderIsHeapOnly(htup)). We may come back to this tuple later if it is part of a chain. If we don't, we need to have called heap_prune_record_live_or_recently_dead(). However, there are other tuples that get redirected to which do not meet this criteria, so we must call heap_prune_record_live_or_recently_dead() when setting an item redirected to. If we call heap_prune_record_live_or_recently_dead() in both places, we will double-count. To fix this, I introduced an array, "counted". But that takes up extra space in the PruneState and extra cycles to memset it. I can't think of a way to make sure we count the right tuples without another array. The tuples we need to count are those not pointed to by prstate->marked + those tuples whose line pointers will be redirected to (those are marked). 2) A large number of the members of PruneFreezeResult are only initialized for the vacuum caller now. Even with a comment, this is a bit confusing. And, it seems like there should be some symmetry between the actions the caller tells heap_page_prune_and_freeze() to take and the result parameters that are filled in. I am concerned about adding all of the actions (setting hastup, determining whole page visibility, etc as mentioned above) because then I also have to check all the actions and that will add extra branching. And out of the two callers of heap_page_prune_and_freeze(), one will do all of the actions and one will do none of them except "main" pruning. > > Note that I still don't think we have a resolution on what to > > correctly update new_relfrozenxid and new_relminmxid to at the end > > when presult->nfrozen == 0 and presult->all_frozen is true. > > > > if (presult->nfrozen > 0) > > { > > presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid; > > presult->new_relminmxid = pagefrz->FreezePageRelminMxid; > > } > > else > > { > > presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid; > > presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid; > > } > > One approach is to take them out of the PageFreezeResult struct again, and > pass them as pointers: > > void > heap_page_prune_and_freeze(Relation relation, Buffer buffer, > ... > TransactionId *new_relfrozenxid, > MultiXactId *new_relminmxid, > ... > ) > > That would be natural for the caller too, as it wouldn't need to set up the > old values to HeapPageFreeze before each call, nor copy the new values to > 'vacrel' after the call. I'm thinking that we'd move the responsibility of > setting up HeapPageFreeze to heap_page_prune_and_freeze(), instead of having > the caller set it up. So the caller would look something like this: > > heap_page_prune_and_freeze(rel, buf, vacrel->vistest, > &vacrel->cutoffs, &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid, > &presult, > PRUNE_VACUUM_SCAN, flags, > &vacrel->offnum); > > In this setup, heap_page_prune_and_freeze() would update *new_relfrozenxid > and *new_relminmxid when it has a new value for them, and leave them > unchanged otherwise. I've passed new_relfrozen_xid and new_relmin_mxid as arguments. But as for only updating them when there is a new value, that doesn't sound cheaper than just setting them when they are passed in with the values from [No]FreezePageRelfrozenXid, [No]FreezePageRelminMxid. Unless you are imagining a way to simplify the current [No]FreezePageRelfrozenXid, [No]FreezePageRelminMxid. - Melanie
Attachment
- v7-0001-lazy_scan_prune-tests-tuple-vis-with-GlobalVisTes.patch
- v7-0002-Pass-heap_prune_chain-PruneResult-output-paramete.patch
- v7-0003-Rename-PruneState-snapshotConflictHorizon-to-late.patch
- v7-0004-heap_page_prune-sets-all_visible-and-visibility_c.patch
- v7-0005-Add-reference-to-VacuumCutoffs-in-HeapPageFreeze.patch
- v7-0006-Prepare-freeze-tuples-in-heap_page_prune.patch
- v7-0007-lazy_scan_prune-reorder-freeze-execution-logic.patch
- v7-0008-Execute-freezing-in-heap_page_prune.patch
- v7-0009-Make-opp-freeze-heuristic-compatible-with-prune-f.patch
- v7-0010-Separate-tuple-pre-freeze-checks-and-invoke-earli.patch
- v7-0011-Remove-heap_freeze_execute_prepared.patch
- v7-0012-Merge-prune-and-freeze-records.patch
- v7-0013-Set-hastup-in-heap_page_prune.patch
- v7-0014-Count-tuples-for-vacuum-logging-in-heap_page_prun.patch
- v7-0015-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v7-0016-move-live-tuple-accounting-to-heap_prune_chain.patch
On Tue, Mar 26, 2024 at 5:46 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Mar 25, 2024 at 09:33:38PM +0200, Heikki Linnakangas wrote: > > On 24/03/2024 18:32, Melanie Plageman wrote: > > > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > > > > > In heap_page_prune_and_freeze(), we now do some extra work on each live > > > > tuple, to set the all_visible_except_removable correctly. And also to > > > > update live_tuples, recently_dead_tuples and hastup. When we're not > > > > freezing, that's a waste of cycles, the caller doesn't care. I hope it's > > > > enough that it doesn't matter, but is it? > > > > > > Last year on an early version of the patch set I did some pgbench > > > tpcb-like benchmarks -- since there is a lot of on-access pruning in > > > that workload -- and I don't remember it being a showstopper. The code > > > has changed a fair bit since then. However, I think it might be safer > > > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip > > > the rest of the loop after heap_prune_satisifies_vacuum() when > > > on-access pruning invokes it. I had avoided that because it felt ugly > > > and error-prone, however it addresses a few other of your points as > > > well. > > > > Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the > > argument described what it does, rather than who it's for. For example, > > 'need_all_visible'. If set to true, the function determines 'all_visible', > > otherwise it does not. > > A very rough v7 is attached. The whole thing is rebased over master and > then 0016 contains an attempt at the refactor we discussed in this > email. > > Instead of just using the PruneReason to avoid doing the extra steps > when on-access pruning calls heap_page_prune_and_freeze(), I've made an > "actions" variable and defined different flags for it. One of them is > a replacement for the existing mark_unused_now flag. I defined another > one, PRUNE_DO_TRY_FREEZE, which could be used in place of checking if > pagefrz is NULL. > > There is a whole group of activities that only the vacuum caller does > outside of freezing -- setting hastup, counting live and recently dead > tuples, determining whole page visibility and a snapshot conflict > horizon for updating the VM. But I didn't want to introduce separate > flags for each of them, because then I would have to check each of them > before taking the action. That would be lots of extra branching and > on-access pruning does none of those actions while vacuum does all of > them. > > > I started to look closer at the loops in heap_prune_chain() and how they > > update all the various flags and counters. There's a lot going on there. We > > have: > > > > - live_tuples counter > > - recently_dead_tuples counter > > - all_visible[_except_removable] > > - all_frozen > > - visibility_cutoff_xid > > - hastup > > - prstate.frozen array > > - nnewlpdead > > - deadoffsets array > > > > And that doesn't even include all the local variables and the final > > dead/redirected arrays. > > > > Some of those are set in the first loop that initializes 'htsv' for each > > tuple on the page. Others are updated in heap_prune_chain(). Some are > > updated in both. It's hard to follow which are set where. > > > > I think recently_dead_tuples is updated incorrectly, for tuples that are > > part of a completely dead HOT chain. For example, imagine a hot chain with > > two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow the > > chain, see the DEAD tuple at the end of the chain, and mark both tuples for > > pruning. However, we already updated 'recently_dead_tuples' in the first > > loop, which is wrong if we remove the tuple. > > Ah, yes, you are so right about this bug. > > > Maybe that's the only bug like this, but I'm a little scared. Is there > > something we could do to make this simpler? Maybe move all the new work that > > we added to the first loop, into heap_prune_chain() ? Maybe introduce a few > > more helper heap_prune_record_*() functions, to update the flags and > > counters also for live and insert/delete-in-progress tuples and for dead > > line pointers? Something like heap_prune_record_live() and > > heap_prune_record_lp_dead(). > > I like the idea of a heap_prune_record_live_or_recently_dead() function. > That's what I've attempted to implement in the attached 0016. I haven't > updated and cleaned up everything (especially comments) in the refactor, > but there are two major issues: > > 1) In heap_prune_chain(), a heap-only tuple which is not HOT updated may > end up being a live tuple not part of any chain or it may end up the > redirect target in a HOT chain. At the top of heap_prune_chain(), we > return if (HeapTupleHeaderIsHeapOnly(htup)). We may come back to this > tuple later if it is part of a chain. If we don't, we need to have > called heap_prune_record_live_or_recently_dead(). However, there are > other tuples that get redirected to which do not meet this criteria, so > we must call heap_prune_record_live_or_recently_dead() when setting an > item redirected to. If we call heap_prune_record_live_or_recently_dead() > in both places, we will double-count. To fix this, I introduced an > array, "counted". But that takes up extra space in the PruneState and > extra cycles to memset it. > > I can't think of a way to make sure we count the right tuples without > another array. The tuples we need to count are those not pointed to by > prstate->marked + those tuples whose line pointers will be redirected to > (those are marked). > > 2) A large number of the members of PruneFreezeResult are only > initialized for the vacuum caller now. Even with a comment, this is a > bit confusing. And, it seems like there should be some symmetry between > the actions the caller tells heap_page_prune_and_freeze() to take and > the result parameters that are filled in. > > I am concerned about adding all of the actions (setting hastup, > determining whole page visibility, etc as mentioned above) because then > I also have to check all the actions and that will add extra branching. > And out of the two callers of heap_page_prune_and_freeze(), one will do > all of the actions and one will do none of them except "main" pruning. This morning I worked on a version of this patchset which moved the counting of live and recently dead tuples and the calculation of the vm conflict horizon back to lazy_scan_prune() but kept the freezing and dead offset collection in heap_prune_chain(). I encountered the same problem with ensuring each tuple was considered for freezing exactly once. It also made me realize that my patch set (v7) still has the same problem in which all_visible_except_removable will be incorrectly set to false and recently dead tuples incorrectly incremented when encountering HEAPTUPLE_RECENTLY_DEAD tuples whose line pointers get set LP_DEAD during pruning. And I think I am incorrectly calling heap_prepare_freeze_tuple() on them too. I need some way to modify the control flow or accounting such that I know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD. And a way to consider freezing and do live tuple accounting for these and HEAPTUPLE_LIVE tuples exactly once. - Melanie
On 27/03/2024 17:18, Melanie Plageman wrote: > I need some way to modify the control flow or accounting such that I > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD. > And a way to consider freezing and do live tuple accounting for these > and HEAPTUPLE_LIVE tuples exactly once. Just a quick update: I've been massaging this some more today, and I think I'm onto got something palatable. I'll send an updated patch later today, but the key is to note that for each item on the page, there is one point where we determine the fate of the item, whether it's pruned or not. That can happen in different points in in heap_page_prune(). That's also when we set marked[offnum] = true. Whenever that happens, we all call one of the a heap_page_prune_record_*() subroutines. We already have those subroutines for when a tuple is marked as dead or unused, but let's add similar subroutines for the case that we're leaving the tuple unchanged. If we move all the bookkeeping logic to those subroutines, we can ensure that it gets done exactly once for each tuple, and at that point we know what we are going to do to the tuple, so we can count it correctly. So heap_prune_chain() decides what to do with each tuple, and ensures that each tuple is marked only once, and the subroutines update all the variables, add the item to the correct arrays etc. depending on what we're doing with it. -- Heikki Linnakangas Neon (https://neon.tech)
On Tue, Mar 19, 2024 at 9:36 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > * "Freeze" NewRelfrozenXid/NewRelminMxid trackers. > * > * Trackers used when heap_freeze_execute_prepared freezes, or when there > * are zero freeze plans for a page. It is always valid for vacuumlazy.c > * to freeze any page, by definition. This even includes pages that have > * no tuples with storage to consider in the first place. That way the > * 'totally_frozen' results from heap_prepare_freeze_tuple can always be > * used in the same way, even when no freeze plans need to be executed to > * "freeze the page". Only the "freeze" path needs to consider the need > * to set pages all-frozen in the visibility map under this scheme. > * > * When we freeze a page, we generally freeze all XIDs < OldestXmin, only > * leaving behind XIDs that are ineligible for freezing, if any. And so > * you might wonder why these trackers are necessary at all; why should > * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that > * ratchet back the top-level NewRelfrozenXid/NewRelminMxid trackers? > * > * It is useful to use a definition of "freeze the page" that does not > * overspecify how MultiXacts are affected. heap_prepare_freeze_tuple > * generally prefers to remove Multis eagerly, but lazy processing is used > * in cases where laziness allows VACUUM to avoid allocating a new Multi. > * The "freeze the page" trackers enable this flexibility. > */ > > So, I don't really know if it is right to just check presult->nfrozen > > 0 when updating relminmxid. I have changed it to the way you suggested. > But we can change it back. I think that this is just about safe. I had to check, though. I see that the FRM_NOOP case (within FreezeMultiXactId/heap_prepare_freeze_tuple) will ratchet back both sets of trackers (both the freeze and no freeze variants). However, it's rather hard to see that this is true. The intent here was that cases where "presult->nfrozen == 0" would always take the "freeze" path. That seems more natural to me, at least, since I think of the freeze path as the default choice. By definition, lazy_scan_prune() can always take the freeze path -- even when the page has no tuples with storage. But it cannot always take the no-freeze path -- "disobeying" pagefrz.freeze_required creates the risk that relfrozenxid/relminmxid will be advanced to unsafe values at the end of the VACUUM. IMV you should stick with that approach now, even if it is currently safe to do it the other way around. -- Peter Geoghegan
On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 27/03/2024 17:18, Melanie Plageman wrote: > > I need some way to modify the control flow or accounting such that I > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD. > > And a way to consider freezing and do live tuple accounting for these > > and HEAPTUPLE_LIVE tuples exactly once. > > Just a quick update: I've been massaging this some more today, and I > think I'm onto got something palatable. I'll send an updated patch later > today, but the key is to note that for each item on the page, there is > one point where we determine the fate of the item, whether it's pruned > or not. That can happen in different points in in heap_page_prune(). > That's also when we set marked[offnum] = true. Whenever that happens, we > all call one of the a heap_page_prune_record_*() subroutines. We already > have those subroutines for when a tuple is marked as dead or unused, but > let's add similar subroutines for the case that we're leaving the tuple > unchanged. If we move all the bookkeeping logic to those subroutines, we > can ensure that it gets done exactly once for each tuple, and at that > point we know what we are going to do to the tuple, so we can count it > correctly. So heap_prune_chain() decides what to do with each tuple, and > ensures that each tuple is marked only once, and the subroutines update > all the variables, add the item to the correct arrays etc. depending on > what we're doing with it. Yes, this would be ideal. I was doing some experimentation with pageinspect today (trying to find that single place where live tuples fates are decided) and it seems like a heap-only tuple that is not HOT-updated will usually be the one at the end of the chain. Which seems like it would be covered by adding a record_live() type function call in the loop of heap_prune_chain(): /* * If the tuple is not HOT-updated, then we are at the end of this * HOT-update chain. */ if (!HeapTupleHeaderIsHotUpdated(htup)) { heap_prune_record_live_or_recently_dead(dp, prstate, offnum, presult); break; } but that doesn't end up producing the same results as if (HeapTupleHeaderIsHeapOnly(htup) && !HeapTupleHeaderIsHotUpdated(htup) && presult->htsv[rootoffnum] == HEAPTUPLE_DEAD) heap_prune_record_live_or_recently_dead(dp, prstate, offnum, presult); at the top of heap_prune_chain(). - Melanie
On Wed, Mar 27, 2024 at 2:26 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 27/03/2024 17:18, Melanie Plageman wrote: > > > I need some way to modify the control flow or accounting such that I > > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD. > > > And a way to consider freezing and do live tuple accounting for these > > > and HEAPTUPLE_LIVE tuples exactly once. > > > > Just a quick update: I've been massaging this some more today, and I > > think I'm onto got something palatable. I'll send an updated patch later > > today, but the key is to note that for each item on the page, there is > > one point where we determine the fate of the item, whether it's pruned > > or not. That can happen in different points in in heap_page_prune(). > > That's also when we set marked[offnum] = true. Whenever that happens, we > > all call one of the a heap_page_prune_record_*() subroutines. We already > > have those subroutines for when a tuple is marked as dead or unused, but > > let's add similar subroutines for the case that we're leaving the tuple > > unchanged. If we move all the bookkeeping logic to those subroutines, we > > can ensure that it gets done exactly once for each tuple, and at that > > point we know what we are going to do to the tuple, so we can count it > > correctly. So heap_prune_chain() decides what to do with each tuple, and > > ensures that each tuple is marked only once, and the subroutines update > > all the variables, add the item to the correct arrays etc. depending on > > what we're doing with it. > > Yes, this would be ideal. > > I was doing some experimentation with pageinspect today (trying to > find that single place where live tuples fates are decided) and it > seems like a heap-only tuple that is not HOT-updated will usually be > the one at the end of the chain. Which seems like it would be covered > by adding a record_live() type function call in the loop of > heap_prune_chain(): > > /* > * If the tuple is not HOT-updated, then we are at the end of this > * HOT-update chain. > */ > if (!HeapTupleHeaderIsHotUpdated(htup)) > { > heap_prune_record_live_or_recently_dead(dp, prstate, > offnum, presult); > break; > } > > but that doesn't end up producing the same results as > > if (HeapTupleHeaderIsHeapOnly(htup) > && !HeapTupleHeaderIsHotUpdated(htup) && > presult->htsv[rootoffnum] == HEAPTUPLE_DEAD) > heap_prune_record_live_or_recently_dead(dp, prstate, > offnum, presult); sorry, that should say presult->htsv[rootoffnum] != HEAPTUPLE_DEAD. The latter should be a subset of the former. But instead it seems there are cases I missed by doing only the former. - Melanie
On 27/03/2024 20:26, Melanie Plageman wrote: > On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> On 27/03/2024 17:18, Melanie Plageman wrote: >>> I need some way to modify the control flow or accounting such that I >>> know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD. >>> And a way to consider freezing and do live tuple accounting for these >>> and HEAPTUPLE_LIVE tuples exactly once. >> >> Just a quick update: I've been massaging this some more today, and I >> think I'm onto got something palatable. I'll send an updated patch later >> today, but the key is to note that for each item on the page, there is >> one point where we determine the fate of the item, whether it's pruned >> or not. That can happen in different points in in heap_page_prune(). >> That's also when we set marked[offnum] = true. Whenever that happens, we >> all call one of the a heap_page_prune_record_*() subroutines. We already >> have those subroutines for when a tuple is marked as dead or unused, but >> let's add similar subroutines for the case that we're leaving the tuple >> unchanged. If we move all the bookkeeping logic to those subroutines, we >> can ensure that it gets done exactly once for each tuple, and at that >> point we know what we are going to do to the tuple, so we can count it >> correctly. So heap_prune_chain() decides what to do with each tuple, and >> ensures that each tuple is marked only once, and the subroutines update >> all the variables, add the item to the correct arrays etc. depending on >> what we're doing with it. > > Yes, this would be ideal. Well, that took me a lot longer than expected. My approach of "make sure you all the right heap_prune_record_*() subroutine in all cases didn't work out quite as easily as I thought. Because, as you pointed out, it's difficult to know if a non-DEAD tuple that is part of a HOT chain will be visited later as part of the chain processing, or needs to be counted at the top of heap_prune_chain(). The solution I came up with is to add a third phase to pruning. At the top of heap_prune_chain(), if we see a live heap-only tuple, and we're not sure if it will be counted later as part of a HOT chain, we stash it away and revisit it later, after processing all the hot chains. That's somewhat similar to your 'counted' array, but not quite. Attached is that approach, on top of v7. It's a bit messy, I made a bunch of other changes too and didn't fully separate them out to separate patch. Sorry about that. One change with this is that live_tuples and many of the other fields are now again updated, even if the caller doesn't need them. It was hard to skip them in a way that would save any cycles, with the other refactorings. Some other notable changes are mentioned in the commit message. > I was doing some experimentation with pageinspect today (trying to > find that single place where live tuples fates are decided) and it > seems like a heap-only tuple that is not HOT-updated will usually be > the one at the end of the chain. Which seems like it would be covered > by adding a record_live() type function call in the loop of > heap_prune_chain(): > > /* > * If the tuple is not HOT-updated, then we are at the end of this > * HOT-update chain. > */ > if (!HeapTupleHeaderIsHotUpdated(htup)) > { > heap_prune_record_live_or_recently_dead(dp, prstate, > offnum, presult); > break; > } > > but that doesn't end up producing the same results as > > if (HeapTupleHeaderIsHeapOnly(htup) > && !HeapTupleHeaderIsHotUpdated(htup) && > presult->htsv[rootoffnum] == HEAPTUPLE_DEAD) > heap_prune_record_live_or_recently_dead(dp, prstate, > offnum, presult); > > at the top of heap_prune_chain(). Yep, this is tricky, I also spent a lot of time trying to find a good "choke point" where we could say for sure that a live tuple is processed exactly once, but fumbled just like you. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
- v8-0001-lazy_scan_prune-tests-tuple-vis-with-GlobalVisTes.patch
- v8-0002-Pass-heap_prune_chain-PruneResult-output-paramete.patch
- v8-0003-Rename-PruneState-snapshotConflictHorizon-to-late.patch
- v8-0004-heap_page_prune-sets-all_visible-and-visibility_c.patch
- v8-0005-Add-reference-to-VacuumCutoffs-in-HeapPageFreeze.patch
- v8-0006-Prepare-freeze-tuples-in-heap_page_prune.patch
- v8-0007-lazy_scan_prune-reorder-freeze-execution-logic.patch
- v8-0008-Execute-freezing-in-heap_page_prune.patch
- v8-0009-Make-opp-freeze-heuristic-compatible-with-prune-f.patch
- v8-0010-Separate-tuple-pre-freeze-checks-and-invoke-earli.patch
- v8-0011-Remove-heap_freeze_execute_prepared.patch
- v8-0012-Merge-prune-and-freeze-records.patch
- v8-0013-Set-hastup-in-heap_page_prune.patch
- v8-0014-Count-tuples-for-vacuum-logging-in-heap_page_prun.patch
- v8-0015-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v8-0016-move-live-tuple-accounting-to-heap_prune_chain.patch
- v8-0017-Move-frozen-array-to-PruneState.patch
- v8-0018-Cosmetic-fixes.patch
- v8-0019-Almost-cosmetic-fixes.patch
- v8-0020-Move-frz_conflict_horizon-to-tighter-scope.patch
- v8-0021-Add-comment-about-a-pre-existing-issue.patch
- v8-0022-WIP.patch
On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote: > On 27/03/2024 20:26, Melanie Plageman wrote: > > On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > > > On 27/03/2024 17:18, Melanie Plageman wrote: > > > > I need some way to modify the control flow or accounting such that I > > > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD. > > > > And a way to consider freezing and do live tuple accounting for these > > > > and HEAPTUPLE_LIVE tuples exactly once. > > > > > > Just a quick update: I've been massaging this some more today, and I > > > think I'm onto got something palatable. I'll send an updated patch later > > > today, but the key is to note that for each item on the page, there is > > > one point where we determine the fate of the item, whether it's pruned > > > or not. That can happen in different points in in heap_page_prune(). > > > That's also when we set marked[offnum] = true. Whenever that happens, we > > > all call one of the a heap_page_prune_record_*() subroutines. We already > > > have those subroutines for when a tuple is marked as dead or unused, but > > > let's add similar subroutines for the case that we're leaving the tuple > > > unchanged. If we move all the bookkeeping logic to those subroutines, we > > > can ensure that it gets done exactly once for each tuple, and at that > > > point we know what we are going to do to the tuple, so we can count it > > > correctly. So heap_prune_chain() decides what to do with each tuple, and > > > ensures that each tuple is marked only once, and the subroutines update > > > all the variables, add the item to the correct arrays etc. depending on > > > what we're doing with it. > > > > Yes, this would be ideal. > > Well, that took me a lot longer than expected. My approach of "make sure you > all the right heap_prune_record_*() subroutine in all cases didn't work out > quite as easily as I thought. Because, as you pointed out, it's difficult to > know if a non-DEAD tuple that is part of a HOT chain will be visited later > as part of the chain processing, or needs to be counted at the top of > heap_prune_chain(). > > The solution I came up with is to add a third phase to pruning. At the top > of heap_prune_chain(), if we see a live heap-only tuple, and we're not sure > if it will be counted later as part of a HOT chain, we stash it away and > revisit it later, after processing all the hot chains. That's somewhat > similar to your 'counted' array, but not quite. I like this idea (the third phase). I've just started reviewing this but I thought I would give the initial thoughts I had inline. > One change with this is that live_tuples and many of the other fields are > now again updated, even if the caller doesn't need them. It was hard to skip > them in a way that would save any cycles, with the other refactorings. I am worried we are writing checks that are going to have to come out of SELECT queries' bank accounts, but I'll do some benchmarking when we're all done with major refactoring. > From 2f38628373ccfb6e8f8fd883955056030092569d Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Thu, 28 Mar 2024 00:16:09 +0200 > Subject: [PATCH v8 21/22] Add comment about a pre-existing issue > > Not sure if we want to keep this, but I wanted to document it for > discussion at least. > --- > src/backend/access/heap/pruneheap.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c > index e37ba655a7d..2b720ab6aa1 100644 > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > @@ -792,6 +792,23 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > * Note that we might first arrive at a dead heap-only tuple > * either here or while following a chain below. Whichever path > * gets there first will mark the tuple unused. > + * > + * FIXME: The next paragraph isn't new with these patches, but > + * just something I realized while looking at this. But maybe we should > + * add a comment like this? Or is it too much detail? I think a comment is a good idea. > + * > + * Whether we arrive at the dead HOT tuple first here or while > + * following a chain below affects whether preceding RECENTLY_DEAD > + * tuples in the chain can be removed or not. Imagine that you > + * have a chain with two tuples: RECENTLY_DEAD -> DEAD. If we > + * reach the RECENTLY_DEAD tuple first, the chain-following logic > + * will find the DEAD tuple and conclude that both tuples are in > + * fact dead and can be removed. But if we reach the DEAD tuple > + * at the end of the chain first, when we reach the RECENTLY_DEAD > + * tuple later, we will not follow the chain because the DEAD > + * TUPLE is already 'marked', and will not remove the > + * RECENTLY_DEAD tuple. This is not a correctness issue, and the > + * RECENTLY_DEAD tuple will be removed by a later VACUUM. > */ > if (!HeapTupleHeaderIsHotUpdated(htup)) Is this intentional? Like would it be correct to remove the RECENTLY_DEAD tuple during the current vacuum? > From c2ee7508456d0e76de985f9997a6840450e342a8 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Thu, 28 Mar 2024 00:45:26 +0200 > Subject: [PATCH v8 22/22] WIP > > - Got rid of all_visible_except_removable again. We're back to the > roughly the same mechanism as on 'master', where the all_visible > doesn't include LP_DEAD items, but at the end of > heap_page_prune_and_freeze() when we return to the caller, we clear > it if there were any LP_DEAD items. I considered calling the > variable 'all_visible_except_lp_dead', which would be more accurate, > but it's also very long. not longer than all_visible_except_removable. I would be happy to keep it more exact, but I'm also okay with just all_visible. > - I duplicated all the fields from PageFreezeResult to PruneState. Now > heap_prune_chain() and all the subroutines don't need the > PageFreezeResult argument, and you don't need to remember which > struct each field is kept in. It's all now in PruneState, and the > fields that the caller is interested in are copied to > PageFreezeResult at the end of heap_page_prune_and_freeze() yea, this makes sense to me. Makes me wonder if we shouldn't just have PruneFreezeResult->live_tuples/recently_dead_tuples/etc be pointers and then lazy_scan_prune() can pass the actual vacrel->live_tuples counter and heap_page_prune_and_freeze() can increment it itself. Maybe that's weird though. > - Replaced the 'counted' array with 'revisit' array. I thought I could > get rid of it altogether, by just being careful to call the right > heap_prune_record_*() subroutine for each tuple in heap_prune_chain(), > but with live and recently-dead tuples that are part of a HOT chain, > we might visit the tuple as part of the HOT chain or not, depending > on what it's position in the chain is. So I invented a new revisit > phase. All live heap-only tuples that we find, that haven't already > been processed as part of a hot chain, are stashed away in the > 'revisit' array. After processing all the HOT chains, the 'revisit' > tuples are re-checked, and counted if they haven't already been counted. makes sense. > - Live tuples are now also marked in the 'marked' array, when they are > counted. This gives a nice invariant: all tuples must be marked > exactly once, as part of a hot chain or otherwise. Added an > assertion for that. this is a nice thing to have. > --- > src/backend/access/heap/pruneheap.c | 706 +++++++++++++++++---------- > src/backend/access/heap/vacuumlazy.c | 6 +- > src/include/access/heapam.h | 37 +- > 3 files changed, 464 insertions(+), 285 deletions(-) > > diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c > index 2b720ab6aa1..a8ed11a1858 100644 > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > > -static void heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, > - OffsetNumber offnum, PruneFreezeResult *presult); > static void page_verify_redirects(Page page); > > > @@ -242,6 +284,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer) > * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD > * (see heap_prune_satisfies_vacuum). > * What's this "cutoffs TODO"? > + * cutoffs TODO > + * > * presult contains output parameters needed by callers such as the number of > * tuples removed and the number of line pointers newly marked LP_DEAD. > * heap_page_prune_and_freeze() is responsible for initializing it. > @@ -326,70 +370,63 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer, > prstate.latest_xid_removed = InvalidTransactionId; > prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nfrozen = 0; > memset(prstate.marked, 0, sizeof(prstate.marked)); > - memset(prstate.counted, 0, sizeof(prstate.counted)); > + prstate.nrevisit = 0; > > if (presult->nfrozen > 0) > @@ -728,10 +832,12 @@ htsv_get_valid_status(int status) > * DEAD, our visibility test is just too coarse to detect it. > * > * In general, pruning must never leave behind a DEAD tuple that still has > - * tuple storage. VACUUM isn't prepared to deal with that case. That's why > + * tuple storage. VACUUM isn't prepared to deal with that case (FIXME: it no longer cares, right?). > + * That's why > * VACUUM prunes the same heap page a second time (without dropping its lock > * in the interim) when it sees a newly DEAD tuple that we initially saw as > - * in-progress. Retrying pruning like this can only happen when an inserting > + * in-progress (FIXME: Really? Does it still do that?). Yea, I think all that is no longer true. I missed this comment back then. > + * Retrying pruning like this can only happen when an inserting > * transaction concurrently aborts. > * > * The root line pointer is redirected to the tuple immediately after the > @@ -743,15 +849,18 @@ htsv_get_valid_status(int status) > * prstate showing the changes to be made. Items to be redirected are added > * to the redirected[] array (two entries per redirection); items to be set to > * LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED > - * state are added to nowunused[]. > - * > - * Returns the number of tuples (to be) deleted from the page. > + * state are added to nowunused[]. We perform bookkeeping of live tuples, > + * visibility etc. based on what the page will look like after the changes > + * applied. All that bookkeeping is performed in the heap_prune_record_*() > + * subroutines. The division of labor is that heap_prune_chain() decides the > + * fate of each tuple, ie. whether it's going to be removed, redirected or > + * left unchanged, and the heap_prune_record_*() subroutines update PruneState > + * based on that outcome. > */ > -static int > +static void > heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > - PruneState *prstate, PruneFreezeResult *presult) > + PruneState *prstate) > { > - int ndeleted = 0; > Page dp = (Page) BufferGetPage(buffer); > TransactionId priorXmax = InvalidTransactionId; > ItemId rootlp; > @@ -794,8 +903,8 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > * gets there first will mark the tuple unused. > * > * FIXME: The next paragraph isn't new with these patches, but > - * just something I realized while looking at this. But maybe we should > - * add a comment like this? Or is it too much detail? > + * just something I realized while looking at this. But maybe we > + * should add a comment like this? Or is it too much detail? i don't think it is too much detail. > * > * Whether we arrive at the dead HOT tuple first here or while > * following a chain below affects whether preceding RECENTLY_DEAD > @@ -809,43 +918,52 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > * TUPLE is already 'marked', and will not remove the > * RECENTLY_DEAD tuple. This is not a correctness issue, and the > * RECENTLY_DEAD tuple will be removed by a later VACUUM. > + * > + * FIXME: Now that we have the 'revisit' array, we could stash > + * these DEAD items there too, instead of processing them here > + * immediately. That way, DEAD tuples that are still part of a > + * chain would always get processed as part of the chain. > */ I really like this idea! > if (!HeapTupleHeaderIsHotUpdated(htup)) > { > if (prstate->htsv[rootoffnum] == HEAPTUPLE_DEAD) > { > - heap_prune_record_unused(prstate, rootoffnum); > + heap_prune_record_unused(prstate, rootoffnum, true); > HeapTupleHeaderAdvanceConflictHorizon(htup, > &prstate->latest_xid_removed); > - ndeleted++; > - } I think we could really do with some more comments with examples like this in the pruning code (that go through an example series of steps). Not least of which because you can't see RECENTLY_DEAD in pageinspect (you'd have to create some kind of status for it from the different tuples on the page). For example, I hadn't thought of this one: REDIRECT -> RECENTY_DEAD -> DEAD -> RECENTLY_DEAD -> DEAD > + /*---- > + * FIXME: this helped me to visualize how different chains might look like > + * here. It's not an exhaustive list, just some examples to help with > + * thinking. Remove this comment from final version, or refine. > + * > + * REDIRECT -> LIVE (stop) -> ... > + * REDIRECT -> RECENTY_DEAD -> LIVE (stop) -> ... > + * REDIRECT -> RECENTY_DEAD -> RECENTLY_DEAD > + * REDIRECT -> RECENTY_DEAD -> DEAD > + * REDIRECT -> RECENTY_DEAD -> DEAD -> RECENTLY_DEAD -> DEAD > + * RECENTLY_DEAD -> ... > + */ > + > /* while not end of the chain */ > for (;;) > { > @@ -897,19 +1015,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > > case HEAPTUPLE_RECENTLY_DEAD: > recent_dead = true; > - > - /* > - * This tuple may soon become DEAD. Update the hint field so > - * that the page is reconsidered for pruning in future. > - */ > - heap_prune_record_prunable(prstate, > - HeapTupleHeaderGetUpdateXid(htup)); > break; > > case HEAPTUPLE_DELETE_IN_PROGRESS: > - > - /* > - * This tuple may soon become DEAD. Update the hint field so > - * that the page is reconsidered for pruning in future. > - */ > - heap_prune_record_prunable(prstate, > - HeapTupleHeaderGetUpdateXid(htup)); > - break; > - > case HEAPTUPLE_LIVE: > case HEAPTUPLE_INSERT_IN_PROGRESS: > - > - /* > - * If we wanted to optimize for aborts, we might consider > - * marking the page prunable when we see INSERT_IN_PROGRESS. > - * But we don't. See related decisions about when to mark the > - * page prunable in heapam.c. > - */ > break; > > default: > @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > * RECENTLY_DEAD tuples just in case there's a DEAD one after them; > * but we can't advance past anything else. We have to make sure that > * we don't miss any DEAD tuples, since DEAD tuples that still have > - * tuple storage after pruning will confuse VACUUM. > + * tuple storage after pruning will confuse VACUUM (FIXME: not anymore > + * I think?). Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples with storage after pruning anymore? > + * > + * FIXME: Not a new issue, but spotted it now : If there is a chain > + * like RECENTLY_DEAD -> DEAD, we will remove both tuples, but will > + * not call HeapTupleHeaderAdvanceConflictHorizon() for the > + * RECENTLY_DEAD tuple. Is that OK? I think it is. In a HOT chain, > + * we know that the later tuple committed before any earlier tuples in > + * the chain, therefore it ought to be enough to set the conflict > + * horizon based on the later tuple. If all snapshots on the standby > + * see the deleter of the last tuple as committed, they must consider > + * all the earlier ones as committed too. > */ This makes sense to me. (that if all the snapshots on the standby see the deleter of the last tuple as committed, then they consider the earlier ones deleted too). Probably wouldn't hurt to call this out here though. > @@ -1206,15 +1318,6 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu > * can't run inside a transaction block, which makes some cases impossible > * (e.g. in-progress insert from the same transaction). > * > - * We treat LP_DEAD items (which are the closest thing to DEAD tuples that > - * might be seen here) differently, too: we assume that they'll become > - * LP_UNUSED before VACUUM finishes. This difference is only superficial. > - * VACUUM effectively agrees with ANALYZE about DEAD items, in the end. > - * VACUUM won't remember LP_DEAD items, but only because they're not > - * supposed to be left behind when it is done. (Cases where we bypass > - * index vacuuming will violate this optimistic assumption, but the > - * overall impact of that should be negligible.) > - * > * HEAPTUPLE_LIVE tuples are naturally counted as live. This is also what > * acquire_sample_rows() does. > * > @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu > * ensure the math works out. The assumption that the transaction will > * commit and update the counters after we report is a bit shaky; but it > * is what acquire_sample_rows() does, so we do the same to be consistent. > + * > + * HEAPTUPLE_DEAD are handled by the other heap_prune_record_*() > + * subroutines. They don't count dead items like acquire_sample_rows() > + * does, because we assume that all dead items will become LP_UNUSED > + * before VACUUM finishes. This difference is only superficial. VACUUM > + * effectively agrees with ANALYZE about DEAD items, in the end. VACUUM > + * won't remember LP_DEAD items, but only because they're not supposed to > + * be left behind when it is done. (Cases where we bypass index vacuuming > + * will violate this optimistic assumption, but the overall impact of that > + * should be negligible.) FIXME: I don't understand that last sentence in > + * parens. It was copied from elsewhere. If we bypass index vacuuming, there will be LP_DEAD items left behind when we are done because we didn't do index vacuuming and then reaping of those dead items. All of these comments are kind of a copypasta, though. > */ > htup = (HeapTupleHeader) PageGetItem(page, PageGetItemId(page, offnum)); > > - switch (status) Okay, that's all for now. I'll do more in-depth review tomorrow. - Melanie
On 28/03/2024 03:53, Melanie Plageman wrote: > On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote: >> One change with this is that live_tuples and many of the other fields are >> now again updated, even if the caller doesn't need them. It was hard to skip >> them in a way that would save any cycles, with the other refactorings. > > I am worried we are writing checks that are going to have to come out of > SELECT queries' bank accounts, but I'll do some benchmarking when we're > all done with major refactoring. Sounds good, thanks. >> + * >> + * Whether we arrive at the dead HOT tuple first here or while >> + * following a chain below affects whether preceding RECENTLY_DEAD >> + * tuples in the chain can be removed or not. Imagine that you >> + * have a chain with two tuples: RECENTLY_DEAD -> DEAD. If we >> + * reach the RECENTLY_DEAD tuple first, the chain-following logic >> + * will find the DEAD tuple and conclude that both tuples are in >> + * fact dead and can be removed. But if we reach the DEAD tuple >> + * at the end of the chain first, when we reach the RECENTLY_DEAD >> + * tuple later, we will not follow the chain because the DEAD >> + * TUPLE is already 'marked', and will not remove the >> + * RECENTLY_DEAD tuple. This is not a correctness issue, and the >> + * RECENTLY_DEAD tuple will be removed by a later VACUUM. >> */ >> if (!HeapTupleHeaderIsHotUpdated(htup)) > > Is this intentional? Like would it be correct to remove the > RECENTLY_DEAD tuple during the current vacuum? Yes, it would be correct. And if we happen to visit the items in different order, the RECENTLY_DEAD tuple first, it will get removed. (This is just based on me reading the code, I'm not sure if I've missed something. Would be nice to construct a test case for that and step through it with a debugger to really see what happens. But this is not a new issue, doesn't need to be part of this patch) >> From c2ee7508456d0e76de985f9997a6840450e342a8 Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Thu, 28 Mar 2024 00:45:26 +0200 >> Subject: [PATCH v8 22/22] WIP >> >> - Got rid of all_visible_except_removable again. We're back to the >> roughly the same mechanism as on 'master', where the all_visible >> doesn't include LP_DEAD items, but at the end of >> heap_page_prune_and_freeze() when we return to the caller, we clear >> it if there were any LP_DEAD items. I considered calling the >> variable 'all_visible_except_lp_dead', which would be more accurate, >> but it's also very long. > > not longer than all_visible_except_removable. I would be happy to keep > it more exact, but I'm also okay with just all_visible. Ok, let's make it 'all_visible_except_lp_dead' for clarity. > What's this "cutoffs TODO"? > >> + * cutoffs TODO >> + * >> * presult contains output parameters needed by callers such as the number of >> * tuples removed and the number of line pointers newly marked LP_DEAD. >> * heap_page_prune_and_freeze() is responsible for initializing it. All the other arguments are documented in the comment, except 'cutoffs'. >> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status) >> * DEAD, our visibility test is just too coarse to detect it. >> * >> * In general, pruning must never leave behind a DEAD tuple that still has >> - * tuple storage. VACUUM isn't prepared to deal with that case. That's why >> + * tuple storage. VACUUM isn't prepared to deal with that case (FIXME: it no longer cares, right?). >> + * That's why >> * VACUUM prunes the same heap page a second time (without dropping its lock >> * in the interim) when it sees a newly DEAD tuple that we initially saw as >> - * in-progress. Retrying pruning like this can only happen when an inserting >> + * in-progress (FIXME: Really? Does it still do that?). > > Yea, I think all that is no longer true. I missed this comment back > then. Committed a patch to remove it. >> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, >> * RECENTLY_DEAD tuples just in case there's a DEAD one after them; >> * but we can't advance past anything else. We have to make sure that >> * we don't miss any DEAD tuples, since DEAD tuples that still have >> - * tuple storage after pruning will confuse VACUUM. >> + * tuple storage after pruning will confuse VACUUM (FIXME: not anymore >> + * I think?). > > Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples > with storage after pruning anymore? I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't loop through the items on the page checking their visibility anymore. Hmm, one confusion remains though: In the 2nd phase of vacuum, we remove all the dead line pointers that we have now removed from the indexes. When we do that, we assume them all to be dead line pointers, without storage, rather than normal tuples that happen to be HEAPTUPLE_DEAD. So it's important that if pruning would leave behind HEAPTUPLE_DEAD tuples, they are not included in 'deadoffsets'. In any case, let's just make sure that pruning doesn't leave HEAPTUPLE_DEAD tuples. There's no reason it should. >> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu >> * ensure the math works out. The assumption that the transaction will >> * commit and update the counters after we report is a bit shaky; but it >> * is what acquire_sample_rows() does, so we do the same to be consistent. >> + * >> + * HEAPTUPLE_DEAD are handled by the other heap_prune_record_*() >> + * subroutines. They don't count dead items like acquire_sample_rows() >> + * does, because we assume that all dead items will become LP_UNUSED >> + * before VACUUM finishes. This difference is only superficial. VACUUM >> + * effectively agrees with ANALYZE about DEAD items, in the end. VACUUM >> + * won't remember LP_DEAD items, but only because they're not supposed to >> + * be left behind when it is done. (Cases where we bypass index vacuuming >> + * will violate this optimistic assumption, but the overall impact of that >> + * should be negligible.) FIXME: I don't understand that last sentence in >> + * parens. It was copied from elsewhere. > > If we bypass index vacuuming, there will be LP_DEAD items left behind > when we are done because we didn't do index vacuuming and then reaping > of those dead items. All of these comments are kind of a copypasta, > though. Ah, gotcha, makes sense now. I didn't remember that we sometimes by pass index vacuuming if there are very few dead items. I thought this talked about the case that there are no indexes, but that case is OK. -- Heikki Linnakangas Neon (https://neon.tech)
On Thu, Mar 28, 2024 at 4:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 28/03/2024 03:53, Melanie Plageman wrote: > > On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote: > >> One change with this is that live_tuples and many of the other fields are > >> now again updated, even if the caller doesn't need them. It was hard to skip > >> them in a way that would save any cycles, with the other refactorings. > > > > I am worried we are writing checks that are going to have to come out of > > SELECT queries' bank accounts, but I'll do some benchmarking when we're > > all done with major refactoring. > > Sounds good, thanks. Actually, after having looked at it again, there really are only a few more increments of various counters, the memory consumed by revisit, and the additional setting of offsets in marked. I think a few carefully constructed queries which do on-access pruning could test the impact of this (as opposed to a bigger benchmarking endeavor). I also wonder if there would be any actual impact of marking the various record_*() routines inline. > >> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status) > >> * DEAD, our visibility test is just too coarse to detect it. > >> * > >> * In general, pruning must never leave behind a DEAD tuple that still has > >> - * tuple storage. VACUUM isn't prepared to deal with that case. That's why > >> + * tuple storage. VACUUM isn't prepared to deal with that case (FIXME: it no longer cares, right?). > >> + * That's why > >> * VACUUM prunes the same heap page a second time (without dropping its lock > >> * in the interim) when it sees a newly DEAD tuple that we initially saw as > >> - * in-progress. Retrying pruning like this can only happen when an inserting > >> + * in-progress (FIXME: Really? Does it still do that?). > > > > Yea, I think all that is no longer true. I missed this comment back > > then. > > Committed a patch to remove it. > > >> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > >> * RECENTLY_DEAD tuples just in case there's a DEAD one after them; > >> * but we can't advance past anything else. We have to make sure that > >> * we don't miss any DEAD tuples, since DEAD tuples that still have > >> - * tuple storage after pruning will confuse VACUUM. > >> + * tuple storage after pruning will confuse VACUUM (FIXME: not anymore > >> + * I think?). > > > > Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples > > with storage after pruning anymore? > > I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't > loop through the items on the page checking their visibility anymore. > > Hmm, one confusion remains though: In the 2nd phase of vacuum, we remove > all the dead line pointers that we have now removed from the indexes. > When we do that, we assume them all to be dead line pointers, without > storage, rather than normal tuples that happen to be HEAPTUPLE_DEAD. So > it's important that if pruning would leave behind HEAPTUPLE_DEAD tuples, > they are not included in 'deadoffsets'. It seems like master only adds items it is marking LP_DEAD to deadoffsets. And things marked LP_DEAD have lp_len set to 0. > In any case, let's just make sure that pruning doesn't leave > HEAPTUPLE_DEAD tuples. There's no reason it should. Maybe worth adding an assert to static void heap_prune_record_unchanged_lp_dead(ItemId itemid, PruneState *prstate, OffsetNumber offnum) { ... Assert(!ItemIdHasStorage(itemid)); prstate->deadoffsets[prstate->lpdead_items++] = offnum; } By the way, I wasn't quite sure about the purpose of heap_prune_record_unchanged_lp_dead(). It is called in heap_prune_chain() in a place where we didn't add things to deadoffsets before, no? /* * Likewise, a dead line pointer can't be part of the chain. (We * already eliminated the case of dead root tuple outside this * function.) */ if (ItemIdIsDead(lp)) { /* * If the caller set PRUNE_DO_MARK_UNUSED_NOW, we can set dead * line pointers LP_UNUSED now. */ if (unlikely(prstate->actions & PRUNE_DO_MARK_UNUSED_NOW)) heap_prune_record_unused(prstate, offnum, false); else heap_prune_record_unchanged_lp_dead(lp, prstate, offnum); break; } > >> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu > >> * ensure the math works out. The assumption that the transaction will > >> * commit and update the counters after we report is a bit shaky; but it > >> * is what acquire_sample_rows() does, so we do the same to be consistent. > >> + * > >> + * HEAPTUPLE_DEAD are handled by the other heap_prune_record_*() > >> + * subroutines. They don't count dead items like acquire_sample_rows() > >> + * does, because we assume that all dead items will become LP_UNUSED > >> + * before VACUUM finishes. This difference is only superficial. VACUUM > >> + * effectively agrees with ANALYZE about DEAD items, in the end. VACUUM > >> + * won't remember LP_DEAD items, but only because they're not supposed to > >> + * be left behind when it is done. (Cases where we bypass index vacuuming > >> + * will violate this optimistic assumption, but the overall impact of that > >> + * should be negligible.) FIXME: I don't understand that last sentence in > >> + * parens. It was copied from elsewhere. > > > > If we bypass index vacuuming, there will be LP_DEAD items left behind > > when we are done because we didn't do index vacuuming and then reaping > > of those dead items. All of these comments are kind of a copypasta, > > though. > > Ah, gotcha, makes sense now. I didn't remember that we sometimes by pass > index vacuuming if there are very few dead items. I thought this talked > about the case that there are no indexes, but that case is OK. These comments could use another pass. I had added some extra (probably redundant) content when I thought I was refactoring it a certain way and then changed my mind. Attached is a diff with some ideas I had for a bit of code simplification. Are you working on cleaning this patch up or should I pick it up? I wonder if it makes sense to move some of the changes to heap_prune_chain() with setting things in marked/revisited to the start of the patch set (to be committed first). - Melanie
Attachment
On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote: > These comments could use another pass. I had added some extra > (probably redundant) content when I thought I was refactoring it a > certain way and then changed my mind. > > Attached is a diff with some ideas I had for a bit of code simplification. > > Are you working on cleaning this patch up or should I pick it up? Attached v9 is rebased over master. But, more importantly, I took another pass at heap_prune_chain() and am pretty happy with what I came up with. See 0021. I simplified the traversal logic and then grouped the chain processing into three branches at the end. I find it much easier to understand what we are doing for different types of HOT chains. I got rid of revisited. We can put it back, but I was thinking: we stash all HOT tuples and then loop over them later, calling record_unchanged() on the ones that aren't marked. But, if we have a lot of HOT tuples, is this really that much better than just looping through all the offsets and calling record_unchanged() on just the ones that aren't marked? I've done that in my version. While testing this, I found that only on-access pruning needed this final loop calling record_unchanged() on items not yet marked. I know we can't skip this final loop entirely in the ON ACCESS case because it calls record_prunable(), but we could consider moving that back out into heap_prune_chain()? Or what do you think? I haven't finished updating all the comments, but I am really interested to know what you think about heap_prune_chain() now. Note that patches 0001-0020 are still the same as before. Only 0021 is the new changes I made (they are built on top of your v8 0022). Tomorrow I will start first thing figuring out how to break this down into parts that can apply on master and then rebase the rest of the patches on top of it. - Melanie
Attachment
- v9-0001-lazy_scan_prune-tests-tuple-vis-with-GlobalVisTes.patch
- v9-0002-Pass-heap_prune_chain-PruneResult-output-paramete.patch
- v9-0003-Rename-PruneState-snapshotConflictHorizon-to-late.patch
- v9-0004-heap_page_prune-sets-all_visible-and-visibility_c.patch
- v9-0005-Add-reference-to-VacuumCutoffs-in-HeapPageFreeze.patch
- v9-0006-Prepare-freeze-tuples-in-heap_page_prune.patch
- v9-0007-lazy_scan_prune-reorder-freeze-execution-logic.patch
- v9-0008-Execute-freezing-in-heap_page_prune.patch
- v9-0009-Make-opp-freeze-heuristic-compatible-with-prune-f.patch
- v9-0010-Separate-tuple-pre-freeze-checks-and-invoke-earli.patch
- v9-0011-Remove-heap_freeze_execute_prepared.patch
- v9-0012-Merge-prune-and-freeze-records.patch
- v9-0013-Set-hastup-in-heap_page_prune.patch
- v9-0014-Count-tuples-for-vacuum-logging-in-heap_page_prun.patch
- v9-0015-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v9-0016-move-live-tuple-accounting-to-heap_prune_chain.patch
- v9-0017-Move-frozen-array-to-PruneState.patch
- v9-0018-Cosmetic-fixes.patch
- v9-0019-Almost-cosmetic-fixes.patch
- v9-0020-Move-frz_conflict_horizon-to-tighter-scope.patch
- v9-0021-WIP-refactor.patch
On 29/03/2024 07:04, Melanie Plageman wrote: > On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote: >> These comments could use another pass. I had added some extra >> (probably redundant) content when I thought I was refactoring it a >> certain way and then changed my mind. >> >> Attached is a diff with some ideas I had for a bit of code simplification. >> >> Are you working on cleaning this patch up or should I pick it up? > > Attached v9 is rebased over master. But, more importantly, I took > another pass at heap_prune_chain() and am pretty happy with what I came > up with. See 0021. I simplified the traversal logic and then grouped the > chain processing into three branches at the end. I find it much easier > to understand what we are doing for different types of HOT chains. Ah yes, agreed, that's nicer. The 'survivor' variable is a little confusing, especially here: if (!survivor) { int i; /* * If no DEAD tuple was found, and the root is redirected, mark it as * such. */ if ((i = ItemIdIsRedirected(rootlp))) heap_prune_record_unchanged_lp_redirect(prstate, rootoffnum); /* the rest of tuples in the chain are normal, unchanged tuples */ for (; i < nchain; i++) heap_prune_record_unchanged(dp, prstate, chainitems[i]); } You would think that "if(!survivor)", it means that there is no live tuples on the chain, i.e. no survivors. But in fact it's the opposite; it means that the are all live. Maybe call it 'ndeadchain' instead, meaning the number of dead items in the chain. > I got rid of revisited. We can put it back, but I was thinking: we stash > all HOT tuples and then loop over them later, calling record_unchanged() > on the ones that aren't marked. But, if we have a lot of HOT tuples, is > this really that much better than just looping through all the offsets > and calling record_unchanged() on just the ones that aren't marked? Well, it requires looping through all the offsets one more time, and unless you have a lot of HOT tuples, most items would be marked already. But maybe the overhead is negligible anyway. > I've done that in my version. While testing this, I found that only > on-access pruning needed this final loop calling record_unchanged() on > items not yet marked. I know we can't skip this final loop entirely in > the ON ACCESS case because it calls record_prunable(), but we could > consider moving that back out into heap_prune_chain()? Or what do you > think? Hmm, why is that different with on-access pruning? Here's another idea: In the first loop through the offsets, where we gather the HTSV status of each item, also collect the offsets of all HOT and non-HOT items to two separate arrays. Call heap_prune_chain() for all the non-HOT items first, and then process any remaining HOT tuples that haven't been marked yet. > I haven't finished updating all the comments, but I am really interested > to know what you think about heap_prune_chain() now. Looks much better now, thanks! -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 29/03/2024 07:04, Melanie Plageman wrote: > > On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote: > >> These comments could use another pass. I had added some extra > >> (probably redundant) content when I thought I was refactoring it a > >> certain way and then changed my mind. > >> > >> Attached is a diff with some ideas I had for a bit of code simplification. > >> > >> Are you working on cleaning this patch up or should I pick it up? > > > > Attached v9 is rebased over master. But, more importantly, I took > > another pass at heap_prune_chain() and am pretty happy with what I came > > up with. See 0021. I simplified the traversal logic and then grouped the > > chain processing into three branches at the end. I find it much easier > > to understand what we are doing for different types of HOT chains. > > Ah yes, agreed, that's nicer. > > The 'survivor' variable is a little confusing, especially here: > > if (!survivor) > { > int i; > > /* > * If no DEAD tuple was found, and the root is redirected, mark it as > * such. > */ > if ((i = ItemIdIsRedirected(rootlp))) > heap_prune_record_unchanged_lp_redirect(prstate, rootoffnum); > > /* the rest of tuples in the chain are normal, unchanged tuples */ > for (; i < nchain; i++) > heap_prune_record_unchanged(dp, prstate, chainitems[i]); > } > > You would think that "if(!survivor)", it means that there is no live > tuples on the chain, i.e. no survivors. But in fact it's the opposite; > it means that the are all live. Maybe call it 'ndeadchain' instead, > meaning the number of dead items in the chain. Makes sense. > > I've done that in my version. While testing this, I found that only > > on-access pruning needed this final loop calling record_unchanged() on > > items not yet marked. I know we can't skip this final loop entirely in > > the ON ACCESS case because it calls record_prunable(), but we could > > consider moving that back out into heap_prune_chain()? Or what do you > > think? > > Hmm, why is that different with on-access pruning? Well, it is highly possible we just don't hit cases like this with vacuum in our test suite (not that it is unreachable by vacuum). It's just very easy to get in this situation with on-access pruning. Imagine an UPDATE which caused the following chain: RECENTLY_DEAD -> DELETE_IN_PROGRESS -> INSERT_IN_PROGRESS It invokes heap_page_prune_and_freeze() (assume the page meets the criteria for on-access pruning) and eventually enters heap_prune_chain() with the first offset in this chain. The first item is LP_NORMAL and the tuple is RECENTLY_DEAD, so the survivor variable stays 0 and we record_unchanged() for that tuple and return. The next two items are LP_NORMAL and the tuples are HOT tuples, so we just return from the "fast path" at the top of heap_prune_chain(). After invoking heap_prune_chain() for all of them, the first offset is marked but the other two are not. Thus, we end up having to record_unchanged() later. This kind of thing is a super common case that we see all the time in queries in the regression test suite. > Here's another idea: In the first loop through the offsets, where we > gather the HTSV status of each item, also collect the offsets of all HOT > and non-HOT items to two separate arrays. Call heap_prune_chain() for > all the non-HOT items first, and then process any remaining HOT tuples > that haven't been marked yet. That's an interesting idea. I'll try it out and see how it works. > > I haven't finished updating all the comments, but I am really interested > > to know what you think about heap_prune_chain() now. > > Looks much better now, thanks! I am currently doing chain traversal refactoring in heap_prune_chain() on top of master as the first patch in the set. - Melanie
On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote: > On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 29/03/2024 07:04, Melanie Plageman wrote: > > > On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote: > > >> These comments could use another pass. I had added some extra > > >> (probably redundant) content when I thought I was refactoring it a > > >> certain way and then changed my mind. > > >> > > >> Attached is a diff with some ideas I had for a bit of code simplification. > > >> > > >> Are you working on cleaning this patch up or should I pick it up? > > > > > > Attached v9 is rebased over master. But, more importantly, I took > > > another pass at heap_prune_chain() and am pretty happy with what I came > > > up with. See 0021. I simplified the traversal logic and then grouped the > > > chain processing into three branches at the end. I find it much easier > > > to understand what we are doing for different types of HOT chains. > > > > Ah yes, agreed, that's nicer. > > > > The 'survivor' variable is a little confusing, especially here: > > > > if (!survivor) > > { > > int i; > > > > /* > > * If no DEAD tuple was found, and the root is redirected, mark it as > > * such. > > */ > > if ((i = ItemIdIsRedirected(rootlp))) > > heap_prune_record_unchanged_lp_redirect(prstate, rootoffnum); > > > > /* the rest of tuples in the chain are normal, unchanged tuples */ > > for (; i < nchain; i++) > > heap_prune_record_unchanged(dp, prstate, chainitems[i]); > > } > > > > You would think that "if(!survivor)", it means that there is no live > > tuples on the chain, i.e. no survivors. But in fact it's the opposite; > > it means that the are all live. Maybe call it 'ndeadchain' instead, > > meaning the number of dead items in the chain. > > Makes sense. I've done this in attached v10. > > Here's another idea: In the first loop through the offsets, where we > > gather the HTSV status of each item, also collect the offsets of all HOT > > and non-HOT items to two separate arrays. Call heap_prune_chain() for > > all the non-HOT items first, and then process any remaining HOT tuples > > that haven't been marked yet. > > That's an interesting idea. I'll try it out and see how it works. Attached v10 implements this method of dividing tuples into HOT and non-HOT and processing the potential HOT chains first then processing tuples not marked by calling heap_prune_chain(). I have applied the refactoring of heap_prune_chain() to master and then built the other patches on top of that. I discovered while writing this that LP_DEAD item offsets must be in order in the deadoffsets array (the one that is used to populate LVRelState->dead_items). When I changed heap_page_prune_and_freeze() to partition the offsets into HOT and non-HOT during the first loop through the item pointers array (where we get tuple visibility information), we add dead item offsets as they are encountered. So, they are no longer in order. I've added a quicksort of the deadoffsets array to satisfy vacuum. I think that we are actually successfully removing more RECENTLY_DEAD HOT tuples than in master with heap_page_prune()'s new approach, and I think it is correct; but let me know if I am missing something. The early patches in the set include some additional comment cleanup as well. 0001 is fairly polished. 0004 could use some variable renaming (this patch partitions the tuples into HOT and not HOT and then processes them). I was struggling with some of the names here (chainmembers and chaincandidates is confusing). The bulk of the combining of pruning and freezing is lumped into 0010. I had planned to separate 0010 into 4 separate patches: 1 to execute freezing in heap_prune_chain(), 1 for the freeze heuristic approximating what is on master, and 1 for emitting a single record containing both the pruning and freezing page modifications. I ended up not doing this because I felt like the grouping of changes in 0007-0009 is off. As long as I still execute freezing in lazy_scan_prune(), I have to share lots of state between lazy_scan_prune() and heap_page_prune(). This meant I added a lot of parameters to heap_page_prune() that later commits removed -- making the later patches noisy and not so easy to understand. I'm actually not sure what should go in what commit (either for review clarity or for the actual final version). But, I think we should probably focus on review of the code and not as much how it is split up yet. The final state of the code could definitely use more cleanup. I've been staring at it for awhile, so I could use some thoughts/ideas about what part to focus on improving. - Melanie
Attachment
- v10-0001-Refactor-heap_prune_chain.patch
- v10-0002-heap_prune_chain-rename-dp-page.patch
- v10-0003-Mark-all-line-pointers-during-pruning.patch
- v10-0004-Handle-non-chain-tuples-outside-of-heap_prune_ch.patch
- v10-0005-Invoke-heap_prune_record_prunable-during-record-.patch
- v10-0006-Introduce-PRUNE_DO_-actions.patch
- v10-0007-Prepare-freeze-tuples-in-heap_page_prune.patch
- v10-0008-Set-hastup-in-heap_page_prune.patch
- v10-0009-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v10-0010-Combine-freezing-and-pruning.patch
On Sat, Mar 30, 2024 at 1:57 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I think that we are actually successfully removing more RECENTLY_DEAD > HOT tuples than in master with heap_page_prune()'s new approach, and I > think it is correct; but let me know if I am missing something. /me blinks. Isn't zero the only correct number of RECENTLY_DEAD tuples to remove? -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Mar 30, 2024 at 8:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Mar 30, 2024 at 1:57 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I think that we are actually successfully removing more RECENTLY_DEAD > > HOT tuples than in master with heap_page_prune()'s new approach, and I > > think it is correct; but let me know if I am missing something. > > /me blinks. > > Isn't zero the only correct number of RECENTLY_DEAD tuples to remove? At the top of the comment for heap_prune_chain() in master, it says * If the item is an index-referenced tuple (i.e. not a heap-only tuple), * the HOT chain is pruned by removing all DEAD tuples at the start of the HOT * chain. We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple. * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really * DEAD, our visibility test is just too coarse to detect it. Heikki had added a comment in one of his patches to the fast path for HOT tuples at the top of heap_prune_chain(): * Note that we might first arrive at a dead heap-only tuple * either while following a chain or here (in the fast path). Whichever path * gets there first will mark the tuple unused. * * Whether we arrive at the dead HOT tuple first here or while * following a chain above affects whether preceding RECENTLY_DEAD * tuples in the chain can be removed or not. Imagine that you * have a chain with two tuples: RECENTLY_DEAD -> DEAD. If we * reach the RECENTLY_DEAD tuple first, the chain-following logic * will find the DEAD tuple and conclude that both tuples are in * fact dead and can be removed. But if we reach the DEAD tuple * at the end of the chain first, when we reach the RECENTLY_DEAD * tuple later, we will not follow the chain because the DEAD * TUPLE is already 'marked', and will not remove the * RECENTLY_DEAD tuple. This is not a correctness issue, and the * RECENTLY_DEAD tuple will be removed by a later VACUUM. My patch splits the tuples into HOT and non-HOT while gathering their visibility information and first calls heap_prune_chain() on the non-HOT tuples and then processes the yet unmarked HOT tuples in a separate loop afterward. This will follow all of the chains and process them completely as well as processing all HOT tuples which may not be reachable from a valid chain. The fast path contains a special check to ensure that line pointers for DEAD not HOT-updated HOT tuples (dead orphaned tuples from aborted HOT updates) are still marked LP_UNUSED even though they are not reachable from a valid HOT chain. By doing this later, we don't preclude ourselves from following all chains. - Melanie
On 30/03/2024 07:57, Melanie Plageman wrote: > On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote: >> On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> Here's another idea: In the first loop through the offsets, where we >>> gather the HTSV status of each item, also collect the offsets of all HOT >>> and non-HOT items to two separate arrays. Call heap_prune_chain() for >>> all the non-HOT items first, and then process any remaining HOT tuples >>> that haven't been marked yet. >> >> That's an interesting idea. I'll try it out and see how it works. > > Attached v10 implements this method of dividing tuples into HOT and > non-HOT and processing the potential HOT chains first then processing > tuples not marked by calling heap_prune_chain(). > > I have applied the refactoring of heap_prune_chain() to master and then > built the other patches on top of that. Committed some of the changes. Continuing to reviewing the rest. > I discovered while writing this that LP_DEAD item offsets must be in > order in the deadoffsets array (the one that is used to populate > LVRelState->dead_items). > > When I changed heap_page_prune_and_freeze() to partition the offsets > into HOT and non-HOT during the first loop through the item pointers > array (where we get tuple visibility information), we add dead item > offsets as they are encountered. So, they are no longer in order. I've > added a quicksort of the deadoffsets array to satisfy vacuum. Good catch. > I think that we are actually successfully removing more RECENTLY_DEAD > HOT tuples than in master with heap_page_prune()'s new approach, and I > think it is correct; but let me know if I am missing something. Yep. In the case of a two-item chain, RECENTLY_DEAD -> DEAD, the new code can always remove both items. On 'master', it depends on which item it happens to process first. If it processes the RECENTLY_DEAD item first, then it follows the chain and removes both. But if it processes the DEAD item first, the RECENTLY_DEAD item is left behind. It will be removed by the next VACUUM, so it's not a correctness issue, and probably doesn't make any practical performance difference either as it's a rare corner case, but I feel better that it's more deterministic now. > The early patches in the set include some additional comment cleanup as > well. 0001 is fairly polished. 0004 could use some variable renaming > (this patch partitions the tuples into HOT and not HOT and then > processes them). I was struggling with some of the names here > (chainmembers and chaincandidates is confusing). I didn't understand why you wanted to juggle both partitions in the same array. So I separated them into two arrays, and called them 'root_items' and 'heaponly_items'. In some micro-benchmarks, the order that the items were processed made a measurable difference. So I'm processing the items in the reverse order. That roughly matches the order the items are processed in master, as it iterates the offsets from high-to-low in the first loop, and low-to-high in the second loop. > The bulk of the combining of pruning and freezing is lumped into 0010. > > I had planned to separate 0010 into 4 separate patches: 1 to execute > freezing in heap_prune_chain(), 1 for the freeze heuristic approximating > what is on master, and 1 for emitting a single record containing both > the pruning and freezing page modifications. > > I ended up not doing this because I felt like the grouping of changes in > 0007-0009 is off. As long as I still execute freezing in > lazy_scan_prune(), I have to share lots of state between > lazy_scan_prune() and heap_page_prune(). This meant I added a lot of > parameters to heap_page_prune() that later commits removed -- making the > later patches noisy and not so easy to understand. > > I'm actually not sure what should go in what commit (either for review > clarity or for the actual final version). > > But, I think we should probably focus on review of the code and not as > much how it is split up yet. Yeah, that's fine, 0010 is manageable-sized now. > The final state of the code could definitely use more cleanup. I've been > staring at it for awhile, so I could use some thoughts/ideas about what > part to focus on improving. Committed some of the changes. I plan to commit at least the first of these remaining patches later today. I'm happy with it now, but I'll give it a final glance over after dinner. I'll continue to review the rest after that, but attached is what I have now. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
- v11-0001-Handle-non-chain-tuples-outside-of-heap_prune_ch.patch
- v11-0002-Invoke-heap_prune_record_prunable-during-record-.patch
- v11-0003-Introduce-PRUNE_DO_-actions.patch
- v11-0004-Prepare-freeze-tuples-in-heap_page_prune.patch
- v11-0005-Set-hastup-in-heap_page_prune.patch
- v11-0006-Save-dead-tuple-offsets-during-heap_page_prune.patch
- v11-0007-Combine-freezing-and-pruning.patch
On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote: > On 30/03/2024 07:57, Melanie Plageman wrote: > > On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote: > > > On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > Here's another idea: In the first loop through the offsets, where we > > > > gather the HTSV status of each item, also collect the offsets of all HOT > > > > and non-HOT items to two separate arrays. Call heap_prune_chain() for > > > > all the non-HOT items first, and then process any remaining HOT tuples > > > > that haven't been marked yet. > > > > > > That's an interesting idea. I'll try it out and see how it works. > > > > Attached v10 implements this method of dividing tuples into HOT and > > non-HOT and processing the potential HOT chains first then processing > > tuples not marked by calling heap_prune_chain(). > > > > I have applied the refactoring of heap_prune_chain() to master and then > > built the other patches on top of that. > > Committed some of the changes. Continuing to reviewing the rest. Thanks! > > The early patches in the set include some additional comment cleanup as > > well. 0001 is fairly polished. 0004 could use some variable renaming > > (this patch partitions the tuples into HOT and not HOT and then > > processes them). I was struggling with some of the names here > > (chainmembers and chaincandidates is confusing). > > I didn't understand why you wanted to juggle both partitions in the same > array. So I separated them into two arrays, and called them 'root_items' and > 'heaponly_items'. I thought it was worth it to save the space. And the algorithm for doing it seemed pretty straightforward. But looking at your patch, it is a lot easier to understand with two arrays (since, for example, they can each have a name). > In some micro-benchmarks, the order that the items were processed made a > measurable difference. So I'm processing the items in the reverse order. > That roughly matches the order the items are processed in master, as it > iterates the offsets from high-to-low in the first loop, and low-to-high in > the second loop. This makes sense. I noticed you didn't there isn't a comment about this above the loop. It might be worth it to mention it. Below is a review of only 0001. I'll look at the others shortly. > From a6ab891779876e7cc1b4fb6fddb09f52f0094646 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Mon, 1 Apr 2024 16:59:38 +0300 > Subject: [PATCH v11 1/7] Handle non-chain tuples outside of heap_prune_chain() > --- > src/backend/access/heap/pruneheap.c | 264 +++++++++++++++++----------- > 1 file changed, 166 insertions(+), 98 deletions(-) > > diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c > @@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer, > tup.t_tableOid = RelationGetRelid(relation); > > /* > - * Determine HTSV for all tuples. > + * Determine HTSV for all tuples, and queue them up for processing as HOT > + * chain roots or as a heap-only items. Reading this comment now as a whole, I would add something like "Determining HTSV for all tuples once is required for correctness" to the start of the second paragraph. The new conjunction on the first paragraph sentence followed by the next paragraph is a bit confusing because it sounds like queuing them up for processing is required for correctness (which, I suppose it is on some level). Basically, I'm just saying that it is now less clear what is required for correctness. > * This is required for correctness to deal with cases where running HTSV > * twice could result in different results (e.g. RECENTLY_DEAD can turn to > * DEAD if another checked item causes GlobalVisTestIsRemovableFullXid() > * to update the horizon, INSERT_IN_PROGRESS can change to DEAD if the > - * inserting transaction aborts, ...). That in turn could cause > - * heap_prune_chain() to behave incorrectly if a tuple is reached twice, > - * once directly via a heap_prune_chain() and once following a HOT chain. > + * inserting transaction aborts, ...). VACUUM assumes that there are no > + * normal DEAD tuples left on the page after pruning, so it needs to have > + * the same understanding of what is DEAD and what is not. > * > * It's also good for performance. Most commonly tuples within a page are > * stored at decreasing offsets (while the items are stored at increasing > @@ -282,52 +297,140 @@ heap_page_prune(Relation relation, Buffer buffer, > + /* > + * Process any heap-only tuples that were not already processed as part of > + * a HOT chain. > + */ While I recognize this is a matter of style and not important, I personally prefer this for reverse looping: for (int i = prstate.nheaponly_items; i --> 0;) I do think a comment about the reverse order would be nice. I know it says something above the first loop to this effect: * Processing the items in reverse order (and thus the tuples in * increasing order) increases prefetching efficiency significantly / * decreases the number of cache misses. So perhaps we could just say "as above, process the items in reverse order" - Melanie
On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote: > On 30/03/2024 07:57, Melanie Plageman wrote: > > > The final state of the code could definitely use more cleanup. I've been > > staring at it for awhile, so I could use some thoughts/ideas about what > > part to focus on improving. > > Committed some of the changes. I plan to commit at least the first of these > remaining patches later today. I'm happy with it now, but I'll give it a > final glance over after dinner. > > I'll continue to review the rest after that, but attached is what I have > now. Review for 0003-0006 (I didn't have any new thoughts on 0002). I know you didn't modify them much/at all, but I noticed some things in my code that could be better. > From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Fri, 29 Mar 2024 19:43:09 -0400 > Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions > > We will eventually take additional actions in heap_page_prune() at the > discretion of the caller. For now, introduce these PRUNE_DO_* macros and > turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_ paramter -> parameter > action. > --- > src/backend/access/heap/pruneheap.c | 51 ++++++++++++++-------------- > src/backend/access/heap/vacuumlazy.c | 11 ++++-- > src/include/access/heapam.h | 13 ++++++- > 3 files changed, 46 insertions(+), 29 deletions(-) > > diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c > index fb0ad834f1b..30965c3c5a1 100644 > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > @@ -29,10 +29,11 @@ > /* Working data for heap_page_prune and subroutines */ > typedef struct > { > + /* PRUNE_DO_* arguments */ > + uint8 actions; I wasn't sure if actions is a good name. What do you think? > + > /* tuple visibility test, initialized for the relation */ > GlobalVisState *vistest; > - /* whether or not dead items can be set LP_UNUSED during pruning */ > - bool mark_unused_now; > > TransactionId new_prune_xid; /* new prune hint value for page */ > TransactionId snapshotConflictHorizon; /* latest xid removed */ > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index 32a3fbce961..35b8486c34a 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > @@ -191,6 +191,17 @@ typedef struct HeapPageFreeze > > } HeapPageFreeze; > > +/* > + * Actions that can be taken during pruning and freezing. By default, we will > + * at least attempt regular pruning. > + */ > + > +/* > + * PRUNE_DO_MARK_UNUSED_NOW indicates whether or not dead items can be set > + * LP_UNUSED during pruning. > + */ > +#define PRUNE_DO_MARK_UNUSED_NOW (1 << 1) No reason for me to waste the zeroth bit here. I just realized that I did this with XLHP_IS_CATALOG_REL too. #define XLHP_IS_CATALOG_REL (1 << 1) > From 083690b946e19ab5e536a9f2689772e7b91d2a70 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Fri, 29 Mar 2024 21:22:14 -0400 > Subject: [PATCH v11 4/7] Prepare freeze tuples in heap_page_prune() > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index 35b8486c34a..ac129692c13 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > > +/* > + * Prepare to freeze if advantageous or required and try to advance > + * relfrozenxid and relminmxid. To attempt freezing, we will need to determine > + * if the page is all frozen. So, if this action is set, we will also inform > + * the caller if the page is all-visible and/or all-frozen and calculate a I guess we don't inform the caller if the page is all-visible, so this is not quite right. > + * snapshot conflict horizon for updating the visibility map. > + */ > +#define PRUNE_DO_TRY_FREEZE (1 << 2) > From ef8cb2c089ad9474a6da309593029c08f71b0bb9 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Fri, 29 Mar 2024 21:36:37 -0400 > Subject: [PATCH v11 5/7] Set hastup in heap_page_prune > > diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c > index 8bdd6389b25..65b0ed185ff 100644 > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > @@ -66,6 +66,9 @@ typedef struct > bool processed[MaxHeapTuplesPerPage + 1]; > > int ndeleted; /* Number of tuples deleted from the page */ > + > + /* Whether or not the page makes rel truncation unsafe */ > + bool hastup; > } PruneState; > > /* Local functions */ > @@ -271,6 +274,7 @@ heap_page_prune(Relation relation, Buffer buffer, > prstate.ndeleted = 0; > prstate.nroot_items = 0; > prstate.nheaponly_items = 0; > + prstate.hastup = false; > > /* > * If we will prepare to freeze tuples, consider that it might be possible > @@ -280,7 +284,7 @@ heap_page_prune(Relation relation, Buffer buffer, > presult->all_frozen = true; > else > presult->all_frozen = false; > - > + presult->hastup = prstate.hastup; > > /* > * presult->htsv is not initialized here because all ntuple spots in the > @@ -819,6 +823,8 @@ heap_prune_record_redirect(PruneState *prstate, > */ > if (was_normal) > prstate->ndeleted++; > + > + prstate->hastup = true; > } > > /* Record line pointer to be marked dead */ > @@ -901,11 +907,15 @@ static void > heap_prune_record_unchanged_lp_normal(Page page, int8 *htsv, PruneState *prstate, > PruneResult *presult, OffsetNumber offnum) > { > - HeapTupleHeader htup = (HeapTupleHeader) PageGetItem(page, PageGetItemId(page, offnum)); > + HeapTupleHeader htup; > > Assert(!prstate->processed[offnum]); > prstate->processed[offnum] = true; > > + presult->hastup = true; /* the page is not empty */ My fault, but hastup being set sometimes in PruneState and sometimes in PruneResult is quite unpalatable. > From dffa90d0bd8e972cfe26da96860051dc8c2a8576 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Sat, 30 Mar 2024 01:27:08 -0400 > Subject: [PATCH v11 6/7] Save dead tuple offsets during heap_page_prune > --- > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c > index 212d76045ef..7f1e4db55c0 100644 > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > @@ -1373,6 +1373,15 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno, > return false; > } > > +static int > +OffsetNumber_cmp(const void *a, const void *b) > +{ > + OffsetNumber na = *(const OffsetNumber *) a, > + nb = *(const OffsetNumber *) b; > + > + return na < nb ? -1 : na > nb; > +} This probably doesn't belong here. I noticed spgdoinsert.c had a static function for sorting OffsetNumbers, but I didn't see anything general purpose anywhere else. - Melanie
On 01/04/2024 19:08, Melanie Plageman wrote: > On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote: >> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c >> @@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer, >> tup.t_tableOid = RelationGetRelid(relation); >> >> /* >> - * Determine HTSV for all tuples. >> + * Determine HTSV for all tuples, and queue them up for processing as HOT >> + * chain roots or as a heap-only items. > > Reading this comment now as a whole, I would add something like > "Determining HTSV for all tuples once is required for correctness" to > the start of the second paragraph. The new conjunction on the first > paragraph sentence followed by the next paragraph is a bit confusing > because it sounds like queuing them up for processing is required for > correctness (which, I suppose it is on some level). Basically, I'm just > saying that it is now less clear what is required for correctness. Fixed. > While I recognize this is a matter of style and not important, I > personally prefer this for reverse looping: > > for (int i = prstate.nheaponly_items; i --> 0;) I don't think we use that style anywhere in the Postgres source tree currently. (And I don't like it ;-) ) > I do think a comment about the reverse order would be nice. I know it > says something above the first loop to this effect: > > * Processing the items in reverse order (and thus the tuples in > * increasing order) increases prefetching efficiency significantly / > * decreases the number of cache misses. > > So perhaps we could just say "as above, process the items in reverse > order" I'm actually not sure why it makes a difference. I would assume all the data to already be in CPU cache at this point, since the first loop already accessed it, so I think there's something else going on. But I didn't investigate it deeper. Anyway, added a comment. Committed the first of the remaining patches with those changes. And also this, which is worth calling out: > if (prstate.htsv[offnum] == HEAPTUPLE_DEAD) > { > ItemId itemid = PageGetItemId(page, offnum); > HeapTupleHeader htup = (HeapTupleHeader) PageGetItem(page, itemid); > > if (likely(!HeapTupleHeaderIsHotUpdated(htup))) > { > HeapTupleHeaderAdvanceConflictHorizon(htup, > &prstate.latest_xid_removed); > heap_prune_record_unused(&prstate, offnum, true); > } > else > { > /* > * This tuple should've been processed and removed as part of > * a HOT chain, so something's wrong. To preserve evidence, > * we don't dare to remove it. We cannot leave behind a DEAD > * tuple either, because that will cause VACUUM to error out. > * Throwing an error with a distinct error message seems like > * the least bad option. > */ > elog(ERROR, "dead heap-only tuple (%u, %d) is not linked to from any HOT chain", > blockno, offnum); > } > } > else > heap_prune_record_unchanged_lp_normal(page, &prstate, offnum); As you can see, I turned that into a hard error. Previously, that code was at the top of heap_prune_chain(), and it was normal to see DEAD heap-only tuples there, because they would presumably get processed later as part of a HOT chain. But now this is done after all the HOT chains have already been processed. Previously if there was a dead heap-only tuple like that on the page for some reason, it was silently not processed by heap_prune_chain() (because it was assumed that it would be processed later as part of a HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the pruning was done as part of VACUUM, VACUUM would fail with "ERROR: unexpected HeapTupleSatisfiesVacuum result". Or am I missing something? Now you get that above error also on on-access pruning, which is not ideal. But I don't remember hearing about corruption like that, and you'd get the error on VACUUM anyway. With the next patches, heap_prune_record_unchanged() will do more, and will also throw an error on a HEAPTUPLE_LIVE tuple, so even though in the first patch we could print just a WARNING and move on, it gets more awkward with the rest of the patches. (Continuing with the remaining patches..) -- Heikki Linnakangas Neon (https://neon.tech)
On Mon, Apr 1, 2024 at 1:37 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Committed the first of the remaining patches with those changes. And > also this, which is worth calling out: > > > if (prstate.htsv[offnum] == HEAPTUPLE_DEAD) > > { > > ItemId itemid = PageGetItemId(page, offnum); > > HeapTupleHeader htup = (HeapTupleHeader) PageGetItem(page, itemid); > > > > if (likely(!HeapTupleHeaderIsHotUpdated(htup))) > > { > > HeapTupleHeaderAdvanceConflictHorizon(htup, > > &prstate.latest_xid_removed); > > heap_prune_record_unused(&prstate, offnum, true); > > } > > else > > { > > /* > > * This tuple should've been processed and removed as part of > > * a HOT chain, so something's wrong. To preserve evidence, > > * we don't dare to remove it. We cannot leave behind a DEAD > > * tuple either, because that will cause VACUUM to error out. > > * Throwing an error with a distinct error message seems like > > * the least bad option. > > */ > > elog(ERROR, "dead heap-only tuple (%u, %d) is not linked to from any HOT chain", > > blockno, offnum); > > } > > } > > else > > heap_prune_record_unchanged_lp_normal(page, &prstate, offnum); > > As you can see, I turned that into a hard error. Previously, that code > was at the top of heap_prune_chain(), and it was normal to see DEAD > heap-only tuples there, because they would presumably get processed > later as part of a HOT chain. But now this is done after all the HOT > chains have already been processed. > > Previously if there was a dead heap-only tuple like that on the page for > some reason, it was silently not processed by heap_prune_chain() > (because it was assumed that it would be processed later as part of a > HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the > pruning was done as part of VACUUM, VACUUM would fail with "ERROR: > unexpected HeapTupleSatisfiesVacuum result". Or am I missing something? I think you are right. I wasn't sure if there was some way for a HOT, DEAD tuple to be not HOT-updated, but that doesn't make much sense. > Now you get that above error also on on-access pruning, which is not > ideal. But I don't remember hearing about corruption like that, and > you'd get the error on VACUUM anyway. Yea, that makes sense. One thing I don't really understand is why vacuum has its own system for saving and restoring error information for context messages (LVSavedErrInfo and update/restore_vacuum_err_info()). I'll confess I don't know much about how error cleanup works in any sub-system. But it stuck out to me that vacuum has its own. I assume it is okay to add new error messages and they somehow will work with the existing system? - Melanie
On 01/04/2024 20:22, Melanie Plageman wrote: >> From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001 >> From: Melanie Plageman <melanieplageman@gmail.com> >> Date: Fri, 29 Mar 2024 19:43:09 -0400 >> Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions >> >> We will eventually take additional actions in heap_page_prune() at the >> discretion of the caller. For now, introduce these PRUNE_DO_* macros and >> turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_ > > paramter -> parameter > >> action. >> --- >> src/backend/access/heap/pruneheap.c | 51 ++++++++++++++-------------- >> src/backend/access/heap/vacuumlazy.c | 11 ++++-- >> src/include/access/heapam.h | 13 ++++++- >> 3 files changed, 46 insertions(+), 29 deletions(-) >> >> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c >> index fb0ad834f1b..30965c3c5a1 100644 >> --- a/src/backend/access/heap/pruneheap.c >> +++ b/src/backend/access/heap/pruneheap.c >> @@ -29,10 +29,11 @@ >> /* Working data for heap_page_prune and subroutines */ >> typedef struct >> { >> + /* PRUNE_DO_* arguments */ >> + uint8 actions; > > I wasn't sure if actions is a good name. What do you think? Committed this part, with the name 'options'. There's some precedent for that in heap_insert(). I decided to keep it a separate bool field here in the PruneState struct, though, and only changed it in the heap_page_prune() function signature. It didn't feel worth the code churn here, and 'prstate.mark_unused_now' is a shorter than "(prstate.options & HEAP_PRUNE_PAGE_MARK_UNUSED_NOW) != 0" anyway. -- Heikki Linnakangas Neon (https://neon.tech)
On 01/04/2024 20:22, Melanie Plageman wrote: > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know > you didn't modify them much/at all, but I noticed some things in my code > that could be better. Ok, here's what I have now. I made a lot of small comment changes here and there, and some minor local refactorings, but nothing major. I lost track of all the individual changes I'm afraid, so I'm afraid you'll have to just diff against the previous version if you want to see what's changed. I hope I didn't break anything. I'm pretty happy with this now. I will skim through it one more time later today or tomorrow, and commit. Please review once more if you have a chance. > This probably doesn't belong here. I noticed spgdoinsert.c had a static > function for sorting OffsetNumbers, but I didn't see anything general > purpose anywhere else. I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would be nice to have just one copy of this in some common place, but I also wasn't sure where to put it. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
On Tue, Apr 2, 2024 at 9:11 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 01/04/2024 20:22, Melanie Plageman wrote: > > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know > > you didn't modify them much/at all, but I noticed some things in my code > > that could be better. > > Ok, here's what I have now. I made a lot of small comment changes here > and there, and some minor local refactorings, but nothing major. I lost > track of all the individual changes I'm afraid, so I'm afraid you'll > have to just diff against the previous version if you want to see what's > changed. I hope I didn't break anything. > > I'm pretty happy with this now. I will skim through it one more time > later today or tomorrow, and commit. Please review once more if you have > a chance. Thanks! 0001 looks good. Attached are some comment updates and such on top of 0001 and 0002. I started some performance testing of 0002 but haven't finished yet. I wanted to provide my other review first. > > This probably doesn't belong here. I noticed spgdoinsert.c had a static > > function for sorting OffsetNumbers, but I didn't see anything general > > purpose anywhere else. > > I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would > be nice to have just one copy of this in some common place, but I also > wasn't sure where to put it. I looked a bit through utils and common and didn't see anywhere obvious to put it. We could make a new file? 0003 fixes where you forgot to change the name of the qsort function, though. - Melanie
Attachment
On Tue, Apr 02, 2024 at 01:24:27PM -0400, Melanie Plageman wrote: > On Tue, Apr 2, 2024 at 9:11 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 01/04/2024 20:22, Melanie Plageman wrote: > > > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know > > > you didn't modify them much/at all, but I noticed some things in my code > > > that could be better. > > > > Ok, here's what I have now. I made a lot of small comment changes here > > and there, and some minor local refactorings, but nothing major. I lost > > track of all the individual changes I'm afraid, so I'm afraid you'll > > have to just diff against the previous version if you want to see what's > > changed. I hope I didn't break anything. > > > > I'm pretty happy with this now. I will skim through it one more time > > later today or tomorrow, and commit. Please review once more if you have > > a chance. > > Thanks! > > 0001 looks good. Attached are some comment updates and such on top of > 0001 and 0002. > > I started some performance testing of 0002 but haven't finished yet. I > wanted to provide my other review first. I tried to do some performance tests of just on-access HOT pruning with the patches in this thread applied. I'm not sure if I succeeded in being targeted enough to have usable results. Off-list Andres gave me some suggestions of how to improve my test case and setup and this is what I ended up doing: ---------------------------------------- On-access pruning during a SELECT query: ---------------------------------------- # Make a table with a single not NULL column of a small datatype to fit # as many tuples as possible on the page so each page we prune exercises # those loops in heap_page_prune_and_freeze() and heap_prune_chain() as # much as possible psql -c "create table small(col smallint not null)" # Insert data that is the same except for ~1 row per page with a different value for i in $(seq 1000) do psql -c "INSERT INTO small VALUES(2);" -c "INSERT INTO small SELECT 1 FROM (SELECT generate_series(1,220));" done # COPY this data to a file psql -c "COPY small TO '/tmp/small.data';" # Start postgres bound to a single CPU core # Run the following script with pgbench # Make the table unlogged table so we don't see the effects of WAL writes in # results # # Make sure autovacuum doesn't run on the table drop table if exists small; create unlogged table small(col smallint not null) with (autovacuum_enabled = false); copy small from '/tmp/small.data'; update small set col = 9 where col = 2; select * from small where col = 0; pgbench -n -f query.sql -t 100 -M prepared -r # (I made sure that HOT pruning was actually happening for the SELECT # query before running this with pgbench) # There seemed to be no meaningful difference for this example with the # patches: on current master: statement latencies in milliseconds and failures: 12.387 0 drop table if exists small; 1.914 0 create unlogged table small(col smallint not null) with (autovacuum_enabled = false); 100.152 0 copy small from '/tmp/small.data'; 49.480 0 update small set col = 9 where col = 2; 46.835 0 select * from small where col = 0; with the patches applied: statement latencies in milliseconds and failures: 13.281 0 drop table if exists small; 1.952 0 create unlogged table small(col smallint not null) with (autovacuum_enabled = false); 99.418 0 copy small from '/tmp/small.data'; 47.397 0 update small set col = 9 where col = 2; 46.887 0 select * from small where col = 0; -------------------------------- On-access pruning during UPDATE: -------------------------------- # The idea is to test a query which would be calling the new # heap_prune_record_unchanged_lp_normal() function a lot # I made the same table but filled entirely with the same value psql -c "create table small(col smallint not null)" \ -c "INSERT INTO small SELECT 1 FROM (SELECT generate_series(1, 221000));" # COPY this data to a file psql -c "COPY small TO '/tmp/small_univalue.data';" # Start postgres bound to a single CPU core # Run the following script with pgbench # Pick a low fillfactor so we have room for the HOT updates drop table if exists small; create unlogged table small(col smallint not null) with (autovacuum_enabled = false, fillfactor=25); copy small from '/tmp/small_univalue.data'; update small set col = 3; update small set col = 4; update small set col = 5; update small set col = 6; pgbench -n -f update.sql -t 20 -M prepared -r # There again seems to be no meaningful difference with the patches # applied on master: statement latencies in milliseconds and failures: 19.880 0 drop table if exists small; 2.099 0 create unlogged table small(col smallint not null) with (autovacuum_enabled = false, fillfactor=25); 130.793 0 copy small from '/tmp/small_univalue.data'; 377.707 0 update small set col = 3; 417.644 0 update small set col = 4; 483.974 0 update small set col = 5; 422.956 0 update small set col = 6; with patches applied: statement latencies in milliseconds and failures: 19.995 0 drop table if exists small; 2.034 0 create unlogged table small(col smallint not null) with (autovacuum_enabled = false, fillfactor=25); 124.270 0 copy small from '/tmp/small_univalue.data'; 375.327 0 update small set col = 3; 419.336 0 update small set col = 4; 483.750 0 update small set col = 5; 420.451 0 update small set col = 6; - Melanie
On 02/04/2024 16:11, Heikki Linnakangas wrote: > On 01/04/2024 20:22, Melanie Plageman wrote: >> Review for 0003-0006 (I didn't have any new thoughts on 0002). I know >> you didn't modify them much/at all, but I noticed some things in my code >> that could be better. > > Ok, here's what I have now. I made a lot of small comment changes here > and there, and some minor local refactorings, but nothing major. I lost > track of all the individual changes I'm afraid, so I'm afraid you'll > have to just diff against the previous version if you want to see what's > changed. I hope I didn't break anything. > > I'm pretty happy with this now. I will skim through it one more time > later today or tomorrow, and commit. Please review once more if you have > a chance. > >> This probably doesn't belong here. I noticed spgdoinsert.c had a static >> function for sorting OffsetNumbers, but I didn't see anything general >> purpose anywhere else. > > I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would > be nice to have just one copy of this in some common place, but I also > wasn't sure where to put it. One more version, with two small fixes: 1. I fumbled the offsetnumber-cmp function at the last minute so that it didn't compile. Fixed. that 2. On VACUUM on an unlogged or temp table, the logic always thought that we would be generating an FPI, causing it to always freeze when it could. But of course, you never generate FPIs on an unlogged table. Fixed that. (Perhaps we should indeed freeze more aggressively on an unlogged table, but changing the heuristic is out of scope for this patch.) Off-list, Melanie reported that there is a small regression with the benchmark script she posted yesterday, after all, but I'm not able to reproduce that. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
On Wed, Apr 3, 2024 at 8:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 02/04/2024 16:11, Heikki Linnakangas wrote: > > On 01/04/2024 20:22, Melanie Plageman wrote: > >> Review for 0003-0006 (I didn't have any new thoughts on 0002). I know > >> you didn't modify them much/at all, but I noticed some things in my code > >> that could be better. > > > > Ok, here's what I have now. I made a lot of small comment changes here > > and there, and some minor local refactorings, but nothing major. I lost > > track of all the individual changes I'm afraid, so I'm afraid you'll > > have to just diff against the previous version if you want to see what's > > changed. I hope I didn't break anything. > > > > I'm pretty happy with this now. I will skim through it one more time > > later today or tomorrow, and commit. Please review once more if you have > > a chance. > > > >> This probably doesn't belong here. I noticed spgdoinsert.c had a static > >> function for sorting OffsetNumbers, but I didn't see anything general > >> purpose anywhere else. > > > > I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would > > be nice to have just one copy of this in some common place, but I also > > wasn't sure where to put it. > > One more version, with two small fixes: > > 1. I fumbled the offsetnumber-cmp function at the last minute so that it > didn't compile. Fixed. that I noticed you didn't make the comment updates I suggested in my version 13 here [1]. A few of them are outdated references to heap_page_prune() and one to a now deleted variable name (all_visible_except_removable). I applied them to your v13 and attached the diff. > Off-list, Melanie reported that there is a small regression with the > benchmark script she posted yesterday, after all, but I'm not able to > reproduce that. Actually, I think it was noise. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_aPqZkThyfr0USaHp-3cN_ruEdAHBKtNQJqXDTjWUz0rw%40mail.gmail.com
Attachment
On 03/04/2024 17:18, Melanie Plageman wrote: > I noticed you didn't make the comment updates I suggested in my > version 13 here [1]. A few of them are outdated references to > heap_page_prune() and one to a now deleted variable name > (all_visible_except_removable). > > I applied them to your v13 and attached the diff. Applied those changes, and committed. Thank you! >> Off-list, Melanie reported that there is a small regression with the >> benchmark script she posted yesterday, after all, but I'm not able to >> reproduce that. > > Actually, I think it was noise. Ok, phew. -- Heikki Linnakangas Neon (https://neon.tech)
On Wed, Apr 3, 2024 at 12:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 03/04/2024 17:18, Melanie Plageman wrote: > > I noticed you didn't make the comment updates I suggested in my > > version 13 here [1]. A few of them are outdated references to > > heap_page_prune() and one to a now deleted variable name > > (all_visible_except_removable). > > > > I applied them to your v13 and attached the diff. > > Applied those changes, and committed. Thank you! Thanks! And thanks for updating the commitfest entry! - Melanie
On Wed, Apr 3, 2024 at 1:04 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Thanks! And thanks for updating the commitfest entry! Nice work! -- Peter Geoghegan