Thread: Combine Prune and Freeze records emitted by vacuum

Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
Hi,

Attached is a patch set which combines the freeze and prune records
for vacuum -- eliminating the overhead of a separate freeze WAL record
for every block frozen by vacuum. The contents of the freeze record
are added to the PRUNE record.

In cases where vacuum does freeze and prune, combining the WAL records
can reduce WAL bytes substantially and, as a consequence, reduce WAL
sync and write time.

For example:

psql -c "CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f
INT, g INT, h TEXT) WITH (autovacuum_enabled=false);"

for x in $(seq 1 16);
do
  psql -c "INSERT INTO foo SELECT i, i, i, i, i, i, i, i, repeat('b',
30) FROM generate_series(1,2000000)i;"
done

psql -c "UPDATE foo SET a = 2 WHERE id % 7 = 0;"
psql -c "VACUUM (FREEZE) foo;"

Generates 30% fewer WAL records and 12% fewer WAL bytes -- which,
depending on what else is happening on the system, can lead to vacuum
spending substantially less time on WAL writing and syncing (often 15%
less time on WAL writes and 10% less time on syncing WAL in my
testing).

Though heap_page_prune() is also used by on-access pruning, on-access
pruning does not pass in the parameter used for freezing, so it should
incur limited additional overhead. The primary additional overhead
would be checking tuples' xmins against the GlobalVisState to
determine if the page would be all_visible and identify the visibility
cutoff xid. This is required to determine whether or not to
opportunistically freeze. We could condition this on the caller being
vacuum if needed.

Though, in the future, we may want to consider opportunistic/eager
freezing on access. This could allow us to, for example, freeze
bulk-loaded read-only data before it goes cold and avoid expensive
wraparound vacuuming.

There are other steps that we can take to decrease vacuum WAL volume
even further. Many of those are natural follow-ons to combining the
prune and freeze record. For example, I intend to propose combining
the visibility map update record into the Prune/Freeze and Vacuum
records -- eliminating an extra visibility map update record. This
would mean a single WAL record emitted per block for vacuum's first
pass.

On master, for my example above, of the roughly 1 million WAL records
emitted by vacuum, about 1/3 of them are prune records, 1/3 are freeze
records, and 1/3 are visibility map update records. So we will achieve
another substantial reduction in the number of WAL records and bytes
of WAL record overhead by eliminating a separate record for updating
the VM.

The attached patch set is broken up into many separate commits for
ease of review. Each patch does a single thing which can be explained
plainly in the commit message. Every commit passes tests and works on
its own.

0001 - 0003 cover checking tuples' xmins against the GlobalVisState in
heap_page_prune().

0004 - 0007 executes freezing in heap_page_prune() (for vacuum only).

0008 translates the eager/opportunistic freeze heuristic into one that
will work without relying on having a separate prune record. Elsewhere
in [1] we are discussing how to improve this heuristic.

0009 - 0012 merges the freeze record into the prune record.

0013 - 0015 removes the loop through the page in lazy_scan_prune() by
doing the accounting it did in heap_page_prune(). A nice bonus of this
patch set is that we can eliminate one of vacuum's loops through the
page.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_ZTDm1d9M%2BENf6oXhW9nRT3J76vOL1ianiCW4%2B4M6hMoA%40mail.gmail.com

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 25/01/2024 00:49, Melanie Plageman wrote:
> Generates 30% fewer WAL records and 12% fewer WAL bytes -- which,
> depending on what else is happening on the system, can lead to vacuum
> spending substantially less time on WAL writing and syncing (often 15%
> less time on WAL writes and 10% less time on syncing WAL in my
> testing).

Nice!

> The attached patch set is broken up into many separate commits for
> ease of review. Each patch does a single thing which can be explained
> plainly in the commit message. Every commit passes tests and works on
> its own.

About this very first change:

> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1526,8 +1526,7 @@ lazy_scan_prune(LVRelState *vacrel,
>                       * that everyone sees it as committed?
>                       */
>                      xmin = HeapTupleHeaderGetXmin(htup);
> -                    if (!TransactionIdPrecedes(xmin,
> -                                               vacrel->cutoffs.OldestXmin))
> +                    if (!GlobalVisTestIsRemovableXid(vacrel->vistest, xmin))
>                      {
>                          all_visible = false;
>                          break;

Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

I read through all the patches in order, and aside from the above they 
all look correct to me. Some comments on the set as whole:

I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type 
anymore. By removing that, you also get rid of the freeze-only codepath 
near the end of heap_page_prune(), and the 
heap_freeze_execute_prepared() function.

The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than 
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for 
the case that there's no pruning, just freezing. The record format 
(xl_heap_prune) looks pretty complex as it is, I think it could be made 
both more compact and more clear with some refactoring.

FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use 
pagefrz->cutoffs now.

HeapPageFreeze has two "trackers", for the "freeze" and "no freeze" 
cases. heap_page_prune() needs to track both, until it decides whether 
to freeze or not. But it doesn't make much sense that the caller 
(lazy_scan_prune()) has to initialize both, and has to choose which of 
the values to use after the call depending on whether heap_page_prune() 
froze or not. The two trackers should be just heap_page_prune()'s 
internal business.

HeapPageFreeze is a bit confusing in general, as it's both an input and 
an output to heap_page_prune(). Not sure what exactly to do there, but I 
feel that we should make heap_page_prune()'s interface more clear. 
Perhaps move the output fields to PruneResult.

Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as 
freezing is now an integral part of the function. And mention it in the 
function comment, too.

In heap_prune_chain:

>  * Tuple visibility information is provided in presult->htsv.

Not this patch's fault directly, but it's not immediate clear what "is 
provided" means here. Does the caller provide it, or does the function 
set it, ie. is it an input or output argument? Looking at the code, it's 
an input, but now it looks a bit weird that an input argument is called 
'presult'.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
Thanks so much for the review!

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 25/01/2024 00:49, Melanie Plageman wrote:
>
> > The attached patch set is broken up into many separate commits for
> > ease of review. Each patch does a single thing which can be explained
> > plainly in the commit message. Every commit passes tests and works on
> > its own.
>
> About this very first change:
>
> > --- a/src/backend/access/heap/vacuumlazy.c
> > +++ b/src/backend/access/heap/vacuumlazy.c
> > @@ -1526,8 +1526,7 @@ lazy_scan_prune(LVRelState *vacrel,
> >                                        * that everyone sees it as committed?
> >                                        */
> >                                       xmin = HeapTupleHeaderGetXmin(htup);
> > -                                     if (!TransactionIdPrecedes(xmin,
> > -
vacrel->cutoffs.OldestXmin))
> > +                                     if (!GlobalVisTestIsRemovableXid(vacrel->vistest, xmin))
> >                                       {
> >                                               all_visible = false;
> >                                               break;
>
> Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

Okay, so I thought a lot about this, and I don't understand how
GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
correctly.

vacrel->cutoffs.OldestXmin is computed initially from
GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
GlobalVisState is updated by ComputeXidHorizons() (when it is
updated).

I do see that the comment above GlobalVisTestIsRemovableXid() says

 * It is crucial that this only gets called for xids from a source that
 * protects against xid wraparounds (e.g. from a table and thus protected by
 * relfrozenxid).

and then in

     * Convert 32 bit argument to FullTransactionId. We can do so safely
     * because we know the xid has to, at the very least, be between
     * [oldestXid, nextXid), i.e. within 2 billion of xid.

I'm not sure what oldestXid is here.
It's true that I don't see GlobalVisTestIsRemovableXid() being called
anywhere else with an xmin as an input. I think that hints that it is
not safe with FrozenTransactionId. But I don't see what could go
wrong.

Maybe it has something to do with converting it to a FullTransactionId?

   FullTransactionIdFromU64(U64FromFullTransactionId(rel)  + (int32)
(xid - rel_xid));

Sorry, I couldn't quite figure it out :(

> I read through all the patches in order, and aside from the above they
> all look correct to me. Some comments on the set as whole:
>
> I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type
> anymore. By removing that, you also get rid of the freeze-only codepath
> near the end of heap_page_prune(), and the
> heap_freeze_execute_prepared() function.

That makes sense to me.

> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
> the case that there's no pruning, just freezing. The record format
> (xl_heap_prune) looks pretty complex as it is, I think it could be made
> both more compact and more clear with some refactoring.

I'm happy to change up xl_heap_prune format. In its current form,
according to pahole, it has no holes and just 3 bytes of padding at
the end.

One way we could make it smaller is by moving the isCatalogRel member
to directly after snapshotConflictHorizon and then adding a flags
field and defining flags to indicate whether or not other members
exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
nredirected, nunused, nplans, ndead and just put the number of
redirected/unused/etc at the beginning of the arrays containing the
offsets. Then I could write various macros for accessing them. That
would make it smaller, but it definitely wouldn't make it less complex
(IMO).

> FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use
> pagefrz->cutoffs now.

Yep, I forgot to add a commit to do this. Thanks!

> HeapPageFreeze has two "trackers", for the "freeze" and "no freeze"
> cases. heap_page_prune() needs to track both, until it decides whether
> to freeze or not. But it doesn't make much sense that the caller
> (lazy_scan_prune()) has to initialize both, and has to choose which of
> the values to use after the call depending on whether heap_page_prune()
> froze or not. The two trackers should be just heap_page_prune()'s
> internal business.
>
> HeapPageFreeze is a bit confusing in general, as it's both an input and
> an output to heap_page_prune(). Not sure what exactly to do there, but I
> feel that we should make heap_page_prune()'s interface more clear.
> Perhaps move the output fields to PruneResult.

Great point. Perhaps I just add NewRelfrozenXid and NewRelminMxid to
PruneResult (and call it PruneFreezeResult) and then make
VacuumCutoffs an optional argument to heap_page_prune() (used by
vacuum and not on-access pruning). Then I eliminate HeapPageFreeze as
a parameter altogether.

> Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as
> freezing is now an integral part of the function. And mention it in the
> function comment, too.

Agreed. Will do in the next version. I want to get some consensus on
what to do with xl_heap_prune before going on my rebase journey with
these 15 patches.

> In heap_prune_chain:
>
> >  * Tuple visibility information is provided in presult->htsv.
>
> Not this patch's fault directly, but it's not immediate clear what "is
> provided" means here. Does the caller provide it, or does the function
> set it, ie. is it an input or output argument? Looking at the code, it's
> an input, but now it looks a bit weird that an input argument is called
> 'presult'.

So, htsv is a member of PruneResult on master because
heap_page_prune() populates PruneResult->htsv for use in
lazy_scan_prune(). heap_prune_chain() doesn't have access to
PruneResult on master. Once I move PruneResult to being populated both
by heap_page_prune() and heap_prune_chain(), it gets more confusing.
htsv is always populated in heap_page_prune(), but it is not until
later patches in the set that I stop accessing it in
lazy_scan_prune(). Once I do so, I move htsv from PruneResult into
PruneState -- which fixes the heap_prune_chain() confusion.

So, only intermediate commits in the set have PruneResult->htsv used
in heap_prune_chain(). The end state is that heap_prune_chain()
accesses PruneState->htsv. However, I can document how it is used more
clearly in the function comment in the intermediate commits. Or, I can
simply leave htsv as a separate input argument to heap_prune_chain()
in the intermediate commits.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 09/03/2024 22:41, Melanie Plageman wrote:
> On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?
> 
> Okay, so I thought a lot about this, and I don't understand how
> GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
> correctly.
> 
> vacrel->cutoffs.OldestXmin is computed initially from
> GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
> GlobalVisState is updated by ComputeXidHorizons() (when it is
> updated).
> 
> I do see that the comment above GlobalVisTestIsRemovableXid() says
> 
>   * It is crucial that this only gets called for xids from a source that
>   * protects against xid wraparounds (e.g. from a table and thus protected by
>   * relfrozenxid).
> 
> and then in
> 
>       * Convert 32 bit argument to FullTransactionId. We can do so safely
>       * because we know the xid has to, at the very least, be between
>       * [oldestXid, nextXid), i.e. within 2 billion of xid.
> 
> I'm not sure what oldestXid is here.
> It's true that I don't see GlobalVisTestIsRemovableXid() being called
> anywhere else with an xmin as an input. I think that hints that it is
> not safe with FrozenTransactionId. But I don't see what could go
> wrong.
> 
> Maybe it has something to do with converting it to a FullTransactionId?
> 
>     FullTransactionIdFromU64(U64FromFullTransactionId(rel)  + (int32)
> (xid - rel_xid));
> 
> Sorry, I couldn't quite figure it out :(

I just tested it. No, GlobalVisTestIsRemovableXid does not work for 
FrozenTransactionId. I just tested it with state->definitely_needed == 
{0, 4000000000} and xid == FrozenTransactionid, and it incorrectly 
returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'.

>> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
>> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
>> the case that there's no pruning, just freezing. The record format
>> (xl_heap_prune) looks pretty complex as it is, I think it could be made
>> both more compact and more clear with some refactoring.
> 
> I'm happy to change up xl_heap_prune format. In its current form,
> according to pahole, it has no holes and just 3 bytes of padding at
> the end.
> 
> One way we could make it smaller is by moving the isCatalogRel member
> to directly after snapshotConflictHorizon and then adding a flags
> field and defining flags to indicate whether or not other members
> exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
> HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
> nredirected, nunused, nplans, ndead and just put the number of
> redirected/unused/etc at the beginning of the arrays containing the
> offsets.

Sounds good.

> Then I could write various macros for accessing them. That
> would make it smaller, but it definitely wouldn't make it less complex
> (IMO).

I don't know, it might turn out not that complex. If you define the 
formats of each of those "sub-record types" as explicit structs, per 
attached sketch, you won't need so many macros. Some care is still 
needed with alignment though.

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Mon, Mar 11, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 09/03/2024 22:41, Melanie Plageman wrote:
> > On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >> Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?
> >
> > Okay, so I thought a lot about this, and I don't understand how
> > GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
> > correctly.
> >
> > vacrel->cutoffs.OldestXmin is computed initially from
> > GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
> > GlobalVisState is updated by ComputeXidHorizons() (when it is
> > updated).
> >
> > I do see that the comment above GlobalVisTestIsRemovableXid() says
> >
> >   * It is crucial that this only gets called for xids from a source that
> >   * protects against xid wraparounds (e.g. from a table and thus protected by
> >   * relfrozenxid).
> >
> > and then in
> >
> >       * Convert 32 bit argument to FullTransactionId. We can do so safely
> >       * because we know the xid has to, at the very least, be between
> >       * [oldestXid, nextXid), i.e. within 2 billion of xid.
> >
> > I'm not sure what oldestXid is here.
> > It's true that I don't see GlobalVisTestIsRemovableXid() being called
> > anywhere else with an xmin as an input. I think that hints that it is
> > not safe with FrozenTransactionId. But I don't see what could go
> > wrong.
> >
> > Maybe it has something to do with converting it to a FullTransactionId?
> >
> >     FullTransactionIdFromU64(U64FromFullTransactionId(rel)  + (int32)
> > (xid - rel_xid));
> >
> > Sorry, I couldn't quite figure it out :(
>
> I just tested it. No, GlobalVisTestIsRemovableXid does not work for
> FrozenTransactionId. I just tested it with state->definitely_needed ==
> {0, 4000000000} and xid == FrozenTransactionid, and it incorrectly
> returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'.

I see. Looking at the original code:
      if (!TransactionIdPrecedes(xmin,
                                 vacrel->cutoffs.OldestXmin))

And the source code for TransactionIdPrecedes:

    if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
        return (id1 < id2);

    diff = (int32) (id1 - id2);
    return (diff < 0);

In your example, It seems like you mean GlobalVisState->maybe_needed is
0 and GlobalVisState->definitely_needed = 4000000000. In this example,
if vacrel->cutoffs.OldestXmin was 0, we would get a wrong answer also.

But, I do see that the comparison done by TransactionIdPrecedes() is is
quite different than that done by FullTransactionIdPrecedes() because of
the modulo 2**32 arithmetic.

Solving the handling of FrozenTransactionId specifically by
GlobalVisTestIsRemovableXid() seems like it would be done easily in our
case by wrapping it in a function which checks if
TransactionIdIsNormal() and returns true if it is not normal. But, I'm
not sure if I am missing the larger problem.

> >> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
> >> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
> >> the case that there's no pruning, just freezing. The record format
> >> (xl_heap_prune) looks pretty complex as it is, I think it could be made
> >> both more compact and more clear with some refactoring.
> >
> > I'm happy to change up xl_heap_prune format. In its current form,
> > according to pahole, it has no holes and just 3 bytes of padding at
> > the end.
> >
> > One way we could make it smaller is by moving the isCatalogRel member
> > to directly after snapshotConflictHorizon and then adding a flags
> > field and defining flags to indicate whether or not other members
> > exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
> > HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
> > nredirected, nunused, nplans, ndead and just put the number of
> > redirected/unused/etc at the beginning of the arrays containing the
> > offsets.
>
> Sounds good.
>
> > Then I could write various macros for accessing them. That
> > would make it smaller, but it definitely wouldn't make it less complex
> > (IMO).
>
> I don't know, it might turn out not that complex. If you define the
> formats of each of those "sub-record types" as explicit structs, per
> attached sketch, you won't need so many macros. Some care is still
> needed with alignment though.

In the attached v2, I've done as you suggested and made all members
except flags and snapshotConflictHorizon optional in the xl_heap_prune
struct and obsoleted the xl_heap_freeze struct. I've kept the actual
xl_heap_freeze_page struct and heap_xlog_freeze_page() function so that
we can replay previously made XLOG_HEAP2_FREEZE_PAGE records.

Because we may set line pointers unused during vacuum's first pass, I
couldn't use the presence of nowunused without redirected or dead items
to indicate that this was a record emitted by vacuum's second pass. As
such, I haven't obsoleted the xl_heap_vacuum struct. I was thinking I
could add a flag that indicates the record was emitted by vacuum's
second pass? We would want to distinguish this so that we could set the
items unused without calling heap_page_prune_execute() -- because that
calls PageRepairFragmentation() which requires a full cleanup lock.

I introduced a few sub-record types similar to what you suggested --
they help a bit with alignment, so I think they are worth keeping. There
are comments around them, but perhaps a larger diagram of the layout of
a the new XLOG_HEAP2_PRUNE record would be helpful.

There is a bit of duplicated code between heap_xlog_prune() and
heap2_desc() since they both need to deserialize the record. Before the
code to do this was small and it didn't matter, but it might be worth
refactoring it that way now.

Note that I've made all of the changes related to obsoleting the
XLOG_HEAP2_FREEZE_PAGE record in separate commits on top of the rest of
the set for ease of review. However, I've rebased the other review
feedback into the relevant commits.

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type
> anymore. By removing that, you also get rid of the freeze-only codepath
> near the end of heap_page_prune(), and the
> heap_freeze_execute_prepared() function.
>
> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
> the case that there's no pruning, just freezing. The record format
> (xl_heap_prune) looks pretty complex as it is, I think it could be made
> both more compact and more clear with some refactoring.

On the point of removing the freeze-only code path from
heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
I realized that heap_pre_freeze_checks() was not being called in the
case that we decide to freeze because we emitted an FPI while setting
the hint bit. I've fixed that, however, I've done so by moving
heap_pre_freeze_checks() into the critical section. I think that is not
okay? I could move it earlier and not do call it when the hint bit FPI
leads us to freeze tuples. But, I think that would lead to us doing a
lot less validation of tuples being frozen when checksums are enabled.
Or, I could make two critical sections?

> FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use
> pagefrz->cutoffs now.

Fixed this.

> HeapPageFreeze has two "trackers", for the "freeze" and "no freeze"
> cases. heap_page_prune() needs to track both, until it decides whether
> to freeze or not. But it doesn't make much sense that the caller
> (lazy_scan_prune()) has to initialize both, and has to choose which of
> the values to use after the call depending on whether heap_page_prune()
> froze or not. The two trackers should be just heap_page_prune()'s
> internal business.

I've added new_relminmxid and new_relfrozenxid to PruneFreezeResult and
set them appropriately in heap_page_prune_and_freeze().

It's a bit sad because if it wasn't for vacrel->skippedallvis,
vacrel->NewRelfrozenXid and vacrel->NewRelminMxid would be
vacrel->cutoffs.OldestXmin and vacrel->cutoffs.OldestMxact respectively
and we could avoid having lazy_scan_prune() initializing the
HeapPageFreeze altogether and just pass VacuumCutoffs (and
heap_page_prune_opt() could pass NULL) to heap_page_prune_and_freeze().

I think it is probably worse to add both of them as additional optional
arguments, so I've just left lazy_scan_prune() with the job of
initializing them.

In heap_page_prune_and_freeze(), I initialize new_relminmxid and
new_relfrozenxid to InvalidMultiXactId and InvalidTransactionId
respectively because on-access pruning doesn't have a value to set them
to. But I wasn't sure if this was okay -- since I don't see validation
that TransactionIdIsValid() in vac_update_relstats(). It will work now
-- just worried about future issues. I could add an assert there?

> HeapPageFreeze is a bit confusing in general, as it's both an input and
> an output to heap_page_prune(). Not sure what exactly to do there, but I
> feel that we should make heap_page_prune()'s interface more clear.
> Perhaps move the output fields to PruneResult.

HeapPageFrz is now only an input argument to
heap_page_prune_and_freeze() as of the commit in which
heap_page_prune_and_freeze() becomes responsible for executing freezing.
It is still an in/out param to earlier commits because we still need
info from it to execute freezing in lazy_scan_prune().

> Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as
> freezing is now an integral part of the function. And mention it in the
> function comment, too.

I've done this.

> In heap_prune_chain:
>
> >  * Tuple visibility information is provided in presult->htsv.
>
> Not this patch's fault directly, but it's not immediate clear what "is
> provided" means here. Does the caller provide it, or does the function
> set it, ie. is it an input or output argument? Looking at the code, it's
> an input, but now it looks a bit weird that an input argument is called
> 'presult'.

I haven't updated the comments about this in the intermediate commits
since it ends up in the PruneState as an input.

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Wed, Mar 13, 2024 at 07:25:56PM -0400, Melanie Plageman wrote:
> On Mon, Mar 11, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 09/03/2024 22:41, Melanie Plageman wrote:
> > > On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > >> Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?
> > >
> > > Okay, so I thought a lot about this, and I don't understand how
> > > GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
> > > correctly.
> > >
> > > vacrel->cutoffs.OldestXmin is computed initially from
> > > GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
> > > GlobalVisState is updated by ComputeXidHorizons() (when it is
> > > updated).
> > >
> > > I do see that the comment above GlobalVisTestIsRemovableXid() says
> > >
> > >   * It is crucial that this only gets called for xids from a source that
> > >   * protects against xid wraparounds (e.g. from a table and thus protected by
> > >   * relfrozenxid).
> > >
> > > and then in
> > >
> > >       * Convert 32 bit argument to FullTransactionId. We can do so safely
> > >       * because we know the xid has to, at the very least, be between
> > >       * [oldestXid, nextXid), i.e. within 2 billion of xid.
> > >
> > > I'm not sure what oldestXid is here.
> > > It's true that I don't see GlobalVisTestIsRemovableXid() being called
> > > anywhere else with an xmin as an input. I think that hints that it is
> > > not safe with FrozenTransactionId. But I don't see what could go
> > > wrong.
> > >
> > > Maybe it has something to do with converting it to a FullTransactionId?
> > >
> > >     FullTransactionIdFromU64(U64FromFullTransactionId(rel)  + (int32)
> > > (xid - rel_xid));
> > >
> > > Sorry, I couldn't quite figure it out :(
> >
> > I just tested it. No, GlobalVisTestIsRemovableXid does not work for
> > FrozenTransactionId. I just tested it with state->definitely_needed ==
> > {0, 4000000000} and xid == FrozenTransactionid, and it incorrectly
> > returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'.
> 
> I see. Looking at the original code:
>       if (!TransactionIdPrecedes(xmin,
>                                  vacrel->cutoffs.OldestXmin))
> 
> And the source code for TransactionIdPrecedes:
> 
>     if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
>         return (id1 < id2);
> 
>     diff = (int32) (id1 - id2);
>     return (diff < 0);
> 
> In your example, It seems like you mean GlobalVisState->maybe_needed is
> 0 and GlobalVisState->definitely_needed = 4000000000. In this example,
> if vacrel->cutoffs.OldestXmin was 0, we would get a wrong answer also.
> 
> But, I do see that the comparison done by TransactionIdPrecedes() is is
> quite different than that done by FullTransactionIdPrecedes() because of
> the modulo 2**32 arithmetic.
> 
> Solving the handling of FrozenTransactionId specifically by
> GlobalVisTestIsRemovableXid() seems like it would be done easily in our
> case by wrapping it in a function which checks if
> TransactionIdIsNormal() and returns true if it is not normal. But, I'm
> not sure if I am missing the larger problem.

I've made such a wrapper in attached v3.

> > >> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
> > >> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
> > >> the case that there's no pruning, just freezing. The record format
> > >> (xl_heap_prune) looks pretty complex as it is, I think it could be made
> > >> both more compact and more clear with some refactoring.
> > >
> > > I'm happy to change up xl_heap_prune format. In its current form,
> > > according to pahole, it has no holes and just 3 bytes of padding at
> > > the end.
> > >
> > > One way we could make it smaller is by moving the isCatalogRel member
> > > to directly after snapshotConflictHorizon and then adding a flags
> > > field and defining flags to indicate whether or not other members
> > > exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
> > > HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
> > > nredirected, nunused, nplans, ndead and just put the number of
> > > redirected/unused/etc at the beginning of the arrays containing the
> > > offsets.
> >
> > Sounds good.
> >
> > > Then I could write various macros for accessing them. That
> > > would make it smaller, but it definitely wouldn't make it less complex
> > > (IMO).
> >
> > I don't know, it might turn out not that complex. If you define the
> > formats of each of those "sub-record types" as explicit structs, per
> > attached sketch, you won't need so many macros. Some care is still
> > needed with alignment though.
> 
> In the attached v2, I've done as you suggested and made all members
> except flags and snapshotConflictHorizon optional in the xl_heap_prune
> struct and obsoleted the xl_heap_freeze struct. I've kept the actual
> xl_heap_freeze_page struct and heap_xlog_freeze_page() function so that
> we can replay previously made XLOG_HEAP2_FREEZE_PAGE records.
> 
> Because we may set line pointers unused during vacuum's first pass, I
> couldn't use the presence of nowunused without redirected or dead items
> to indicate that this was a record emitted by vacuum's second pass. As
> such, I haven't obsoleted the xl_heap_vacuum struct. I was thinking I
> could add a flag that indicates the record was emitted by vacuum's
> second pass? We would want to distinguish this so that we could set the
> items unused without calling heap_page_prune_execute() -- because that
> calls PageRepairFragmentation() which requires a full cleanup lock.

Okay, so I was going to start using xl_heap_prune for vacuum here too,
but I realized it would be bigger because of the
snapshotConflictHorizon. Do you think there is a non-terrible way to
make the snapshotConflictHorizon optional? Like with a flag?

> I introduced a few sub-record types similar to what you suggested --
> they help a bit with alignment, so I think they are worth keeping. There
> are comments around them, but perhaps a larger diagram of the layout of
> a the new XLOG_HEAP2_PRUNE record would be helpful.

I started doing this, but I can't find a way of laying out the diagram
that pgindent doesn't destroy. I thought I remember other diagrams in
the source code showing the layout of something (something with pages
somewhere?) that don't get messed up by pgindent, but I couldn't find
them.

> There is a bit of duplicated code between heap_xlog_prune() and
> heap2_desc() since they both need to deserialize the record. Before the
> code to do this was small and it didn't matter, but it might be worth
> refactoring it that way now.

I have added a helper function to do the deserialization,
heap_xlog_deserialize_prune_and_freeze(). But I didn't start using it in
heap2_desc() because of the way the pg_waldump build file works. Do you
think the helper belongs in any of waldump's existing sources?

    pg_waldump_sources = files(
        'compat.c',
        'pg_waldump.c',
        'rmgrdesc.c',
    )

    pg_waldump_sources += rmgr_desc_sources
    pg_waldump_sources += xlogreader_sources
    pg_waldump_sources += files('../../backend/access/transam/xlogstats.c')

Otherwise, I assume I am suppose to avoid adding some big new include to
waldump.

> On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type
> > anymore. By removing that, you also get rid of the freeze-only codepath
> > near the end of heap_page_prune(), and the
> > heap_freeze_execute_prepared() function.
> >
> > The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
> > XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
> > the case that there's no pruning, just freezing. The record format
> > (xl_heap_prune) looks pretty complex as it is, I think it could be made
> > both more compact and more clear with some refactoring.
> 
> On the point of removing the freeze-only code path from
> heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
> I realized that heap_pre_freeze_checks() was not being called in the
> case that we decide to freeze because we emitted an FPI while setting
> the hint bit. I've fixed that, however, I've done so by moving
> heap_pre_freeze_checks() into the critical section. I think that is not
> okay? I could move it earlier and not do call it when the hint bit FPI
> leads us to freeze tuples. But, I think that would lead to us doing a
> lot less validation of tuples being frozen when checksums are enabled.
> Or, I could make two critical sections?

I found another approach and just do the pre-freeze checks if we are
considering freezing except for the FPI criteria.

> > HeapPageFreeze has two "trackers", for the "freeze" and "no freeze"
> > cases. heap_page_prune() needs to track both, until it decides whether
> > to freeze or not. But it doesn't make much sense that the caller
> > (lazy_scan_prune()) has to initialize both, and has to choose which of
> > the values to use after the call depending on whether heap_page_prune()
> > froze or not. The two trackers should be just heap_page_prune()'s
> > internal business.
> 
> I've added new_relminmxid and new_relfrozenxid to PruneFreezeResult and
> set them appropriately in heap_page_prune_and_freeze().
> 
> It's a bit sad because if it wasn't for vacrel->skippedallvis,
> vacrel->NewRelfrozenXid and vacrel->NewRelminMxid would be
> vacrel->cutoffs.OldestXmin and vacrel->cutoffs.OldestMxact respectively
> and we could avoid having lazy_scan_prune() initializing the
> HeapPageFreeze altogether and just pass VacuumCutoffs (and
> heap_page_prune_opt() could pass NULL) to heap_page_prune_and_freeze().
> 
> I think it is probably worse to add both of them as additional optional
> arguments, so I've just left lazy_scan_prune() with the job of
> initializing them.
> 
> In heap_page_prune_and_freeze(), I initialize new_relminmxid and
> new_relfrozenxid to InvalidMultiXactId and InvalidTransactionId
> respectively because on-access pruning doesn't have a value to set them
> to. But I wasn't sure if this was okay -- since I don't see validation
> that TransactionIdIsValid() in vac_update_relstats(). It will work now
> -- just worried about future issues. I could add an assert there?

I looked more closely at vac_update_relstats() and it won't update
relfrozenxid unless the proposed value is smaller than the existing one.
And for MultiXactIds, it checks if it is valid. So, this is not an
issue.

I've also optimized the member ordering of PruneFreezeResult. pahole
identified some avoidable holes. I went through each commit and
optimized the PruneResult data structure as members are being added and
removed.

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 15/03/2024 02:56, Melanie Plageman wrote:
> Okay, so I was going to start using xl_heap_prune for vacuum here too,
> but I realized it would be bigger because of the
> snapshotConflictHorizon. Do you think there is a non-terrible way to
> make the snapshotConflictHorizon optional? Like with a flag?

Yeah, another flag would do the trick.

>> I introduced a few sub-record types similar to what you suggested --
>> they help a bit with alignment, so I think they are worth keeping. There
>> are comments around them, but perhaps a larger diagram of the layout of
>> a the new XLOG_HEAP2_PRUNE record would be helpful.
> 
> I started doing this, but I can't find a way of laying out the diagram
> that pgindent doesn't destroy. I thought I remember other diagrams in
> the source code showing the layout of something (something with pages
> somewhere?) that don't get messed up by pgindent, but I couldn't find
> them.

See src/tools/pgindent/README, section "Cleaning up in case of failure 
or ugly output":

     /*----------
      * Text here will not be touched by pgindent.
      */

>> There is a bit of duplicated code between heap_xlog_prune() and
>> heap2_desc() since they both need to deserialize the record. Before the
>> code to do this was small and it didn't matter, but it might be worth
>> refactoring it that way now.
> 
> I have added a helper function to do the deserialization,
> heap_xlog_deserialize_prune_and_freeze(). But I didn't start using it in
> heap2_desc() because of the way the pg_waldump build file works. Do you
> think the helper belongs in any of waldump's existing sources?
> 
>     pg_waldump_sources = files(
>         'compat.c',
>         'pg_waldump.c',
>         'rmgrdesc.c',
>     )
> 
>     pg_waldump_sources += rmgr_desc_sources
>     pg_waldump_sources += xlogreader_sources
>     pg_waldump_sources += files('../../backend/access/transam/xlogstats.c')
> 
> Otherwise, I assume I am suppose to avoid adding some big new include to
> waldump.

Didn't look closely at that, but there's some precedent with 
commit/prepare/abort records. See ParseCommitRecord, xl_xact_commit, 
xl_parsed_commit et al.

Note that we don't provide WAL compatibility across major versions. You 
can fully remove the old xl_heap_freeze_page format. (We should bump 
XLOG_PAGE_MAGIC when this is committed though)

>> On the point of removing the freeze-only code path from
>> heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
>> I realized that heap_pre_freeze_checks() was not being called in the
>> case that we decide to freeze because we emitted an FPI while setting
>> the hint bit. I've fixed that, however, I've done so by moving
>> heap_pre_freeze_checks() into the critical section. I think that is not
>> okay? I could move it earlier and not do call it when the hint bit FPI
>> leads us to freeze tuples. But, I think that would lead to us doing a
>> lot less validation of tuples being frozen when checksums are enabled.
>> Or, I could make two critical sections?
> 
> I found another approach and just do the pre-freeze checks if we are
> considering freezing except for the FPI criteria.

Hmm, I think you can make this simpler if you use 
XLogCheckBufferNeedsBackup() to check if the hint-bit update will 
generate a full-page-image before entering the critical section. Like 
you did to check if pruning will generate a full-page-image. I included 
that change in the attached patches.

I don't fully understand this:

>     /*
>      * If we will freeze tuples on the page or they were all already frozen
>      * on the page, if we will set the page all-frozen in the visibility map,
>      * we can advance relfrozenxid and relminmxid to the values in
>      * pagefrz->FreezePageRelfrozenXid and pagefrz->FreezePageRelminMxid.
>      */
>     if (presult->all_frozen || presult->nfrozen > 0)
>     {
>         presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
>         presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
>     }
>     else
>     {
>         presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
>         presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
>     }

Firstly, the comment is outdated, because we have already done the 
freezing at this point. But more importantly, I don't understand the 
logic here. Could we check just presult->nfrozen > 0 here and get the 
same result?

>> I think it is probably worse to add both of them as additional optional
>> arguments, so I've just left lazy_scan_prune() with the job of
>> initializing them.

Ok.


Here are some patches on top of your patches for some further 
refactorings. Some notable changes in heap_page_prune_and_freeze():

- I moved the heap_prepare_freeze_tuple() work from the 2nd loop to the 
1st one. It seems more clear and efficient that way.

- I extracted the code to generate the WAL record to a separate function.

- Refactored the "will setting hint bit generate FPI" check as discussed 
above

These patches are in a very rough WIP state, but I wanted to share 
early. I haven't done much testing, and I'm not wedded to these changes, 
but I think they make it more readable. Please include / squash in the 
patch set if you agree with them.

Please also take a look at the comments I marked with HEIKKI or FIXME, 
in the patches and commit messages.

I'll wait for a new version from you before reviewing more.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote:
> On 15/03/2024 02:56, Melanie Plageman wrote:
> > Okay, so I was going to start using xl_heap_prune for vacuum here too,
> > but I realized it would be bigger because of the
> > snapshotConflictHorizon. Do you think there is a non-terrible way to
> > make the snapshotConflictHorizon optional? Like with a flag?
> 
> Yeah, another flag would do the trick.

Okay, I've done this in attached v4 (including removing
XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the
"main chunk" of data available at replay regardless of whether or not
the record ended up including an FPI.

I made it its own sub-record (xlhp_conflict_horizon) less to help with
alignment (though we can use all the help we can get there) and more to
keep it from getting lost. When you look at heapam_xlog.h, you can see
what a XLOG_HEAP2_PRUNE record will contain starting with the
xl_heap_prune struct and then all the sub-record types.

> > > I introduced a few sub-record types similar to what you suggested --
> > > they help a bit with alignment, so I think they are worth keeping. There
> > > are comments around them, but perhaps a larger diagram of the layout of
> > > a the new XLOG_HEAP2_PRUNE record would be helpful.
> > 
> > I started doing this, but I can't find a way of laying out the diagram
> > that pgindent doesn't destroy. I thought I remember other diagrams in
> > the source code showing the layout of something (something with pages
> > somewhere?) that don't get messed up by pgindent, but I couldn't find
> > them.
> 
> See src/tools/pgindent/README, section "Cleaning up in case of failure or
> ugly output":
> 
>     /*----------
>      * Text here will not be touched by pgindent.
>      */
> 

Cool. This version doesn't include the spiffy drawing I promised yet.

> Note that we don't provide WAL compatibility across major versions. You can
> fully remove the old xl_heap_freeze_page format. (We should bump
> XLOG_PAGE_MAGIC when this is committed though)

I've removed the xl_heap_freeze (and xl_heap_prune). I didn't bump
XLOG_PAGE_MAGIC.

> > > On the point of removing the freeze-only code path from
> > > heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
> > > I realized that heap_pre_freeze_checks() was not being called in the
> > > case that we decide to freeze because we emitted an FPI while setting
> > > the hint bit. I've fixed that, however, I've done so by moving
> > > heap_pre_freeze_checks() into the critical section. I think that is not
> > > okay? I could move it earlier and not do call it when the hint bit FPI
> > > leads us to freeze tuples. But, I think that would lead to us doing a
> > > lot less validation of tuples being frozen when checksums are enabled.
> > > Or, I could make two critical sections?
> > 
> > I found another approach and just do the pre-freeze checks if we are
> > considering freezing except for the FPI criteria.
> 
> Hmm, I think you can make this simpler if you use
> XLogCheckBufferNeedsBackup() to check if the hint-bit update will generate a
> full-page-image before entering the critical section. Like you did to check
> if pruning will generate a full-page-image. I included that change in the
> attached patches.

I used your proposed structure. You had XLogCheckBufferNeedsBackup()
twice in your proposed version a few lines apart. I don't think there is
any point in checking it twice. If we are going to rely on
XLogCheckBufferNeedsBackup() to tell us whether or not setting the hint
bit is *likely* to emit an FPI, then we might as well just call
XLogCheckBufferNeedsBackup() once.

> I don't fully understand this:
> 
> >     /*
> >      * If we will freeze tuples on the page or they were all already frozen
> >      * on the page, if we will set the page all-frozen in the visibility map,
> >      * we can advance relfrozenxid and relminmxid to the values in
> >      * pagefrz->FreezePageRelfrozenXid and pagefrz->FreezePageRelminMxid.
> >      */
> >     if (presult->all_frozen || presult->nfrozen > 0)
> >     {
> >         presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
> >         presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
> >     }
> >     else
> >     {
> >         presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
> >         presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
> >     }
> 
> Firstly, the comment is outdated, because we have already done the freezing
> at this point. But more importantly, I don't understand the logic here.
> Could we check just presult->nfrozen > 0 here and get the same result?

I've updated the comment. The reason I had
if (presult->all_frozen ||  presult->nfrozen > 0) was because of this
comment in heapam.h in the HeapPageFreeze struct

     * "Freeze" NewRelfrozenXid/NewRelminMxid trackers.
     *
     * Trackers used when heap_freeze_execute_prepared freezes, or when there
     * are zero freeze plans for a page.  It is always valid for vacuumlazy.c
     * to freeze any page, by definition.  This even includes pages that have
     * no tuples with storage to consider in the first place.  That way the
     * 'totally_frozen' results from heap_prepare_freeze_tuple can always be
     * used in the same way, even when no freeze plans need to be executed to
     * "freeze the page".  Only the "freeze" path needs to consider the need
     * to set pages all-frozen in the visibility map under this scheme.
     *
     * When we freeze a page, we generally freeze all XIDs < OldestXmin, only
     * leaving behind XIDs that are ineligible for freezing, if any.  And so
     * you might wonder why these trackers are necessary at all; why should
     * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
     * ratchet back the top-level NewRelfrozenXid/NewRelminMxid trackers?
     *
     * It is useful to use a definition of "freeze the page" that does not
     * overspecify how MultiXacts are affected.  heap_prepare_freeze_tuple
     * generally prefers to remove Multis eagerly, but lazy processing is used
     * in cases where laziness allows VACUUM to avoid allocating a new Multi.
     * The "freeze the page" trackers enable this flexibility.
     */

So, I don't really know if it is right to just check presult->nfrozen >
0 when updating relminmxid. I have changed it to the way you suggested.
But we can change it back.

> Here are some patches on top of your patches for some further refactorings.
> Some notable changes in heap_page_prune_and_freeze():
> 
> - I moved the heap_prepare_freeze_tuple() work from the 2nd loop to the 1st
> one. It seems more clear and efficient that way.

cool. I kept this.

> - I extracted the code to generate the WAL record to a separate function.

cool. kept this.

> These patches are in a very rough WIP state, but I wanted to share early. I
> haven't done much testing, and I'm not wedded to these changes, but I think
> they make it more readable. Please include / squash in the patch set if you
> agree with them.

I've squashed the changes into and across my nineteen patches :)

I cleaned up your sugestions a bit and made a few stylistic choices.

In this version, I also broke up the last couple commits which
streamlined the WAL record and eliminated XLOG_HEAP2_FREEZE/VACUUM and
redistributed those changes in a way that I thought made sense.

Now, the progression is that in one commit we merge the prune and freeze
record, eliminating the XLOG_HEAP2_FREEZE record. Then, in another
commit, we eliminate the XLOG_HEAP2_VACUUM record. Then a later commit
streamlines the new mega xl_heap_prune struct into the variable size
structure based on which modifications it includes.

> From 622620a7875ae8c1626e9cd118156e0c734d44ed Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Sun, 17 Mar 2024 22:52:28 +0200
> Subject: [PATCH v3heikki 1/4] Inline heap_frz_conflict_horizon() to the
>  caller.
> 
> FIXME: This frz_conflict_horizon business looks fishy to me. We have:
> - local frz_conflict_horizon variable,
> - presult->frz_conflict_horizon, and
> - prstate.snapshotConflictHorizon

Yea, this is a mistake I made when I was rebasing some changes in. The
local variable is gone now.


> should we really have all three, and what are the differences?

We do need both the prstate.snapshotConflictHorizon and the
presult->frz_conflict_horizon because the youngest freezable xmin will
often be different than the oldest removable xmax, so we have to track
both and take the most conservative one if we are both freezing and
pruning.

The third (local variable) one was an oops.

> From 0219842487931f899abcf183c863c43270c098f0 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Sun, 17 Mar 2024 23:05:31 +0200
> Subject: [PATCH v3heikki 2/4] Misc cleanup
> ---
>  src/backend/access/heap/pruneheap.c  | 16 +++++++---------
>  src/backend/access/heap/vacuumlazy.c |  7 ++++++-
>  2 files changed, 13 insertions(+), 10 deletions(-)
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 8d3723faf3a..d3df7029667 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -433,7 +433,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
>       * heap_page_prune_and_freeze(). We expect vistest will always make
>       * heap_page_prune_and_freeze() remove any deleted tuple whose xmax is <
>       * OldestXmin.  lazy_scan_prune must never become confused about whether a
> -     * tuple should be frozen or removed.  (In the future we might want to
> +     * tuple should be frozen or removed.
> +     * HEIKKI: Is such confusion possible anymore?
> +     * (In the future we might want to
>       * teach lazy_scan_prune to recompute vistest from time to time, to
>       * increase the number of dead tuples it can prune away.)

TBH, I don't really know what this comment is even saying. But
lazy_scan_prune() doesn't do any freezing anymore, so I removed this
sentence.

>       */
> @@ -1387,6 +1389,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
>   * so we could deal with tuples that were DEAD to VACUUM, but nevertheless were
>   * left with storage after pruning.
>   *
> + * HEIKKI: does the above paragraph still make sense? We don't call
> + * HeapTupleSatisfiesVacuum() here anymore
> + *

Yea this whole comment definitely doesn't belong here anymore. Even
though we are calling HeapTupleSatisfiesVacuum() (from inside
heap_prune_satisifes_vacuum()) inside heap_page_prune_and_freeze(), the
comment really doesn't fit anywhere in there either. The comment is
describing a situation that is no longer possible. So describing a
situation that is no longer possible in a part of the code that it never
could have been possible does not make sense to me. I've removed the
comment.

>   * As of Postgres 17, we circumvent this problem altogether by reusing the
>   * result of heap_page_prune_and_freeze()'s visibility check. Without the
>   * second call to HeapTupleSatisfiesVacuum(), there is no new HTSV_Result and
> -- 
> 2.39.2
> 

> From d72cebf13f9866112309883f72a382fc2cb57e17 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Sun, 17 Mar 2024 23:08:42 +0200
> Subject: [PATCH v3heikki 3/4] Move work to the first loop
>  src/backend/access/heap/pruneheap.c | 141 ++++++++++------------------
>  1 file changed, 52 insertions(+), 89 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index b3573bb628b..3541628799a 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -78,9 +78,6 @@ static int    heap_prune_chain(Buffer buffer,
>                               PruneState *prstate, PruneFreezeResult *presult);
>  
>  static inline HTSV_Result htsv_get_valid_status(int status);
> -static void prune_prepare_freeze_tuple(Page page, OffsetNumber offnum, PruneState *prstate,
> -                                       HeapPageFreeze *pagefrz, HeapTupleFreeze *frozen,
> -                                       PruneFreezeResult *presult);
>  static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
>  static void heap_prune_record_redirect(PruneState *prstate,
>                                         OffsetNumber offnum, OffsetNumber rdoffnum,
> @@ -322,6 +319,16 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
>      /* for recovery conflicts */
>      presult->frz_conflict_horizon = InvalidTransactionId;
>  
> +    /*
> +     * We will update the VM after pruning, collecting LP_DEAD items, and
> +     * freezing tuples. Keep track of whether or not the page is all_visible
> +     * and all_frozen and use this information to update the VM. all_visible
> +     * implies lpdead_items == 0, but don't trust all_frozen result unless
> +     * all_visible is also set to true.
> +     */
> +    /* HEIKKI: the caller updates the VM? not us */

I've updated this comment.

Other questions and notes on v4:

xl_heap_prune->flags is a uint8, but we are already using 7 of the bits.
Should we make it a unit16?

Eventually, I would like to avoid emitting a separate XLOG_HEAP2_VISIBLE
record for vacuum's first and second passes and just include the VM
update flags in the xl_heap_prune record. xl_heap_visible->flags is a
uint8. If we made xl_heap_prune->flags uint16, we could probably combine
them (though maybe we want other bits available). Also vacuum's second
pass doesn't set a snapshotConflictHorizon, so if we combined
xl_heap_visible and xl_heap_prune for vacuum we would end up saving even
more space (since vacuum sets xl_heap_visible->snapshotConflictHorizon
to InvalidXLogRecPtr).

A note on sub-record naming: I kept xl_heap_freeze_plan's name but
prefixed the other sub-records with xlhp. Do you think it is worth
renaming it (to xlhp_freeze_plan)? Also, should I change xlhp_freeze to
xlhp_freeze_page?

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 20/03/2024 03:36, Melanie Plageman wrote:
> On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote:
>> On 15/03/2024 02:56, Melanie Plageman wrote:
>>> Okay, so I was going to start using xl_heap_prune for vacuum here too,
>>> but I realized it would be bigger because of the
>>> snapshotConflictHorizon. Do you think there is a non-terrible way to
>>> make the snapshotConflictHorizon optional? Like with a flag?
>>
>> Yeah, another flag would do the trick.
> 
> Okay, I've done this in attached v4 (including removing
> XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the
> "main chunk" of data available at replay regardless of whether or not
> the record ended up including an FPI.
> 
> I made it its own sub-record (xlhp_conflict_horizon) less to help with
> alignment (though we can use all the help we can get there) and more to
> keep it from getting lost. When you look at heapam_xlog.h, you can see
> what a XLOG_HEAP2_PRUNE record will contain starting with the
> xl_heap_prune struct and then all the sub-record types.

Ok, now that I look at this, I wonder if we're being overly cautious 
about the WAL size. We probably could just always include the snapshot 
field, and set it to InvalidTransactionId and waste 4 bytes when it's 
not needed. For the sake of simplicity. I don't feel strongly either way 
though, the flag is pretty simple too.

> xl_heap_prune->flags is a uint8, but we are already using 7 of the bits.
> Should we make it a unit16?

It doesn't matter much either way. We can also make it larger when we 
need more bits, there's no need make room for them them beforehand.

> Eventually, I would like to avoid emitting a separate XLOG_HEAP2_VISIBLE
> record for vacuum's first and second passes and just include the VM
> update flags in the xl_heap_prune record. xl_heap_visible->flags is a
> uint8. If we made xl_heap_prune->flags uint16, we could probably combine
> them (though maybe we want other bits available). Also vacuum's second
> pass doesn't set a snapshotConflictHorizon, so if we combined
> xl_heap_visible and xl_heap_prune for vacuum we would end up saving even
> more space (since vacuum sets xl_heap_visible->snapshotConflictHorizon
> to InvalidXLogRecPtr).

Makes sense.

> A note on sub-record naming: I kept xl_heap_freeze_plan's name but
> prefixed the other sub-records with xlhp. Do you think it is worth
> renaming it (to xlhp_freeze_plan)?

Yeah, perhaps.

> Also, should I change xlhp_freeze to xlhp_freeze_page?
I renamed it to xlhp_freeze_plans, for some consistency with 
xlhp_prune_items.


I realized that the WAL record format changes are pretty independent 
from the rest of the patches. They could be applied before the rest. 
Without the rest of the changes, we'll still write two WAL records per 
page in vacuum, one to prune and another one to freeze, but it's another 
meaningful incremental step. So I reshuffled the patches, so that the 
WAL format is changed first, before the rest of the changes.

0001-0008: These are the WAL format changes. There's some comment 
cleanup needed, but as far as the code goes, I think these are pretty 
much ready to be squashed & committed.

0009-: The rest of the v4 patches, rebased over the WAL format changes. 
I also added a few small commits for little cleanups that caught my eye, 
let me know if you disagree with those.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:
> On 20/03/2024 03:36, Melanie Plageman wrote:
> > On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote:
> > > On 15/03/2024 02:56, Melanie Plageman wrote:
> > > > Okay, so I was going to start using xl_heap_prune for vacuum here too,
> > > > but I realized it would be bigger because of the
> > > > snapshotConflictHorizon. Do you think there is a non-terrible way to
> > > > make the snapshotConflictHorizon optional? Like with a flag?
> > > 
> > > Yeah, another flag would do the trick.
> > 
> > Okay, I've done this in attached v4 (including removing
> > XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the
> > "main chunk" of data available at replay regardless of whether or not
> > the record ended up including an FPI.
> > 
> > I made it its own sub-record (xlhp_conflict_horizon) less to help with
> > alignment (though we can use all the help we can get there) and more to
> > keep it from getting lost. When you look at heapam_xlog.h, you can see
> > what a XLOG_HEAP2_PRUNE record will contain starting with the
> > xl_heap_prune struct and then all the sub-record types.
> 
> Ok, now that I look at this, I wonder if we're being overly cautious about
> the WAL size. We probably could just always include the snapshot field, and
> set it to InvalidTransactionId and waste 4 bytes when it's not needed. For
> the sake of simplicity. I don't feel strongly either way though, the flag is
> pretty simple too.

That will mean that all vacuum records are at least 3 bytes bigger than
before -- which makes it somewhat less defensible to get rid of
xl_heap_vacuum.

That being said, I ended up doing an unaligned access when I
packed it and made it optional, so maybe it is less user-friendly.

But I also think that making it optional is more clear for vacuum which
will never use it.

> I realized that the WAL record format changes are pretty independent from
> the rest of the patches. They could be applied before the rest. Without the
> rest of the changes, we'll still write two WAL records per page in vacuum,
> one to prune and another one to freeze, but it's another meaningful
> incremental step. So I reshuffled the patches, so that the WAL format is
> changed first, before the rest of the changes.

Ah, great idea! That eliminates the issue of preliminary commits having
larger WAL records that then get streamlined.

> 0001-0008: These are the WAL format changes. There's some comment cleanup
> needed, but as far as the code goes, I think these are pretty much ready to
> be squashed & committed.

My review in this email is *only* for 0001-0008. I have not looked at
the rest yet.

> From 06d5ff5349a8aa95cbfd06a8043fe503b7b1bf7b Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 14:50:14 +0200
> Subject: [PATCH v5 01/26] Merge prune, freeze and vacuum WAL record formats
> 
> The new combined WAL record is now used for pruning, freezing and 2nd
> pass of vacuum. This is in preparation of changing vacuuming to write
> a combined prune+freeze record per page, instead of separate two
> records. The new WAL record format now supports that, but the code
> still always writes separate records for pruning and freezing.

Attached patch changes-for-0001.patch has a bunch of updated comments --
especially for heapam_xlog.h (including my promised diagram). And a few
suggestions (mostly changes that I should have made before).

> 
> XXX I tried to lift-and-shift the code from v4 patch set as unchanged
> as possible, for easier review, but some noteworthy changes:

In the final commit message, I think it is worth calling out that all of
those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and
shifted from one file to another. When I am reviewing a big diff, it is
always helpful to know where I need to review line-by-line.

> 
> - Instead of passing PruneState and PageFreezeResult to
>   log_heap_prune_and_freeze(), pass the arrays of frozen, redirected
>   et al offsets directly. That way, it can be called from other places.

good idea.

> - moved heap_xlog_deserialize_prune_and_freeze() from xactdesc.c to
>   heapdesc.c. (Because that's clearly where it belongs)

:)

> From cd6cdaebb362b014733e99ecd868896caf0fb3aa Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 13:45:01 +0200
> Subject: [PATCH v5 02/26] Keep the original numbers for existing WAL records
> 
> Doesn't matter much because the WAL format is not compatible across
> major versions anyway. But still seems nice to keep the identifiers
> unchanged when we can. (There's some precedence for this if you search
> the git history for "is free, was").

sounds good.

> From d3207bb557aa1d2868a50d357a06318a6c0cb5cd Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 13:48:29 +0200
> Subject: [PATCH v5 03/26] Rename record to XLOG_HEAP2_PRUNE_FREEZE
> 
> To clarify that it also freezes now, and to make it clear that it's
> significantly different from the old XLOG_HEAP2_PRUNE format.

+1 

> From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 13:49:59 +0200
> Subject: [PATCH v5 04/26] 'nplans' is a pointer
> 
> I'm surprised the compiler didn't warn about this

oops. while looking at this, I noticed that the asserts I added that
nredirected > 0, ndead > 0, and nunused > 0 have the same problem.

> ---
>  src/backend/access/rmgrdesc/heapdesc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
> index 8b94c869faf..9ef8a745982 100644
> --- a/src/backend/access/rmgrdesc/heapdesc.c
> +++ b/src/backend/access/rmgrdesc/heapdesc.c
> @@ -155,8 +155,7 @@ heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags,
>          cursor += sizeof(OffsetNumber) * *nunused;
>      }
>  
> -    if (nplans > 0)

> From 59f3f80f82ed7a63d86c991d0cb025e4cde2caec Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 13:36:41 +0200
> Subject: [PATCH v5 06/26] Fix logging snapshot conflict horizon.
> 
> - it was accessed without proper alignment, which won't work on
>   architectures that are strict about alignment. Use memcpy.

wow, oops. thanks for fixing this!

> - in heap_xlog_prune_freeze, the code tried to access the xid with
>   "(xlhp_conflict_horizon *) (xlrec + SizeOfHeapPrune);" But 'xlrec'
>   was "xl_heap_prune *" rather than "char *". That happened to work,
>   because sizeof(xl_heap_prune) == 1, but make it more robust by
>   adding a cast to char *.

good catch.

> - remove xlhp_conflict_horizon and store a TransactionId directly. A
>   separate struct would make sense if we needed to store anything else
>   there, but for now it just seems like more code.

makes sense. I just want to make sure heapam_xlog.h makes it extra clear
that this is happening. I see your comment addition. If you combine it
with my comment additions in the attached patch for 0001, hopefully that
makes it clear enough.

> From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 14:03:06 +0200
> Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze().
> 
> XXX: This should be rewritten, but I tried to at least list some
> important points.

Are you thinking that it needs to mention more things or that the things
it mentions need more detail?

> From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 14:53:31 +0200
> Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()
> 
> Mostly to make local variables more tightly-scoped.

So, I don't think you can move those sub-records into the tighter scope.
If you run tests with this applied, you'll see it crashes and a lot of
the data in the record is wrong. If you move the sub-record declarations
out to a wider scope, the tests pass.

The WAL record data isn't actually copied into the buffer until

    recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);

after registering everything.
So all of those sub-records you made are out of scope the time it tries
to copy from them.

I spent the better part of a day last week trying to figure out what was
happening after I did the exact same thing. I must say that I found the
xloginsert API incredibly unhelpful on this point.

I would like to review the rest of the suggested changes in this patch
after we fix the issue I just mentioned.

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2024 at 9:15 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > I made it its own sub-record (xlhp_conflict_horizon) less to help with
> > alignment (though we can use all the help we can get there) and more to
> > keep it from getting lost. When you look at heapam_xlog.h, you can see
> > what a XLOG_HEAP2_PRUNE record will contain starting with the
> > xl_heap_prune struct and then all the sub-record types.
>
> Ok, now that I look at this, I wonder if we're being overly cautious
> about the WAL size. We probably could just always include the snapshot
> field, and set it to InvalidTransactionId and waste 4 bytes when it's
> not needed. For the sake of simplicity. I don't feel strongly either way
> though, the flag is pretty simple too.

What about the issue of cleanup locks, which aren't needed and aren't
taken with the current heapam VACUUM record type? Will you preserve
that aspect of the existing design?

--
Peter Geoghegan



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Wed, Mar 20, 2024 at 4:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Mar 20, 2024 at 9:15 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > I made it its own sub-record (xlhp_conflict_horizon) less to help with
> > > alignment (though we can use all the help we can get there) and more to
> > > keep it from getting lost. When you look at heapam_xlog.h, you can see
> > > what a XLOG_HEAP2_PRUNE record will contain starting with the
> > > xl_heap_prune struct and then all the sub-record types.
> >
> > Ok, now that I look at this, I wonder if we're being overly cautious
> > about the WAL size. We probably could just always include the snapshot
> > field, and set it to InvalidTransactionId and waste 4 bytes when it's
> > not needed. For the sake of simplicity. I don't feel strongly either way
> > though, the flag is pretty simple too.
>
> What about the issue of cleanup locks, which aren't needed and aren't
> taken with the current heapam VACUUM record type? Will you preserve
> that aspect of the existing design?

Yep, we have a flag to indicate whether or not a cleanup lock is needed.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2024 at 4:06 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> > What about the issue of cleanup locks, which aren't needed and aren't
> > taken with the current heapam VACUUM record type? Will you preserve
> > that aspect of the existing design?
>
> Yep, we have a flag to indicate whether or not a cleanup lock is needed.

Thanks for confirming.

I realize that this is fairly obvious, but thought it better to ask.

--
Peter Geoghegan



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:
>
> 0009-: The rest of the v4 patches, rebased over the WAL format changes. I
> also added a few small commits for little cleanups that caught my eye, let
> me know if you disagree with those.

This review is just of the patches containing changes you made in
0009-0026.

> From d36138b5bf0a93557273b5e47f8cd5ea089057c7 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 11:47:42 +0200
> Subject: [PATCH v5 13/26] still use a local 'cutoffs' variable
> 
> Given how often 'cutoffs' is used in the function, I think it still
> makes sense to have a local variable for it, just to keep the source
> lines shorter.

Works for me.

> From 913617ed98cfddd678a6f620db7dee68d1d61c89 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 10:51:13 +0200
> Subject: [PATCH v5 21/26] move whole_page_freezable to tighter scope

Works for me.

> From e2b50f9b64f7e4255f4f764e2a348e1b446573dc Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 11:43:31 +0200
> Subject: [PATCH v5 22/26] make 'all_visible_except_removable' local
> 
> The caller doesn't need it, so it doesn't belong in PruneFreezeResult

Makes sense to me.

> From e993e0d98cd0ef1ecbd229f6ddbe23d59a427e3a Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Wed, 20 Mar 2024 11:40:34 +0200
> Subject: [PATCH v5 26/26] reorder PruneFreezeResult fields
> 
> ---
>  src/include/access/heapam.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index ee0eca8ae02..b2015f5a1ac 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -202,14 +202,17 @@ typedef struct PruneFreezeResult
>      int            recently_dead_tuples;
>      int            ndeleted;        /* Number of tuples deleted from the page */
>      int            nnewlpdead;        /* Number of newly LP_DEAD items */
> +    int            nfrozen;

Let's add a comment after int nfrozen like
/* Number of newly frozen tuples */

> +
>      bool        all_visible;    /* Whether or not the page is all visible */
>      bool        hastup;            /* Does page make rel truncation unsafe */
>  
> +    /* The following fields are only set if freezing */

So, all_frozen will be set correctly if we are even considering freezing
(if pagefrz is passed). all_frozen will be true even if we didn't freeze
anything if the page is all-frozen and can be set as such in the VM. And
it will be false if the page is not all-frozen. So, maybe we say
"considering freezing".

But, I'm glad you thought to call out which of these fields will make
sense to the caller.

Also, maybe we should just name the members to which this applies. It is
a bit confusing that I can't tell if the comment also refers to the
other members following it (lpdead_items and deadoffsets) -- which it
doesn't.

> +
>      /* Whether or not the page can be set all frozen in the VM */
>      bool        all_frozen;
>  
>      /* Number of newly frozen tuples */
> -    int            nfrozen;
>      TransactionId frz_conflict_horizon; /* Newest xmin on the page */
>  
>      /* New value of relfrozenxid found by heap_page_prune_and_freeze() */
> -- 
> 2.39.2
> 

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 20/03/2024 23:03, Melanie Plageman wrote:
> On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:
>> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
>> index ee0eca8ae02..b2015f5a1ac 100644
>> --- a/src/include/access/heapam.h
>> +++ b/src/include/access/heapam.h
>> @@ -202,14 +202,17 @@ typedef struct PruneFreezeResult
>>       int            recently_dead_tuples;
>>       int            ndeleted;        /* Number of tuples deleted from the page */
>>       int            nnewlpdead;        /* Number of newly LP_DEAD items */
>> +    int            nfrozen;
> 
> Let's add a comment after int nfrozen like
> /* Number of newly frozen tuples */
> 
>> +
>>       bool        all_visible;    /* Whether or not the page is all visible */
>>       bool        hastup;            /* Does page make rel truncation unsafe */
>>   
>> +    /* The following fields are only set if freezing */
> 
> So, all_frozen will be set correctly if we are even considering freezing
> (if pagefrz is passed). all_frozen will be true even if we didn't freeze
> anything if the page is all-frozen and can be set as such in the VM. And
> it will be false if the page is not all-frozen. So, maybe we say
> "considering freezing".
> 
> But, I'm glad you thought to call out which of these fields will make
> sense to the caller.
> 
> Also, maybe we should just name the members to which this applies. It is
> a bit confusing that I can't tell if the comment also refers to the
> other members following it (lpdead_items and deadoffsets) -- which it
> doesn't.

Right, sorry, I spotted the general issue that it's not clear which 
fields are valid when. I added that comment to remind about that, but I 
then forgot about it.

In heap_page_prune_and_freeze(), we now do some extra work on each live 
tuple, to set the all_visible_except_removable correctly. And also to 
update live_tuples, recently_dead_tuples and hastup. When we're not 
freezing, that's a waste of cycles, the caller doesn't care. I hope it's 
enough that it doesn't matter, but is it?

The first commit (after the WAL format changes) changes the all-visible 
check to use GlobalVisTestIsRemovableXid. The commit message says that 
it's because we don't have 'cutoffs' available, but we only care about 
that when we're freezing, and when we're freezing, we actually do have 
'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems 
sensible anyway, because that's what we use in 
heap_prune_satisfies_vacuum() too, but just wanted to point that out.


The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily 
these patches's fault). This at least is wrong, because Max(a, b) 
doesn't handle XID wraparound correctly:

>             if (do_freeze)
>                 conflict_xid = Max(prstate.snapshotConflictHorizon,
>                                    presult->frz_conflict_horizon);
>             else
>                 conflict_xid = prstate.snapshotConflictHorizon;

Then there's this in lazy_scan_prune():

>         /* Using same cutoff when setting VM is now unnecessary */
>         if (presult.all_frozen)
>             presult.frz_conflict_horizon = InvalidTransactionId;
This does the right thing in the end, but if all the tuples are frozen 
shouldn't frz_conflict_horizon already be InvalidTransactionId? The 
comment says it's "newest xmin on the page", and if everything was 
frozen, all xmins are FrozenTransactionId. In other words, that should 
be moved to heap_page_prune_and_freeze() so that it doesn't lie to its 
caller. Also, frz_conflict_horizon is only set correctly if 
'all_frozen==true', would be good to mention that in the comments too.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 20/03/2024 21:17, Melanie Plageman wrote:
> Attached patch changes-for-0001.patch has a bunch of updated comments --
> especially for heapam_xlog.h (including my promised diagram). And a few
> suggestions (mostly changes that I should have made before).

New version of these WAL format changes attached. Squashed to one patch now.

> +    // TODO: should we avoid this if we only froze? heap_xlog_freeze() doesn't
> +    // do it

Ah yes, that makes sense, did that.

> In the final commit message, I think it is worth calling out that all of
> those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and
> shifted from one file to another. When I am reviewing a big diff, it is
> always helpful to know where I need to review line-by-line.

Done.

>>  From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001
>> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
>> Date: Wed, 20 Mar 2024 13:49:59 +0200
>> Subject: [PATCH v5 04/26] 'nplans' is a pointer
>>
>> I'm surprised the compiler didn't warn about this
> 
> oops. while looking at this, I noticed that the asserts I added that
> nredirected > 0, ndead > 0, and nunused > 0 have the same problem.

Good catch, fixed.

>> - remove xlhp_conflict_horizon and store a TransactionId directly. A
>>    separate struct would make sense if we needed to store anything else
>>    there, but for now it just seems like more code.
> 
> makes sense. I just want to make sure heapam_xlog.h makes it extra clear
> that this is happening. I see your comment addition. If you combine it
> with my comment additions in the attached patch for 0001, hopefully that
> makes it clear enough.

Thanks. I spent more time on the comments throughout the patch. And one 
notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with 
XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a 
cleanup lock to replay the record. It must always be set when 
XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying 
those always needs a cleanup lock. That felt easier to document and 
understand than XLHP_LP_TRUNCATE_ONLY.

>>  From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001
>> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
>> Date: Wed, 20 Mar 2024 14:03:06 +0200
>> Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze().
>>
>> XXX: This should be rewritten, but I tried to at least list some
>> important points.
> 
> Are you thinking that it needs to mention more things or that the things
> it mentions need more detail?

I previously just quickly jotted down things that seemed worth 
mentioning in the comment. It was not so bad actually, but I reworded it 
a little.

>>  From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
>> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
>> Date: Wed, 20 Mar 2024 14:53:31 +0200
>> Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()
>>
>> Mostly to make local variables more tightly-scoped.
> 
> So, I don't think you can move those sub-records into the tighter scope.
> If you run tests with this applied, you'll see it crashes and a lot of
> the data in the record is wrong. If you move the sub-record declarations
> out to a wider scope, the tests pass.
> 
> The WAL record data isn't actually copied into the buffer until
> 
>     recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);
> 
> after registering everything.
> So all of those sub-records you made are out of scope the time it tries
> to copy from them.
> 
> I spent the better part of a day last week trying to figure out what was
> happening after I did the exact same thing. I must say that I found the
> xloginsert API incredibly unhelpful on this point.

Oops. I had that in mind and that was actually why I moved the 
XLogRegisterData() call to the end of the function, because I found it 
confusing to register the struct before filling it in completely, even 
though it works perfectly fine. But then I missed it anyway when I moved 
the local variables. I added a brief comment on that.

> I would like to review the rest of the suggested changes in this patch
> after we fix the issue I just mentioned.

Thanks, review is appreciated. I feel this is ready now, so barring any 
big new issues, I plan to commit this early next week.

There is another patch in the commitfest that touches this area: 
"Recording whether Heap2/PRUNE records are from VACUUM or from 
opportunistic pruning" [1]. That actually goes in the opposite direction 
than this patch. That patch wants to add more information, to show 
whether a record was emitted by VACUUM or on-access pruning, while this 
patch makes the freezing and VACUUM's 2nd phase records also look the 
same. We could easily add more flags to xl_heap_prune to distinguish 
them. Or assign different xl_info code for them, like that other patch 
proposed. But I don't think that needs to block this patch, that can be 
added as a separate patch.

[1] 
https://www.postgresql.org/message-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> In heap_page_prune_and_freeze(), we now do some extra work on each live
> tuple, to set the all_visible_except_removable correctly. And also to
> update live_tuples, recently_dead_tuples and hastup. When we're not
> freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> enough that it doesn't matter, but is it?

Last year on an early version of the patch set I did some pgbench
tpcb-like benchmarks -- since there is a lot of on-access pruning in
that workload -- and I don't remember it being a showstopper. The code
has changed a fair bit since then. However, I think it might be safer
to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
the rest of the loop after heap_prune_satisifies_vacuum() when
on-access pruning invokes it. I had avoided that because it felt ugly
and error-prone, however it addresses a few other of your points as
well.

For example, I think we should set a bit in the prune/freeze WAL
record's flags to indicate if pruning was done by vacuum or on access
(mentioned in another of your recent emails).

> The first commit (after the WAL format changes) changes the all-visible
> check to use GlobalVisTestIsRemovableXid. The commit message says that
> it's because we don't have 'cutoffs' available, but we only care about
> that when we're freezing, and when we're freezing, we actually do have
> 'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems
> sensible anyway, because that's what we use in
> heap_prune_satisfies_vacuum() too, but just wanted to point that out.

Yes, this is true. If we skip this part of the loop when on-access
pruning invokes it, we can go back to using the OldestXmin. I have
done that as well as some other changes in the attached patch,
conflict_horizon_updates.diff. Note that this patch may not apply on
your latest patch as I wrote it on top of an older version. Switching
back to using OldestXmin for page visibility determination makes this
patch set more similar to master as well. We could keep the
alternative check (with GlobalVisState) to maintain the illusion that
callers passing by_vacuum as True can pass NULL for pagefrz, but I was
worried about the extra overhead.

It would be nice to pick a single reasonable visibility horizon (the
oldest running xid we compare things to) at the beginning of
heap_page_prune_and_freeze() and use it for determining if we can
remove tuples, if we can freeze tuples, and if the page is all
visible. It makes it confusing that we use OldestXmin for freezing and
setting the page visibility in the VM and GlobalVisState for
determining prunability. I think using GlobalVisState for freezing has
been discussed before and ruled out for a variety of reasons, and I
definitely don't suggest making that change in this patch set.

> The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
> these patches's fault). This at least is wrong, because Max(a, b)
> doesn't handle XID wraparound correctly:
>
> >                       if (do_freeze)
> >                               conflict_xid = Max(prstate.snapshotConflictHorizon,
> >                                                                  presult->frz_conflict_horizon);
> >                       else
> >                               conflict_xid = prstate.snapshotConflictHorizon;
>
> Then there's this in lazy_scan_prune():
>
> >               /* Using same cutoff when setting VM is now unnecessary */
> >               if (presult.all_frozen)
> >                       presult.frz_conflict_horizon = InvalidTransactionId;
> This does the right thing in the end, but if all the tuples are frozen
> shouldn't frz_conflict_horizon already be InvalidTransactionId? The
> comment says it's "newest xmin on the page", and if everything was
> frozen, all xmins are FrozenTransactionId. In other words, that should
> be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
> caller. Also, frz_conflict_horizon is only set correctly if
> 'all_frozen==true', would be good to mention that in the comments too.

Yes, this is a good point. I've spent some time swapping all of this
back into my head. I think we should change the names of all these
conflict horizon variables and introduce some local variables again.
In the attached patch, I've updated the name of the variable in
PruneFreezeResult to vm_conflict_horizon, as it is only used for
emitting a VM update record. Now, I don't set it until the end of
heap_page_prune_and_freeze(). It is only updated from
InvalidTransactionId if the page is not all frozen. As you say, if the
page is all frozen, there can be no conflict.

I've also changed PruneState->snapshotConflictHorizon to
PruneState->latest_xid_removed.

And I introduced the local variables visibility_cutoff_xid and
frz_conflict_horizon. I think it is important we distinguish between
the latest xid pruned, the latest xmin of tuples frozen, and the
latest xid of all live tuples on the page.

Though we end up using visibility_cutoff_xid as the freeze conflict
horizon if the page is all frozen, I think it is more clear to have
the three variables and name them what they are. Then, we calculate
the correct one for the combined WAL record before emitting it. I've
done that in the attached. I've tried to reduce the scope of the
variables as much as possible to keep it as clear as I could.

I think I've also fixed the issue with using Max() to compare
TransactionIds by using TransactionIdFollows() instead.

Note that I still don't think we have a resolution on what to
correctly update new_relfrozenxid and new_relminmxid to at the end
when presult->nfrozen == 0 and presult->all_frozen is true.

        if (presult->nfrozen > 0)
        {
            presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
            presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
        }
        else
        {
            presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
            presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
        }

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote:
> On 20/03/2024 21:17, Melanie Plageman wrote:
> > Attached patch changes-for-0001.patch has a bunch of updated comments --
> > especially for heapam_xlog.h (including my promised diagram). And a few
> > suggestions (mostly changes that I should have made before).
> 
> New version of these WAL format changes attached. Squashed to one patch now.
> I spent more time on the comments throughout the patch. And one
> notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with
> XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a
> cleanup lock to replay the record. It must always be set when
> XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying those
> always needs a cleanup lock. That felt easier to document and understand
> than XLHP_LP_TRUNCATE_ONLY.

Makes sense to me.

> > >  From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
> > > From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> > > Date: Wed, 20 Mar 2024 14:53:31 +0200
> > > Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()
> > > 
> > > Mostly to make local variables more tightly-scoped.
> > 
> > So, I don't think you can move those sub-records into the tighter scope.
> > If you run tests with this applied, you'll see it crashes and a lot of
> > the data in the record is wrong. If you move the sub-record declarations
> > out to a wider scope, the tests pass.
> > 
> > The WAL record data isn't actually copied into the buffer until
> > 
> >     recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);
> > 
> > after registering everything.
> > So all of those sub-records you made are out of scope the time it tries
> > to copy from them.
> > 
> > I spent the better part of a day last week trying to figure out what was
> > happening after I did the exact same thing. I must say that I found the
> > xloginsert API incredibly unhelpful on this point.
> 
> Oops. I had that in mind and that was actually why I moved the
> XLogRegisterData() call to the end of the function, because I found it
> confusing to register the struct before filling it in completely, even
> though it works perfectly fine. But then I missed it anyway when I moved the
> local variables. I added a brief comment on that.

Comment was a good idea.

> There is another patch in the commitfest that touches this area: "Recording
> whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning"
> [1]. That actually goes in the opposite direction than this patch. That
> patch wants to add more information, to show whether a record was emitted by
> VACUUM or on-access pruning, while this patch makes the freezing and
> VACUUM's 2nd phase records also look the same. We could easily add more
> flags to xl_heap_prune to distinguish them. Or assign different xl_info code
> for them, like that other patch proposed. But I don't think that needs to
> block this patch, that can be added as a separate patch.
> 
> [1] https://www.postgresql.org/message-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com

I think it would be very helpful to distinguish amongst vacuum pass 1,
2, and on-access pruning. I often want to know what did most of the
pruning -- and I could see also wanting to know if the first or second
vacuum pass was responsible for removing the tuples. I agree it could be
done separately, but it would be very helpful to have as soon as
possible now that the record type will be the same for all three.

> From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Fri, 22 Mar 2024 23:10:22 +0200
> Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats
>  
>  /*
> - * Handles XLOG_HEAP2_PRUNE record type.
> - *
> - * Acquires a full cleanup lock.
> + * Replay XLOG_HEAP2_PRUNE_FREEZE record.
>   */
>  static void
> -heap_xlog_prune(XLogReaderState *record)
> +heap_xlog_prune_freeze(XLogReaderState *record)
>  {
>      XLogRecPtr    lsn = record->EndRecPtr;
> -    xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
> +    char       *ptr;
> +    xl_heap_prune *xlrec;
>      Buffer        buffer;
>      RelFileLocator rlocator;
>      BlockNumber blkno;
>      XLogRedoAction action;
>  
>      XLogRecGetBlockTag(record, 0, &rlocator, NULL, &blkno);
> +    ptr = XLogRecGetData(record);

I don't love having two different pointers that alias each other and we
don't know which one is for what. Perhaps we could memcpy xlrec like in
my attached diff (log_updates.diff). It also might perform slightly
better than accessing flags through a xl_heap_prune
* -- since it wouldn't be doing pointer dereferencing.

> +    xlrec = (xl_heap_prune *) ptr;
> +    ptr += SizeOfHeapPrune;
>  
>      /*
> -     * We're about to remove tuples. In Hot Standby mode, ensure that there's
> -     * no queries running for which the removed tuples are still visible.
> +     * We will take an ordinary exclusive lock or a cleanup lock depending on
> +     * whether the XLHP_CLEANUP_LOCK flag is set.  With an ordinary exclusive
> +     * lock, we better not be doing anything that requires moving existing
> +     * tuple data.
>       */
> -    if (InHotStandby)
> -        ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
> -                                            xlrec->isCatalogRel,
> +    Assert((xlrec->flags & XLHP_CLEANUP_LOCK) != 0 ||
> +           (xlrec->flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
> +
> +    /*
> +     * We are about to remove and/or freeze tuples.  In Hot Standby mode,
> +     * ensure that there are no queries running for which the removed tuples
> +     * are still visible or which still consider the frozen xids as running.
> +     * The conflict horizon XID comes after xl_heap_prune.
> +     */
> +    if (InHotStandby && (xlrec->flags & XLHP_HAS_CONFLICT_HORIZON) != 0)
> +    {

My attached patch has a TODO here for the comment. It sticks out that
the serialization and deserialization conditions are different for the
snapshot conflict horizon. We don't deserialize if InHotStandby is
false. That's still correct now because we don't write anything else to
the main data chunk afterward. But, if someone were to add another
member after snapshot_conflict_horizon, they would want to know to
deserialize snapshot_conflict_horizon first even if InHotStandby is
false.

> +        TransactionId snapshot_conflict_horizon;
> +

I would throw a comment in about the memcpy being required because the
snapshot_conflict_horizon is in the buffer unaligned.

> +        memcpy(&snapshot_conflict_horizon, ptr, sizeof(TransactionId));
> +        ResolveRecoveryConflictWithSnapshot(snapshot_conflict_horizon,
> +                                            (xlrec->flags & XLHP_IS_CATALOG_REL) != 0,
>                                              rlocator);
> +    }
>  
>      /*

> +
> +/*
> + * Given a MAXALIGNed buffer returned by XLogRecGetBlockData() and pointed to
> + * by cursor and any xl_heap_prune flags, deserialize the arrays of
> + * OffsetNumbers contained in an XLOG_HEAP2_PRUNE_FREEZE record.
> + *
> + * This is in heapdesc.c so it can be shared between heap2_redo and heap2_desc
> + * code, the latter of which is used in frontend (pg_waldump) code.
> + */
> +void
> +heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags,
> +                                       int *nredirected, OffsetNumber **redirected,
> +                                       int *ndead, OffsetNumber **nowdead,
> +                                       int *nunused, OffsetNumber **nowunused,
> +                                       int *nplans, xlhp_freeze_plan **plans,
> +                                       OffsetNumber **frz_offsets)
> +{
> +    if (flags & XLHP_HAS_FREEZE_PLANS)
> +    {
> +        xlhp_freeze_plans *freeze_plans = (xlhp_freeze_plans *) cursor;
> +
> +        *nplans = freeze_plans->nplans;
> +        Assert(*nplans > 0);
> +        *plans = freeze_plans->plans;
> +
> +        cursor += offsetof(xlhp_freeze_plans, plans);
> +        cursor += sizeof(xlhp_freeze_plan) * *nplans;
> +    }

I noticed you decided to set these in the else statements. Is that to
emphasize that it is important to correctness that they be properly
initialized?

> +    else
> +    {
> +        *nplans = 0;
> +        *plans = NULL;
> +    }
> +

Thanks again for all your work on this set!

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 24/03/2024 21:55, Melanie Plageman wrote:
> On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote:
>> On 20/03/2024 21:17, Melanie Plageman wrote:
>> There is another patch in the commitfest that touches this area: "Recording
>> whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning"
>> [1]. That actually goes in the opposite direction than this patch. That
>> patch wants to add more information, to show whether a record was emitted by
>> VACUUM or on-access pruning, while this patch makes the freezing and
>> VACUUM's 2nd phase records also look the same. We could easily add more
>> flags to xl_heap_prune to distinguish them. Or assign different xl_info code
>> for them, like that other patch proposed. But I don't think that needs to
>> block this patch, that can be added as a separate patch.
>>
>> [1] https://www.postgresql.org/message-id/CAH2-Wzmsevhox+HJpFmQgCxWWDgNzP0R9F+VBnpOK6TgxMPrRw@mail.gmail.com
> 
> I think it would be very helpful to distinguish amongst vacuum pass 1,
> 2, and on-access pruning. I often want to know what did most of the
> pruning -- and I could see also wanting to know if the first or second
> vacuum pass was responsible for removing the tuples. I agree it could be
> done separately, but it would be very helpful to have as soon as
> possible now that the record type will be the same for all three.

Ok, I used separate 'info' codes for records generated on on-access 
pruning and vacuum's 1st and 2nd pass. Similar to Peter's patch on that 
other thread, except that I didn't reserve the whole high bit for this, 
but used three different 'info' codes. Freezing uses the same 
XLOG_HEAP2_PRUNE_VACUUM_SCAN as the pruning in vacuum's 1st pass. You 
can distinguish them based on whether the record has nfrozen > 0, and 
with the rest of the patches, they will be merged anyway.

>>  From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001
>> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
>> Date: Fri, 22 Mar 2024 23:10:22 +0200
>> Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats
>>   
>>   /*
>> - * Handles XLOG_HEAP2_PRUNE record type.
>> - *
>> - * Acquires a full cleanup lock.
>> + * Replay XLOG_HEAP2_PRUNE_FREEZE record.
>>    */
>>   static void
>> -heap_xlog_prune(XLogReaderState *record)
>> +heap_xlog_prune_freeze(XLogReaderState *record)
>>   {
>>       XLogRecPtr    lsn = record->EndRecPtr;
>> -    xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
>> +    char       *ptr;
>> +    xl_heap_prune *xlrec;
>>       Buffer        buffer;
>>       RelFileLocator rlocator;
>>       BlockNumber blkno;
>>       XLogRedoAction action;
>>   
>>       XLogRecGetBlockTag(record, 0, &rlocator, NULL, &blkno);
>> +    ptr = XLogRecGetData(record);
> 
> I don't love having two different pointers that alias each other and we
> don't know which one is for what. Perhaps we could memcpy xlrec like in
> my attached diff (log_updates.diff). It also might perform slightly
> better than accessing flags through a xl_heap_prune
> * -- since it wouldn't be doing pointer dereferencing.

Ok.

>>       /*
>> -     * We're about to remove tuples. In Hot Standby mode, ensure that there's
>> -     * no queries running for which the removed tuples are still visible.
>> +     * We will take an ordinary exclusive lock or a cleanup lock depending on
>> +     * whether the XLHP_CLEANUP_LOCK flag is set.  With an ordinary exclusive
>> +     * lock, we better not be doing anything that requires moving existing
>> +     * tuple data.
>>        */
>> -    if (InHotStandby)
>> -        ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
>> -                                            xlrec->isCatalogRel,
>> +    Assert((xlrec->flags & XLHP_CLEANUP_LOCK) != 0 ||
>> +           (xlrec->flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
>> +
>> +    /*
>> +     * We are about to remove and/or freeze tuples.  In Hot Standby mode,
>> +     * ensure that there are no queries running for which the removed tuples
>> +     * are still visible or which still consider the frozen xids as running.
>> +     * The conflict horizon XID comes after xl_heap_prune.
>> +     */
>> +    if (InHotStandby && (xlrec->flags & XLHP_HAS_CONFLICT_HORIZON) != 0)
>> +    {
> 
> My attached patch has a TODO here for the comment. It sticks out that
> the serialization and deserialization conditions are different for the
> snapshot conflict horizon. We don't deserialize if InHotStandby is
> false. That's still correct now because we don't write anything else to
> the main data chunk afterward. But, if someone were to add another
> member after snapshot_conflict_horizon, they would want to know to
> deserialize snapshot_conflict_horizon first even if InHotStandby is
> false.

Good point. Fixed so that 'snapshot_conflict_horizon' is deserialized 
even if !InHotStandby. A memcpy is cheap, and is probably optimized away 
by the compiler anyway.

>> +        TransactionId snapshot_conflict_horizon;
>> +
> 
> I would throw a comment in about the memcpy being required because the
> snapshot_conflict_horizon is in the buffer unaligned.

Added.

>> +/*
>> + * Given a MAXALIGNed buffer returned by XLogRecGetBlockData() and pointed to
>> + * by cursor and any xl_heap_prune flags, deserialize the arrays of
>> + * OffsetNumbers contained in an XLOG_HEAP2_PRUNE_FREEZE record.
>> + *
>> + * This is in heapdesc.c so it can be shared between heap2_redo and heap2_desc
>> + * code, the latter of which is used in frontend (pg_waldump) code.
>> + */
>> +void
>> +heap_xlog_deserialize_prune_and_freeze(char *cursor, uint8 flags,
>> +                                       int *nredirected, OffsetNumber **redirected,
>> +                                       int *ndead, OffsetNumber **nowdead,
>> +                                       int *nunused, OffsetNumber **nowunused,
>> +                                       int *nplans, xlhp_freeze_plan **plans,
>> +                                       OffsetNumber **frz_offsets)
>> +{
>> +    if (flags & XLHP_HAS_FREEZE_PLANS)
>> +    {
>> +        xlhp_freeze_plans *freeze_plans = (xlhp_freeze_plans *) cursor;
>> +
>> +        *nplans = freeze_plans->nplans;
>> +        Assert(*nplans > 0);
>> +        *plans = freeze_plans->plans;
>> +
>> +        cursor += offsetof(xlhp_freeze_plans, plans);
>> +        cursor += sizeof(xlhp_freeze_plan) * *nplans;
>> +    }
> 
> I noticed you decided to set these in the else statements. Is that to
> emphasize that it is important to correctness that they be properly
> initialized?

The point was to always initialize *nplans et al in the function. You 
required the caller to initialize them to zero, but that seems error-prone.

I made one more last minute change: I changed the order of the array 
arguments in heap_xlog_deserialize_prune_and_freeze() to match the order 
in log_heap_prune_and_freeze().

Committed with the above little changes. Thank you! Now, back to the 
rest of the patches :-).

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 24/03/2024 18:32, Melanie Plageman wrote:
> On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> In heap_page_prune_and_freeze(), we now do some extra work on each live
>> tuple, to set the all_visible_except_removable correctly. And also to
>> update live_tuples, recently_dead_tuples and hastup. When we're not
>> freezing, that's a waste of cycles, the caller doesn't care. I hope it's
>> enough that it doesn't matter, but is it?
> 
> Last year on an early version of the patch set I did some pgbench
> tpcb-like benchmarks -- since there is a lot of on-access pruning in
> that workload -- and I don't remember it being a showstopper. The code
> has changed a fair bit since then. However, I think it might be safer
> to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
> the rest of the loop after heap_prune_satisifies_vacuum() when
> on-access pruning invokes it. I had avoided that because it felt ugly
> and error-prone, however it addresses a few other of your points as
> well.

Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the 
argument described what it does, rather than who it's for. For example, 
'need_all_visible'. If set to true, the function determines 
'all_visible', otherwise it does not.

I started to look closer at the loops in heap_prune_chain() and how they 
update all the various flags and counters. There's a lot going on there. 
We have:

- live_tuples counter
- recently_dead_tuples counter
- all_visible[_except_removable]
- all_frozen
- visibility_cutoff_xid
- hastup
- prstate.frozen array
- nnewlpdead
- deadoffsets array

And that doesn't even include all the local variables and the final 
dead/redirected arrays.

Some of those are set in the first loop that initializes 'htsv' for each 
tuple on the page. Others are updated in heap_prune_chain(). Some are 
updated in both. It's hard to follow which are set where.

I think recently_dead_tuples is updated incorrectly, for tuples that are 
part of a completely dead HOT chain. For example, imagine a hot chain 
with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow 
the chain, see the DEAD tuple at the end of the chain, and mark both 
tuples for pruning. However, we already updated 'recently_dead_tuples' 
in the first loop, which is wrong if we remove the tuple.

Maybe that's the only bug like this, but I'm a little scared. Is there 
something we could do to make this simpler? Maybe move all the new work 
that we added to the first loop, into heap_prune_chain() ? Maybe 
introduce a few more helper heap_prune_record_*() functions, to update 
the flags and counters also for live and insert/delete-in-progress 
tuples and for dead line pointers? Something like 
heap_prune_record_live() and heap_prune_record_lp_dead().

>> The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
>> these patches's fault). This at least is wrong, because Max(a, b)
>> doesn't handle XID wraparound correctly:
>>
>>>                        if (do_freeze)
>>>                                conflict_xid = Max(prstate.snapshotConflictHorizon,
>>>                                                                   presult->frz_conflict_horizon);
>>>                        else
>>>                                conflict_xid = prstate.snapshotConflictHorizon;
>>
>> Then there's this in lazy_scan_prune():
>>
>>>                /* Using same cutoff when setting VM is now unnecessary */
>>>                if (presult.all_frozen)
>>>                        presult.frz_conflict_horizon = InvalidTransactionId;
>> This does the right thing in the end, but if all the tuples are frozen
>> shouldn't frz_conflict_horizon already be InvalidTransactionId? The
>> comment says it's "newest xmin on the page", and if everything was
>> frozen, all xmins are FrozenTransactionId. In other words, that should
>> be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
>> caller. Also, frz_conflict_horizon is only set correctly if
>> 'all_frozen==true', would be good to mention that in the comments too.
> 
> Yes, this is a good point. I've spent some time swapping all of this
> back into my head. I think we should change the names of all these
> conflict horizon variables and introduce some local variables again.
> In the attached patch, I've updated the name of the variable in
> PruneFreezeResult to vm_conflict_horizon, as it is only used for
> emitting a VM update record. Now, I don't set it until the end of
> heap_page_prune_and_freeze(). It is only updated from
> InvalidTransactionId if the page is not all frozen. As you say, if the
> page is all frozen, there can be no conflict.

Makes sense.

> I've also changed PruneState->snapshotConflictHorizon to
> PruneState->latest_xid_removed.
> 
> And I introduced the local variables visibility_cutoff_xid and
> frz_conflict_horizon. I think it is important we distinguish between
> the latest xid pruned, the latest xmin of tuples frozen, and the
> latest xid of all live tuples on the page.
>
> Though we end up using visibility_cutoff_xid as the freeze conflict
> horizon if the page is all frozen, I think it is more clear to have
> the three variables and name them what they are.

Agreed.

> Note that I still don't think we have a resolution on what to
> correctly update new_relfrozenxid and new_relminmxid to at the end
> when presult->nfrozen == 0 and presult->all_frozen is true.
> 
>          if (presult->nfrozen > 0)
>          {
>              presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
>              presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
>          }
>          else
>          {
>              presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
>              presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
>          }

One approach is to take them out of the PageFreezeResult struct again, 
and pass them as pointers:

void
heap_page_prune_and_freeze(Relation relation, Buffer buffer,
    ...
    TransactionId *new_relfrozenxid,
    MultiXactId *new_relminmxid,
    ...
)

That would be natural for the caller too, as it wouldn't need to set up 
the old values to HeapPageFreeze before each call, nor copy the new 
values to 'vacrel' after the call. I'm thinking that we'd move the 
responsibility of setting up HeapPageFreeze to 
heap_page_prune_and_freeze(), instead of having the caller set it up. So 
the caller would look something like this:

    heap_page_prune_and_freeze(rel, buf, vacrel->vistest,
       &vacrel->cutoffs, &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid,
       &presult,
       PRUNE_VACUUM_SCAN, flags,
       &vacrel->offnum);

In this setup, heap_page_prune_and_freeze() would update 
*new_relfrozenxid and *new_relminmxid when it has a new value for them, 
and leave them unchanged otherwise.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
Thanks for committing the new WAL format!

On Mon, Mar 25, 2024 at 3:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 24/03/2024 18:32, Melanie Plageman wrote:
> > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> In heap_page_prune_and_freeze(), we now do some extra work on each live
> >> tuple, to set the all_visible_except_removable correctly. And also to
> >> update live_tuples, recently_dead_tuples and hastup. When we're not
> >> freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> >> enough that it doesn't matter, but is it?
> >
> > Last year on an early version of the patch set I did some pgbench
> > tpcb-like benchmarks -- since there is a lot of on-access pruning in
> > that workload -- and I don't remember it being a showstopper. The code
> > has changed a fair bit since then. However, I think it might be safer
> > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
> > the rest of the loop after heap_prune_satisifies_vacuum() when
> > on-access pruning invokes it. I had avoided that because it felt ugly
> > and error-prone, however it addresses a few other of your points as
> > well.
>
> Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the
> argument described what it does, rather than who it's for. For example,
> 'need_all_visible'. If set to true, the function determines
> 'all_visible', otherwise it does not.

I like that way of putting it -- describing what it does instead of
who it is for. However, we now have PruneReason as an argument to
heap_page_prune(), which would be usable for this purpose (for
skipping the rest of the first loop). It is not descriptive of how we
would use it in this scenario, though.

> I started to look closer at the loops in heap_prune_chain() and how they
> update all the various flags and counters. There's a lot going on there.
> We have:
>
> - live_tuples counter
> - recently_dead_tuples counter
> - all_visible[_except_removable]
> - all_frozen
> - visibility_cutoff_xid
> - hastup
> - prstate.frozen array
> - nnewlpdead
> - deadoffsets array
>
> And that doesn't even include all the local variables and the final
> dead/redirected arrays.

Yes, there are a lot of things happening. In an early version, I had
hoped for the first loop to be just getting the visibility information
and then to do most of the other stuff as we went in
heap_prune_chain() as you mention below. I couldn't quite get a
version of that working that looked nice. I agree that the whole thing
feels a bit brittle and error-prone. It's hard to be objective after
fiddling with something over the course of a year. I'm trying to take
a step back now and rethink it.

> Some of those are set in the first loop that initializes 'htsv' for each
> tuple on the page. Others are updated in heap_prune_chain(). Some are
> updated in both. It's hard to follow which are set where.

Yep.

> I think recently_dead_tuples is updated incorrectly, for tuples that are
> part of a completely dead HOT chain. For example, imagine a hot chain
> with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow
> the chain, see the DEAD tuple at the end of the chain, and mark both
> tuples for pruning. However, we already updated 'recently_dead_tuples'
> in the first loop, which is wrong if we remove the tuple.
>
> Maybe that's the only bug like this, but I'm a little scared. Is there
> something we could do to make this simpler? Maybe move all the new work
> that we added to the first loop, into heap_prune_chain() ? Maybe
> introduce a few more helper heap_prune_record_*() functions, to update
> the flags and counters also for live and insert/delete-in-progress
> tuples and for dead line pointers? Something like
> heap_prune_record_live() and heap_prune_record_lp_dead().

I had discarded previous attempts to get everything done in
heap_prune_chain() because it was hard to make sure I was doing the
right thing given that it visits the line pointers out of order so
making sure you've considered all of them once and only once was hard.
I hadn't thought of the approach you suggested with record_live() --
that might help. I will work on this tomorrow. I had hoped to get
something out today, but I am still in the middle of rebasing the back
20 patches from your v5 over current master and then adding in the
suggestions that I made in the various diffs on the thread.

> > Note that I still don't think we have a resolution on what to
> > correctly update new_relfrozenxid and new_relminmxid to at the end
> > when presult->nfrozen == 0 and presult->all_frozen is true.
> >
> >          if (presult->nfrozen > 0)
> >          {
> >              presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
> >              presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
> >          }
> >          else
> >          {
> >              presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
> >              presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
> >          }
>
> One approach is to take them out of the PageFreezeResult struct again,
> and pass them as pointers:
>
> void
> heap_page_prune_and_freeze(Relation relation, Buffer buffer,
>         ...
>         TransactionId *new_relfrozenxid,
>         MultiXactId *new_relminmxid,
>         ...
> )

What about the question about whether or not we should be using
FreezePageRelfrozenXid when all_frozen is true and nrfrozen == 0. I
was confused about whether or not we had to do this by the comment in
HeapPageFreeze.

> That would be natural for the caller too, as it wouldn't need to set up
> the old values to HeapPageFreeze before each call, nor copy the new
> values to 'vacrel' after the call. I'm thinking that we'd move the
> responsibility of setting up HeapPageFreeze to
> heap_page_prune_and_freeze(), instead of having the caller set it up. So
> the caller would look something like this:
>
>         heap_page_prune_and_freeze(rel, buf, vacrel->vistest,
>            &vacrel->cutoffs, &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid,
>            &presult,
>            PRUNE_VACUUM_SCAN, flags,
>            &vacrel->offnum);
>
> In this setup, heap_page_prune_and_freeze() would update
> *new_relfrozenxid and *new_relminmxid when it has a new value for them,
> and leave them unchanged otherwise.

I do prefer having heap_page_prune_and_freeze() own HeapPageFreeze.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Mon, Mar 25, 2024 at 09:33:38PM +0200, Heikki Linnakangas wrote:
> On 24/03/2024 18:32, Melanie Plageman wrote:
> > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > 
> > > In heap_page_prune_and_freeze(), we now do some extra work on each live
> > > tuple, to set the all_visible_except_removable correctly. And also to
> > > update live_tuples, recently_dead_tuples and hastup. When we're not
> > > freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> > > enough that it doesn't matter, but is it?
> > 
> > Last year on an early version of the patch set I did some pgbench
> > tpcb-like benchmarks -- since there is a lot of on-access pruning in
> > that workload -- and I don't remember it being a showstopper. The code
> > has changed a fair bit since then. However, I think it might be safer
> > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
> > the rest of the loop after heap_prune_satisifies_vacuum() when
> > on-access pruning invokes it. I had avoided that because it felt ugly
> > and error-prone, however it addresses a few other of your points as
> > well.
> 
> Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the
> argument described what it does, rather than who it's for. For example,
> 'need_all_visible'. If set to true, the function determines 'all_visible',
> otherwise it does not.

A very rough v7 is attached. The whole thing is rebased over master and
then 0016 contains an attempt at the refactor we discussed in this
email.

Instead of just using the PruneReason to avoid doing the extra steps
when on-access pruning calls heap_page_prune_and_freeze(), I've made an
"actions" variable and defined different flags for it. One of them is
a replacement for the existing mark_unused_now flag. I defined another
one, PRUNE_DO_TRY_FREEZE, which could be used in place of checking if
pagefrz is NULL.

There is a whole group of activities that only the vacuum caller does
outside of freezing -- setting hastup, counting live and recently dead
tuples, determining whole page visibility and a snapshot conflict
horizon for updating the VM. But I didn't want to introduce separate
flags for each of them, because then I would have to check each of them
before taking the action. That would be lots of extra branching and
on-access pruning does none of those actions while vacuum does all of
them.

> I started to look closer at the loops in heap_prune_chain() and how they
> update all the various flags and counters. There's a lot going on there. We
> have:
> 
> - live_tuples counter
> - recently_dead_tuples counter
> - all_visible[_except_removable]
> - all_frozen
> - visibility_cutoff_xid
> - hastup
> - prstate.frozen array
> - nnewlpdead
> - deadoffsets array
> 
> And that doesn't even include all the local variables and the final
> dead/redirected arrays.
> 
> Some of those are set in the first loop that initializes 'htsv' for each
> tuple on the page. Others are updated in heap_prune_chain(). Some are
> updated in both. It's hard to follow which are set where.
> 
> I think recently_dead_tuples is updated incorrectly, for tuples that are
> part of a completely dead HOT chain. For example, imagine a hot chain with
> two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow the
> chain, see the DEAD tuple at the end of the chain, and mark both tuples for
> pruning. However, we already updated 'recently_dead_tuples' in the first
> loop, which is wrong if we remove the tuple.

Ah, yes, you are so right about this bug.

> Maybe that's the only bug like this, but I'm a little scared. Is there
> something we could do to make this simpler? Maybe move all the new work that
> we added to the first loop, into heap_prune_chain() ? Maybe introduce a few
> more helper heap_prune_record_*() functions, to update the flags and
> counters also for live and insert/delete-in-progress tuples and for dead
> line pointers? Something like heap_prune_record_live() and
> heap_prune_record_lp_dead().

I like the idea of a heap_prune_record_live_or_recently_dead() function.
That's what I've attempted to implement in the attached 0016. I haven't
updated and cleaned up everything (especially comments) in the refactor,
but there are two major issues:

1) In heap_prune_chain(), a heap-only tuple which is not HOT updated may
end up being a live tuple not part of any chain or it may end up the
redirect target in a HOT chain. At the top of heap_prune_chain(), we
return if (HeapTupleHeaderIsHeapOnly(htup)). We may come back to this
tuple later if it is part of a chain. If we don't, we need to have
called heap_prune_record_live_or_recently_dead(). However, there are
other tuples that get redirected to which do not meet this criteria, so
we must call heap_prune_record_live_or_recently_dead() when setting an
item redirected to. If we call heap_prune_record_live_or_recently_dead()
in both places, we will double-count. To fix this, I introduced an
array, "counted". But that takes up extra space in the PruneState and
extra cycles to memset it.

I can't think of a way to make sure we count the right tuples without
another array. The tuples we need to count are those not pointed to by
prstate->marked + those tuples whose line pointers will be redirected to
(those are marked).

2) A large number of the members of PruneFreezeResult are only
initialized for the vacuum caller now. Even with a comment, this is a
bit confusing. And, it seems like there should be some symmetry between
the actions the caller tells heap_page_prune_and_freeze() to take and
the result parameters that are filled in.

I am concerned about adding all of the actions (setting hastup,
determining whole page visibility, etc as mentioned above) because then
I also have to check all the actions and that will add extra branching.
And out of the two callers of heap_page_prune_and_freeze(), one will do
all of the actions and one will do none of them except "main" pruning.

> > Note that I still don't think we have a resolution on what to
> > correctly update new_relfrozenxid and new_relminmxid to at the end
> > when presult->nfrozen == 0 and presult->all_frozen is true.
> > 
> >          if (presult->nfrozen > 0)
> >          {
> >              presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
> >              presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
> >          }
> >          else
> >          {
> >              presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
> >              presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
> >          }
> 
> One approach is to take them out of the PageFreezeResult struct again, and
> pass them as pointers:
> 
> void
> heap_page_prune_and_freeze(Relation relation, Buffer buffer,
>     ...
>     TransactionId *new_relfrozenxid,
>     MultiXactId *new_relminmxid,
>     ...
> )
> 
> That would be natural for the caller too, as it wouldn't need to set up the
> old values to HeapPageFreeze before each call, nor copy the new values to
> 'vacrel' after the call. I'm thinking that we'd move the responsibility of
> setting up HeapPageFreeze to heap_page_prune_and_freeze(), instead of having
> the caller set it up. So the caller would look something like this:
> 
>     heap_page_prune_and_freeze(rel, buf, vacrel->vistest,
>        &vacrel->cutoffs, &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid,
>        &presult,
>        PRUNE_VACUUM_SCAN, flags,
>        &vacrel->offnum);
> 
> In this setup, heap_page_prune_and_freeze() would update *new_relfrozenxid
> and *new_relminmxid when it has a new value for them, and leave them
> unchanged otherwise.

I've passed new_relfrozen_xid and new_relmin_mxid as arguments.

But as for only updating them when there is a new value, that doesn't
sound cheaper than just setting them when they are passed in with the
values from [No]FreezePageRelfrozenXid, [No]FreezePageRelminMxid. Unless
you are imagining a way to simplify the current
[No]FreezePageRelfrozenXid, [No]FreezePageRelminMxid.

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Tue, Mar 26, 2024 at 5:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 09:33:38PM +0200, Heikki Linnakangas wrote:
> > On 24/03/2024 18:32, Melanie Plageman wrote:
> > > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > >
> > > > In heap_page_prune_and_freeze(), we now do some extra work on each live
> > > > tuple, to set the all_visible_except_removable correctly. And also to
> > > > update live_tuples, recently_dead_tuples and hastup. When we're not
> > > > freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> > > > enough that it doesn't matter, but is it?
> > >
> > > Last year on an early version of the patch set I did some pgbench
> > > tpcb-like benchmarks -- since there is a lot of on-access pruning in
> > > that workload -- and I don't remember it being a showstopper. The code
> > > has changed a fair bit since then. However, I think it might be safer
> > > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
> > > the rest of the loop after heap_prune_satisifies_vacuum() when
> > > on-access pruning invokes it. I had avoided that because it felt ugly
> > > and error-prone, however it addresses a few other of your points as
> > > well.
> >
> > Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the
> > argument described what it does, rather than who it's for. For example,
> > 'need_all_visible'. If set to true, the function determines 'all_visible',
> > otherwise it does not.
>
> A very rough v7 is attached. The whole thing is rebased over master and
> then 0016 contains an attempt at the refactor we discussed in this
> email.
>
> Instead of just using the PruneReason to avoid doing the extra steps
> when on-access pruning calls heap_page_prune_and_freeze(), I've made an
> "actions" variable and defined different flags for it. One of them is
> a replacement for the existing mark_unused_now flag. I defined another
> one, PRUNE_DO_TRY_FREEZE, which could be used in place of checking if
> pagefrz is NULL.
>
> There is a whole group of activities that only the vacuum caller does
> outside of freezing -- setting hastup, counting live and recently dead
> tuples, determining whole page visibility and a snapshot conflict
> horizon for updating the VM. But I didn't want to introduce separate
> flags for each of them, because then I would have to check each of them
> before taking the action. That would be lots of extra branching and
> on-access pruning does none of those actions while vacuum does all of
> them.
>
> > I started to look closer at the loops in heap_prune_chain() and how they
> > update all the various flags and counters. There's a lot going on there. We
> > have:
> >
> > - live_tuples counter
> > - recently_dead_tuples counter
> > - all_visible[_except_removable]
> > - all_frozen
> > - visibility_cutoff_xid
> > - hastup
> > - prstate.frozen array
> > - nnewlpdead
> > - deadoffsets array
> >
> > And that doesn't even include all the local variables and the final
> > dead/redirected arrays.
> >
> > Some of those are set in the first loop that initializes 'htsv' for each
> > tuple on the page. Others are updated in heap_prune_chain(). Some are
> > updated in both. It's hard to follow which are set where.
> >
> > I think recently_dead_tuples is updated incorrectly, for tuples that are
> > part of a completely dead HOT chain. For example, imagine a hot chain with
> > two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow the
> > chain, see the DEAD tuple at the end of the chain, and mark both tuples for
> > pruning. However, we already updated 'recently_dead_tuples' in the first
> > loop, which is wrong if we remove the tuple.
>
> Ah, yes, you are so right about this bug.
>
> > Maybe that's the only bug like this, but I'm a little scared. Is there
> > something we could do to make this simpler? Maybe move all the new work that
> > we added to the first loop, into heap_prune_chain() ? Maybe introduce a few
> > more helper heap_prune_record_*() functions, to update the flags and
> > counters also for live and insert/delete-in-progress tuples and for dead
> > line pointers? Something like heap_prune_record_live() and
> > heap_prune_record_lp_dead().
>
> I like the idea of a heap_prune_record_live_or_recently_dead() function.
> That's what I've attempted to implement in the attached 0016. I haven't
> updated and cleaned up everything (especially comments) in the refactor,
> but there are two major issues:
>
> 1) In heap_prune_chain(), a heap-only tuple which is not HOT updated may
> end up being a live tuple not part of any chain or it may end up the
> redirect target in a HOT chain. At the top of heap_prune_chain(), we
> return if (HeapTupleHeaderIsHeapOnly(htup)). We may come back to this
> tuple later if it is part of a chain. If we don't, we need to have
> called heap_prune_record_live_or_recently_dead(). However, there are
> other tuples that get redirected to which do not meet this criteria, so
> we must call heap_prune_record_live_or_recently_dead() when setting an
> item redirected to. If we call heap_prune_record_live_or_recently_dead()
> in both places, we will double-count. To fix this, I introduced an
> array, "counted". But that takes up extra space in the PruneState and
> extra cycles to memset it.
>
> I can't think of a way to make sure we count the right tuples without
> another array. The tuples we need to count are those not pointed to by
> prstate->marked + those tuples whose line pointers will be redirected to
> (those are marked).
>
> 2) A large number of the members of PruneFreezeResult are only
> initialized for the vacuum caller now. Even with a comment, this is a
> bit confusing. And, it seems like there should be some symmetry between
> the actions the caller tells heap_page_prune_and_freeze() to take and
> the result parameters that are filled in.
>
> I am concerned about adding all of the actions (setting hastup,
> determining whole page visibility, etc as mentioned above) because then
> I also have to check all the actions and that will add extra branching.
> And out of the two callers of heap_page_prune_and_freeze(), one will do
> all of the actions and one will do none of them except "main" pruning.

This morning I worked on a version of this patchset which moved the
counting of live and recently dead tuples and the calculation of the
vm conflict horizon back to lazy_scan_prune() but kept the freezing
and dead offset collection in heap_prune_chain(). I encountered the
same problem with ensuring each tuple was considered for freezing
exactly once. It also made me realize that my patch set (v7) still has
the same problem in which all_visible_except_removable will be
incorrectly set to false and recently dead tuples incorrectly
incremented when encountering HEAPTUPLE_RECENTLY_DEAD tuples whose
line pointers get set LP_DEAD during pruning. And I think I am
incorrectly calling heap_prepare_freeze_tuple() on them too.

I need some way to modify the control flow or accounting such that I
know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
And a way to consider freezing and do live tuple accounting for these
and HEAPTUPLE_LIVE tuples exactly once.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 27/03/2024 17:18, Melanie Plageman wrote:
> I need some way to modify the control flow or accounting such that I
> know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> And a way to consider freezing and do live tuple accounting for these
> and HEAPTUPLE_LIVE tuples exactly once.

Just a quick update: I've been massaging this some more today, and I 
think I'm onto got something palatable. I'll send an updated patch later 
today, but the key is to note that for each item on the page, there is 
one point where we determine the fate of the item, whether it's pruned 
or not. That can happen in different points in in heap_page_prune(). 
That's also when we set marked[offnum] = true. Whenever that happens, we 
all call one of the a heap_page_prune_record_*() subroutines. We already 
have those subroutines for when a tuple is marked as dead or unused, but 
let's add similar subroutines for the case that we're leaving the tuple 
unchanged. If we move all the bookkeeping logic to those subroutines, we 
can ensure that it gets done exactly once for each tuple, and at that 
point we know what we are going to do to the tuple, so we can count it 
correctly. So heap_prune_chain() decides what to do with each tuple, and 
ensures that each tuple is marked only once, and the subroutines update 
all the variables, add the item to the correct arrays etc. depending on 
what we're doing with it.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Peter Geoghegan
Date:
On Tue, Mar 19, 2024 at 9:36 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>          * "Freeze" NewRelfrozenXid/NewRelminMxid trackers.
>          *
>          * Trackers used when heap_freeze_execute_prepared freezes, or when there
>          * are zero freeze plans for a page.  It is always valid for vacuumlazy.c
>          * to freeze any page, by definition.  This even includes pages that have
>          * no tuples with storage to consider in the first place.  That way the
>          * 'totally_frozen' results from heap_prepare_freeze_tuple can always be
>          * used in the same way, even when no freeze plans need to be executed to
>          * "freeze the page".  Only the "freeze" path needs to consider the need
>          * to set pages all-frozen in the visibility map under this scheme.
>          *
>          * When we freeze a page, we generally freeze all XIDs < OldestXmin, only
>          * leaving behind XIDs that are ineligible for freezing, if any.  And so
>          * you might wonder why these trackers are necessary at all; why should
>          * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
>          * ratchet back the top-level NewRelfrozenXid/NewRelminMxid trackers?
>          *
>          * It is useful to use a definition of "freeze the page" that does not
>          * overspecify how MultiXacts are affected.  heap_prepare_freeze_tuple
>          * generally prefers to remove Multis eagerly, but lazy processing is used
>          * in cases where laziness allows VACUUM to avoid allocating a new Multi.
>          * The "freeze the page" trackers enable this flexibility.
>          */
>
> So, I don't really know if it is right to just check presult->nfrozen >
> 0 when updating relminmxid. I have changed it to the way you suggested.
> But we can change it back.

I think that this is just about safe. I had to check, though. I see
that the FRM_NOOP case (within
FreezeMultiXactId/heap_prepare_freeze_tuple) will ratchet back both
sets of trackers (both the freeze and no freeze variants). However,
it's rather hard to see that this is true.

The intent here was that cases where "presult->nfrozen == 0" would
always take the "freeze" path. That seems more natural to me, at
least, since I think of the freeze path as the default choice. By
definition, lazy_scan_prune() can always take the freeze path -- even
when the page has no tuples with storage. But it cannot always take
the no-freeze path -- "disobeying" pagefrz.freeze_required creates the
risk that relfrozenxid/relminmxid will be advanced to unsafe values at
the end of the VACUUM. IMV you should stick with that approach now,
even if it is currently safe to do it the other way around.

--
Peter Geoghegan



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 27/03/2024 17:18, Melanie Plageman wrote:
> > I need some way to modify the control flow or accounting such that I
> > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > And a way to consider freezing and do live tuple accounting for these
> > and HEAPTUPLE_LIVE tuples exactly once.
>
> Just a quick update: I've been massaging this some more today, and I
> think I'm onto got something palatable. I'll send an updated patch later
> today, but the key is to note that for each item on the page, there is
> one point where we determine the fate of the item, whether it's pruned
> or not. That can happen in different points in in heap_page_prune().
> That's also when we set marked[offnum] = true. Whenever that happens, we
> all call one of the a heap_page_prune_record_*() subroutines. We already
> have those subroutines for when a tuple is marked as dead or unused, but
> let's add similar subroutines for the case that we're leaving the tuple
> unchanged. If we move all the bookkeeping logic to those subroutines, we
> can ensure that it gets done exactly once for each tuple, and at that
> point we know what we are going to do to the tuple, so we can count it
> correctly. So heap_prune_chain() decides what to do with each tuple, and
> ensures that each tuple is marked only once, and the subroutines update
> all the variables, add the item to the correct arrays etc. depending on
> what we're doing with it.

Yes, this would be ideal.

I was doing some experimentation with pageinspect today (trying to
find that single place where live tuples fates are decided) and it
seems like a heap-only tuple that is not HOT-updated will usually be
the one at the end of the chain. Which seems like it would be covered
by adding a record_live() type function call  in the loop of
heap_prune_chain():

        /*
         * If the tuple is not HOT-updated, then we are at the end of this
         * HOT-update chain.
         */
        if (!HeapTupleHeaderIsHotUpdated(htup))
        {
            heap_prune_record_live_or_recently_dead(dp, prstate,
offnum, presult);
            break;
        }

but that doesn't end up producing the same results as

        if (HeapTupleHeaderIsHeapOnly(htup)
            && !HeapTupleHeaderIsHotUpdated(htup) &&
presult->htsv[rootoffnum] == HEAPTUPLE_DEAD)
            heap_prune_record_live_or_recently_dead(dp, prstate,
offnum, presult);

at the top of heap_prune_chain().

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Wed, Mar 27, 2024 at 2:26 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 27/03/2024 17:18, Melanie Plageman wrote:
> > > I need some way to modify the control flow or accounting such that I
> > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > > And a way to consider freezing and do live tuple accounting for these
> > > and HEAPTUPLE_LIVE tuples exactly once.
> >
> > Just a quick update: I've been massaging this some more today, and I
> > think I'm onto got something palatable. I'll send an updated patch later
> > today, but the key is to note that for each item on the page, there is
> > one point where we determine the fate of the item, whether it's pruned
> > or not. That can happen in different points in in heap_page_prune().
> > That's also when we set marked[offnum] = true. Whenever that happens, we
> > all call one of the a heap_page_prune_record_*() subroutines. We already
> > have those subroutines for when a tuple is marked as dead or unused, but
> > let's add similar subroutines for the case that we're leaving the tuple
> > unchanged. If we move all the bookkeeping logic to those subroutines, we
> > can ensure that it gets done exactly once for each tuple, and at that
> > point we know what we are going to do to the tuple, so we can count it
> > correctly. So heap_prune_chain() decides what to do with each tuple, and
> > ensures that each tuple is marked only once, and the subroutines update
> > all the variables, add the item to the correct arrays etc. depending on
> > what we're doing with it.
>
> Yes, this would be ideal.
>
> I was doing some experimentation with pageinspect today (trying to
> find that single place where live tuples fates are decided) and it
> seems like a heap-only tuple that is not HOT-updated will usually be
> the one at the end of the chain. Which seems like it would be covered
> by adding a record_live() type function call  in the loop of
> heap_prune_chain():
>
>         /*
>          * If the tuple is not HOT-updated, then we are at the end of this
>          * HOT-update chain.
>          */
>         if (!HeapTupleHeaderIsHotUpdated(htup))
>         {
>             heap_prune_record_live_or_recently_dead(dp, prstate,
> offnum, presult);
>             break;
>         }
>
> but that doesn't end up producing the same results as
>
>         if (HeapTupleHeaderIsHeapOnly(htup)
>             && !HeapTupleHeaderIsHotUpdated(htup) &&
> presult->htsv[rootoffnum] == HEAPTUPLE_DEAD)
>             heap_prune_record_live_or_recently_dead(dp, prstate,
> offnum, presult);

sorry, that should say presult->htsv[rootoffnum] != HEAPTUPLE_DEAD.

The latter should be a subset of the former. But instead it seems
there are cases I missed by doing only the former.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 27/03/2024 20:26, Melanie Plageman wrote:
> On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> On 27/03/2024 17:18, Melanie Plageman wrote:
>>> I need some way to modify the control flow or accounting such that I
>>> know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
>>> And a way to consider freezing and do live tuple accounting for these
>>> and HEAPTUPLE_LIVE tuples exactly once.
>>
>> Just a quick update: I've been massaging this some more today, and I
>> think I'm onto got something palatable. I'll send an updated patch later
>> today, but the key is to note that for each item on the page, there is
>> one point where we determine the fate of the item, whether it's pruned
>> or not. That can happen in different points in in heap_page_prune().
>> That's also when we set marked[offnum] = true. Whenever that happens, we
>> all call one of the a heap_page_prune_record_*() subroutines. We already
>> have those subroutines for when a tuple is marked as dead or unused, but
>> let's add similar subroutines for the case that we're leaving the tuple
>> unchanged. If we move all the bookkeeping logic to those subroutines, we
>> can ensure that it gets done exactly once for each tuple, and at that
>> point we know what we are going to do to the tuple, so we can count it
>> correctly. So heap_prune_chain() decides what to do with each tuple, and
>> ensures that each tuple is marked only once, and the subroutines update
>> all the variables, add the item to the correct arrays etc. depending on
>> what we're doing with it.
> 
> Yes, this would be ideal.

Well, that took me a lot longer than expected. My approach of "make sure 
you all the right heap_prune_record_*() subroutine in all cases didn't 
work out quite as easily as I thought. Because, as you pointed out, it's 
difficult to know if a non-DEAD tuple that is part of a HOT chain will 
be visited later as part of the chain processing, or needs to be counted 
at the top of heap_prune_chain().

The solution I came up with is to add a third phase to pruning. At the 
top of heap_prune_chain(), if we see a live heap-only tuple, and we're 
not sure if it will be counted later as part of a HOT chain, we stash it 
away and revisit it later, after processing all the hot chains. That's 
somewhat similar to your 'counted' array, but not quite.

Attached is that approach, on top of v7. It's a bit messy, I made a 
bunch of other changes too and didn't fully separate them out to 
separate patch. Sorry about that.

One change with this is that live_tuples and many of the other fields 
are now again updated, even if the caller doesn't need them. It was hard 
to skip them in a way that would save any cycles, with the other 
refactorings.

Some other notable changes are mentioned in the commit message.

> I was doing some experimentation with pageinspect today (trying to
> find that single place where live tuples fates are decided) and it
> seems like a heap-only tuple that is not HOT-updated will usually be
> the one at the end of the chain. Which seems like it would be covered
> by adding a record_live() type function call  in the loop of
> heap_prune_chain():
> 
>          /*
>           * If the tuple is not HOT-updated, then we are at the end of this
>           * HOT-update chain.
>           */
>          if (!HeapTupleHeaderIsHotUpdated(htup))
>          {
>              heap_prune_record_live_or_recently_dead(dp, prstate,
> offnum, presult);
>              break;
>          }
> 
> but that doesn't end up producing the same results as
> 
>          if (HeapTupleHeaderIsHeapOnly(htup)
>              && !HeapTupleHeaderIsHotUpdated(htup) &&
> presult->htsv[rootoffnum] == HEAPTUPLE_DEAD)
>              heap_prune_record_live_or_recently_dead(dp, prstate,
> offnum, presult);
> 
> at the top of heap_prune_chain().

Yep, this is tricky, I also spent a lot of time trying to find a good 
"choke point" where we could say for sure that a live tuple is processed 
exactly once, but fumbled just like you.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
> On 27/03/2024 20:26, Melanie Plageman wrote:
> > On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > 
> > > On 27/03/2024 17:18, Melanie Plageman wrote:
> > > > I need some way to modify the control flow or accounting such that I
> > > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > > > And a way to consider freezing and do live tuple accounting for these
> > > > and HEAPTUPLE_LIVE tuples exactly once.
> > > 
> > > Just a quick update: I've been massaging this some more today, and I
> > > think I'm onto got something palatable. I'll send an updated patch later
> > > today, but the key is to note that for each item on the page, there is
> > > one point where we determine the fate of the item, whether it's pruned
> > > or not. That can happen in different points in in heap_page_prune().
> > > That's also when we set marked[offnum] = true. Whenever that happens, we
> > > all call one of the a heap_page_prune_record_*() subroutines. We already
> > > have those subroutines for when a tuple is marked as dead or unused, but
> > > let's add similar subroutines for the case that we're leaving the tuple
> > > unchanged. If we move all the bookkeeping logic to those subroutines, we
> > > can ensure that it gets done exactly once for each tuple, and at that
> > > point we know what we are going to do to the tuple, so we can count it
> > > correctly. So heap_prune_chain() decides what to do with each tuple, and
> > > ensures that each tuple is marked only once, and the subroutines update
> > > all the variables, add the item to the correct arrays etc. depending on
> > > what we're doing with it.
> > 
> > Yes, this would be ideal.
> 
> Well, that took me a lot longer than expected. My approach of "make sure you
> all the right heap_prune_record_*() subroutine in all cases didn't work out
> quite as easily as I thought. Because, as you pointed out, it's difficult to
> know if a non-DEAD tuple that is part of a HOT chain will be visited later
> as part of the chain processing, or needs to be counted at the top of
> heap_prune_chain().
> 
> The solution I came up with is to add a third phase to pruning. At the top
> of heap_prune_chain(), if we see a live heap-only tuple, and we're not sure
> if it will be counted later as part of a HOT chain, we stash it away and
> revisit it later, after processing all the hot chains. That's somewhat
> similar to your 'counted' array, but not quite.

I like this idea (the third phase). I've just started reviewing this but
I thought I would give the initial thoughts I had inline.

> One change with this is that live_tuples and many of the other fields are
> now again updated, even if the caller doesn't need them. It was hard to skip
> them in a way that would save any cycles, with the other refactorings.

I am worried we are writing checks that are going to have to come out of
SELECT queries' bank accounts, but I'll do some benchmarking when we're
all done with major refactoring.

> From 2f38628373ccfb6e8f8fd883955056030092569d Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Thu, 28 Mar 2024 00:16:09 +0200
> Subject: [PATCH v8 21/22] Add comment about a pre-existing issue
> 
> Not sure if we want to keep this, but I wanted to document it for
> discussion at least.
> ---
>  src/backend/access/heap/pruneheap.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index e37ba655a7d..2b720ab6aa1 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -792,6 +792,23 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>               * Note that we might first arrive at a dead heap-only tuple
>               * either here or while following a chain below.  Whichever path
>               * gets there first will mark the tuple unused.
> +             *
> +             * FIXME: The next paragraph isn't new with these patches, but
> +             * just something I realized while looking at this. But maybe we should
> +             * add a comment like this? Or is it too much detail?

I think a comment is a good idea.

> +             *
> +             * Whether we arrive at the dead HOT tuple first here or while
> +             * following a chain below affects whether preceding RECENTLY_DEAD
> +             * tuples in the chain can be removed or not.  Imagine that you
> +             * have a chain with two tuples: RECENTLY_DEAD -> DEAD.  If we
> +             * reach the RECENTLY_DEAD tuple first, the chain-following logic
> +             * will find the DEAD tuple and conclude that both tuples are in
> +             * fact dead and can be removed.  But if we reach the DEAD tuple
> +             * at the end of the chain first, when we reach the RECENTLY_DEAD
> +             * tuple later, we will not follow the chain because the DEAD
> +             * TUPLE is already 'marked', and will not remove the
> +             * RECENTLY_DEAD tuple.  This is not a correctness issue, and the
> +             * RECENTLY_DEAD tuple will be removed by a later VACUUM.
>               */
>              if (!HeapTupleHeaderIsHotUpdated(htup))

Is this intentional? Like would it be correct to remove the
RECENTLY_DEAD tuple during the current vacuum?

> From c2ee7508456d0e76de985f9997a6840450e342a8 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Thu, 28 Mar 2024 00:45:26 +0200
> Subject: [PATCH v8 22/22] WIP
> 
> - Got rid of all_visible_except_removable again. We're back to the
>   roughly the same mechanism as on 'master', where the all_visible
>   doesn't include LP_DEAD items, but at the end of
>   heap_page_prune_and_freeze() when we return to the caller, we clear
>   it if there were any LP_DEAD items. I considered calling the
>   variable 'all_visible_except_lp_dead', which would be more accurate,
>   but it's also very long.

not longer than all_visible_except_removable. I would be happy to keep
it more exact, but I'm also okay with just all_visible.

> - I duplicated all the fields from PageFreezeResult to PruneState. Now
>   heap_prune_chain() and all the subroutines don't need the
>   PageFreezeResult argument, and you don't need to remember which
>   struct each field is kept in. It's all now in PruneState, and the
>   fields that the caller is interested in are copied to
>   PageFreezeResult at the end of heap_page_prune_and_freeze()

yea, this makes sense to me. Makes me wonder if we shouldn't just have
PruneFreezeResult->live_tuples/recently_dead_tuples/etc be pointers and
then lazy_scan_prune() can pass the actual vacrel->live_tuples counter
and heap_page_prune_and_freeze() can increment it itself. Maybe that's
weird though.

> - Replaced the 'counted' array with 'revisit' array. I thought I could
>   get rid of it altogether, by just being careful to call the right
>   heap_prune_record_*() subroutine for each tuple in heap_prune_chain(),
>   but with live and recently-dead tuples that are part of a HOT chain,
>   we might visit the tuple as part of the HOT chain or not, depending
>   on what it's position in the chain is. So I invented a new revisit
>   phase. All live heap-only tuples that we find, that haven't already
>   been processed as part of a hot chain, are stashed away in the
>   'revisit' array. After processing all the HOT chains, the 'revisit'
>   tuples are re-checked, and counted if they haven't already been counted.

makes sense.

> - Live tuples are now also marked in the 'marked' array, when they are
>   counted. This gives a nice invariant: all tuples must be marked
>   exactly once, as part of a hot chain or otherwise. Added an
>   assertion for that.

this is a nice thing to have.

> ---
>  src/backend/access/heap/pruneheap.c  | 706 +++++++++++++++++----------
>  src/backend/access/heap/vacuumlazy.c |   6 +-
>  src/include/access/heapam.h          |  37 +-
>  3 files changed, 464 insertions(+), 285 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index 2b720ab6aa1..a8ed11a1858 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
>  
> -static void heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate,
> -                                                    OffsetNumber offnum, PruneFreezeResult *presult);
>  static void page_verify_redirects(Page page);
>  
>  
> @@ -242,6 +284,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
>   * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD
>   * (see heap_prune_satisfies_vacuum).
>   *

What's this "cutoffs TODO"?

> + * cutoffs TODO
> + *
>   * presult contains output parameters needed by callers such as the number of
>   * tuples removed and the number of line pointers newly marked LP_DEAD.
>   * heap_page_prune_and_freeze() is responsible for initializing it.
> @@ -326,70 +370,63 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
>      prstate.latest_xid_removed = InvalidTransactionId;
>      prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nfrozen = 0;
>      memset(prstate.marked, 0, sizeof(prstate.marked));
> -    memset(prstate.counted, 0, sizeof(prstate.counted));
> +    prstate.nrevisit = 0;
>  

>          if (presult->nfrozen > 0)
> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
>   * DEAD, our visibility test is just too coarse to detect it.
>   *
>   * In general, pruning must never leave behind a DEAD tuple that still has
> - * tuple storage.  VACUUM isn't prepared to deal with that case.  That's why
> + * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: it no longer cares, right?).
> + * That's why
>   * VACUUM prunes the same heap page a second time (without dropping its lock
>   * in the interim) when it sees a newly DEAD tuple that we initially saw as
> - * in-progress.  Retrying pruning like this can only happen when an inserting
> + * in-progress (FIXME: Really? Does it still do that?).

Yea, I think all that is no longer true. I missed this comment back
then.

> + * Retrying pruning like this can only happen when an inserting
>   * transaction concurrently aborts.
>   *
>   * The root line pointer is redirected to the tuple immediately after the
> @@ -743,15 +849,18 @@ htsv_get_valid_status(int status)
>   * prstate showing the changes to be made.  Items to be redirected are added
>   * to the redirected[] array (two entries per redirection); items to be set to
>   * LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED
> - * state are added to nowunused[].
> - *
> - * Returns the number of tuples (to be) deleted from the page.
> + * state are added to nowunused[].  We perform bookkeeping of live tuples,
> + * visibility etc. based on what the page will look like after the changes
> + * applied.  All that bookkeeping is performed in the heap_prune_record_*()
> + * subroutines.  The division of labor is that heap_prune_chain() decides the
> + * fate of each tuple, ie. whether it's going to be removed, redirected or
> + * left unchanged, and the heap_prune_record_*() subroutines update PruneState
> + * based on that outcome.
>   */
> -static int
> +static void
>  heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
> -                 PruneState *prstate, PruneFreezeResult *presult)
> +                 PruneState *prstate)
>  {
> -    int            ndeleted = 0;
>      Page        dp = (Page) BufferGetPage(buffer);
>      TransactionId priorXmax = InvalidTransactionId;
>      ItemId        rootlp;
> @@ -794,8 +903,8 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>               * gets there first will mark the tuple unused.
>               *
>               * FIXME: The next paragraph isn't new with these patches, but
> -             * just something I realized while looking at this. But maybe we should
> -             * add a comment like this? Or is it too much detail?
> +             * just something I realized while looking at this. But maybe we
> +             * should add a comment like this? Or is it too much detail?

i don't think it is too much detail.

>               *
>               * Whether we arrive at the dead HOT tuple first here or while
>               * following a chain below affects whether preceding RECENTLY_DEAD
> @@ -809,43 +918,52 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>               * TUPLE is already 'marked', and will not remove the
>               * RECENTLY_DEAD tuple.  This is not a correctness issue, and the
>               * RECENTLY_DEAD tuple will be removed by a later VACUUM.
> +             *
> +             * FIXME: Now that we have the 'revisit' array, we could stash
> +             * these DEAD items there too, instead of processing them here
> +             * immediately.  That way, DEAD tuples that are still part of a
> +             * chain would always get processed as part of the chain.
>               */

I really like this idea!

>              if (!HeapTupleHeaderIsHotUpdated(htup))
>              {
>                  if (prstate->htsv[rootoffnum] == HEAPTUPLE_DEAD)
>                  {
> -                    heap_prune_record_unused(prstate, rootoffnum);
> +                    heap_prune_record_unused(prstate, rootoffnum, true);
>                      HeapTupleHeaderAdvanceConflictHorizon(htup,
>                                                            &prstate->latest_xid_removed);
> -                    ndeleted++;
> -                }

I think we could really do with some more comments with examples like
this in the pruning code (that go through an example series of steps).
Not least of which because you can't see RECENTLY_DEAD in pageinspect
(you'd have to create some kind of status for it from the different
tuples on the page).

For example, I hadn't thought of this one:

REDIRECT -> RECENTY_DEAD -> DEAD -> RECENTLY_DEAD -> DEAD

> +    /*----
> +     * FIXME: this helped me to visualize how different chains might look like
> +     * here. It's not an exhaustive list, just some examples to help with
> +     * thinking. Remove this comment from final version, or refine.
> +     *
> +     * REDIRECT -> LIVE (stop) -> ...
> +     * REDIRECT -> RECENTY_DEAD -> LIVE (stop) -> ...
> +     * REDIRECT -> RECENTY_DEAD -> RECENTLY_DEAD
> +     * REDIRECT -> RECENTY_DEAD -> DEAD
> +     * REDIRECT -> RECENTY_DEAD -> DEAD -> RECENTLY_DEAD -> DEAD
> +     * RECENTLY_DEAD -> ...
> +     */
> +
>      /* while not end of the chain */
>      for (;;)
>      {
> @@ -897,19 +1015,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>  
>              case HEAPTUPLE_RECENTLY_DEAD:
>                  recent_dead = true;
> -
> -                /*
> -                 * This tuple may soon become DEAD.  Update the hint field so
> -                 * that the page is reconsidered for pruning in future.
> -                 */
> -                heap_prune_record_prunable(prstate,
> -                                           HeapTupleHeaderGetUpdateXid(htup));
>                  break;
>  
>              case HEAPTUPLE_DELETE_IN_PROGRESS:
> -
> -                /*
> -                 * This tuple may soon become DEAD.  Update the hint field so
> -                 * that the page is reconsidered for pruning in future.
> -                 */
> -                heap_prune_record_prunable(prstate,
> -                                           HeapTupleHeaderGetUpdateXid(htup));
> -                break;
> -
>              case HEAPTUPLE_LIVE:
>              case HEAPTUPLE_INSERT_IN_PROGRESS:
> -
> -                /*
> -                 * If we wanted to optimize for aborts, we might consider
> -                 * marking the page prunable when we see INSERT_IN_PROGRESS.
> -                 * But we don't.  See related decisions about when to mark the
> -                 * page prunable in heapam.c.
> -                 */
>                  break;
>  
>              default:
> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>           * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
>           * but we can't advance past anything else.  We have to make sure that
>           * we don't miss any DEAD tuples, since DEAD tuples that still have
> -         * tuple storage after pruning will confuse VACUUM.
> +         * tuple storage after pruning will confuse VACUUM (FIXME: not anymore
> +         * I think?).

Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
with storage after pruning anymore?

> +         *
> +         * FIXME: Not a new issue, but spotted it now : If there is a chain
> +         * like RECENTLY_DEAD -> DEAD, we will remove both tuples, but will
> +         * not call HeapTupleHeaderAdvanceConflictHorizon() for the
> +         * RECENTLY_DEAD tuple. Is that OK?  I think it is.  In a HOT chain,
> +         * we know that the later tuple committed before any earlier tuples in
> +         * the chain, therefore it ought to be enough to set the conflict
> +         * horizon based on the later tuple. If all snapshots on the standby
> +         * see the deleter of the last tuple as committed, they must consider
> +         * all the earlier ones as committed too.
>           */

This makes sense to me. (that if all the snapshots on the standby see
the deleter of the last tuple as committed, then they consider the
earlier ones deleted too). Probably wouldn't hurt to call this out here
though.

> @@ -1206,15 +1318,6 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu
>       * can't run inside a transaction block, which makes some cases impossible
>       * (e.g. in-progress insert from the same transaction).
>       *
> -     * We treat LP_DEAD items (which are the closest thing to DEAD tuples that
> -     * might be seen here) differently, too: we assume that they'll become
> -     * LP_UNUSED before VACUUM finishes.  This difference is only superficial.
> -     * VACUUM effectively agrees with ANALYZE about DEAD items, in the end.
> -     * VACUUM won't remember LP_DEAD items, but only because they're not
> -     * supposed to be left behind when it is done. (Cases where we bypass
> -     * index vacuuming will violate this optimistic assumption, but the
> -     * overall impact of that should be negligible.)
> -     *
>       * HEAPTUPLE_LIVE tuples are naturally counted as live. This is also what
>       * acquire_sample_rows() does.
>       *
> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu
>       * ensure the math works out. The assumption that the transaction will
>       * commit and update the counters after we report is a bit shaky; but it
>       * is what acquire_sample_rows() does, so we do the same to be consistent.
> +     *
> +     * HEAPTUPLE_DEAD are handled by the other heap_prune_record_*()
> +     * subroutines.  They don't count dead items like acquire_sample_rows()
> +     * does, because we assume that all dead items will become LP_UNUSED
> +     * before VACUUM finishes.  This difference is only superficial.  VACUUM
> +     * effectively agrees with ANALYZE about DEAD items, in the end.  VACUUM
> +     * won't remember LP_DEAD items, but only because they're not supposed to
> +     * be left behind when it is done. (Cases where we bypass index vacuuming
> +     * will violate this optimistic assumption, but the overall impact of that
> +     * should be negligible.) FIXME: I don't understand that last sentence in
> +     * parens. It was copied from elsewhere.

If we bypass index vacuuming, there will be LP_DEAD items left behind
when we are done because we didn't do index vacuuming and then reaping
of those dead items. All of these comments are kind of a copypasta,
though.

>       */
>      htup = (HeapTupleHeader) PageGetItem(page, PageGetItemId(page, offnum));
>  
> -    switch (status)

Okay, that's all for now. I'll do more in-depth review tomorrow.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 28/03/2024 03:53, Melanie Plageman wrote:
> On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
>> One change with this is that live_tuples and many of the other fields are
>> now again updated, even if the caller doesn't need them. It was hard to skip
>> them in a way that would save any cycles, with the other refactorings.
> 
> I am worried we are writing checks that are going to have to come out of
> SELECT queries' bank accounts, but I'll do some benchmarking when we're
> all done with major refactoring.

Sounds good, thanks.

>> +             *
>> +             * Whether we arrive at the dead HOT tuple first here or while
>> +             * following a chain below affects whether preceding RECENTLY_DEAD
>> +             * tuples in the chain can be removed or not.  Imagine that you
>> +             * have a chain with two tuples: RECENTLY_DEAD -> DEAD.  If we
>> +             * reach the RECENTLY_DEAD tuple first, the chain-following logic
>> +             * will find the DEAD tuple and conclude that both tuples are in
>> +             * fact dead and can be removed.  But if we reach the DEAD tuple
>> +             * at the end of the chain first, when we reach the RECENTLY_DEAD
>> +             * tuple later, we will not follow the chain because the DEAD
>> +             * TUPLE is already 'marked', and will not remove the
>> +             * RECENTLY_DEAD tuple.  This is not a correctness issue, and the
>> +             * RECENTLY_DEAD tuple will be removed by a later VACUUM.
>>                */
>>               if (!HeapTupleHeaderIsHotUpdated(htup))
> 
> Is this intentional? Like would it be correct to remove the
> RECENTLY_DEAD tuple during the current vacuum?

Yes, it would be correct. And if we happen to visit the items in 
different order, the RECENTLY_DEAD tuple first, it will get removed.

(This is just based on me reading the code, I'm not sure if I've missed 
something. Would be nice to construct a test case for that and step 
through it with a debugger to really see what happens. But this is not a 
new issue, doesn't need to be part of this patch)

>>  From c2ee7508456d0e76de985f9997a6840450e342a8 Mon Sep 17 00:00:00 2001
>> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
>> Date: Thu, 28 Mar 2024 00:45:26 +0200
>> Subject: [PATCH v8 22/22] WIP
>>
>> - Got rid of all_visible_except_removable again. We're back to the
>>    roughly the same mechanism as on 'master', where the all_visible
>>    doesn't include LP_DEAD items, but at the end of
>>    heap_page_prune_and_freeze() when we return to the caller, we clear
>>    it if there were any LP_DEAD items. I considered calling the
>>    variable 'all_visible_except_lp_dead', which would be more accurate,
>>    but it's also very long.
> 
> not longer than all_visible_except_removable. I would be happy to keep
> it more exact, but I'm also okay with just all_visible.

Ok, let's make it 'all_visible_except_lp_dead' for clarity.

> What's this "cutoffs TODO"?
> 
>> + * cutoffs TODO
>> + *
>>    * presult contains output parameters needed by callers such as the number of
>>    * tuples removed and the number of line pointers newly marked LP_DEAD.
>>    * heap_page_prune_and_freeze() is responsible for initializing it.

All the other arguments are documented in the comment, except 'cutoffs'.

>> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
>>    * DEAD, our visibility test is just too coarse to detect it.
>>    *
>>    * In general, pruning must never leave behind a DEAD tuple that still has
>> - * tuple storage.  VACUUM isn't prepared to deal with that case.  That's why
>> + * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: it no longer cares, right?).
>> + * That's why
>>    * VACUUM prunes the same heap page a second time (without dropping its lock
>>    * in the interim) when it sees a newly DEAD tuple that we initially saw as
>> - * in-progress.  Retrying pruning like this can only happen when an inserting
>> + * in-progress (FIXME: Really? Does it still do that?).
> 
> Yea, I think all that is no longer true. I missed this comment back
> then.

Committed a patch to remove it.

>> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>>            * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
>>            * but we can't advance past anything else.  We have to make sure that
>>            * we don't miss any DEAD tuples, since DEAD tuples that still have
>> -         * tuple storage after pruning will confuse VACUUM.
>> +         * tuple storage after pruning will confuse VACUUM (FIXME: not anymore
>> +         * I think?).
> 
> Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
> with storage after pruning anymore?

I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't 
loop through the items on the page checking their visibility anymore.

Hmm, one confusion remains though: In the 2nd phase of vacuum, we remove 
all the dead line pointers that we have now removed from the indexes. 
When we do that, we assume them all to be dead line pointers, without 
storage, rather than normal tuples that happen to be HEAPTUPLE_DEAD. So 
it's important that if pruning would leave behind HEAPTUPLE_DEAD tuples, 
they are not included in 'deadoffsets'.

In any case, let's just make sure that pruning doesn't leave 
HEAPTUPLE_DEAD tuples. There's no reason it should.


>> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu
>>        * ensure the math works out. The assumption that the transaction will
>>        * commit and update the counters after we report is a bit shaky; but it
>>        * is what acquire_sample_rows() does, so we do the same to be consistent.
>> +     *
>> +     * HEAPTUPLE_DEAD are handled by the other heap_prune_record_*()
>> +     * subroutines.  They don't count dead items like acquire_sample_rows()
>> +     * does, because we assume that all dead items will become LP_UNUSED
>> +     * before VACUUM finishes.  This difference is only superficial.  VACUUM
>> +     * effectively agrees with ANALYZE about DEAD items, in the end.  VACUUM
>> +     * won't remember LP_DEAD items, but only because they're not supposed to
>> +     * be left behind when it is done. (Cases where we bypass index vacuuming
>> +     * will violate this optimistic assumption, but the overall impact of that
>> +     * should be negligible.) FIXME: I don't understand that last sentence in
>> +     * parens. It was copied from elsewhere.
> 
> If we bypass index vacuuming, there will be LP_DEAD items left behind
> when we are done because we didn't do index vacuuming and then reaping
> of those dead items. All of these comments are kind of a copypasta,
> though.

Ah, gotcha, makes sense now. I didn't remember that we sometimes by pass 
index vacuuming if there are very few dead items. I thought this talked 
about the case that there are no indexes, but that case is OK.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Thu, Mar 28, 2024 at 4:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 28/03/2024 03:53, Melanie Plageman wrote:
> > On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
> >> One change with this is that live_tuples and many of the other fields are
> >> now again updated, even if the caller doesn't need them. It was hard to skip
> >> them in a way that would save any cycles, with the other refactorings.
> >
> > I am worried we are writing checks that are going to have to come out of
> > SELECT queries' bank accounts, but I'll do some benchmarking when we're
> > all done with major refactoring.
>
> Sounds good, thanks.

Actually, after having looked at it again, there really are only a few
more increments of various counters, the memory consumed by revisit,
and the additional setting of offsets in marked. I think a few
carefully constructed queries which do on-access pruning could test
the impact of this (as opposed to a bigger benchmarking endeavor).

I also wonder if there would be any actual impact of marking the
various record_*() routines inline.

> >> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
> >>    * DEAD, our visibility test is just too coarse to detect it.
> >>    *
> >>    * In general, pruning must never leave behind a DEAD tuple that still has
> >> - * tuple storage.  VACUUM isn't prepared to deal with that case.  That's why
> >> + * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: it no longer cares, right?).
> >> + * That's why
> >>    * VACUUM prunes the same heap page a second time (without dropping its lock
> >>    * in the interim) when it sees a newly DEAD tuple that we initially saw as
> >> - * in-progress.  Retrying pruning like this can only happen when an inserting
> >> + * in-progress (FIXME: Really? Does it still do that?).
> >
> > Yea, I think all that is no longer true. I missed this comment back
> > then.
>
> Committed a patch to remove it.
>
> >> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
> >>               * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
> >>               * but we can't advance past anything else.  We have to make sure that
> >>               * we don't miss any DEAD tuples, since DEAD tuples that still have
> >> -             * tuple storage after pruning will confuse VACUUM.
> >> +             * tuple storage after pruning will confuse VACUUM (FIXME: not anymore
> >> +             * I think?).
> >
> > Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
> > with storage after pruning anymore?
>
> I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't
> loop through the items on the page checking their visibility anymore.
>
> Hmm, one confusion remains though: In the 2nd phase of vacuum, we remove
> all the dead line pointers that we have now removed from the indexes.
> When we do that, we assume them all to be dead line pointers, without
> storage, rather than normal tuples that happen to be HEAPTUPLE_DEAD. So
> it's important that if pruning would leave behind HEAPTUPLE_DEAD tuples,
> they are not included in 'deadoffsets'.

It seems like master only adds items it is marking LP_DEAD to
deadoffsets. And things marked LP_DEAD have lp_len set to 0.

> In any case, let's just make sure that pruning doesn't leave
> HEAPTUPLE_DEAD tuples. There's no reason it should.

Maybe worth adding an assert to

static void
heap_prune_record_unchanged_lp_dead(ItemId itemid, PruneState
*prstate, OffsetNumber offnum)
{
...
    Assert(!ItemIdHasStorage(itemid));
    prstate->deadoffsets[prstate->lpdead_items++] = offnum;
}

By the way, I wasn't quite sure about the purpose of
heap_prune_record_unchanged_lp_dead(). It is called in
heap_prune_chain() in a place where we didn't add things to
deadoffsets before, no?

        /*
         * Likewise, a dead line pointer can't be part of the chain. (We
         * already eliminated the case of dead root tuple outside this
         * function.)
         */
        if (ItemIdIsDead(lp))
        {
            /*
             * If the caller set PRUNE_DO_MARK_UNUSED_NOW, we can set dead
             * line pointers LP_UNUSED now.
             */
            if (unlikely(prstate->actions & PRUNE_DO_MARK_UNUSED_NOW))
                heap_prune_record_unused(prstate, offnum, false);
            else
                heap_prune_record_unchanged_lp_dead(lp, prstate, offnum);
            break;
        }

> >> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu
> >>       * ensure the math works out. The assumption that the transaction will
> >>       * commit and update the counters after we report is a bit shaky; but it
> >>       * is what acquire_sample_rows() does, so we do the same to be consistent.
> >> +     *
> >> +     * HEAPTUPLE_DEAD are handled by the other heap_prune_record_*()
> >> +     * subroutines.  They don't count dead items like acquire_sample_rows()
> >> +     * does, because we assume that all dead items will become LP_UNUSED
> >> +     * before VACUUM finishes.  This difference is only superficial.  VACUUM
> >> +     * effectively agrees with ANALYZE about DEAD items, in the end.  VACUUM
> >> +     * won't remember LP_DEAD items, but only because they're not supposed to
> >> +     * be left behind when it is done. (Cases where we bypass index vacuuming
> >> +     * will violate this optimistic assumption, but the overall impact of that
> >> +     * should be negligible.) FIXME: I don't understand that last sentence in
> >> +     * parens. It was copied from elsewhere.
> >
> > If we bypass index vacuuming, there will be LP_DEAD items left behind
> > when we are done because we didn't do index vacuuming and then reaping
> > of those dead items. All of these comments are kind of a copypasta,
> > though.
>
> Ah, gotcha, makes sense now. I didn't remember that we sometimes by pass
> index vacuuming if there are very few dead items. I thought this talked
> about the case that there are no indexes, but that case is OK.

These comments could use another pass. I had added some extra
(probably redundant) content when I thought I was refactoring it a
certain way and then changed my mind.

Attached is a diff with some ideas I had for a bit of code simplification.

Are you working on cleaning this patch up or should I pick it up?

I wonder if it makes sense to move some of the changes to
heap_prune_chain() with setting things in marked/revisited to the
start of the patch set (to be committed first).

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote:
> These comments could use another pass. I had added some extra
> (probably redundant) content when I thought I was refactoring it a
> certain way and then changed my mind.
> 
> Attached is a diff with some ideas I had for a bit of code simplification.
> 
> Are you working on cleaning this patch up or should I pick it up?

Attached v9 is rebased over master. But, more importantly, I took
another pass at heap_prune_chain() and am pretty happy with what I came
up with. See 0021. I simplified the traversal logic and then grouped the
chain processing into three branches at the end. I find it much easier
to understand what we are doing for different types of HOT chains.

I got rid of revisited. We can put it back, but I was thinking: we stash
all HOT tuples and then loop over them later, calling record_unchanged()
on the ones that aren't marked. But, if we have a lot of HOT tuples, is
this really that much better than just looping through all the offsets
and calling record_unchanged() on just the ones that aren't marked?

I've done that in my version. While testing this, I found that only
on-access pruning needed this final loop calling record_unchanged() on
items not yet marked. I know we can't skip this final loop entirely in
the ON ACCESS case because it calls record_prunable(), but we could
consider moving that back out into heap_prune_chain()? Or what do you
think?

I haven't finished updating all the comments, but I am really interested
to know what you think about heap_prune_chain() now.

Note that patches 0001-0020 are still the same as before. Only 0021 is
the new changes I made (they are built on top of your v8 0022).

Tomorrow I will start first thing figuring out how to break this down
into parts that can apply on master and then rebase the rest of the
patches on top of it.

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 29/03/2024 07:04, Melanie Plageman wrote:
> On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote:
>> These comments could use another pass. I had added some extra
>> (probably redundant) content when I thought I was refactoring it a
>> certain way and then changed my mind.
>>
>> Attached is a diff with some ideas I had for a bit of code simplification.
>>
>> Are you working on cleaning this patch up or should I pick it up?
> 
> Attached v9 is rebased over master. But, more importantly, I took
> another pass at heap_prune_chain() and am pretty happy with what I came
> up with. See 0021. I simplified the traversal logic and then grouped the
> chain processing into three branches at the end. I find it much easier
> to understand what we are doing for different types of HOT chains.

Ah yes, agreed, that's nicer.

The 'survivor' variable is a little confusing, especially here:

    if (!survivor)
    {
        int            i;

        /*
         * If no DEAD tuple was found, and the root is redirected, mark it as
         * such.
         */
        if ((i = ItemIdIsRedirected(rootlp)))
            heap_prune_record_unchanged_lp_redirect(prstate, rootoffnum);

        /* the rest of tuples in the chain are normal, unchanged tuples */
        for (; i < nchain; i++)
            heap_prune_record_unchanged(dp, prstate, chainitems[i]);
    }

You would think that "if(!survivor)", it means that there is no live 
tuples on the chain, i.e. no survivors. But in fact it's the opposite; 
it means that the are all live. Maybe call it 'ndeadchain' instead, 
meaning the number of dead items in the chain.

> I got rid of revisited. We can put it back, but I was thinking: we stash
> all HOT tuples and then loop over them later, calling record_unchanged()
> on the ones that aren't marked. But, if we have a lot of HOT tuples, is
> this really that much better than just looping through all the offsets
> and calling record_unchanged() on just the ones that aren't marked?

Well, it requires looping through all the offsets one more time, and 
unless you have a lot of HOT tuples, most items would be marked already. 
But maybe the overhead is negligible anyway.

> I've done that in my version. While testing this, I found that only
> on-access pruning needed this final loop calling record_unchanged() on
> items not yet marked. I know we can't skip this final loop entirely in
> the ON ACCESS case because it calls record_prunable(), but we could
> consider moving that back out into heap_prune_chain()? Or what do you
> think?

Hmm, why is that different with on-access pruning?

Here's another idea: In the first loop through the offsets, where we 
gather the HTSV status of each item, also collect the offsets of all HOT 
and non-HOT items to two separate arrays. Call heap_prune_chain() for 
all the non-HOT items first, and then process any remaining HOT tuples 
that haven't been marked yet.

> I haven't finished updating all the comments, but I am really interested
> to know what you think about heap_prune_chain() now.

Looks much better now, thanks!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 29/03/2024 07:04, Melanie Plageman wrote:
> > On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote:
> >> These comments could use another pass. I had added some extra
> >> (probably redundant) content when I thought I was refactoring it a
> >> certain way and then changed my mind.
> >>
> >> Attached is a diff with some ideas I had for a bit of code simplification.
> >>
> >> Are you working on cleaning this patch up or should I pick it up?
> >
> > Attached v9 is rebased over master. But, more importantly, I took
> > another pass at heap_prune_chain() and am pretty happy with what I came
> > up with. See 0021. I simplified the traversal logic and then grouped the
> > chain processing into three branches at the end. I find it much easier
> > to understand what we are doing for different types of HOT chains.
>
> Ah yes, agreed, that's nicer.
>
> The 'survivor' variable is a little confusing, especially here:
>
>         if (!survivor)
>         {
>                 int                     i;
>
>                 /*
>                  * If no DEAD tuple was found, and the root is redirected, mark it as
>                  * such.
>                  */
>                 if ((i = ItemIdIsRedirected(rootlp)))
>                         heap_prune_record_unchanged_lp_redirect(prstate, rootoffnum);
>
>                 /* the rest of tuples in the chain are normal, unchanged tuples */
>                 for (; i < nchain; i++)
>                         heap_prune_record_unchanged(dp, prstate, chainitems[i]);
>         }
>
> You would think that "if(!survivor)", it means that there is no live
> tuples on the chain, i.e. no survivors. But in fact it's the opposite;
> it means that the are all live. Maybe call it 'ndeadchain' instead,
> meaning the number of dead items in the chain.

Makes sense.

> > I've done that in my version. While testing this, I found that only
> > on-access pruning needed this final loop calling record_unchanged() on
> > items not yet marked. I know we can't skip this final loop entirely in
> > the ON ACCESS case because it calls record_prunable(), but we could
> > consider moving that back out into heap_prune_chain()? Or what do you
> > think?
>
> Hmm, why is that different with on-access pruning?

Well, it is highly possible we just don't hit cases like this with
vacuum in our test suite (not that it is unreachable by vacuum).

It's just very easy to get in this situation with on-access pruning.
Imagine an UPDATE which caused the following chain:

RECENTLY_DEAD -> DELETE_IN_PROGRESS -> INSERT_IN_PROGRESS

It invokes heap_page_prune_and_freeze() (assume the page meets the
criteria for on-access pruning) and eventually enters
heap_prune_chain() with the first offset in this chain.

The first item is LP_NORMAL and the tuple is RECENTLY_DEAD, so the
survivor variable stays 0 and we record_unchanged() for that tuple and
return. The next two items are LP_NORMAL and the tuples are HOT
tuples, so we just return from the "fast path" at the top of
heap_prune_chain(). After invoking heap_prune_chain() for all of them,
the first offset is marked but the other two are not. Thus, we end up
having to record_unchanged() later. This kind of thing is a super
common case that we see all the time in queries in the regression test
suite.

> Here's another idea: In the first loop through the offsets, where we
> gather the HTSV status of each item, also collect the offsets of all HOT
> and non-HOT items to two separate arrays. Call heap_prune_chain() for
> all the non-HOT items first, and then process any remaining HOT tuples
> that haven't been marked yet.

That's an interesting idea. I'll try it out and see how it works.

> > I haven't finished updating all the comments, but I am really interested
> > to know what you think about heap_prune_chain() now.
>
> Looks much better now, thanks!

I am currently doing chain traversal refactoring in heap_prune_chain()
on top of master as the first patch in the set.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote:
> On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 29/03/2024 07:04, Melanie Plageman wrote:
> > > On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote:
> > >> These comments could use another pass. I had added some extra
> > >> (probably redundant) content when I thought I was refactoring it a
> > >> certain way and then changed my mind.
> > >>
> > >> Attached is a diff with some ideas I had for a bit of code simplification.
> > >>
> > >> Are you working on cleaning this patch up or should I pick it up?
> > >
> > > Attached v9 is rebased over master. But, more importantly, I took
> > > another pass at heap_prune_chain() and am pretty happy with what I came
> > > up with. See 0021. I simplified the traversal logic and then grouped the
> > > chain processing into three branches at the end. I find it much easier
> > > to understand what we are doing for different types of HOT chains.
> >
> > Ah yes, agreed, that's nicer.
> >
> > The 'survivor' variable is a little confusing, especially here:
> >
> >         if (!survivor)
> >         {
> >                 int                     i;
> >
> >                 /*
> >                  * If no DEAD tuple was found, and the root is redirected, mark it as
> >                  * such.
> >                  */
> >                 if ((i = ItemIdIsRedirected(rootlp)))
> >                         heap_prune_record_unchanged_lp_redirect(prstate, rootoffnum);
> >
> >                 /* the rest of tuples in the chain are normal, unchanged tuples */
> >                 for (; i < nchain; i++)
> >                         heap_prune_record_unchanged(dp, prstate, chainitems[i]);
> >         }
> >
> > You would think that "if(!survivor)", it means that there is no live
> > tuples on the chain, i.e. no survivors. But in fact it's the opposite;
> > it means that the are all live. Maybe call it 'ndeadchain' instead,
> > meaning the number of dead items in the chain.
> 
> Makes sense.

I've done this in attached v10.

> > Here's another idea: In the first loop through the offsets, where we
> > gather the HTSV status of each item, also collect the offsets of all HOT
> > and non-HOT items to two separate arrays. Call heap_prune_chain() for
> > all the non-HOT items first, and then process any remaining HOT tuples
> > that haven't been marked yet.
> 
> That's an interesting idea. I'll try it out and see how it works.

Attached v10 implements this method of dividing tuples into HOT and
non-HOT and processing the potential HOT chains first then processing
tuples not marked by calling heap_prune_chain().

I have applied the refactoring of heap_prune_chain() to master and then
built the other patches on top of that.

I discovered while writing this that LP_DEAD item offsets must be in
order in the deadoffsets array (the one that is used to populate
LVRelState->dead_items).

When I changed heap_page_prune_and_freeze() to partition the offsets
into HOT and non-HOT during the first loop through the item pointers
array (where we get tuple visibility information), we add dead item
offsets as they are encountered. So, they are no longer in order. I've
added a quicksort of the deadoffsets array to satisfy vacuum.

I think that we are actually successfully removing more RECENTLY_DEAD
HOT tuples than in master with heap_page_prune()'s new approach, and I
think it is correct; but let me know if I am missing something.

The early patches in the set include some additional comment cleanup as
well. 0001 is fairly polished. 0004 could use some variable renaming
(this patch partitions the tuples into HOT and not HOT and then
processes them). I was struggling with some of the names here
(chainmembers and chaincandidates is confusing).

The bulk of the combining of pruning and freezing is lumped into 0010.

I had planned to separate 0010 into 4 separate patches: 1 to execute
freezing in heap_prune_chain(), 1 for the freeze heuristic approximating
what is on master, and 1 for emitting a single record containing both
the pruning and freezing page modifications.

I ended up not doing this because I felt like the grouping of changes in
0007-0009 is off. As long as I still execute freezing in
lazy_scan_prune(), I have to share lots of state between
lazy_scan_prune() and heap_page_prune(). This meant I added a lot of
parameters to heap_page_prune() that later commits removed -- making the
later patches noisy and not so easy to understand.

I'm actually not sure what should go in what commit (either for review
clarity or for the actual final version).

But, I think we should probably focus on review of the code and not as
much how it is split up yet.

The final state of the code could definitely use more cleanup. I've been
staring at it for awhile, so I could use some thoughts/ideas about what
part to focus on improving.

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Robert Haas
Date:
On Sat, Mar 30, 2024 at 1:57 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I think that we are actually successfully removing more RECENTLY_DEAD
> HOT tuples than in master with heap_page_prune()'s new approach, and I
> think it is correct; but let me know if I am missing something.

/me blinks.

Isn't zero the only correct number of RECENTLY_DEAD tuples to remove?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Sat, Mar 30, 2024 at 8:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Mar 30, 2024 at 1:57 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I think that we are actually successfully removing more RECENTLY_DEAD
> > HOT tuples than in master with heap_page_prune()'s new approach, and I
> > think it is correct; but let me know if I am missing something.
>
> /me blinks.
>
> Isn't zero the only correct number of RECENTLY_DEAD tuples to remove?

At the top of the comment for heap_prune_chain() in master, it says

 * If the item is an index-referenced tuple (i.e. not a heap-only tuple),
 * the HOT chain is pruned by removing all DEAD tuples at the start of the HOT
 * chain.  We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
 * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
 * DEAD, our visibility test is just too coarse to detect it.

Heikki had added a comment in one of his patches to the fast path for
HOT tuples at the top of heap_prune_chain():

             * Note that we might first arrive at a dead heap-only tuple
             * either while following a chain or here (in the fast
path).  Whichever path
             * gets there first will mark the tuple unused.
             *
             * Whether we arrive at the dead HOT tuple first here or while
             * following a chain above affects whether preceding RECENTLY_DEAD
             * tuples in the chain can be removed or not.  Imagine that you
             * have a chain with two tuples: RECENTLY_DEAD -> DEAD.  If we
             * reach the RECENTLY_DEAD tuple first, the chain-following logic
             * will find the DEAD tuple and conclude that both tuples are in
             * fact dead and can be removed.  But if we reach the DEAD tuple
             * at the end of the chain first, when we reach the RECENTLY_DEAD
             * tuple later, we will not follow the chain because the DEAD
             * TUPLE is already 'marked', and will not remove the
             * RECENTLY_DEAD tuple.  This is not a correctness issue, and the
             * RECENTLY_DEAD tuple will be removed by a later VACUUM.

My patch splits the tuples into HOT and non-HOT while gathering their
visibility information and first calls heap_prune_chain() on the
non-HOT tuples and then processes the yet unmarked HOT tuples in a
separate loop afterward. This will follow all of the chains and
process them completely as well as processing all HOT tuples which may
not be reachable from a valid chain. The fast path contains a special
check to ensure that line pointers for DEAD not HOT-updated HOT tuples
(dead orphaned tuples from aborted HOT updates) are still marked
LP_UNUSED even though they are not reachable from a valid HOT chain.
By doing this later, we don't preclude ourselves from following all
chains.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 30/03/2024 07:57, Melanie Plageman wrote:
> On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote:
>> On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Here's another idea: In the first loop through the offsets, where we
>>> gather the HTSV status of each item, also collect the offsets of all HOT
>>> and non-HOT items to two separate arrays. Call heap_prune_chain() for
>>> all the non-HOT items first, and then process any remaining HOT tuples
>>> that haven't been marked yet.
>>
>> That's an interesting idea. I'll try it out and see how it works.
> 
> Attached v10 implements this method of dividing tuples into HOT and
> non-HOT and processing the potential HOT chains first then processing
> tuples not marked by calling heap_prune_chain().
> 
> I have applied the refactoring of heap_prune_chain() to master and then
> built the other patches on top of that.

Committed some of the changes. Continuing to reviewing the rest.

> I discovered while writing this that LP_DEAD item offsets must be in
> order in the deadoffsets array (the one that is used to populate
> LVRelState->dead_items).
> 
> When I changed heap_page_prune_and_freeze() to partition the offsets
> into HOT and non-HOT during the first loop through the item pointers
> array (where we get tuple visibility information), we add dead item
> offsets as they are encountered. So, they are no longer in order. I've
> added a quicksort of the deadoffsets array to satisfy vacuum.

Good catch.

> I think that we are actually successfully removing more RECENTLY_DEAD
> HOT tuples than in master with heap_page_prune()'s new approach, and I
> think it is correct; but let me know if I am missing something.

Yep. In the case of a two-item chain, RECENTLY_DEAD -> DEAD, the new 
code can always remove both items. On 'master', it depends on which item 
it happens to process first. If it processes the RECENTLY_DEAD item 
first, then it follows the chain and removes both. But if it processes 
the DEAD item first, the RECENTLY_DEAD item is left behind. It will be 
removed by the next VACUUM, so it's not a correctness issue, and 
probably doesn't make any practical performance difference either as 
it's a rare corner case, but I feel better that it's more deterministic now.

> The early patches in the set include some additional comment cleanup as
> well. 0001 is fairly polished. 0004 could use some variable renaming
> (this patch partitions the tuples into HOT and not HOT and then
> processes them). I was struggling with some of the names here
> (chainmembers and chaincandidates is confusing).

I didn't understand why you wanted to juggle both partitions in the same 
array. So I separated them into two arrays, and called them 'root_items' 
and 'heaponly_items'.

In some micro-benchmarks, the order that the items were processed made a 
measurable difference. So I'm processing the items in the reverse order. 
That roughly matches the order the items are processed in master, as it 
iterates the offsets from high-to-low in the first loop, and low-to-high 
in the second loop.

> The bulk of the combining of pruning and freezing is lumped into 0010.
> 
> I had planned to separate 0010 into 4 separate patches: 1 to execute
> freezing in heap_prune_chain(), 1 for the freeze heuristic approximating
> what is on master, and 1 for emitting a single record containing both
> the pruning and freezing page modifications.
> 
> I ended up not doing this because I felt like the grouping of changes in
> 0007-0009 is off. As long as I still execute freezing in
> lazy_scan_prune(), I have to share lots of state between
> lazy_scan_prune() and heap_page_prune(). This meant I added a lot of
> parameters to heap_page_prune() that later commits removed -- making the
> later patches noisy and not so easy to understand.
> 
> I'm actually not sure what should go in what commit (either for review
> clarity or for the actual final version).
> 
> But, I think we should probably focus on review of the code and not as
> much how it is split up yet.

Yeah, that's fine, 0010 is manageable-sized now.

> The final state of the code could definitely use more cleanup. I've been
> staring at it for awhile, so I could use some thoughts/ideas about what
> part to focus on improving.

Committed some of the changes. I plan to commit at least the first of 
these remaining patches later today. I'm happy with it now, but I'll 
give it a final glance over after dinner.

I'll continue to review the rest after that, but attached is what I have 
now.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:
> On 30/03/2024 07:57, Melanie Plageman wrote:
> > On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote:
> > > On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > > Here's another idea: In the first loop through the offsets, where we
> > > > gather the HTSV status of each item, also collect the offsets of all HOT
> > > > and non-HOT items to two separate arrays. Call heap_prune_chain() for
> > > > all the non-HOT items first, and then process any remaining HOT tuples
> > > > that haven't been marked yet.
> > > 
> > > That's an interesting idea. I'll try it out and see how it works.
> > 
> > Attached v10 implements this method of dividing tuples into HOT and
> > non-HOT and processing the potential HOT chains first then processing
> > tuples not marked by calling heap_prune_chain().
> > 
> > I have applied the refactoring of heap_prune_chain() to master and then
> > built the other patches on top of that.
> 
> Committed some of the changes. Continuing to reviewing the rest.

Thanks!

> > The early patches in the set include some additional comment cleanup as
> > well. 0001 is fairly polished. 0004 could use some variable renaming
> > (this patch partitions the tuples into HOT and not HOT and then
> > processes them). I was struggling with some of the names here
> > (chainmembers and chaincandidates is confusing).
> 
> I didn't understand why you wanted to juggle both partitions in the same
> array. So I separated them into two arrays, and called them 'root_items' and
> 'heaponly_items'.

I thought it was worth it to save the space. And the algorithm for doing
it seemed pretty straightforward. But looking at your patch, it is a lot
easier to understand with two arrays (since, for example, they can each
have a name).

> In some micro-benchmarks, the order that the items were processed made a
> measurable difference. So I'm processing the items in the reverse order.
> That roughly matches the order the items are processed in master, as it
> iterates the offsets from high-to-low in the first loop, and low-to-high in
> the second loop.

This makes sense. I noticed you didn't there isn't a comment about this
above the loop. It might be worth it to mention it.

Below is a review of only 0001. I'll look at the others shortly.

> From a6ab891779876e7cc1b4fb6fddb09f52f0094646 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Mon, 1 Apr 2024 16:59:38 +0300
> Subject: [PATCH v11 1/7] Handle non-chain tuples outside of heap_prune_chain()
> ---
>  src/backend/access/heap/pruneheap.c | 264 +++++++++++++++++-----------
>  1 file changed, 166 insertions(+), 98 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> @@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer,
>      tup.t_tableOid = RelationGetRelid(relation);
>  
>      /*
> -     * Determine HTSV for all tuples.
> +     * Determine HTSV for all tuples, and queue them up for processing as HOT
> +     * chain roots or as a heap-only items.

Reading this comment now as a whole, I would add something like
"Determining HTSV for all tuples once is required for correctness" to
the start of the second paragraph. The new conjunction on the first
paragraph sentence followed by the next paragraph is a bit confusing
because it sounds like queuing them up for processing is required for
correctness (which, I suppose it is on some level). Basically, I'm just
saying that it is now less clear what is required for correctness.

>       * This is required for correctness to deal with cases where running HTSV
>       * twice could result in different results (e.g. RECENTLY_DEAD can turn to
>       * DEAD if another checked item causes GlobalVisTestIsRemovableFullXid()
>       * to update the horizon, INSERT_IN_PROGRESS can change to DEAD if the
> -     * inserting transaction aborts, ...). That in turn could cause
> -     * heap_prune_chain() to behave incorrectly if a tuple is reached twice,
> -     * once directly via a heap_prune_chain() and once following a HOT chain.
> +     * inserting transaction aborts, ...).  VACUUM assumes that there are no
> +     * normal DEAD tuples left on the page after pruning, so it needs to have
> +     * the same understanding of what is DEAD and what is not.
>       *
>       * It's also good for performance. Most commonly tuples within a page are
>       * stored at decreasing offsets (while the items are stored at increasing
> @@ -282,52 +297,140 @@ heap_page_prune(Relation relation, Buffer buffer,

> +    /*
> +     * Process any heap-only tuples that were not already processed as part of
> +     * a HOT chain.
> +     */

While I recognize this is a matter of style and not important, I
personally prefer this for reverse looping:

    for (int i = prstate.nheaponly_items; i --> 0;)

I do think a comment about the reverse order would be nice. I know it
says something above the first loop to this effect:

* Processing the items in reverse order (and thus the tuples in
* increasing order) increases prefetching efficiency significantly /
* decreases the number of cache misses.

So perhaps we could just say "as above, process the items in reverse
order"

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:
> On 30/03/2024 07:57, Melanie Plageman wrote:
> 
> > The final state of the code could definitely use more cleanup. I've been
> > staring at it for awhile, so I could use some thoughts/ideas about what
> > part to focus on improving.
> 
> Committed some of the changes. I plan to commit at least the first of these
> remaining patches later today. I'm happy with it now, but I'll give it a
> final glance over after dinner.
> 
> I'll continue to review the rest after that, but attached is what I have
> now.

Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
you didn't modify them much/at all, but I noticed some things in my code
that could be better.

> From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 29 Mar 2024 19:43:09 -0400
> Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions
> 
> We will eventually take additional actions in heap_page_prune() at the
> discretion of the caller. For now, introduce these PRUNE_DO_* macros and
> turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_

paramter -> parameter

> action.
> ---
>  src/backend/access/heap/pruneheap.c  | 51 ++++++++++++++--------------
>  src/backend/access/heap/vacuumlazy.c | 11 ++++--
>  src/include/access/heapam.h          | 13 ++++++-
>  3 files changed, 46 insertions(+), 29 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index fb0ad834f1b..30965c3c5a1 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -29,10 +29,11 @@
>  /* Working data for heap_page_prune and subroutines */
>  typedef struct
>  {
> +    /* PRUNE_DO_* arguments */
> +    uint8        actions;

I wasn't sure if actions is a good name. What do you think?

> +
>      /* tuple visibility test, initialized for the relation */
>      GlobalVisState *vistest;
> -    /* whether or not dead items can be set LP_UNUSED during pruning */
> -    bool        mark_unused_now;
>  
>      TransactionId new_prune_xid;    /* new prune hint value for page */
>      TransactionId snapshotConflictHorizon;    /* latest xid removed */

> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 32a3fbce961..35b8486c34a 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -191,6 +191,17 @@ typedef struct HeapPageFreeze
>  
>  } HeapPageFreeze;
>  
> +/*
> + * Actions that can be taken during pruning and freezing. By default, we will
> + * at least attempt regular pruning.
> + */
> +
> +/*
> + * PRUNE_DO_MARK_UNUSED_NOW indicates whether or not dead items can be set
> + * LP_UNUSED during pruning.
> + */
> +#define        PRUNE_DO_MARK_UNUSED_NOW (1 << 1)


No reason for me to waste the zeroth bit here. I just realized that I
did this with XLHP_IS_CATALOG_REL too.

#define        XLHP_IS_CATALOG_REL            (1 << 1)

> From 083690b946e19ab5e536a9f2689772e7b91d2a70 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 29 Mar 2024 21:22:14 -0400
> Subject: [PATCH v11 4/7] Prepare freeze tuples in heap_page_prune()
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 35b8486c34a..ac129692c13 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
>  
> +/*
> + * Prepare to freeze if advantageous or required and try to advance
> + * relfrozenxid and relminmxid. To attempt freezing, we will need to determine
> + * if the page is all frozen. So, if this action is set, we will also inform
> + * the caller if the page is all-visible and/or all-frozen and calculate a

I guess we don't inform the caller if the page is all-visible, so this
is not quite right.

> + * snapshot conflict horizon for updating the visibility map.
> + */
> +#define        PRUNE_DO_TRY_FREEZE (1 << 2)

> From ef8cb2c089ad9474a6da309593029c08f71b0bb9 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 29 Mar 2024 21:36:37 -0400
> Subject: [PATCH v11 5/7] Set hastup in heap_page_prune
> 
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index 8bdd6389b25..65b0ed185ff 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -66,6 +66,9 @@ typedef struct
>      bool        processed[MaxHeapTuplesPerPage + 1];
>  
>      int            ndeleted;        /* Number of tuples deleted from the page */
> +
> +    /* Whether or not the page makes rel truncation unsafe */
> +    bool        hastup;
>  } PruneState;
>  
>  /* Local functions */
> @@ -271,6 +274,7 @@ heap_page_prune(Relation relation, Buffer buffer,
>      prstate.ndeleted = 0;
>      prstate.nroot_items = 0;
>      prstate.nheaponly_items = 0;
> +    prstate.hastup = false;
>  
>      /*
>       * If we will prepare to freeze tuples, consider that it might be possible
> @@ -280,7 +284,7 @@ heap_page_prune(Relation relation, Buffer buffer,
>          presult->all_frozen = true;
>      else
>          presult->all_frozen = false;
> -
> +    presult->hastup = prstate.hastup;
>  
>      /*
>       * presult->htsv is not initialized here because all ntuple spots in the
> @@ -819,6 +823,8 @@ heap_prune_record_redirect(PruneState *prstate,
>       */
>      if (was_normal)
>          prstate->ndeleted++;
> +
> +    prstate->hastup = true;
>  }
>  
>  /* Record line pointer to be marked dead */
> @@ -901,11 +907,15 @@ static void
>  heap_prune_record_unchanged_lp_normal(Page page, int8 *htsv, PruneState *prstate,
>                                        PruneResult *presult, OffsetNumber offnum)
>  {
> -    HeapTupleHeader htup = (HeapTupleHeader) PageGetItem(page, PageGetItemId(page, offnum));
> +    HeapTupleHeader htup;
>  
>      Assert(!prstate->processed[offnum]);
>      prstate->processed[offnum] = true;
>  
> +    presult->hastup = true;        /* the page is not empty */

My fault, but hastup being set sometimes in PruneState and sometimes in
PruneResult is quite unpalatable.


> From dffa90d0bd8e972cfe26da96860051dc8c2a8576 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Sat, 30 Mar 2024 01:27:08 -0400
> Subject: [PATCH v11 6/7] Save dead tuple offsets during heap_page_prune
> ---
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 212d76045ef..7f1e4db55c0 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1373,6 +1373,15 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
>      return false;
>  }
>  
> +static int
> +OffsetNumber_cmp(const void *a, const void *b)
> +{
> +    OffsetNumber na = *(const OffsetNumber *) a,
> +                nb = *(const OffsetNumber *) b;
> +
> +    return na < nb ? -1 : na > nb;
> +}

This probably doesn't belong here. I noticed spgdoinsert.c had a static
function for sorting OffsetNumbers, but I didn't see anything general
purpose anywhere else.

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 01/04/2024 19:08, Melanie Plageman wrote:
> On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:
>> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
>> @@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer,
>>       tup.t_tableOid = RelationGetRelid(relation);
>>   
>>       /*
>> -     * Determine HTSV for all tuples.
>> +     * Determine HTSV for all tuples, and queue them up for processing as HOT
>> +     * chain roots or as a heap-only items.
> 
> Reading this comment now as a whole, I would add something like
> "Determining HTSV for all tuples once is required for correctness" to
> the start of the second paragraph. The new conjunction on the first
> paragraph sentence followed by the next paragraph is a bit confusing
> because it sounds like queuing them up for processing is required for
> correctness (which, I suppose it is on some level). Basically, I'm just
> saying that it is now less clear what is required for correctness.

Fixed.

> While I recognize this is a matter of style and not important, I
> personally prefer this for reverse looping:
> 
>     for (int i = prstate.nheaponly_items; i --> 0;)

I don't think we use that style anywhere in the Postgres source tree 
currently. (And I don't like it ;-) )

> I do think a comment about the reverse order would be nice. I know it
> says something above the first loop to this effect:
> 
> * Processing the items in reverse order (and thus the tuples in
> * increasing order) increases prefetching efficiency significantly /
> * decreases the number of cache misses.
> 
> So perhaps we could just say "as above, process the items in reverse
> order"

I'm actually not sure why it makes a difference. I would assume all the 
data to already be in CPU cache at this point, since the first loop 
already accessed it, so I think there's something else going on. But I 
didn't investigate it deeper. Anyway, added a comment.

Committed the first of the remaining patches with those changes. And 
also this, which is worth calling out:

>         if (prstate.htsv[offnum] == HEAPTUPLE_DEAD)
>         {
>             ItemId        itemid = PageGetItemId(page, offnum);
>             HeapTupleHeader htup = (HeapTupleHeader) PageGetItem(page, itemid);
> 
>             if (likely(!HeapTupleHeaderIsHotUpdated(htup)))
>             {
>                 HeapTupleHeaderAdvanceConflictHorizon(htup,
>                                                       &prstate.latest_xid_removed);
>                 heap_prune_record_unused(&prstate, offnum, true);
>             }
>             else
>             {
>                 /*
>                  * This tuple should've been processed and removed as part of
>                  * a HOT chain, so something's wrong.  To preserve evidence,
>                  * we don't dare to remove it.  We cannot leave behind a DEAD
>                  * tuple either, because that will cause VACUUM to error out.
>                  * Throwing an error with a distinct error message seems like
>                  * the least bad option.
>                  */
>                 elog(ERROR, "dead heap-only tuple (%u, %d) is not linked to from any HOT chain",
>                      blockno, offnum);
>             }
>         }
>         else
>             heap_prune_record_unchanged_lp_normal(page, &prstate, offnum);

As you can see, I turned that into a hard error. Previously, that code 
was at the top of heap_prune_chain(), and it was normal to see DEAD 
heap-only tuples there, because they would presumably get processed 
later as part of a HOT chain. But now this is done after all the HOT 
chains have already been processed.

Previously if there was a dead heap-only tuple like that on the page for 
some reason, it was silently not processed by heap_prune_chain() 
(because it was assumed that it would be processed later as part of a 
HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the 
pruning was done as part of VACUUM, VACUUM would fail with "ERROR: 
unexpected HeapTupleSatisfiesVacuum result". Or am I missing something?

Now you get that above error also on on-access pruning, which is not 
ideal. But I don't remember hearing about corruption like that, and 
you'd get the error on VACUUM anyway.

With the next patches, heap_prune_record_unchanged() will do more, and 
will also throw an error on a HEAPTUPLE_LIVE tuple, so even though in 
the first patch we could print just a WARNING and move on, it gets more 
awkward with the rest of the patches.

(Continuing with the remaining patches..)

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Mon, Apr 1, 2024 at 1:37 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Committed the first of the remaining patches with those changes. And
> also this, which is worth calling out:
>
> >               if (prstate.htsv[offnum] == HEAPTUPLE_DEAD)
> >               {
> >                       ItemId          itemid = PageGetItemId(page, offnum);
> >                       HeapTupleHeader htup = (HeapTupleHeader) PageGetItem(page, itemid);
> >
> >                       if (likely(!HeapTupleHeaderIsHotUpdated(htup)))
> >                       {
> >                               HeapTupleHeaderAdvanceConflictHorizon(htup,
> >
&prstate.latest_xid_removed);
> >                               heap_prune_record_unused(&prstate, offnum, true);
> >                       }
> >                       else
> >                       {
> >                               /*
> >                                * This tuple should've been processed and removed as part of
> >                                * a HOT chain, so something's wrong.  To preserve evidence,
> >                                * we don't dare to remove it.  We cannot leave behind a DEAD
> >                                * tuple either, because that will cause VACUUM to error out.
> >                                * Throwing an error with a distinct error message seems like
> >                                * the least bad option.
> >                                */
> >                               elog(ERROR, "dead heap-only tuple (%u, %d) is not linked to from any HOT chain",
> >                                        blockno, offnum);
> >                       }
> >               }
> >               else
> >                       heap_prune_record_unchanged_lp_normal(page, &prstate, offnum);
>
> As you can see, I turned that into a hard error. Previously, that code
> was at the top of heap_prune_chain(), and it was normal to see DEAD
> heap-only tuples there, because they would presumably get processed
> later as part of a HOT chain. But now this is done after all the HOT
> chains have already been processed.
>
> Previously if there was a dead heap-only tuple like that on the page for
> some reason, it was silently not processed by heap_prune_chain()
> (because it was assumed that it would be processed later as part of a
> HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the
> pruning was done as part of VACUUM, VACUUM would fail with "ERROR:
> unexpected HeapTupleSatisfiesVacuum result". Or am I missing something?

I think you are right. I wasn't sure if there was some way for a HOT,
DEAD tuple to be not HOT-updated, but that doesn't make much sense.

> Now you get that above error also on on-access pruning, which is not
> ideal. But I don't remember hearing about corruption like that, and
> you'd get the error on VACUUM anyway.

Yea, that makes sense. One thing I don't really understand is why
vacuum has its own system for saving and restoring error information
for context messages (LVSavedErrInfo and
update/restore_vacuum_err_info()). I'll confess I don't know much
about how error cleanup works in any sub-system. But it stuck out to
me that vacuum has its own. I assume it is okay to add new error
messages and they somehow will work with the existing system?

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 01/04/2024 20:22, Melanie Plageman wrote:
>>  From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001
>> From: Melanie Plageman <melanieplageman@gmail.com>
>> Date: Fri, 29 Mar 2024 19:43:09 -0400
>> Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions
>>
>> We will eventually take additional actions in heap_page_prune() at the
>> discretion of the caller. For now, introduce these PRUNE_DO_* macros and
>> turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_
> 
> paramter -> parameter
> 
>> action.
>> ---
>>   src/backend/access/heap/pruneheap.c  | 51 ++++++++++++++--------------
>>   src/backend/access/heap/vacuumlazy.c | 11 ++++--
>>   src/include/access/heapam.h          | 13 ++++++-
>>   3 files changed, 46 insertions(+), 29 deletions(-)
>>
>> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
>> index fb0ad834f1b..30965c3c5a1 100644
>> --- a/src/backend/access/heap/pruneheap.c
>> +++ b/src/backend/access/heap/pruneheap.c
>> @@ -29,10 +29,11 @@
>>   /* Working data for heap_page_prune and subroutines */
>>   typedef struct
>>   {
>> +    /* PRUNE_DO_* arguments */
>> +    uint8        actions;
> 
> I wasn't sure if actions is a good name. What do you think?

Committed this part, with the name 'options'. There's some precedent for 
that in heap_insert().

I decided to keep it a separate bool field here in the PruneState 
struct, though, and only changed it in the heap_page_prune() function 
signature. It didn't feel worth the code churn here, and 
'prstate.mark_unused_now' is a shorter than "(prstate.options & 
HEAP_PRUNE_PAGE_MARK_UNUSED_NOW) != 0" anyway.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 01/04/2024 20:22, Melanie Plageman wrote:
> Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> you didn't modify them much/at all, but I noticed some things in my code
> that could be better.

Ok, here's what I have now. I made a lot of small comment changes here 
and there, and some minor local refactorings, but nothing major. I lost 
track of all the individual changes I'm afraid, so I'm afraid you'll 
have to just diff against the previous version if you want to see what's 
changed. I hope I didn't break anything.

I'm pretty happy with this now. I will skim through it one more time 
later today or tomorrow, and commit. Please review once more if you have 
a chance.

> This probably doesn't belong here. I noticed spgdoinsert.c had a static
> function for sorting OffsetNumbers, but I didn't see anything general
> purpose anywhere else.

I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would 
be nice to have just one copy of this in some common place, but I also 
wasn't sure where to put it.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Tue, Apr 2, 2024 at 9:11 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 01/04/2024 20:22, Melanie Plageman wrote:
> > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> > you didn't modify them much/at all, but I noticed some things in my code
> > that could be better.
>
> Ok, here's what I have now. I made a lot of small comment changes here
> and there, and some minor local refactorings, but nothing major. I lost
> track of all the individual changes I'm afraid, so I'm afraid you'll
> have to just diff against the previous version if you want to see what's
> changed. I hope I didn't break anything.
>
> I'm pretty happy with this now. I will skim through it one more time
> later today or tomorrow, and commit. Please review once more if you have
> a chance.

Thanks!

0001 looks good. Attached are some comment updates and such on top of
0001 and 0002.

I started some performance testing of 0002 but haven't finished yet. I
wanted to provide my other review first.

> > This probably doesn't belong here. I noticed spgdoinsert.c had a static
> > function for sorting OffsetNumbers, but I didn't see anything general
> > purpose anywhere else.
>
> I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
> be nice to have just one copy of this in some common place, but I also
> wasn't sure where to put it.

I looked a bit through utils and common and didn't see anywhere
obvious to put it. We could make a new file? 0003 fixes where you
forgot to change the name of the qsort function, though.

- Melanie

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Tue, Apr 02, 2024 at 01:24:27PM -0400, Melanie Plageman wrote:
> On Tue, Apr 2, 2024 at 9:11 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 01/04/2024 20:22, Melanie Plageman wrote:
> > > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> > > you didn't modify them much/at all, but I noticed some things in my code
> > > that could be better.
> >
> > Ok, here's what I have now. I made a lot of small comment changes here
> > and there, and some minor local refactorings, but nothing major. I lost
> > track of all the individual changes I'm afraid, so I'm afraid you'll
> > have to just diff against the previous version if you want to see what's
> > changed. I hope I didn't break anything.
> >
> > I'm pretty happy with this now. I will skim through it one more time
> > later today or tomorrow, and commit. Please review once more if you have
> > a chance.
> 
> Thanks!
> 
> 0001 looks good. Attached are some comment updates and such on top of
> 0001 and 0002.
> 
> I started some performance testing of 0002 but haven't finished yet. I
> wanted to provide my other review first.

I tried to do some performance tests of just on-access HOT pruning with
the patches in this thread applied. I'm not sure if I succeeded in being
targeted enough to have usable results.

Off-list Andres gave me some suggestions of how to improve my test case
and setup and this is what I ended up doing:

----------------------------------------
On-access pruning during a SELECT query:
----------------------------------------

# Make a table with a single not NULL column of a small datatype to fit
# as many tuples as possible on the page so each page we prune exercises
# those loops in heap_page_prune_and_freeze() and heap_prune_chain() as
# much as possible

psql -c "create table small(col smallint not null)"

# Insert data that is the same except for ~1 row per page with a different value

for i in $(seq 1000)
do
  psql -c "INSERT INTO small VALUES(2);" -c "INSERT INTO small SELECT 1 FROM (SELECT generate_series(1,220));"
done

# COPY this data to a file
psql -c "COPY small TO '/tmp/small.data';"

# Start postgres bound to a single CPU core

# Run the following script with pgbench

    # Make the table unlogged table so we don't see the effects of WAL writes in
    # results
    #
    # Make sure autovacuum doesn't run on the table
  drop table if exists small;
  create unlogged table small(col smallint not null) with (autovacuum_enabled = false);
  copy small from '/tmp/small.data';
  update small set col = 9 where col = 2;
  select * from small where col = 0;

pgbench -n -f query.sql -t 100 -M prepared -r

# (I made sure that HOT pruning was actually happening for the SELECT
# query before running this with pgbench)

# There seemed to be no meaningful difference for this example with the
# patches:

on current master:
statement latencies in milliseconds and failures:
        12.387           0  drop table if exists small;
         1.914           0  create unlogged table small(col smallint not null) with (autovacuum_enabled = false);
       100.152           0  copy small from '/tmp/small.data';
        49.480           0  update small set col = 9 where col = 2;
        46.835           0  select * from small where col = 0;

with the patches applied:
statement latencies in milliseconds and failures:
        13.281           0  drop table if exists small;
         1.952           0  create unlogged table small(col smallint not null) with (autovacuum_enabled = false);
        99.418           0  copy small from '/tmp/small.data';
        47.397           0  update small set col = 9 where col = 2;
        46.887           0  select * from small where col = 0;

--------------------------------
On-access pruning during UPDATE:
--------------------------------

# The idea is to test a query which would be calling the new
# heap_prune_record_unchanged_lp_normal() function a lot

# I made the same table but filled entirely with the same value

psql -c "create table small(col smallint not null)" \
        -c "INSERT INTO small SELECT 1 FROM (SELECT generate_series(1, 221000));"

# COPY this data to a file
psql -c "COPY small TO '/tmp/small_univalue.data';"

# Start postgres bound to a single CPU core

# Run the following script with pgbench

    # Pick a low fillfactor so we have room for the HOT updates
  drop table if exists small;
  create unlogged table small(col smallint not null) with (autovacuum_enabled = false, fillfactor=25);
  copy small from '/tmp/small_univalue.data';
  update small set col = 3;
  update small set col = 4;
  update small set col = 5;
  update small set col = 6;

pgbench -n -f update.sql -t 20 -M prepared -r

# There again seems to be no meaningful difference with the patches
# applied

on master:
statement latencies in milliseconds and failures:
        19.880           0  drop table if exists small;
         2.099           0  create unlogged table small(col smallint not null) with (autovacuum_enabled = false,
fillfactor=25);
       130.793           0  copy small from '/tmp/small_univalue.data';
       377.707           0  update small set col = 3;
       417.644           0  update small set col = 4;
       483.974           0  update small set col = 5;
       422.956           0  update small set col = 6;

with patches applied:
statement latencies in milliseconds and failures:
        19.995           0  drop table if exists small;
         2.034           0  create unlogged table small(col smallint not null) with (autovacuum_enabled = false,
fillfactor=25);
       124.270           0  copy small from '/tmp/small_univalue.data';
       375.327           0  update small set col = 3;
       419.336           0  update small set col = 4;
       483.750           0  update small set col = 5;
       420.451           0  update small set col = 6;

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 02/04/2024 16:11, Heikki Linnakangas wrote:
> On 01/04/2024 20:22, Melanie Plageman wrote:
>> Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
>> you didn't modify them much/at all, but I noticed some things in my code
>> that could be better.
> 
> Ok, here's what I have now. I made a lot of small comment changes here
> and there, and some minor local refactorings, but nothing major. I lost
> track of all the individual changes I'm afraid, so I'm afraid you'll
> have to just diff against the previous version if you want to see what's
> changed. I hope I didn't break anything.
> 
> I'm pretty happy with this now. I will skim through it one more time
> later today or tomorrow, and commit. Please review once more if you have
> a chance.
> 
>> This probably doesn't belong here. I noticed spgdoinsert.c had a static
>> function for sorting OffsetNumbers, but I didn't see anything general
>> purpose anywhere else.
> 
> I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
> be nice to have just one copy of this in some common place, but I also
> wasn't sure where to put it.

One more version, with two small fixes:

1. I fumbled the offsetnumber-cmp function at the last minute so that it 
didn't compile. Fixed. that

2. On VACUUM on an unlogged or temp table, the logic always thought that 
we would be generating an FPI, causing it to always freeze when it 
could. But of course, you never generate FPIs on an unlogged table. 
Fixed that. (Perhaps we should indeed freeze more aggressively on an 
unlogged table, but changing the heuristic is out of scope for this patch.)

Off-list, Melanie reported that there is a small regression with the 
benchmark script she posted yesterday, after all, but I'm not able to 
reproduce that.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Wed, Apr 3, 2024 at 8:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/04/2024 16:11, Heikki Linnakangas wrote:
> > On 01/04/2024 20:22, Melanie Plageman wrote:
> >> Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> >> you didn't modify them much/at all, but I noticed some things in my code
> >> that could be better.
> >
> > Ok, here's what I have now. I made a lot of small comment changes here
> > and there, and some minor local refactorings, but nothing major. I lost
> > track of all the individual changes I'm afraid, so I'm afraid you'll
> > have to just diff against the previous version if you want to see what's
> > changed. I hope I didn't break anything.
> >
> > I'm pretty happy with this now. I will skim through it one more time
> > later today or tomorrow, and commit. Please review once more if you have
> > a chance.
> >
> >> This probably doesn't belong here. I noticed spgdoinsert.c had a static
> >> function for sorting OffsetNumbers, but I didn't see anything general
> >> purpose anywhere else.
> >
> > I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
> > be nice to have just one copy of this in some common place, but I also
> > wasn't sure where to put it.
>
> One more version, with two small fixes:
>
> 1. I fumbled the offsetnumber-cmp function at the last minute so that it
> didn't compile. Fixed. that

I noticed you didn't make the comment updates I suggested in my
version 13 here [1]. A few of them are outdated references to
heap_page_prune() and one to a now deleted variable name
(all_visible_except_removable).

I applied them to your v13 and attached the diff.

> Off-list, Melanie reported that there is a small regression with the
> benchmark script she posted yesterday, after all, but I'm not able to
> reproduce that.

Actually, I think it was noise.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_aPqZkThyfr0USaHp-3cN_ruEdAHBKtNQJqXDTjWUz0rw%40mail.gmail.com

Attachment

Re: Combine Prune and Freeze records emitted by vacuum

From
Heikki Linnakangas
Date:
On 03/04/2024 17:18, Melanie Plageman wrote:
> I noticed you didn't make the comment updates I suggested in my
> version 13 here [1]. A few of them are outdated references to
> heap_page_prune() and one to a now deleted variable name
> (all_visible_except_removable).
> 
> I applied them to your v13 and attached the diff.

Applied those changes, and committed. Thank you!

>> Off-list, Melanie reported that there is a small regression with the
>> benchmark script she posted yesterday, after all, but I'm not able to
>> reproduce that.
> 
> Actually, I think it was noise.

Ok, phew.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Combine Prune and Freeze records emitted by vacuum

From
Melanie Plageman
Date:
On Wed, Apr 3, 2024 at 12:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 03/04/2024 17:18, Melanie Plageman wrote:
> > I noticed you didn't make the comment updates I suggested in my
> > version 13 here [1]. A few of them are outdated references to
> > heap_page_prune() and one to a now deleted variable name
> > (all_visible_except_removable).
> >
> > I applied them to your v13 and attached the diff.
>
> Applied those changes, and committed. Thank you!

Thanks! And thanks for updating the commitfest entry!

- Melanie



Re: Combine Prune and Freeze records emitted by vacuum

From
Peter Geoghegan
Date:
On Wed, Apr 3, 2024 at 1:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Thanks! And thanks for updating the commitfest entry!

Nice work!

--
Peter Geoghegan