Thread: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl"

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

- Melanie

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee
[2] https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com

Attachment
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> If vacuum fails to remove a tuple with xmax older than
> VacuumCutoffs->OldestXmin and younger than
> GlobalVisState->maybe_needed, it will ERROR out when determining
> whether or not to freeze the tuple with "cannot freeze committed
> xmax".
>
> In back branches starting with 14, failing to remove tuples older than
> OldestXmin during pruning caused vacuum to infinitely loop in
> lazy_scan_prune(), as investigated on this [1] thread.

This is a great summary.

> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
> has a transaction that sees that tuple as alive, because it will
> simply wait to replay the removal until it would be correct to do so
> or recovery conflict handling will cancel the transaction that sees
> the tuple as alive and allow replay to continue.

I think that this is the right general approach.

> The repro forces a round of index vacuuming after the standby
> reconnects and before pruning a dead tuple whose xmax is older than
> OldestXmin.
>
> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> calls GetOldestNonRemovableTransactionId(), thereby updating the
> backend's GlobalVisState and moving maybe_needed backwards.

Right. I saw details exactly consistent with this when I used GDB
against a production instance.

I'm glad that you were able to come up with a repro that involves
exactly the same basic elements, including index page deletion.

--
Peter Geoghegan



Hi, Melanie! I'm glad to hear you that you have found a root case of the problem) Thank you for that!

On 21.06.2024 02:42, Melanie Plageman wrote:
Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide excessive freezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine.

if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
    return HEAPTUPLE_DEAD;

Can I keep it?

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee
[2] https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com
To be honest, the meson test is new for me, but I see its useful features. I think I will use it for checking my features)

I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without replica when I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original problem because the symptoms were the same.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 21/06/2024 03:02, Peter Geoghegan wrote:
> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>> If vacuum fails to remove a tuple with xmax older than
>> VacuumCutoffs->OldestXmin and younger than
>> GlobalVisState->maybe_needed, it will ERROR out when determining
>> whether or not to freeze the tuple with "cannot freeze committed
>> xmax".
>>
>> In back branches starting with 14, failing to remove tuples older than
>> OldestXmin during pruning caused vacuum to infinitely loop in
>> lazy_scan_prune(), as investigated on this [1] thread.
> 
> This is a great summary.

+1

>> We can fix this by always removing tuples considered dead before
>> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
>> has a transaction that sees that tuple as alive, because it will
>> simply wait to replay the removal until it would be correct to do so
>> or recovery conflict handling will cancel the transaction that sees
>> the tuple as alive and allow replay to continue.
> 
> I think that this is the right general approach.

+1

>> The repro forces a round of index vacuuming after the standby
>> reconnects and before pruning a dead tuple whose xmax is older than
>> OldestXmin.
>>
>> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
>> calls GetOldestNonRemovableTransactionId(), thereby updating the
>> backend's GlobalVisState and moving maybe_needed backwards.
> 
> Right. I saw details exactly consistent with this when I used GDB
> against a production instance.
> 
> I'm glad that you were able to come up with a repro that involves
> exactly the same basic elements, including index page deletion.

Would it be possible to make it robust so that we could always run it 
with "make check"? This seems like an important corner case to 
regression test.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
> has a transaction that sees that tuple as alive, because it will
> simply wait to replay the removal until it would be correct to do so
> or recovery conflict handling will cancel the transaction that sees
> the tuple as alive and allow replay to continue.
>
> This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide
excessivefreezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid
ofthis by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine. 
>
> if (prstate->cutoffs &&
> TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
> prstate->cutoffs->OldestMxact != FirstMultiXactId &&
> NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
>     return HEAPTUPLE_DEAD;
>
> Can I keep it?

This looks like an addition to the new criteria I added to
heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so,
it looks like it would only return HEAPTUPLE_DEAD (and thus only
remove) a subset of the tuples my original criteria would remove. When
vacuum calculates OldestMxact as FirstMultiXactId, it would not remove
those tuples deleted before OldestXmin. It seems like OldestMxact will
equal FirstMultiXactID sometimes right after initdb and after
transaction ID wraparound. I'm not sure I totally understand the
criteria.

One thing I find confusing about this is that this would actually
remove less tuples than with my criteria -- which could lead to more
freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we
would not remove tuples deleted before OldestXmin and thus return
HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider
freezing them. So, it seems like we would do more freezing by adding
this criteria.

Could you explain more about how the criteria you are suggesting
works? Are you saying it does less freezing than master or less
freezing than with my patch?

> Attached is the suggested fix for master plus a repro. I wrote it as a
> recovery suite TAP test, but I am _not_ proposing we add it to the
> ongoing test suite. It is, amongst other things, definitely prone to
> flaking. I also had to use loads of data to force two index vacuuming
> passes now that we have TIDStore, so it is a slow test.
-- snip --
> I have a modified version of this that repros the infinite loop on
> 14-16 with substantially less data. See it here [2]. Also, the repro
> attached to this mail won't work on 14 and 15 because of changes to
> background_psql.
>
> I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without
replicawhen I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original
problembecause the symptoms were the same. 

Did you get similar behavior on master or on back branches? Was the
behavior you observed the infinite loop or the error during
heap_prepare_freeze_tuple()?

In my examples, the replica is needed because something has to move
the horizon on the primary backwards. When a standby reconnects with
an older oldest running transaction ID than any of the running
transactions on the primary and the vacuuming backend recomputes its
RecentXmin, the horizon may move backwards when compared to the
horizon calculated at the beginning of the vacuum. Vacuum does not
recompute cutoffs->OldestXmin during vacuuming a relation but it may
recompute the values in the GlobalVisState it uses for pruning.

We knew of only one other way that the horizon could move backwards
which Matthias describes here [1]. However, this is thought to be its
own concurrency-related bug in the commit-abort path that should be
fixed -- as opposed to the standby reconnecting with an older oldest
running transaction ID which can be expected.

Do you know if you were seeing the effects of the scenario Matthias describes?

- Melanie

[1]
https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com



On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 21/06/2024 03:02, Peter Geoghegan wrote:
> > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> >
> >> The repro forces a round of index vacuuming after the standby
> >> reconnects and before pruning a dead tuple whose xmax is older than
> >> OldestXmin.
> >>
> >> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> >> calls GetOldestNonRemovableTransactionId(), thereby updating the
> >> backend's GlobalVisState and moving maybe_needed backwards.
> >
> > Right. I saw details exactly consistent with this when I used GDB
> > against a production instance.
> >
> > I'm glad that you were able to come up with a repro that involves
> > exactly the same basic elements, including index page deletion.
>
> Would it be possible to make it robust so that we could always run it
> with "make check"? This seems like an important corner case to
> regression test.

I'd have to look into how to ensure I can stabilize some of the parts
that seem prone to flaking. I can probably stabilize the vacuum bit
with a query of pg_stat_activity making sure it is waiting to acquire
the cleanup lock.

I don't, however, see a good way around the large amount of data
required to trigger more than one round of index vacuuming. I could
generate the data more efficiently than I am doing here
(generate_series() in the from clause). Perhaps with a copy? I know it
is too slow now to go in an ongoing test, but I don't have an
intuition around how fast it would have to be to be acceptable. Is
there a set of additional tests that are slower that we don't always
run? I didn't follow how the wraparound test ended up, but that seems
like one that would have been slow.

- Melanie





On 6/24/24 16:53, Melanie Plageman wrote:
> On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> On 21/06/2024 03:02, Peter Geoghegan wrote:
>>> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
>>> <melanieplageman@gmail.com> wrote:
>>>
>>>> The repro forces a round of index vacuuming after the standby
>>>> reconnects and before pruning a dead tuple whose xmax is older than
>>>> OldestXmin.
>>>>
>>>> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
>>>> calls GetOldestNonRemovableTransactionId(), thereby updating the
>>>> backend's GlobalVisState and moving maybe_needed backwards.
>>>
>>> Right. I saw details exactly consistent with this when I used GDB
>>> against a production instance.
>>>
>>> I'm glad that you were able to come up with a repro that involves
>>> exactly the same basic elements, including index page deletion.
>>
>> Would it be possible to make it robust so that we could always run it
>> with "make check"? This seems like an important corner case to
>> regression test.
> 
> I'd have to look into how to ensure I can stabilize some of the parts
> that seem prone to flaking. I can probably stabilize the vacuum bit
> with a query of pg_stat_activity making sure it is waiting to acquire
> the cleanup lock.
> 
> I don't, however, see a good way around the large amount of data
> required to trigger more than one round of index vacuuming. I could
> generate the data more efficiently than I am doing here
> (generate_series() in the from clause). Perhaps with a copy? I know it
> is too slow now to go in an ongoing test, but I don't have an
> intuition around how fast it would have to be to be acceptable. Is
> there a set of additional tests that are slower that we don't always
> run? I didn't follow how the wraparound test ended up, but that seems
> like one that would have been slow.
> 

I think it depends on what is the impact on the 'make check' duration.
If it could be added to one of the existing test groups, then it depends
on how long the slowest test in that group is. If the new test needs to
be in a separate group, it probably needs to be very fast.

But I was wondering how much time are we talking about, so I tried

creating a table, filling it with 300k rows => 250ms
creating an index => 180ms
delete 90% data => 200ms
vacuum t => 130ms

which with m_w_m=1MB does two rounds of index cleanup. That's ~760ms.
I'm not sure how much more stuff does the test need to do, but this
would be pretty reasonable, if we could add it to an existing group.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin.

I don't have a great feeling about this fix. It's not that I think
it's wrong. It's just that the underlying problem here is that we have
heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
of deciding what gets pruned, but that doesn't actually work, because
as I pointed out in
http://postgr.es/m/CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com
it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
fix is to consider both variables, which again may be totally correct,
but wouldn't it be a lot better if we didn't have two variables
fighting for control of the same behavior?

(I'm not trying to be a nuisance here -- I think it's great that
you've done the work to pin this down and perhaps there is no better
fix than what you've proposed.)

--
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Jun 24, 2024 at 11:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't have a great feeling about this fix. It's not that I think
> it's wrong. It's just that the underlying problem here is that we have
> heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
> struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
> of deciding what gets pruned, but that doesn't actually work, because
> as I pointed out in
> http://postgr.es/m/CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com
> it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
> fix is to consider both variables, which again may be totally correct,
> but wouldn't it be a lot better if we didn't have two variables
> fighting for control of the same behavior?

Why would it be better? It's to our advantage to have vistest prune
away extra tuples when possible. Andres placed a lot of emphasis on
that during the snapshot scalability work -- vistest can be updated on
the fly.

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

--
Peter Geoghegan



On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The problem here is that OldestXmin is supposed to be more
> conservative than vistest, which it almost always is, except in this
> one edge case. I don't think that plugging that hole changes the basic
> fact that there is one source of truth about what *needs* to be
> pruned. There is such a source of truth: OldestXmin.

Well, another approach could be to make it so that OldestXmin actually
is always more conservative than vistest rather than almost always.

I agree with you that letting the pruning horizon move forward during
vacuum is desirable. I'm just wondering if having the vacuum code need
to know a second horizon is really the best way to address that.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Jun 24, 2024 at 1:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

If we did things like that then it would still be necessary to write a
patch like the one Melanie came up with, on the grounds that we'd
really need to be paranoid about having missed some subtlety. We might
as well just rely on the mechanism directly. I just don't think that
it makes much difference.

--
Peter Geoghegan



On Mon, Jun 24, 2024 at 1:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

For the purposes of pruning, we are effectively always using the more
conservative of the two with this patch.

Are you more concerned about having a single horizon for pruning or
about having a horizon that does not move backwards after being
established at the beginning of vacuuming the relation?

Right now, in master, we do use a single horizon when determining what
is pruned -- that from GlobalVisState. OldestXmin is only used for
freezing and full page visibility determinations. Using a different
horizon for pruning by vacuum than freezing is what is causing the
error on master.

> I agree with you that letting the pruning horizon move forward during
> vacuum is desirable. I'm just wondering if having the vacuum code need
> to know a second horizon is really the best way to address that.

I was thinking about this some more and I realized I don't really get
why we think using GlobalVisState for pruning will let us remove more
tuples in the common case.

I had always thought it was because the vacuuming backend's
GlobalVisState will get updated periodically throughout vacuum and so,
assuming the oldest running transaction changes, our horizon for
vacuum would change. But, in writing this repro, it is actually quite
hard to get GlobalVisState to update. Our backend's RecentXmin needs
to have changed. And there aren't very many places where we take a new
snapshot after starting to vacuum a relation. One of those is at the
end of index vacuuming, but that can only affect the pruning horizon
if we have to do multiple rounds of index vacuuming. Is that really
the case we are thinking of when we say we want the pruning horizon to
move forward during vacuum?

- Melanie



On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change.

I believe that it's more of an aspirational thing at this point. That
it is currently aspirational (it happens to some extent but isn't ever
particularly useful) shouldn't change the analysis about how to fix
this bug.

> One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

No, that's definitely not what we were thinking of. It's just an
accident that it's almost the only thing that'll do that.

--
Peter Geoghegan



On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> If vacuum fails to remove a tuple with xmax older than
> VacuumCutoffs->OldestXmin and younger than
> GlobalVisState->maybe_needed, it will ERROR out when determining
> whether or not to freeze the tuple with "cannot freeze committed
> xmax".

One thing I don't understand is why it is okay to freeze the xmax of a
dead tuple just because it is from an aborted update.
heap_prepare_freeze_tuple() is called on HEAPTUPLE_RECENTLY_DEAD
tuples with normal xmaxes (non-multis) so that it can freeze tuples
from aborted updates. The only case in which we freeze dead tuples
with a non-multi xmax is if the xmax is from before OldestXmin and is
also not committed (so from an aborted update). Freezing dead tuples
replaces their xmax with InvalidTransactionId -- which would make them
look alive. So, it makes sense we don't do this for dead tuples in the
common case. But why is it 1) okay and 2) desirable to freeze xmaxes
of tuples from aborted updates? Won't it make them look alive again?

- Melanie



On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Are you more concerned about having a single horizon for pruning or
> about having a horizon that does not move backwards after being
> established at the beginning of vacuuming the relation?

I'm not sure I understand. The most important thing here is fixing the
bug. But if we have a choice of how to fix the bug, I'd prefer to do
it by having the pruning code test one horizon that is always correct,
rather than (as I think the patch does) having it test against two
horizons because as a way of covering possible discrepancies between
those values.

> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed.

> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change. But, in writing this repro, it is actually quite
> hard to get GlobalVisState to update. Our backend's RecentXmin needs
> to have changed. And there aren't very many places where we take a new
> snapshot after starting to vacuum a relation. One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

I thought the idea was that the GlobalVisTest stuff would force a
recalculation now and then, but maybe it doesn't actually do that?

Suppose process A begins a transaction, acquires an XID, and then goes
idle. Process B now begins a giant vacuum. At some point in the middle
of the vacuum, A ends the transaction. Are you saying that B's
GlobalVisTest never really notices that this has happened?

--
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> One thing I don't understand is why it is okay to freeze the xmax of a
> dead tuple just because it is from an aborted update.

We don't do that with XID-based xmaxs. Though perhaps we should, since
we'll already prune-away the successor tuple, and so might as well go
one tiny step further and clear the xmax for the original tuple via
freezing/setting it InvalidTransactionId. Instead we just leave the
original tuple largely undisturbed, with its original xmax.

We do something like that with Multi-based xmax fields, though not
with the specific goal of cleaning up after aborts in mind (we can
also remove lockers that are no longer running, regardless of where
they are relative to OldestXmin, stuff like that). The actual goal
with that is to enforce MultiXactCutoff, independently of whether or
not their member XIDs are < FreezeLimit yet.

> The only case in which we freeze dead tuples
> with a non-multi xmax is if the xmax is from before OldestXmin and is
> also not committed (so from an aborted update).

Perhaps I misunderstand, but: we simply don't freeze DEAD (not
RECENTLY_DEAD) tuples in the first place, because we don't have to
(pruning removes them instead). It doesn't matter if they're DEAD due
to being from aborted transactions or DEAD due to being
deleted/updated by a transaction that committed (committed and <
OldestXmin).

The freezing related code paths in heapam.c don't particularly care
whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
Just as long as it's not fully DEAD (then it should have been pruned).

--
Peter Geoghegan



On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not sure I understand. The most important thing here is fixing the
> bug. But if we have a choice of how to fix the bug, I'd prefer to do
> it by having the pruning code test one horizon that is always correct,
> rather than (as I think the patch does) having it test against two
> horizons because as a way of covering possible discrepancies between
> those values.

Your characterizing of OldestXmin + vistest as two horizons seems
pretty arbitrary to me. I know what you mean, of course, but it seems
like a distinction without a difference.

> I thought the idea was that the GlobalVisTest stuff would force a
> recalculation now and then, but maybe it doesn't actually do that?

It definitely can do that. Just not in a way that meaningfully
increases the number of heap tuples that we can recognize as DEAD and
remove. At least not currently.

> Suppose process A begins a transaction, acquires an XID, and then goes
> idle. Process B now begins a giant vacuum. At some point in the middle
> of the vacuum, A ends the transaction. Are you saying that B's
> GlobalVisTest never really notices that this has happened?

That's my understanding, yes. That is, vistest is approximately the
same thing as OldestXmin anyway. At least for now.

--
Peter Geoghegan



On Mon, Jun 24, 2024 at 4:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > One thing I don't understand is why it is okay to freeze the xmax of a
> > dead tuple just because it is from an aborted update.
>
> We don't do that with XID-based xmaxs. Though perhaps we should, since
> we'll already prune-away the successor tuple, and so might as well go
> one tiny step further and clear the xmax for the original tuple via
> freezing/setting it InvalidTransactionId. Instead we just leave the
> original tuple largely undisturbed, with its original xmax.

I thought that was the case too, but we call
heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples and then

    else if (TransactionIdIsNormal(xid))
    {
        /* Raw xmax is normal XID */
        freeze_xmax = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
    }

And then later we

    if (freeze_xmax)
        frz->xmax = InvalidTransactionId;

and then when we execute freezing the tuple in heap_execute_freeze_tuple()

    HeapTupleHeaderSetXmax(tuple, frz->xmax);

Which sets the xmax to InvalidTransactionId. Or am I missing something?

> > The only case in which we freeze dead tuples
> > with a non-multi xmax is if the xmax is from before OldestXmin and is
> > also not committed (so from an aborted update).
>
> Perhaps I misunderstand, but: we simply don't freeze DEAD (not
> RECENTLY_DEAD) tuples in the first place, because we don't have to
> (pruning removes them instead). It doesn't matter if they're DEAD due
> to being from aborted transactions or DEAD due to being
> deleted/updated by a transaction that committed (committed and <
> OldestXmin).

Right, I'm talking about HEAPTUPLE_RECENTLY_DEAD tuples.
HEAPTUPLE_DEAD tuples are pruned away. But we can't replace the xmax
of a tuple that has been deleted or updated by a transaction that
committed with InvalidTransactionId. And it seems like the code does
that? Why even call heap_prepare_freeze_tuple() on
HEAPTUPLE_RECENTLY_DEAD tuples? Is it mainly to handle MultiXact
freezing?

> The freezing related code paths in heapam.c don't particularly care
> whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
> Just as long as it's not fully DEAD (then it should have been pruned).

But it just seems like we shouldn't freeze RECENTLY_DEAD either.

- Melanie



On Mon, Jun 24, 2024 at 4:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I thought the idea was that the GlobalVisTest stuff would force a
> > recalculation now and then, but maybe it doesn't actually do that?
>
> It definitely can do that. Just not in a way that meaningfully
> increases the number of heap tuples that we can recognize as DEAD and
> remove. At least not currently.
>
> > Suppose process A begins a transaction, acquires an XID, and then goes
> > idle. Process B now begins a giant vacuum. At some point in the middle
> > of the vacuum, A ends the transaction. Are you saying that B's
> > GlobalVisTest never really notices that this has happened?
>
> That's my understanding, yes. That is, vistest is approximately the
> same thing as OldestXmin anyway. At least for now.

Exactly. Something has to cause this backend to update its view of the
horizon. At the end of index vacuuming,
GetOldestNonRemovableTransactionId() will explicitly
ComputeXidHorizons() which will update our backend's GlobalVisStates.
Otherwise, if our backend's RecentXmin is updated, by taking a new
snapshot, then we may update our GlobalVisStates. See
GlobalVisTestShouldUpdate() for the conditions under which we would
update our GlobalVisStates during the normal visibility checks
happening during pruning.

Vacuum used to open indexes after calculating horizons before starting
its first pass. This led to a recomputation of the horizon. But, in
master, there aren't many obvious places where such a thing would be
happening.

- Melanie



On Mon, Jun 24, 2024 at 04:51:24PM -0400, Peter Geoghegan wrote:
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I'm not sure I understand. The most important thing here is fixing the
> > bug. But if we have a choice of how to fix the bug, I'd prefer to do
> > it by having the pruning code test one horizon that is always correct,
> > rather than (as I think the patch does) having it test against two
> > horizons because as a way of covering possible discrepancies between
> > those values.
> 
> Your characterizing of OldestXmin + vistest as two horizons seems
> pretty arbitrary to me. I know what you mean, of course, but it seems
> like a distinction without a difference.

"Two horizons" matches how I model it.  If the two were _always_ indicating
the same notion of visibility, we wouldn't have this thread.

On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed, and I think using different sources for pruning and freezing is a
recipe for future bugs.  Fundamentally, both are about answering "is
snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
That's not to say this thread shall unify the two, but I suspect that's the
right long-term direction.



On Mon, Jun 24, 2024 at 9:30 PM Noah Misch <noah@leadboat.com> wrote:
> On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> > Right now, in master, we do use a single horizon when determining what
> > is pruned -- that from GlobalVisState. OldestXmin is only used for
> > freezing and full page visibility determinations. Using a different
> > horizon for pruning by vacuum than freezing is what is causing the
> > error on master.
>
> Agreed, and I think using different sources for pruning and freezing is a
> recipe for future bugs.  Fundamentally, both are about answering "is
> snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
> That's not to say this thread shall unify the two, but I suspect that's the
> right long-term direction.

What does it really mean to unify the two, though?

If the OldestXmin field was located in struct GlobalVisState (next to
definitely_needed and maybe_needed), but everything worked in
essentially the same way as it will with Melanie's patch in place,
would that count as unifying the two? Why or why not?

--
Peter Geoghegan



On Mon, Jun 24, 2024 at 09:49:53PM -0400, Peter Geoghegan wrote:
> On Mon, Jun 24, 2024 at 9:30 PM Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> > > Right now, in master, we do use a single horizon when determining what
> > > is pruned -- that from GlobalVisState. OldestXmin is only used for
> > > freezing and full page visibility determinations. Using a different
> > > horizon for pruning by vacuum than freezing is what is causing the
> > > error on master.
> >
> > Agreed, and I think using different sources for pruning and freezing is a
> > recipe for future bugs.  Fundamentally, both are about answering "is
> > snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
> > That's not to say this thread shall unify the two, but I suspect that's the
> > right long-term direction.
> 
> What does it really mean to unify the two, though?
> 
> If the OldestXmin field was located in struct GlobalVisState (next to
> definitely_needed and maybe_needed), but everything worked in
> essentially the same way as it will with Melanie's patch in place,
> would that count as unifying the two? Why or why not?

To me, no, unification would mean removing the data redundancy.  Relocating
the redundant field and/or code that updates the redundant field certainly
could reduce the risk of bugs, so I'm not opposing everything short of
removing the data redundancy.  I'm just agreeing with the "prefer" from
https://postgr.es/m/CA+TgmoYzS_bkt_MrNxr5QrXDKfedmh4tStn8UBTTBXqv=3JTew@mail.gmail.com



Hi,

On 2024-06-24 16:35:50 -0400, Robert Haas wrote:
> On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Are you more concerned about having a single horizon for pruning or
> > about having a horizon that does not move backwards after being
> > established at the beginning of vacuuming the relation?
> 
> I'm not sure I understand. The most important thing here is fixing the
> bug. But if we have a choice of how to fix the bug, I'd prefer to do
> it by having the pruning code test one horizon that is always correct,
> rather than (as I think the patch does) having it test against two
> horizons because as a way of covering possible discrepancies between
> those values.

I think that's going in the wrong direction. We *want* to prune more
aggressively if we can (*), the necessary state is represented by the
vistest. That's a different thing than *having* to prune tuples beyond a
certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
we're having here is that the two states can get out of sync due to the
vistest "moving backwards", because of hot_standby_feedback (and perhaps also
an issue around aborts).

To prevent that we can
a) make sure that we always take the hard cutoff into account
b) prevent vistest from going backwards

(*) we really ought to become more aggressive, by removing intermediary row
    versions when they're not visible to anyone, but not yet old enough to be
    below the horizon. But that realistically will only be possible in *some*
    cases, e.g. when the predecessor row version is on the same page.



> > I had always thought it was because the vacuuming backend's
> > GlobalVisState will get updated periodically throughout vacuum and so,
> > assuming the oldest running transaction changes, our horizon for
> > vacuum would change. But, in writing this repro, it is actually quite
> > hard to get GlobalVisState to update. Our backend's RecentXmin needs
> > to have changed. And there aren't very many places where we take a new
> > snapshot after starting to vacuum a relation. One of those is at the
> > end of index vacuuming, but that can only affect the pruning horizon
> > if we have to do multiple rounds of index vacuuming. Is that really
> > the case we are thinking of when we say we want the pruning horizon to
> > move forward during vacuum?
> 
> I thought the idea was that the GlobalVisTest stuff would force a
> recalculation now and then, but maybe it doesn't actually do that?

It forces an accurate horizon to be determined the first time it would require
it to determine visibility. The "first time" is determined by RecentXmin not
having changed.

The main goal of the vistest stuff was to not have to determine an accurate
horizon in GetSnapshotData(). Determining an accurate horizon requires
accessing each backends ->xmin, which causes things to scale badly, as ->xmin
changes so frequently.

The cost of determining the accurate horizon is irrelevant for vacuums, but
it's not at all irrelevant for on-access pruning.


> Suppose process A begins a transaction, acquires an XID, and then goes
> idle. Process B now begins a giant vacuum. At some point in the middle
> of the vacuum, A ends the transaction. Are you saying that B's
> GlobalVisTest never really notices that this has happened?

Not at the moment, but we should add heuristics like that.


Greetings,

Andres Freund



On Tue, Jun 25, 2024 at 8:03 AM Andres Freund <andres@anarazel.de> wrote:
> I think that's going in the wrong direction. We *want* to prune more
> aggressively if we can (*), the necessary state is represented by the
> vistest. That's a different thing than *having* to prune tuples beyond a
> certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
> we're having here is that the two states can get out of sync due to the
> vistest "moving backwards", because of hot_standby_feedback (and perhaps also
> an issue around aborts).

I agree that we want to prune more aggressively if we can. I think
that fixing this by preventing vistest from going backward is
reasonable, and I like it better than what Melanie proposed, although
I like what Melanie proposed much better than not fixing it! I'm not
sure how to do that cleanly, but one of you may have an idea.

I do think that having a bunch of different XID values that function
as horizons and a vistest object that holds some more XID horizons
floating around in vacuum makes the code hard to understand. The
relationships between the various values are not well-documented. For
instance, the vistest has to be after vacrel->cutoffs.OldestXmin for
correctness, but I don't think there's a single comment anywhere
saying that; meanwhile, the comments for VacuumCutoffs say "OldestXmin
is the Xid below which tuples deleted by any xact (that committed)
should be considered DEAD, not just RECENTLY_DEAD." Surely the reader
can be forgiven for thinking that this is the cutoff that will
actually be used by pruning, but it isn't.

And more generally, it seems like a fairly big problem to me that
LVRelState directly stores NewRelfrozenXid; contains a VacuumCutoffs
object that stores relfrozenxid, OldestXmin, and FreezeLimit; and also
points to a GlobalVisState object that contains definitely_needed and
maybe_needed. That is six different XID cutoffs for one vacuum
operation. That's a lot. I can't describe how they're all different
from each other or what the necessary relationships between them are
off-hand, and I bet nobody else could either, at least until recently,
else we might not have this bug. I feel like if it were possible to
have fewer of them and still have things work, we'd be better off. I'm
not sure that's doable. But six seems like a lot.

--
Robert Haas
EDB: http://www.enterprisedb.com



On 2024-06-25 08:42:02 -0400, Robert Haas wrote:
> On Tue, Jun 25, 2024 at 8:03 AM Andres Freund <andres@anarazel.de> wrote:
> > I think that's going in the wrong direction. We *want* to prune more
> > aggressively if we can (*), the necessary state is represented by the
> > vistest. That's a different thing than *having* to prune tuples beyond a
> > certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
> > we're having here is that the two states can get out of sync due to the
> > vistest "moving backwards", because of hot_standby_feedback (and perhaps also
> > an issue around aborts).
> 
> I agree that we want to prune more aggressively if we can. I think
> that fixing this by preventing vistest from going backward is
> reasonable, and I like it better than what Melanie proposed, although
> I like what Melanie proposed much better than not fixing it! I'm not
> sure how to do that cleanly, but one of you may have an idea.

It's not hard - but it has downsides. It'll mean that - outside of vacuum -
we'll much more often not react to horizons going backwards due to
hot_standby_feedback. Which means that hot_standby_feedback, when used without
slots, will prevent fewer conflicts.


> I do think that having a bunch of different XID values that function
> as horizons and a vistest object that holds some more XID horizons
> floating around in vacuum makes the code hard to understand. The
> relationships between the various values are not well-documented. For
> instance, the vistest has to be after vacrel->cutoffs.OldestXmin for
> correctness, but I don't think there's a single comment anywhere
> saying that;

It is somewhat documented:

 * Note: the approximate horizons (see definition of GlobalVisState) are
 * updated by the computations done here. That's currently required for
 * correctness and a small optimization. Without doing so it's possible that
 * heap vacuum's call to heap_page_prune_and_freeze() uses a more conservative
 * horizon than later when deciding which tuples can be removed - which the
 * code doesn't expect (breaking HOT).


> And more generally, it seems like a fairly big problem to me that
> LVRelState directly stores NewRelfrozenXid; contains a VacuumCutoffs
> object that stores relfrozenxid, OldestXmin, and FreezeLimit; and also
> points to a GlobalVisState object that contains definitely_needed and
> maybe_needed. That is six different XID cutoffs for one vacuum
> operation. That's a lot. I can't describe how they're all different
> from each other or what the necessary relationships between them are
> off-hand, and I bet nobody else could either, at least until recently,
> else we might not have this bug. I feel like if it were possible to
> have fewer of them and still have things work, we'd be better off. I'm
> not sure that's doable. But six seems like a lot.

Agreed. I don't think you can just unify things though, they actually are all
different for good, or at least decent, reasons.  I think improving the naming
alone could help a good bit though.

Greetings,

Andres Freund



On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote:
> It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> we'll much more often not react to horizons going backwards due to
> hot_standby_feedback. Which means that hot_standby_feedback, when used without
> slots, will prevent fewer conflicts.

Can you explain this in more detail?

--
Robert Haas
EDB: http://www.enterprisedb.com



On 24.06.2024 17:37, Melanie Plageman wrote:
On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

This is an interesting and difficult case) I noticed that when initializing the cluster, in my opinion, we provide excessive freezing. Initialization takes a long time, which can lead, for example, to longer test execution. I got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, and it works fine.

if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))    return HEAPTUPLE_DEAD;

Can I keep it?
This looks like an addition to the new criteria I added to
heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so,
it looks like it would only return HEAPTUPLE_DEAD (and thus only
remove) a subset of the tuples my original criteria would remove. When
vacuum calculates OldestMxact as FirstMultiXactId, it would not remove
those tuples deleted before OldestXmin. It seems like OldestMxact will
equal FirstMultiXactID sometimes right after initdb and after
transaction ID wraparound. I'm not sure I totally understand the
criteria.

One thing I find confusing about this is that this would actually
remove less tuples than with my criteria -- which could lead to more
freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we
would not remove tuples deleted before OldestXmin and thus return
HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider
freezing them. So, it seems like we would do more freezing by adding
this criteria.

Could you explain more about how the criteria you are suggesting
works? Are you saying it does less freezing than master or less
freezing than with my patch?


At first, I noticed that with this patch, vacuum fouls the nodes more often than before, and it seemed to me that more time was spent initializing the cluster with this patch than before, so I suggested considering this condition. After checking again, I found that the problem was with my laptop. So, sorry for the noise.

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.
-- snip --
I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

I couldn't understand why the replica is necessary here. Now I am digging why I got the similar behavior without replica when I have only one instance. I'm still checking this in my test, but I believe this patch fixes the original problem because the symptoms were the same.
Did you get similar behavior on master or on back branches? Was the
behavior you observed the infinite loop or the error during
heap_prepare_freeze_tuple()?

In my examples, the replica is needed because something has to move
the horizon on the primary backwards. When a standby reconnects with
an older oldest running transaction ID than any of the running
transactions on the primary and the vacuuming backend recomputes its
RecentXmin, the horizon may move backwards when compared to the
horizon calculated at the beginning of the vacuum. Vacuum does not
recompute cutoffs->OldestXmin during vacuuming a relation but it may
recompute the values in the GlobalVisState it uses for pruning.

We knew of only one other way that the horizon could move backwards
which Matthias describes here [1]. However, this is thought to be its
own concurrency-related bug in the commit-abort path that should be
fixed -- as opposed to the standby reconnecting with an older oldest
running transaction ID which can be expected.

Do you know if you were seeing the effects of the scenario Matthias describes?


[1] https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com
I'm sorry, I need a little more time to figure this out. I will answer this question later.
-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote:
> > It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> > we'll much more often not react to horizons going backwards due to
> > hot_standby_feedback. Which means that hot_standby_feedback, when used without
> > slots, will prevent fewer conflicts.
>
> Can you explain this in more detail?

If we prevent GlobalVisState from moving backward, then we would less
frequently be pushing the horizon backward on the primary in response
to hot standby feedback. Then, the primary would do more things that
would not be safely replayable on the standby -- so the standby could
end up encountering more recovery conflicts.

- Melanie



On Tue, Jun 25, 2024 at 11:39 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote:
> > > It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> > > we'll much more often not react to horizons going backwards due to
> > > hot_standby_feedback. Which means that hot_standby_feedback, when used without
> > > slots, will prevent fewer conflicts.
> >
> > Can you explain this in more detail?
>
> If we prevent GlobalVisState from moving backward, then we would less
> frequently be pushing the horizon backward on the primary in response
> to hot standby feedback. Then, the primary would do more things that
> would not be safely replayable on the standby -- so the standby could
> end up encountering more recovery conflicts.

I don't get it. hot_standby_feedback only moves horizons backward on
the primary, AFAIK, when it first connects, or when it reconnects.
Which I guess could be frequent for some users with flaky networks,
but does that really rise to the level of "much more often"?

--
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2024-06-25 12:31:11 -0400, Robert Haas wrote:
> On Tue, Jun 25, 2024 at 11:39 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > On Tue, Jun 25, 2024 at 10:31 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund <andres@anarazel.de> wrote:
> > > > It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> > > > we'll much more often not react to horizons going backwards due to
> > > > hot_standby_feedback. Which means that hot_standby_feedback, when used without
> > > > slots, will prevent fewer conflicts.
> > >
> > > Can you explain this in more detail?
> >
> > If we prevent GlobalVisState from moving backward, then we would less
> > frequently be pushing the horizon backward on the primary in response
> > to hot standby feedback. Then, the primary would do more things that
> > would not be safely replayable on the standby -- so the standby could
> > end up encountering more recovery conflicts.
> 
> I don't get it. hot_standby_feedback only moves horizons backward on
> the primary, AFAIK, when it first connects, or when it reconnects.
> Which I guess could be frequent for some users with flaky networks,
> but does that really rise to the level of "much more often"?

Well, the thing is that with the "prevent it from going backwards" approach,
once the horizon is set to something recent in a backend, it's "sticky".  If a
replica is a bit behind or if there's a long-lived snapshot and disconnects,
the vistest state will advance beyond where the replica needs it to be.  Even
if the standby later reconnects, the vistest in long-lived sessions will still
have the more advanced state. So all future pruning these backends do runs
into the risk of performing pruning that removes rows the standby still deems
visible and thus causes recovery conflicts.

I.e. you don't even need frequent disconnects, you just need one disconnect
and sessions that aren't shortlived.

That said, obviously there will be plenty setups where this won't cause an
issue. I don't really have a handle on how often it'd be a problem.


Greetings,

Andres Freund



On Tue, Jun 25, 2024 at 1:10 PM Andres Freund <andres@anarazel.de> wrote:
> That said, obviously there will be plenty setups where this won't cause an
> issue. I don't really have a handle on how often it'd be a problem.

Fair enough. Even if it's not super-common, it doesn't seem like a
great idea to regress such scenarios in the back-branches.

Is there any way that we could instead tweak things so that we adjust
the visibility test object itself? Like can have a GlobalVisTest API
where we can supply the OldestXmin from the VacuumCutoffs and have it
... do something useful with that?

--
Robert Haas
EDB: http://www.enterprisedb.com



On 2024-06-25 14:35:00 -0400, Robert Haas wrote:
> Is there any way that we could instead tweak things so that we adjust
> the visibility test object itself? Like can have a GlobalVisTest API
> where we can supply the OldestXmin from the VacuumCutoffs and have it
> ... do something useful with that?

I doubt that's doable in the back branches. And even on HEAD, I don't think
it's a particularly attractive option - there's just a global vistest for each
of the types of objects with a specific horizon (they need to be updated
occasionally, e.g. when taking snapshots). So there's not really a spot to put
an associated OldestXmin. We could put it there and remove it at the end of
vacuum / in an exception handler, but that seems substantially worse.



On Tue, Jun 25, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:
> I doubt that's doable in the back branches. And even on HEAD, I don't think
> it's a particularly attractive option - there's just a global vistest for each
> of the types of objects with a specific horizon (they need to be updated
> occasionally, e.g. when taking snapshots). So there's not really a spot to put
> an associated OldestXmin. We could put it there and remove it at the end of
> vacuum / in an exception handler, but that seems substantially worse.

Oh, right: I forgot that the visibility test objects were just
pointers to global variables.

Well, I don't know. I guess that doesn't leave any real options but to
fix it as Melanie proposed. But I still don't like it very much. I
feel like having to test against two different thresholds in the
pruning code is surely a sign that we're doing something wrong.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Would it be possible to make it robust so that we could always run it
> with "make check"? This seems like an important corner case to
> regression test.

Okay, I've attached a new version of the patch and a new version of
the repro that may be fast and stable enough to commit. It is more
minimal than the previous version. I made the table as small as I
could to still trigger two rounds of index vacuuming. I tried to make
it as stable as possible. I also removed the cursor on the standby
that could trigger recovery conflicts. It would be super helpful if
someone could take a look at the test and point out any ways I could
make it even more likely to be stable.

- Melanie

Attachment
On Tue, Jul 2, 2024 at 7:07 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > Would it be possible to make it robust so that we could always run it
> > with "make check"? This seems like an important corner case to
> > regression test.
>
> Okay, I've attached a new version of the patch and a new version of
> the repro that may be fast and stable enough to commit. It is more
> minimal than the previous version. I made the table as small as I
> could to still trigger two rounds of index vacuuming. I tried to make
> it as stable as possible. I also removed the cursor on the standby
> that could trigger recovery conflicts. It would be super helpful if
> someone could take a look at the test and point out any ways I could
> make it even more likely to be stable.

Attached v3 has one important additional component in the test -- I
use pg_stat_progress_vacuum to confirm that we actually do more than
one pass of index vacuuming. Otherwise, it would have been trivial for
the test to incorrectly pass.

I could still use another pair of eyes on the test (looking out for
stability enhancing measures I could take). I also would be happy if
someone looked at the commit message and/or comments to let me know if
they make sense.

I'll finish with versions of the patch and test targeted at v14-16 and
propose those before committing this.

- Melanie

Attachment
On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Attached v3 has one important additional component in the test -- I
> use pg_stat_progress_vacuum to confirm that we actually do more than
> one pass of index vacuuming. Otherwise, it would have been trivial for
> the test to incorrectly pass.

That's a good idea.

> I could still use another pair of eyes on the test (looking out for
> stability enhancing measures I could take).

First, the basics: I found that your test failed reliably without your
fix, and passed reliably with your fix.

Next, performance: the total test runtime (as indicated by "time meson
test -q recovery/043_vacuum_horizon_floor") was comparable to other
recovery/* TAP tests. The new vacuum_horizon_floor test is a little on
the long running side, as these tests go, but I think that's fine.

Minor nitpicking about the comments in your TAP test:

* It is necessary but not sufficient for your test to "skewer"
maybe_needed, relative to OldestXmin. Obviously, it is not sufficient
because the test can only fail when VACUUM prunes a heap page after
the backend's horizons have been "skewered" in this sense.

Pruning is when we get stuck, and if there's no more pruning then
there's no opportunity for VACUUM to get stuck. Perhaps this point
should be noted directly in the comments. You could add a sentence
immediately after the existing sentence "Then vacuum's first pass will
continue and pruning...". This new sentence would then add commentary
such as "Finally, vacuum's second pass over the heap...".

* Perhaps you should point out that you're using VACUUM FREEZE for
this because it'll force the backend to always get a cleanup lock.
This is something you rely on to make the repro reliable, but that's
it.

In other words, point out to the reader that this bug has nothing to
do with freezing; it just so happens to be convenient to use VACUUM
FREEZE this way, due to implementation details.

* The sentence "VACUUM proceeds with pruning and does a visibility
check on each tuple..." describes the bug in terms of the current
state of things on Postgres 17, but Postgres 17 hasn't been released
just yet. Does that really make sense?

If you want to describe the invariant that caused
heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the
commit message of your fix seems like the right place for that. You
could reference these errors in passing. The errors seem fairly
incidental to the real problem, at least to me.

I think that there is some chance that this test will break the build
farm in whatever way, since there is a long history of VACUUM not
quite behaving as expected with these sorts of tests. I think that you
should commit the test case separately, first thing in the morning,
and then keep an eye on the build farm for the rest of the day. I
don't think that it's sensible to bend over backwards, just to avoid
breaking the build farm in this way.

Thanks for working on this
--
Peter Geoghegan



On Mon, Jul 15, 2024 at 6:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I could still use another pair of eyes on the test (looking out for
> > stability enhancing measures I could take).
>
> First, the basics: I found that your test failed reliably without your
> fix, and passed reliably with your fix.

Thanks for the review.

> Minor nitpicking about the comments in your TAP test:
>
> * It is necessary but not sufficient for your test to "skewer"
> maybe_needed, relative to OldestXmin. Obviously, it is not sufficient
> because the test can only fail when VACUUM prunes a heap page after
> the backend's horizons have been "skewered" in this sense.
>
> Pruning is when we get stuck, and if there's no more pruning then
> there's no opportunity for VACUUM to get stuck. Perhaps this point
> should be noted directly in the comments. You could add a sentence
> immediately after the existing sentence "Then vacuum's first pass will
> continue and pruning...". This new sentence would then add commentary
> such as "Finally, vacuum's second pass over the heap...".

I've added a description to the top of the test of the scenario
required and then reworked the comment you are describing to try and
make this more clear.

> * Perhaps you should point out that you're using VACUUM FREEZE for
> this because it'll force the backend to always get a cleanup lock.
> This is something you rely on to make the repro reliable, but that's
> it.
>
> In other words, point out to the reader that this bug has nothing to
> do with freezing; it just so happens to be convenient to use VACUUM
> FREEZE this way, due to implementation details.

I've mentioned this in a comment.

> * The sentence "VACUUM proceeds with pruning and does a visibility
> check on each tuple..." describes the bug in terms of the current
> state of things on Postgres 17, but Postgres 17 hasn't been released
> just yet. Does that really make sense?

In the patch targeted at master, I think it makes sense to describe
the code as it is. In the backpatch versions, I reworked this comment
to be correct for those versions.

> If you want to describe the invariant that caused
> heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the
> commit message of your fix seems like the right place for that. You
> could reference these errors in passing. The errors seem fairly
> incidental to the real problem, at least to me.

The errors are mentioned in the fix commit message.

> I think that there is some chance that this test will break the build
> farm in whatever way, since there is a long history of VACUUM not
> quite behaving as expected with these sorts of tests. I think that you
> should commit the test case separately, first thing in the morning,
> and then keep an eye on the build farm for the rest of the day. I
> don't think that it's sensible to bend over backwards, just to avoid
> breaking the build farm in this way.

Sounds good.

- Melanie

Attachment
On Wed, Jul 17, 2024 at 11:07 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jul 15, 2024 at 6:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > I think that there is some chance that this test will break the build
> > farm in whatever way, since there is a long history of VACUUM not
> > quite behaving as expected with these sorts of tests. I think that you
> > should commit the test case separately, first thing in the morning,
> > and then keep an eye on the build farm for the rest of the day. I
> > don't think that it's sensible to bend over backwards, just to avoid
> > breaking the build farm in this way.
>
> Sounds good.

Hmm. So, I was just running all the versions through CI again, and I
noticed that the test failed on master on CI on Linux - Debian
Bookworm with Meson. (This passes locally for me and has passed on
previous CI runs).

[15:43:41.547] stderr:
[15:43:41.547] # poll_query_until timed out executing this query:
[15:43:41.547] #
[15:43:41.547] # SELECT index_vacuum_count > 0
[15:43:41.547] # FROM pg_stat_progress_vacuum
[15:43:41.547] # WHERE datname='test_db' AND relid::regclass =
'vac_horizon_floor_table'::regclass;
[15:43:41.547] #
[15:43:41.547] # expecting this output:
[15:43:41.547] # t
[15:43:41.547] # last actual query output:
[15:43:41.547] # f

We didn't end up doing two index vacuum passes. Because it doesn't
repro locally for me, I can only assume that the conditions for
forcing two index vacuuming passes in master just weren't met in this
case. I'm unsurprised, as it is much harder since 17 to force two
passes of index vacuuming. It seems like this might be as unstable as
I feared. I could add more dead data. Or, I could just commit the test
to the back branches before 17. What do you think?

- Melanie



On Wed, Jul 17, 2024 at 12:07 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> We didn't end up doing two index vacuum passes. Because it doesn't
> repro locally for me, I can only assume that the conditions for
> forcing two index vacuuming passes in master just weren't met in this
> case. I'm unsurprised, as it is much harder since 17 to force two
> passes of index vacuuming. It seems like this might be as unstable as
> I feared. I could add more dead data. Or, I could just commit the test
> to the back branches before 17. What do you think?

How much margin of error do you have, in terms of total number of
dead_items? That is, have you whittled it down to the minimum possible
threshold for 2 passes?

Some logging with VACUUM VERBOSE (run on the ci instance) might be illuminating.


--
Peter Geoghegan



On Wed, Jul 17, 2024 at 12:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jul 17, 2024 at 12:07 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > We didn't end up doing two index vacuum passes. Because it doesn't
> > repro locally for me, I can only assume that the conditions for
> > forcing two index vacuuming passes in master just weren't met in this
> > case. I'm unsurprised, as it is much harder since 17 to force two
> > passes of index vacuuming. It seems like this might be as unstable as
> > I feared. I could add more dead data. Or, I could just commit the test
> > to the back branches before 17. What do you think?
>
> How much margin of error do you have, in terms of total number of
> dead_items? That is, have you whittled it down to the minimum possible
> threshold for 2 passes?

When I run it on my machine with some added logging, the space taken
by dead items is about 330 kB more than maintenance_work_mem (which is
set to 1 MB). I could roughly double the excess by increasing the
number of inserted tuples from 400000 to 600000. I'll do this.

> Some logging with VACUUM VERBOSE (run on the ci instance) might be illuminating.

Vacuum verbose only will tell us the number of dead tuples and dead
item identifiers but not how much space they take up -- which is how
we decide whether or not to do index vacuuming.

- Melanie



Melanie Plageman <melanieplageman@gmail.com> writes:
> When I run it on my machine with some added logging, the space taken
> by dead items is about 330 kB more than maintenance_work_mem (which is
> set to 1 MB). I could roughly double the excess by increasing the
> number of inserted tuples from 400000 to 600000. I'll do this.

So, after about two days in the buildfarm, we have failure reports
from this test on gull, mamba, mereswine, and copperhead.  mamba
is mine, and I was able to reproduce the failure in a manual run.
The problem seems to be that the test simply takes too long and
we hit the default 180-second timeout on one step or another.
I was able to make it pass by dint of

$ export PG_TEST_TIMEOUT_DEFAULT=1800

However, the test then took 908 seconds:

$ time make installcheck PROVE_TESTS=t/043_vacuum_horizon_floor.pl
...
# +++ tap install-check in src/test/recovery +++
t/043_vacuum_horizon_floor.pl .. ok
All tests successful.
Files=1, Tests=3, 908 wallclock secs ( 0.17 usr  0.01 sys + 21.42 cusr 35.03 csys = 56.63 CPU)
Result: PASS
      909.26 real        22.10 user        35.21 sys

This is even slower than the 027_stream_regress.pl test, which
currently takes around 847 seconds on that machine.

mamba, gull, and mereswine are 32-bit machines, which aside from
being old and slow suffer an immediate 2x size-of-test penalty:

>> # The TIDStore vacuum uses to store dead items is optimized for its target
>> # system. On a 32-bit system, our example requires twice as many pages with
>> # the same number of dead items per page to fill the TIDStore and trigger a
>> # second round of index vacuuming.
>> my $is_64bit = $node_primary->safe_psql($test_db,
>>     qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]);
>>
>> my $nrows = $is_64bit eq 't' ? 400000 : 800000;

copperhead is 64-bit but is nonetheless even slower than the
other three, so the fact that it's also timing out isn't
that surprising.

I do not think the answer to this is to nag the respective animal
owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
not worth the cycles it takes, at least not for these machines.
I'm not sure whether to propose reverting it entirely or just
disabling it on 32-bit hardware.  I don't think we'd lose anything
meaningful in test coverage if we did the latter; but that won't be
enough to make copperhead happy.  I am also suspicious that we'll
get bad news from other very slow animals such as dikkop.

I wonder if there is a less expensive way to trigger the test
situation than brute-forcing things with a large index.
Maybe the injection point infrastructure could help?

            regards, tom lane



On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I do not think the answer to this is to nag the respective animal
> owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
> not worth the cycles it takes, at least not for these machines.

Can't we just move it to PG_TEST_EXTRA? Alongside the existing
"xid_wraparound" test?

We didn't even have basic coverage of multi-pass VACUUMs before now.
This new test added that coverage. I think that it will pull its
weight.

There will always be a small number of extremely slow buildfarm
animals. Optimizing for things like Raspberry pi animals with SD cards
just doesn't seem like a good use of developer time. I really care
about keeping the tests fast, but only on platforms that hackers
actually use for their development work.

--
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do not think the answer to this is to nag the respective animal
>> owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
>> not worth the cycles it takes, at least not for these machines.

> Can't we just move it to PG_TEST_EXTRA? Alongside the existing
> "xid_wraparound" test?

Perhaps.  xid_wraparound seems entirely too slow for what it's
testing as well, if you ask me, and there's a concurrent thread
about that test causing problems too.

> There will always be a small number of extremely slow buildfarm
> animals. Optimizing for things like Raspberry pi animals with SD cards
> just doesn't seem like a good use of developer time. I really care
> about keeping the tests fast, but only on platforms that hackers
> actually use for their development work.

I find this argument completely disingenuous.  If a test is slow
enough to cause timeout failures on slower machines, then it's also
eating a disproportionate number of cycles in every other check-world
run --- many of which have humans waiting for them to finish.  Caring
about the runtime of test cases is good for future-you not just
obsolete buildfarm animals.

I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
to get next-to-zero buildfarm coverage.  If that test is actually
capable of revealing problems, we're unlikely to find out under the
status quo.

            regards, tom lane



On Sun, Jul 21, 2024 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > There will always be a small number of extremely slow buildfarm
> > animals. Optimizing for things like Raspberry pi animals with SD cards
> > just doesn't seem like a good use of developer time. I really care
> > about keeping the tests fast, but only on platforms that hackers
> > actually use for their development work.
>
> I find this argument completely disingenuous.

Disingenuous? Really?

> If a test is slow
> enough to cause timeout failures on slower machines, then it's also
> eating a disproportionate number of cycles in every other check-world
> run --- many of which have humans waiting for them to finish.  Caring
> about the runtime of test cases is good for future-you not just
> obsolete buildfarm animals.

That's not necessarily true, though.

I actually benchmarked this new test. I found that its runtime was a
little on the long side as these recovery TAP tests go, but not to an
excessive degree. It wasn't the slowest by any means.

It's entirely possible that the new test would in fact be far slower
than other comparable tests, were I to run a similar benchmark on
something like a Raspberry pi with an SD card -- that would explain
the apparent inconsistency here. Obviously Raspberry pi type hardware
is expected to be much slower than the machine I use day to day, but
that isn't the only thing that matters. A Raspberry pi can also have
completely different performance characteristics to high quality
workstation hardware. The CPU might be tolerably fast, while I/O is a
huge bottleneck.

> I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
> to get next-to-zero buildfarm coverage.  If that test is actually
> capable of revealing problems, we're unlikely to find out under the
> status quo.

I saw that.

I think that there is significant value in providing a way for
individual developers to test wraparound. Both by providing a TAP
test, and providing the associated SQL callable C test functions.
There is less value in testing it on every conceivable platform.

--
Peter Geoghegan



On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Melanie Plageman <melanieplageman@gmail.com> writes:
> > When I run it on my machine with some added logging, the space taken
> > by dead items is about 330 kB more than maintenance_work_mem (which is
> > set to 1 MB). I could roughly double the excess by increasing the
> > number of inserted tuples from 400000 to 600000. I'll do this.
>
> So, after about two days in the buildfarm, we have failure reports
> from this test on gull, mamba, mereswine, and copperhead.  mamba
> is mine, and I was able to reproduce the failure in a manual run.
> The problem seems to be that the test simply takes too long and
> we hit the default 180-second timeout on one step or another.
> I was able to make it pass by dint of
>
> $ export PG_TEST_TIMEOUT_DEFAULT=1800
>
> However, the test then took 908 seconds:

Thanks for taking the time to do this. If the test failures can be
fixed by increasing timeout, that means that at least multiple index
vacuums are reliably triggered with that number of rows. Obviously we
can't have a super slow, flakey test, but I was worried the test might
fail on different platforms because somehow the row count was
insufficient to cause multiple index vacuums on some platforms for
some reason (due to adaptive radix tree size being dependent on many
factors).

> $ time make installcheck PROVE_TESTS=t/043_vacuum_horizon_floor.pl
> ...
> # +++ tap install-check in src/test/recovery +++
> t/043_vacuum_horizon_floor.pl .. ok
> All tests successful.
> Files=1, Tests=3, 908 wallclock secs ( 0.17 usr  0.01 sys + 21.42 cusr 35.03 csys = 56.63 CPU)
> Result: PASS
>       909.26 real        22.10 user        35.21 sys
>
> This is even slower than the 027_stream_regress.pl test, which
> currently takes around 847 seconds on that machine.
>
> mamba, gull, and mereswine are 32-bit machines, which aside from
> being old and slow suffer an immediate 2x size-of-test penalty:
>
> >> # The TIDStore vacuum uses to store dead items is optimized for its target
> >> # system. On a 32-bit system, our example requires twice as many pages with
> >> # the same number of dead items per page to fill the TIDStore and trigger a
> >> # second round of index vacuuming.
> >> my $is_64bit = $node_primary->safe_psql($test_db,
> >>      qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]);
> >>
> >> my $nrows = $is_64bit eq 't' ? 400000 : 800000;
>
> copperhead is 64-bit but is nonetheless even slower than the
> other three, so the fact that it's also timing out isn't
> that surprising.
>
> I do not think the answer to this is to nag the respective animal
> owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
> not worth the cycles it takes, at least not for these machines.
> I'm not sure whether to propose reverting it entirely or just
> disabling it on 32-bit hardware.  I don't think we'd lose anything
> meaningful in test coverage if we did the latter; but that won't be
> enough to make copperhead happy.  I am also suspicious that we'll
> get bad news from other very slow animals such as dikkop.

I am happy to do what Peter suggests and move it to PG_TEST_EXTRA, to
disable for 32-bit, or to revert it.

> I wonder if there is a less expensive way to trigger the test
> situation than brute-forcing things with a large index.
> Maybe the injection point infrastructure could help?

The issue with an injection point is that we need more than for the
vacuuming backend to pause at a specific point, we need a refresh of
GlobalVisState to be forced at that point. Even if the horizon moves
backward on the primary, this backend won't notice unless it has to
update its GlobalVisState -- which often happens due to taking a new
snapshot but this also happens at the end of index vacuuming
explicitly.

- Melanie



On Sun, Jul 21, 2024 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Geoghegan <pg@bowt.ie> writes:
> > On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I do not think the answer to this is to nag the respective animal
> >> owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
> >> not worth the cycles it takes, at least not for these machines.
>
> > Can't we just move it to PG_TEST_EXTRA? Alongside the existing
> > "xid_wraparound" test?
>
> Perhaps.  xid_wraparound seems entirely too slow for what it's
> testing as well, if you ask me, and there's a concurrent thread
> about that test causing problems too.
>
> > There will always be a small number of extremely slow buildfarm
> > animals. Optimizing for things like Raspberry pi animals with SD cards
> > just doesn't seem like a good use of developer time. I really care
> > about keeping the tests fast, but only on platforms that hackers
> > actually use for their development work.
>
> I find this argument completely disingenuous.  If a test is slow
> enough to cause timeout failures on slower machines, then it's also
> eating a disproportionate number of cycles in every other check-world
> run --- many of which have humans waiting for them to finish.  Caring
> about the runtime of test cases is good for future-you not just
> obsolete buildfarm animals.
>
> I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
> to get next-to-zero buildfarm coverage.  If that test is actually
> capable of revealing problems, we're unlikely to find out under the
> status quo.

What is the argument for PG_TEST_EXTRA if it is not running on almost
any buildfarm animals? Are some of those tests valuable for other
reasons than being consistently automatically run (e.g. developer
understanding of how a particular part of code works)?

If they aren't being run, how do we know that they still work (as test
infrastructure changes)? The recovery conflict test is skipped on 15,
which means that backported perl test changes may break it without us
knowing. I don't know if you mean that PG_TEST_EXTRA tests are never
run or just seldom run. If they are never run on the buildfarm, then
they could end up silently breaking too.

- Melanie



On Sun, Jul 21, 2024 at 4:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I do not think the answer to this is to nag the respective animal
> > owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
> > not worth the cycles it takes, at least not for these machines.
>
> Can't we just move it to PG_TEST_EXTRA? Alongside the existing
> "xid_wraparound" test?
>
> We didn't even have basic coverage of multi-pass VACUUMs before now.
> This new test added that coverage. I think that it will pull its
> weight.

Andres has suggested in the past that we allow maintenance_work_mem be
set to a lower value or introduce some kind of development GUC so that
we can more easily test multiple pass index vacuuming. Do you think
this would be worth it?

- Melanie



On 2024-Jul-22, Melanie Plageman wrote:

> On Sun, Jul 21, 2024 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
> > to get next-to-zero buildfarm coverage.  If that test is actually
> > capable of revealing problems, we're unlikely to find out under the
> > status quo.
> 
> What is the argument for PG_TEST_EXTRA if it is not running on almost
> any buildfarm animals? Are some of those tests valuable for other
> reasons than being consistently automatically run (e.g. developer
> understanding of how a particular part of code works)?

I think it's a bad idea to require buildfarm owners to edit their config
files as we add tests that depend on PG_TEST_EXTRA.  AFAIR we invented
that setting so that tests that had security implications could be made
opt-in instead of opt-out; I think this was a sensible thing to do, to
avoid possibly compromising the machines in some way.  But I think these
new tests have a different problem, so we shouldn't use the same
mechanism.

What about some brainstorming to improve this?

For example: have something in the tree that lets committers opt some
tests out from specific BF machines without having to poke at the BF
machines.  I imagine two files: one that carries tags for buildfarm
members, something like the /etc/groups file,

src/test/tags.lst
  slow: gull,mamba,mereswine,copperhead

and another file that lists tests to skip on members that have certain
tags,

src/tools/buildfarm/do_not_run.lst
  slow:src/test/modules/xid_wraparound
  slow:src/test/recovery/t/043_vacuum_horizon_floor.pl

so that run_build.pl know that if the current member has tag slow, then
these two tests are to be skipped.

Then we can have xid_wraparound enabled generally (without requiring
PG_TEST_EXTRA), and the BF client knows not to run it in the particular
cases where it's not wanted.

This proposal has a number of problems (a glaring one being the
maintenance of the list of members per tag), but maybe it inspires
better ideas.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)



On Mon, Jul 22, 2024 at 9:32 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Andres has suggested in the past that we allow maintenance_work_mem be
> set to a lower value or introduce some kind of development GUC so that
> we can more easily test multiple pass index vacuuming. Do you think
> this would be worth it?

No, I don't.

--
Peter Geoghegan



Melanie Plageman <melanieplageman@gmail.com> writes:
> On Sun, Jul 21, 2024 at 5:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
>> to get next-to-zero buildfarm coverage.  If that test is actually
>> capable of revealing problems, we're unlikely to find out under the
>> status quo.

> What is the argument for PG_TEST_EXTRA if it is not running on almost
> any buildfarm animals? Are some of those tests valuable for other
> reasons than being consistently automatically run (e.g. developer
> understanding of how a particular part of code works)?

The point of PG_TEST_EXTRA is to make some of the tests be opt-in.
Originally it was just used for tests that might have security
implications (e.g. the kerberos tests, which involve running a
not-terribly-locked-down kerberos server).  I'm a little suspicious
of using it for tests that merely take an unreasonable amount of
time --- to me, that indicates laziness on the part of the test
author.  It'd be better to get the test runtime down to the point
where it's reasonable to expect all the buildfarm animals to run it.
As an example, we're not getting any Valgrind coverage on
xid_wraparound, and we won't ever get it with the current approach,
which I find bad.

> I don't know if you mean that PG_TEST_EXTRA tests are never
> run or just seldom run. If they are never run on the buildfarm, then
> they could end up silently breaking too.

They are opt-in, meaning that both buildfarm owners and regular
developers have to take extra action (i.e. set the PG_TEST_EXTRA
environment variable) to run them.  There's a reasonable number
of animals opting into ssl, kerberos, etc, but I see only two
that are opting into xid_wraparound.  If we change this new test
to be conditional on PG_TEST_EXTRA, it won't get run unless you
successfully nag some buildfarm owners to run it.

            regards, tom lane



Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Jul 22, 2024 at 9:32 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>> Andres has suggested in the past that we allow maintenance_work_mem be
>> set to a lower value or introduce some kind of development GUC so that
>> we can more easily test multiple pass index vacuuming. Do you think
>> this would be worth it?

> No, I don't.

I don't see why that's not a good idea.

            regards, tom lane



Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> I think it's a bad idea to require buildfarm owners to edit their config
> files as we add tests that depend on PG_TEST_EXTRA.  AFAIR we invented
> that setting so that tests that had security implications could be made
> opt-in instead of opt-out; I think this was a sensible thing to do, to
> avoid possibly compromising the machines in some way.  But I think these
> new tests have a different problem, so we shouldn't use the same
> mechanism.

That's my feeling also.

> What about some brainstorming to improve this?

> For example: have something in the tree that lets committers opt some
> tests out from specific BF machines without having to poke at the BF
> machines.  I imagine two files: one that carries tags for buildfarm
> members, something like the /etc/groups file,

I'd turn it around, and provide some way for buildfarm owners to
say "this machine is slow".  Maybe make the tests respond to the
presence of an environment variable PG_TEST_SKIP_SLOW, or some
such thing.  That particular solution would require no new
infrastructure (such as a new buildfarm client release); it'd
only require editing the config files of affected animals.

            regards, tom lane



On Mon, Jul 22, 2024 at 11:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Andres has suggested in the past that we allow maintenance_work_mem be
> >> set to a lower value or introduce some kind of development GUC so that
> >> we can more easily test multiple pass index vacuuming. Do you think
> >> this would be worth it?
>
> > No, I don't.
>
> I don't see why that's not a good idea.

I don't think that it's worth going to that trouble. Testing multiple
passes isn't hard -- not in any real practical sense.

I accept that there needs to be some solution to the problem of the
tests timing out on slow running buildfarm animals. Your
PG_TEST_SKIP_SLOW proposal seems like a good approach.

--
Peter Geoghegan



Hi,

On 2024-07-21 12:51:51 -0400, Tom Lane wrote:
> Melanie Plageman <melanieplageman@gmail.com> writes:
> > When I run it on my machine with some added logging, the space taken
> > by dead items is about 330 kB more than maintenance_work_mem (which is
> > set to 1 MB). I could roughly double the excess by increasing the
> > number of inserted tuples from 400000 to 600000. I'll do this.

> mamba, gull, and mereswine are 32-bit machines, which aside from
> being old and slow suffer an immediate 2x size-of-test penalty:

I think what we ought to do here is to lower the lower limit for memory usage
for vacuum. With the new state in 17+ it basically has become impossible to
test multi-pass vacuums in a way that won't get your test thrown out - that's
bad.


> I do not think the answer to this is to nag the respective animal owners to
> raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply not worth the cycles
> it takes, at least not for these machines.

This specific area of the code has a *long* history of bugs, I'd be very loath
to give up testing.

Greetings,

Andres Freund



On Mon, Jul 22, 2024 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm a little suspicious
> of using it for tests that merely take an unreasonable amount of
> time --- to me, that indicates laziness on the part of the test
> author.

Laziness would have been not bothering to develop a TAP test for this
at all. Going to the trouble of creating one and not being able to
make it as fast or as stable as everybody would like is just being
human.

I never quite know what to do about TAP testing for issues like this.
Ideally, we want a test case that runs quickly, is highly stable, is
perfectly sensitive to the bug being fixed, and has a reasonable
likelihood of being sensitive to future bugs of the same ilk. But such
a test case need not exist, and even if it does, it need not be the
case that any of us are able to find it. Or maybe finding it is
possible but will take an unreasonable amount of time: if it took a
committer six months to come up with such a test case for this bug,
would that be worth it, or just overkill? I'd say overkill: I'd rather
have that committer working on other stuff than spending six months
trying to craft the perfect test case for a bug that's already fixed.

Also, this particular bug seems to require a very specific combination
of circumstances in order to trigger it. So the test gets complicated.
As mentioned, that makes it harder to get the test case fast and
stable, but it also reduces the chances that the test case will ever
find anything. I don't think that this will be the last time we make a
mistake around VACUUM's xmin handling, but the next mistake may well
require an equally baroque but *different* setup to cause a problem. I
hate to come to the conclusion that we just shouldn't test for this,
but I don't think it's fair to send Melanie off on a wild goose chase
looking for a perfect test case that may not realistically exist,
either.

--
Robert Haas
EDB: http://www.enterprisedb.com



On 2024-07-22 13:16:49 -0400, Robert Haas wrote:
> On Mon, Jul 22, 2024 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm a little suspicious
> > of using it for tests that merely take an unreasonable amount of
> > time --- to me, that indicates laziness on the part of the test
> > author.
> 
> Laziness would have been not bothering to develop a TAP test for this
> at all. Going to the trouble of creating one and not being able to
> make it as fast or as stable as everybody would like is just being
> human.

Yea, I think calling weeks of effort by Melanie lazy is, uhm, not kind.

It's not like somebody else had a great suggestion for how to do this in a
better way either.



On 2024-07-22 12:00:51 -0400, Peter Geoghegan wrote:
> On Mon, Jul 22, 2024 at 11:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> Andres has suggested in the past that we allow maintenance_work_mem be
> > >> set to a lower value or introduce some kind of development GUC so that
> > >> we can more easily test multiple pass index vacuuming. Do you think
> > >> this would be worth it?
> >
> > > No, I don't.
> >
> > I don't see why that's not a good idea.
> 
> I don't think that it's worth going to that trouble. Testing multiple
> passes isn't hard -- not in any real practical sense.

It's hard by now (i.e. 17+) because you need substantial amounts of rows to be
able to trigger it which makes it a hard fight to introduce. And the cost of
setting the GUC limit lower is essentially zero.

What's the point of having such a high lower limit?

Greetings,

Andres Freund



On Mon, Jul 22, 2024 at 1:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm a little suspicious
> > of using it for tests that merely take an unreasonable amount of
> > time --- to me, that indicates laziness on the part of the test
> > author.
>
> Laziness would have been not bothering to develop a TAP test for this
> at all. Going to the trouble of creating one and not being able to
> make it as fast or as stable as everybody would like is just being
> human.
>
> I never quite know what to do about TAP testing for issues like this.
> Ideally, we want a test case that runs quickly, is highly stable, is
> perfectly sensitive to the bug being fixed, and has a reasonable
> likelihood of being sensitive to future bugs of the same ilk. But such
> a test case need not exist, and even if it does, it need not be the
> case that any of us are able to find it. Or maybe finding it is
> possible but will take an unreasonable amount of time: if it took a
> committer six months to come up with such a test case for this bug,
> would that be worth it, or just overkill? I'd say overkill: I'd rather
> have that committer working on other stuff than spending six months
> trying to craft the perfect test case for a bug that's already fixed.
>
> Also, this particular bug seems to require a very specific combination
> of circumstances in order to trigger it. So the test gets complicated.
> As mentioned, that makes it harder to get the test case fast and
> stable, but it also reduces the chances that the test case will ever
> find anything. I don't think that this will be the last time we make a
> mistake around VACUUM's xmin handling, but the next mistake may well
> require an equally baroque but *different* setup to cause a problem. I
> hate to come to the conclusion that we just shouldn't test for this,
> but I don't think it's fair to send Melanie off on a wild goose chase
> looking for a perfect test case that may not realistically exist,
> either.

So, I've just gone through all the test failures on master and 17 for
mamba, gull, mereswine, and copperhead. I wanted to confirm that the
test was always failing for the same reason and also if it had any
failures pre-TIDStore.

We've only run tests with this commit on some of the back branches for
some of these animals. Of those, I don't see any failures so far. So,
it seems the test instability is just related to trying to get
multiple passes of index vacuuming reliably with TIDStore.

AFAICT, all the 32bit machine failures are timeouts waiting for the
standby to catch up (mamba, gull, merswine). Unfortunately, the
failures on copperhead (a 64 bit machine) are because we don't
actually succeed in triggering a second vacuum pass. This would not be
fixed by a longer timeout.

Because of this, I'm inclined to revert the test on 17 and master to
avoid distracting folks committing other work and seeing those animals
go red.

I wonder if Sawada-san or John have a test case minimally reproducing
a case needing multiple index vacuuming rounds. You can't do it with
my example and just more dead rows per page. If you just increase the
number of dead tuples, it doesn't increase the size of the TIDStore
unless those dead tuples are at different offsets. And I couldn't find
DDL which would cause the TIDStore to be > 1MB without using a low
fill-factor and many rows. Additionally, the fact that the same number
of rows does not trigger the multiple passes on two different 64bit
machines worries me and makes me think that we will struggle to
trigger these conditions without overshooting the minimum by quite a
bit.

- Melanie



On Mon, Jul 22, 2024 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
> It's hard by now (i.e. 17+) because you need substantial amounts of rows to be
> able to trigger it which makes it a hard fight to introduce.

I didn't think that it was particularly hard when I tested the test
that Melanie committed.

> And the cost of
> setting the GUC limit lower is essentially zero.

Apparently you know more about TID Store than me.

If it really is trivial to lower the limit, then I have no objections
to doing so. That would make it easy to fix the test flappiness issues
by just using the much lower limit.

--
Peter Geoghegan



On Mon, Jul 22, 2024 at 2:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> So, I've just gone through all the test failures on master and 17 for
> mamba, gull, mereswine, and copperhead. I wanted to confirm that the
> test was always failing for the same reason and also if it had any
> failures pre-TIDStore.
>
> We've only run tests with this commit on some of the back branches for
> some of these animals. Of those, I don't see any failures so far. So,
> it seems the test instability is just related to trying to get
> multiple passes of index vacuuming reliably with TIDStore.
>
> AFAICT, all the 32bit machine failures are timeouts waiting for the
> standby to catch up (mamba, gull, merswine). Unfortunately, the
> failures on copperhead (a 64 bit machine) are because we don't
> actually succeed in triggering a second vacuum pass. This would not be
> fixed by a longer timeout.
>
> Because of this, I'm inclined to revert the test on 17 and master to
> avoid distracting folks committing other work and seeing those animals
> go red.

Okay, I reverted this for now on 17 and master. Adding Sawada-san to
the thread to see if he has any ideas for a smaller two-round index
vacuum example.

- Melanie



Melanie Plageman <melanieplageman@gmail.com> writes:
> We've only run tests with this commit on some of the back branches for
> some of these animals. Of those, I don't see any failures so far. So,
> it seems the test instability is just related to trying to get
> multiple passes of index vacuuming reliably with TIDStore.

> AFAICT, all the 32bit machine failures are timeouts waiting for the
> standby to catch up (mamba, gull, merswine). Unfortunately, the
> failures on copperhead (a 64 bit machine) are because we don't
> actually succeed in triggering a second vacuum pass. This would not be
> fixed by a longer timeout.

Ouch.  This seems to me to raise the importance of getting a better
way to test multiple-index-vacuum-passes.  Peter argued upthread
that we don't need a better way, but I don't see how that argument
holds water if copperhead was not reaching it despite being 64-bit.
(Did you figure out exactly why it doesn't reach the code?)

> Because of this, I'm inclined to revert the test on 17 and master to
> avoid distracting folks committing other work and seeing those animals
> go red.

Agreed as a short-term measure.

            regards, tom lane



On Mon, Jul 22, 2024 at 10:37:17AM -0700, Andres Freund wrote:
> On 2024-07-22 13:16:49 -0400, Robert Haas wrote:
>> Laziness would have been not bothering to develop a TAP test for this
>> at all. Going to the trouble of creating one and not being able to
>> make it as fast or as stable as everybody would like is just being
>> human.
>
> Yea, I think calling weeks of effort by Melanie lazy is, uhm, not kind.

FWIW, I'm really impressed by what she has achieved here by Melanie,
fixing a hard bug while hacking a crazily-complicated test to make it
reproducible.  This has an incredible amount of value in the long-run.

> It's not like somebody else had a great suggestion for how to do this in a
> better way either.

Sawada-san and John are the two ones in the best position to answer
that.  I'm not sure either how to force a second index pass, either.
Hmm.  An idea would be to manipulate the TIDStore stack under the
injection points switch?  This is run by the CI and some buildfarm
members already.
--
Michael

Attachment
Michael Paquier <michael@paquier.xyz> writes:
> Sawada-san and John are the two ones in the best position to answer
> that.  I'm not sure either how to force a second index pass, either.

Yeah, I think we've established that having some way to force that,
without using a huge test case, would be really desirable.  Maybe
just provide a way to put an artificial limit on how many tuples
processed per pass?

(And no, I wasn't trying to rag on Melanie.  My point here is that
we've failed to design-in easy testability of this code path, and
that's surely not her fault.)

            regards, tom lane



On Mon, Jul 22, 2024 at 2:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 2:17 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > So, I've just gone through all the test failures on master and 17 for
> > mamba, gull, mereswine, and copperhead. I wanted to confirm that the
> > test was always failing for the same reason and also if it had any
> > failures pre-TIDStore.
> >
> > We've only run tests with this commit on some of the back branches for
> > some of these animals. Of those, I don't see any failures so far. So,
> > it seems the test instability is just related to trying to get
> > multiple passes of index vacuuming reliably with TIDStore.
> >
> > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > failures on copperhead (a 64 bit machine) are because we don't
> > actually succeed in triggering a second vacuum pass. This would not be
> > fixed by a longer timeout.
> >
> > Because of this, I'm inclined to revert the test on 17 and master to
> > avoid distracting folks committing other work and seeing those animals
> > go red.
>
> Okay, I reverted this for now on 17 and master. Adding Sawada-san to
> the thread to see if he has any ideas for a smaller two-round index
> vacuum example.
>

+   CREATE TABLE ${table1}(col1 int)
+       WITH (autovacuum_enabled=false, fillfactor=10);
+   INSERT INTO $table1 VALUES(7);
+   INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3;
+   CREATE INDEX on ${table1}(col1);
+   UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
+   INSERT INTO $table1 VALUES(7);

These queries make sense to me; these make the radix tree wide and use
more nodes, instead of fattening lead nodes (i.e. the offset bitmap).
The $table1 has 18182 blocks and the statistics of radix tree shows:

max_val = 65535
num_keys = 18182
height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182

Which means that the height of the tree is 2 and we use the maximum
size node for all nodes except for 1 node.

I don't have any great idea to substantially reduce the total number
of tuples in the $table1. Probably we can use DELETE instead of UPDATE
to make garbage tuples (although I'm not sure it's okay for this
test). Which reduces the amount of WAL records from 11MB to 4MB and
would reduce the time to catch up. But I'm not sure how much it would
help. There might be ideas to trigger a two-round index vacuum with
fewer tuples but if the tests are too optimized for the current
TidStore, we will have to re-adjust them if the TidStore changes in
the future. So I think it's better and reliable to allow
maintenance_work_mem to be a lower value or use injection points
somehow.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Melanie Plageman <melanieplageman@gmail.com> writes:
> > We've only run tests with this commit on some of the back branches for
> > some of these animals. Of those, I don't see any failures so far. So,
> > it seems the test instability is just related to trying to get
> > multiple passes of index vacuuming reliably with TIDStore.
>
> > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > failures on copperhead (a 64 bit machine) are because we don't
> > actually succeed in triggering a second vacuum pass. This would not be
> > fixed by a longer timeout.
>
> Ouch.  This seems to me to raise the importance of getting a better
> way to test multiple-index-vacuum-passes.  Peter argued upthread
> that we don't need a better way, but I don't see how that argument
> holds water if copperhead was not reaching it despite being 64-bit.
> (Did you figure out exactly why it doesn't reach the code?)

I wasn't able to reproduce the failure (failing to do > 1 index vacuum
pass) on my local machine (which is 64 bit) without decreasing the
number of tuples inserted. The copperhead failure confuses me because
the speed of the machine should *not* affect how much space the dead
item TIDStore takes up. I would have bet money that the same number
and offsets of dead tuples per page in a relation would take up the
same amount of space in a TIDStore on any 64-bit system -- regardless
of how slowly it runs vacuum.

Here is some background on how I came up with the DDL and tuple count
for the test: TIDStore uses 32 BITS_PER_BITMAPWORD on 32 bit systems
and 64 on 64 bit systems. So, if you only have one bitmapword's worth
of dead items per page, it was easy to figure out that you would need
double the number of pages with dead items to take up the same amount
of TIDStore space on a 32 bit system than on a 64 bit system.

I wanted to figure out how to take up double the amount of TIDStore
space *without* doubling the number of tuples. This is not
straightforward. You can't just delete twice as many dead tuples per
page. For starters, you can compactly represent many dead tuples in a
single bitmapword. Outside of this, there seems to be some effect on
the amount of space the adaptive radix tree takes up if the dead items
on the pages are at the same offsets on all the pages. I thought this
might have to do with being able to use the same chunk (in ART terms)?
I spent some time trying to figure it out, but I gave up once I got
confused enough to try and read the adaptive radix tree paper.

I found myself wishing there was some way to visualize the TIDStore. I
don't have good ideas how to represent this, but if we found one, we
could add a function to the test_tidstore module.

I also think it would be useful to have peak TIDStore usage in bytes
in the vacuum verbose output. I had it on my list to propose something
like this after I hacked together a version myself while trying to
debug the test locally.

- Melanie



On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Melanie Plageman <melanieplageman@gmail.com> writes:
> > > We've only run tests with this commit on some of the back branches for
> > > some of these animals. Of those, I don't see any failures so far. So,
> > > it seems the test instability is just related to trying to get
> > > multiple passes of index vacuuming reliably with TIDStore.
> >
> > > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > > failures on copperhead (a 64 bit machine) are because we don't
> > > actually succeed in triggering a second vacuum pass. This would not be
> > > fixed by a longer timeout.
> >
> > Ouch.  This seems to me to raise the importance of getting a better
> > way to test multiple-index-vacuum-passes.  Peter argued upthread
> > that we don't need a better way, but I don't see how that argument
> > holds water if copperhead was not reaching it despite being 64-bit.
> > (Did you figure out exactly why it doesn't reach the code?)
>
> I wasn't able to reproduce the failure (failing to do > 1 index vacuum
> pass) on my local machine (which is 64 bit) without decreasing the
> number of tuples inserted. The copperhead failure confuses me because
> the speed of the machine should *not* affect how much space the dead
> item TIDStore takes up. I would have bet money that the same number
> and offsets of dead tuples per page in a relation would take up the
> same amount of space in a TIDStore on any 64-bit system -- regardless
> of how slowly it runs vacuum.

Looking at copperhead's failure logs, I could not find that "VACUUM
(VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index
scans in logs. Is there any clue that made you think the test failed
to do multiple index vacuum passes?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Mon, Jul 22, 2024 at 10:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > Melanie Plageman <melanieplageman@gmail.com> writes:
> > > > We've only run tests with this commit on some of the back branches for
> > > > some of these animals. Of those, I don't see any failures so far. So,
> > > > it seems the test instability is just related to trying to get
> > > > multiple passes of index vacuuming reliably with TIDStore.
> > >
> > > > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > > > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > > > failures on copperhead (a 64 bit machine) are because we don't
> > > > actually succeed in triggering a second vacuum pass. This would not be
> > > > fixed by a longer timeout.
> > >
> > > Ouch.  This seems to me to raise the importance of getting a better
> > > way to test multiple-index-vacuum-passes.  Peter argued upthread
> > > that we don't need a better way, but I don't see how that argument
> > > holds water if copperhead was not reaching it despite being 64-bit.
> > > (Did you figure out exactly why it doesn't reach the code?)
> >
> > I wasn't able to reproduce the failure (failing to do > 1 index vacuum
> > pass) on my local machine (which is 64 bit) without decreasing the
> > number of tuples inserted. The copperhead failure confuses me because
> > the speed of the machine should *not* affect how much space the dead
> > item TIDStore takes up. I would have bet money that the same number
> > and offsets of dead tuples per page in a relation would take up the
> > same amount of space in a TIDStore on any 64-bit system -- regardless
> > of how slowly it runs vacuum.
>
> Looking at copperhead's failure logs, I could not find that "VACUUM
> (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index
> scans in logs. Is there any clue that made you think the test failed
> to do multiple index vacuum passes?

The vacuum doesn't actually finish because I have a cursor that keeps
it from finishing and then I query pg_stat_progress_vacuum after the
first index vacuuming round should have happened and it did not do the
index vacuum:

[20:39:34.645](351.522s) # poll_query_until timed out executing this query:
#
# SELECT index_vacuum_count > 0
# FROM pg_stat_progress_vacuum
# WHERE datname='test_db' AND relid::regclass =
'vac_horizon_floor_table'::regclass;
#
# expecting this output:
# t
# last actual query output:
# f

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-07-22%2015%3A00%3A11

I suppose it is possible that it did in fact time out and the index
vacuum was still in progress. But most of the other "too slow"
failures were when the standby was trying to catch up. Usually the
pg_stat_progress_vacuum test fails because we didn't actually do that
index vacuuming round yet.

- Melanie



On Tue, Jul 23, 2024 at 5:43 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 10:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > >
> > > > Melanie Plageman <melanieplageman@gmail.com> writes:
> > > > > We've only run tests with this commit on some of the back branches for
> > > > > some of these animals. Of those, I don't see any failures so far. So,
> > > > > it seems the test instability is just related to trying to get
> > > > > multiple passes of index vacuuming reliably with TIDStore.
> > > >
> > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > > > > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > > > > failures on copperhead (a 64 bit machine) are because we don't
> > > > > actually succeed in triggering a second vacuum pass. This would not be
> > > > > fixed by a longer timeout.
> > > >
> > > > Ouch.  This seems to me to raise the importance of getting a better
> > > > way to test multiple-index-vacuum-passes.  Peter argued upthread
> > > > that we don't need a better way, but I don't see how that argument
> > > > holds water if copperhead was not reaching it despite being 64-bit.
> > > > (Did you figure out exactly why it doesn't reach the code?)
> > >
> > > I wasn't able to reproduce the failure (failing to do > 1 index vacuum
> > > pass) on my local machine (which is 64 bit) without decreasing the
> > > number of tuples inserted. The copperhead failure confuses me because
> > > the speed of the machine should *not* affect how much space the dead
> > > item TIDStore takes up. I would have bet money that the same number
> > > and offsets of dead tuples per page in a relation would take up the
> > > same amount of space in a TIDStore on any 64-bit system -- regardless
> > > of how slowly it runs vacuum.
> >
> > Looking at copperhead's failure logs, I could not find that "VACUUM
> > (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index
> > scans in logs. Is there any clue that made you think the test failed
> > to do multiple index vacuum passes?
>
> The vacuum doesn't actually finish because I have a cursor that keeps
> it from finishing and then I query pg_stat_progress_vacuum after the
> first index vacuuming round should have happened and it did not do the
> index vacuum:
>
> [20:39:34.645](351.522s) # poll_query_until timed out executing this query:
> #
> # SELECT index_vacuum_count > 0
> # FROM pg_stat_progress_vacuum
> # WHERE datname='test_db' AND relid::regclass =
> 'vac_horizon_floor_table'::regclass;
> #
> # expecting this output:
> # t
> # last actual query output:
> # f
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-07-22%2015%3A00%3A11
>
> I suppose it is possible that it did in fact time out and the index
> vacuum was still in progress. But most of the other "too slow"
> failures were when the standby was trying to catch up. Usually the
> pg_stat_progress_vacuum test fails because we didn't actually do that
> index vacuuming round yet.

Thank you for your explanation! I understood the test cases.

I figured out why two-round index vacuum was not triggered on
copperhead although it's a 64-bit system. In short, this test case
depends on MEMORY_CONTEXT_CHECK (or USE_ASSERT_CHECKING) being on.

In this test case, every BlocktableEntry size would be 16 bytes; the
header is 8 bytes and offset bitmap is 8 bytes (covering up to offset
63). We calculate the memory size (required_size in BumpAlloc()) to
allocate in a bump memory context as follows:

#ifdef MEMORY_CONTEXT_CHECKING
    /* ensure there's always space for the sentinel byte */
    chunk_size = MAXALIGN(size + 1);
#else
    chunk_size = MAXALIGN(size);
#endif

   (snip)

    required_size = chunk_size + Bump_CHUNKHDRSZ;

Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is
also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0.
On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is
bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16
bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory
and use more Bump memory blocks, resulting in filling up TidStore in
the test cases. We can easily reproduce this test failure with
PostgreSQL server built without --enable-cassert. It seems that
copperhead is the sole BF animal that doesn't use --enable-cassert but
runs recovery-check.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Wed, Jul 24, 2024 at 5:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is
> also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0.
> On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is
> bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16
> bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory
> and use more Bump memory blocks, resulting in filling up TidStore in
> the test cases. We can easily reproduce this test failure with
> PostgreSQL server built without --enable-cassert. It seems that
> copperhead is the sole BF animal that doesn't use --enable-cassert but
> runs recovery-check.

It seems we could force the bitmaps to be larger, and also reduce the
number of updated tuples by updating only the last few tuples (say
5-10) by looking at the ctid's offset. This requires some trickery,
but I believe I've done it in the past by casting to text and
extracting with a regex. (I'm assuming the number of tuples updated is
more important than the number of tuples inserted on a newly created
table.)

As for lowering the limit, we've experimented with 256kB here:

https://www.postgresql.org/message-id/CANWCAZZUTvZ3LsYpauYQVzcEZXZ7Qe+9ntnHgYZDTWxPuL++zA@mail.gmail.com

As I mention there, going lower than that would need a small amount of
reorganization in the radix tree. Not difficult -- the thing I'm
concerned about is that we'd likely need to document a separate
minimum for DSA, since that behaves strangely with 256kB and might not
work at all lower than that.



On Wed, Jul 24, 2024 at 2:42 PM John Naylor <johncnaylorls@gmail.com> wrote:
> As for lowering the limit, we've experimented with 256kB here:
>
> https://www.postgresql.org/message-id/CANWCAZZUTvZ3LsYpauYQVzcEZXZ7Qe+9ntnHgYZDTWxPuL++zA@mail.gmail.com
>
> As I mention there, going lower than that would need a small amount of
> reorganization in the radix tree. Not difficult -- the thing I'm
> concerned about is that we'd likely need to document a separate
> minimum for DSA, since that behaves strangely with 256kB and might not
> work at all lower than that.

For experimentation, here's a rough patch (really two, squashed
together for now) that allows m_w_m to go down to 64kB.

drop table if exists test;
create table test (a int) with (autovacuum_enabled=false, fillfactor=10);
insert into test (a) select i from generate_series(1,2000) i;
create index on test (a);
update test set a = a + 1;

set maintenance_work_mem = '64kB';
vacuum (verbose) test;

INFO:  vacuuming "john.public.test"
INFO:  finished vacuuming "john.public.test": index scans: 3
pages: 0 removed, 91 remain, 91 scanned (100.00% of total)

The advantage with this is that we don't need to care about
MEMORY_CONTEXT_CHECKING or 32/64 bit-ness, since allocating a single
large node will immediately blow the limit, and that will happen
fairly quickly regardless. I suspect going this low will not work with
dynamic shared memory and if so would need a warning comment.

Attachment
On Mon, Jul 22, 2024 at 9:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> +   CREATE TABLE ${table1}(col1 int)
> +       WITH (autovacuum_enabled=false, fillfactor=10);
> +   INSERT INTO $table1 VALUES(7);
> +   INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3;
> +   CREATE INDEX on ${table1}(col1);
> +   UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
> +   INSERT INTO $table1 VALUES(7);
>
> These queries make sense to me; these make the radix tree wide and use
> more nodes, instead of fattening lead nodes (i.e. the offset bitmap).
> The $table1 has 18182 blocks and the statistics of radix tree shows:
>
> max_val = 65535
> num_keys = 18182
> height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182
>
> Which means that the height of the tree is 2 and we use the maximum
> size node for all nodes except for 1 node.

Do you have some kind of tool that prints this out for you? That would
be really handy.

> I don't have any great idea to substantially reduce the total number
> of tuples in the $table1. Probably we can use DELETE instead of UPDATE
> to make garbage tuples (although I'm not sure it's okay for this
> test). Which reduces the amount of WAL records from 11MB to 4MB and
> would reduce the time to catch up. But I'm not sure how much it would
> help. There might be ideas to trigger a two-round index vacuum with
> fewer tuples but if the tests are too optimized for the current
> TidStore, we will have to re-adjust them if the TidStore changes in
> the future. So I think it's better and reliable to allow
> maintenance_work_mem to be a lower value or use injection points
> somehow.

I think we can make improvements in overall time on master and 17 with
the examples John provided later in the thread. However, I realized
you are right about using a DELETE instead of an UPDATE. At some point
in my development, I needed the UPDATE to satisfy some other aspect of
the test. But that is no longer true. A DELETE works just as well as
an UPDATE WRT the dead items and, as you point out, much less WAL is
created and replay is much faster.

I also realized I forgot to add 043_vacuum_horizon_floor.pl to
src/test/recovery/meson.build in 16. I will post a patch here this
weekend which changes the UPDATE to a DELETE in 14-16 (sped up the
test by about 20% for me locally) and adds 043_vacuum_horizon_floor.pl
to src/test/recovery/meson.build in 16. I'll plan to push it on Monday
to save myself any weekend buildfarm embarrassment.

As for 17 and master, I'm going to try out John's examples and see if
it seems like it will be fast enough to commit to 17/master without
lowering the maintenance_work_mem lower bound.

If we want to lower it, I wonder if we just halve it -- since it seems
like the tests with half the number of tuples were fast enough to
avoid timing out on slow animals on the buildfarm? Or do we need some
more meaningful value to decrease it to?

- Melanie



On Wed, Jul 24, 2024 at 8:19 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Jul 24, 2024 at 2:42 PM John Naylor <johncnaylorls@gmail.com> wrote:
> > As for lowering the limit, we've experimented with 256kB here:
> >
> > https://www.postgresql.org/message-id/CANWCAZZUTvZ3LsYpauYQVzcEZXZ7Qe+9ntnHgYZDTWxPuL++zA@mail.gmail.com
> >
> > As I mention there, going lower than that would need a small amount of
> > reorganization in the radix tree. Not difficult -- the thing I'm
> > concerned about is that we'd likely need to document a separate
> > minimum for DSA, since that behaves strangely with 256kB and might not
> > work at all lower than that.
>
> For experimentation, here's a rough patch (really two, squashed
> together for now) that allows m_w_m to go down to 64kB.

Oh, great, thanks! I didn't read this closely enough before I posed my
upthread question about how small we should make the minimum. It
sounds like you've thought a lot about this.

I ran my test with your patch (on my 64-bit system, non-assert build)
and the result is great:

master with my test (slightly modified to now use DELETE instead of
UPDATE as mentioned upthread)
    3.09s

master with your patch applied, MWM set to 64kB and 9000 rows instead of 800000
    1.06s

> drop table if exists test;
> create table test (a int) with (autovacuum_enabled=false, fillfactor=10);
> insert into test (a) select i from generate_series(1,2000) i;
> create index on test (a);
> update test set a = a + 1;
>
> set maintenance_work_mem = '64kB';
> vacuum (verbose) test;
>
> INFO:  vacuuming "john.public.test"
> INFO:  finished vacuuming "john.public.test": index scans: 3
> pages: 0 removed, 91 remain, 91 scanned (100.00% of total)
>
> The advantage with this is that we don't need to care about
> MEMORY_CONTEXT_CHECKING or 32/64 bit-ness, since allocating a single
> large node will immediately blow the limit, and that will happen
> fairly quickly regardless. I suspect going this low will not work with
> dynamic shared memory and if so would need a warning comment.

I took a look at the patch, but I can't say I know enough about the
memory allocation subsystems and how TIDStore works to meaningfully
review it -- nor enough about DSM to comment about the interactions.

I suspect 256kB would also be fast enough to avoid my test timing out
on the buildfarm, but it is appealing to have a minimum for
maintenance_work_mem that is the same as work_mem.

- Melanie



On Wed, Jul 24, 2024 at 3:43 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Jul 24, 2024 at 5:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is
> > also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0.
> > On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is
> > bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16
> > bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory
> > and use more Bump memory blocks, resulting in filling up TidStore in
> > the test cases. We can easily reproduce this test failure with
> > PostgreSQL server built without --enable-cassert. It seems that
> > copperhead is the sole BF animal that doesn't use --enable-cassert but
> > runs recovery-check.
>
> It seems we could force the bitmaps to be larger, and also reduce the
> number of updated tuples by updating only the last few tuples (say
> 5-10) by looking at the ctid's offset. This requires some trickery,
> but I believe I've done it in the past by casting to text and
> extracting with a regex. (I'm assuming the number of tuples updated is
> more important than the number of tuples inserted on a newly created
> table.)

Yes, the only thing that is important is having two rounds of index
vacuuming and having one tuple with a value matching my cursor
condition before the first index vacuum and one after. What do you
mean update only the last few tuples though?

- Melanie





On Saturday, July 27, 2024, Melanie Plageman <melanieplageman@gmail.com> wrote:

Yes, the only thing that is important is having two rounds of index
vacuuming and having one tuple with a value matching my cursor
condition before the first index vacuum and one after. What do you
mean update only the last few tuples though?

I meant we could update tuples with the highest offsets on each page. That would then lead to longer arrays of bitmaps to store offsets during vacuum. Lowering the minimum memory setting is easier to code and reason about, however. 

On Fri, Jul 26, 2024 at 1:27 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 9:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > +   CREATE TABLE ${table1}(col1 int)
> > +       WITH (autovacuum_enabled=false, fillfactor=10);
> > +   INSERT INTO $table1 VALUES(7);
> > +   INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3;
> > +   CREATE INDEX on ${table1}(col1);
> > +   UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
> > +   INSERT INTO $table1 VALUES(7);
> >
> > These queries make sense to me; these make the radix tree wide and use
> > more nodes, instead of fattening lead nodes (i.e. the offset bitmap).
> > The $table1 has 18182 blocks and the statistics of radix tree shows:
> >
> > max_val = 65535
> > num_keys = 18182
> > height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182
> >
> > Which means that the height of the tree is 2 and we use the maximum
> > size node for all nodes except for 1 node.
>
> Do you have some kind of tool that prints this out for you? That would
> be really handy.

You can add '#define RT_DEBUG' for radix tree used in TidStore and
then call RT_STATS (e.g., local_ts_stats()).

>
> > I don't have any great idea to substantially reduce the total number
> > of tuples in the $table1. Probably we can use DELETE instead of UPDATE
> > to make garbage tuples (although I'm not sure it's okay for this
> > test). Which reduces the amount of WAL records from 11MB to 4MB and
> > would reduce the time to catch up. But I'm not sure how much it would
> > help. There might be ideas to trigger a two-round index vacuum with
> > fewer tuples but if the tests are too optimized for the current
> > TidStore, we will have to re-adjust them if the TidStore changes in
> > the future. So I think it's better and reliable to allow
> > maintenance_work_mem to be a lower value or use injection points
> > somehow.
>
> I think we can make improvements in overall time on master and 17 with
> the examples John provided later in the thread. However, I realized
> you are right about using a DELETE instead of an UPDATE. At some point
> in my development, I needed the UPDATE to satisfy some other aspect of
> the test. But that is no longer true. A DELETE works just as well as
> an UPDATE WRT the dead items and, as you point out, much less WAL is
> created and replay is much faster.
>
> I also realized I forgot to add 043_vacuum_horizon_floor.pl to
> src/test/recovery/meson.build in 16. I will post a patch here this
> weekend which changes the UPDATE to a DELETE in 14-16 (sped up the
> test by about 20% for me locally) and adds 043_vacuum_horizon_floor.pl
> to src/test/recovery/meson.build in 16. I'll plan to push it on Monday
> to save myself any weekend buildfarm embarrassment.
>
> As for 17 and master, I'm going to try out John's examples and see if
> it seems like it will be fast enough to commit to 17/master without
> lowering the maintenance_work_mem lower bound.

+1. Thanks.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> In back branches starting with 14, failing to remove tuples older than
> OldestXmin during pruning caused vacuum to infinitely loop in
> lazy_scan_prune(), as investigated on this [1] thread.

Shouldn't somebody remove the entry that we have for this issue under
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Older_bugs_affecting_stable_branches?

--
Peter Geoghegan



On Wed, Jul 31, 2024 at 4:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > In back branches starting with 14, failing to remove tuples older than
> > OldestXmin during pruning caused vacuum to infinitely loop in
> > lazy_scan_prune(), as investigated on this [1] thread.
>
> Shouldn't somebody remove the entry that we have for this issue under
> https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Older_bugs_affecting_stable_branches?

Thanks for the reminder. Done!

- Melanie



On Sat, Jul 27, 2024 at 4:04 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Jul 24, 2024 at 8:19 AM John Naylor <johncnaylorls@gmail.com> wrote:

> I ran my test with your patch (on my 64-bit system, non-assert build)
> and the result is great:
>
> master with my test (slightly modified to now use DELETE instead of
> UPDATE as mentioned upthread)
>     3.09s
>
> master with your patch applied, MWM set to 64kB and 9000 rows instead of 800000
>     1.06s

Glad to hear it!

> I took a look at the patch, but I can't say I know enough about the
> memory allocation subsystems and how TIDStore works to meaningfully
> review it -- nor enough about DSM to comment about the interactions.

I tried using parallel vacuum with 64kB and it succeeded, but needed
to perform an index scan for every heap page pruned. It's not hard to
imagine some code moving around so that it doesn't work anymore, but
since this is for testing only, it seems a warning comment is enough.

> I suspect 256kB would also be fast enough to avoid my test timing out
> on the buildfarm, but it is appealing to have a minimum for
> maintenance_work_mem that is the same as work_mem.

Agreed on both counts:

I came up with a simple ctid expression to make the bitmap arrays larger:

delete from test where ctid::text like '%,2__)';

With that, it still takes between 250k and 300k tuples to force a
second index scan with 256kB m_w_m, default fillfactor, and without
asserts. (It may need a few more pages for 32-bit but not many more)
The table is around 1300 pages, where on v16 it's about 900. But with
fewer tuples deleted, the WAL for deletes should be lower. So it might
be comparable to v16's test.

It also turns out that to support 64kB memory settings, we actually
wouldn't need to change radix tree to lazily create memory contexts --
at least currently, SlabCreate doesn't allocate a keeper block, so a
newly created slab context reports 0 for "mem_allocated". So I'm
inclined to go ahead change the minimum m_w_m on v17 and master to
64kB. It's the quickest and (I think) most future-proof way to make
this test work. Any objections?



On Tue, Aug 6, 2024 at 9:58 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> It also turns out that to support 64kB memory settings, we actually
> wouldn't need to change radix tree to lazily create memory contexts --
> at least currently, SlabCreate doesn't allocate a keeper block, so a
> newly created slab context reports 0 for "mem_allocated". So I'm
> inclined to go ahead change the minimum m_w_m on v17 and master to
> 64kB. It's the quickest and (I think) most future-proof way to make
> this test work. Any objections?

This is done. I also changed autovacuum_work_mem just for the sake of
consistency. I did some quick math and found that there shouldn't be a
difference between 32- and 64-bit platforms for when they exceed 64kB
in the tid store. That's because exceeding the limit is caused by
allocating the first block of one of the slab contexts. That
independence may not be stable, so I'm thinking of hard-coding the
block sizes in master only, but I've left that for another time.



On 2024-08-10 16:01:18 +0700, John Naylor wrote:
> On Tue, Aug 6, 2024 at 9:58 PM John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > It also turns out that to support 64kB memory settings, we actually
> > wouldn't need to change radix tree to lazily create memory contexts --
> > at least currently, SlabCreate doesn't allocate a keeper block, so a
> > newly created slab context reports 0 for "mem_allocated". So I'm
> > inclined to go ahead change the minimum m_w_m on v17 and master to
> > 64kB. It's the quickest and (I think) most future-proof way to make
> > this test work. Any objections?
> 
> This is done. I also changed autovacuum_work_mem just for the sake of
> consistency. I did some quick math and found that there shouldn't be a
> difference between 32- and 64-bit platforms for when they exceed 64kB
> in the tid store. That's because exceeding the limit is caused by
> allocating the first block of one of the slab contexts. That
> independence may not be stable, so I'm thinking of hard-coding the
> block sizes in master only, but I've left that for another time.

Thanks a lot!

Greetings,

Andres Freund