Thread: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Alexander Lakhin
Date:
29.10.2021 16:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17257
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 14.0
> Operating system:   Ubuntu 20.04
> Description:        
>
> When running concurrent installchecks (x100) I've observed the autovacuum
> process hanging:
>
I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)

The reproducing script is:
###
export PGDATABASE=regression
createdb regression

echo "
vacuum (verbose, skip_locked, index_cleanup off) pg_catalog.pg_class;
select pg_sleep(random()/50);
" >/tmp/pseudo-autovacuum.sql

( timeout 10m bash -c "while true; do psql -f
src/test/regress/sql/inherit.sql >>psql1.log 2>&1; coredumpctl
--no-pager >/dev/null 2>&1 && break; done" ) &
( timeout 10m bash -c "while true; do psql -f
src/test/regress/sql/vacuum.sql >>psql2.log 2>&1; coredumpctl --no-pager
>/dev/null 2>&1 && break; done" ) &
pgbench -n -f /tmp/pseudo-autovacuum.sql -C -c 40 -T 600 >pgbench.log 2>&1 &

wait
###

(I've set:
autovacuum=off
fsync=off
in postgresql.conf)

It leads to:
TIME                            PID   UID   GID SIG COREFILE  EXE
Fri 2021-10-29 16:15:59 MSK  2337121  1000  1000   6 present  
.../usr/local/pgsql/bin/postgres

real    2m19,425s
user    4m26,078s
sys     0m31,658s

Core was generated by `postgres: law regression [local]
VACUUM                                       '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f3d17e88859 in __GI_abort () at abort.c:79
#2  0x0000557358fbecb8 in ExceptionalCondition
(conditionName=conditionName@entry=0x557359027544 "numretries < 100",
    errorType=errorType@entry=0x55735901800b "FailedAssertion",
fileName=0x7ffef5bbfa70 "\231\354\373XsU",
    fileName@entry=0x557359027520 "vacuumlazy.c",
lineNumber=lineNumber@entry=1726) at assert.c:69
#3  0x0000557358ba0bdb in lazy_scan_prune (vacrel=0x55735a508d60,
buf=16184, blkno=270, page=0x7f3d16ef6f80 "",
    vistest=0x5573592dd540 <GlobalVisCatalogRels>,
prunestate=0x7ffef5bc0fb0) at vacuumlazy.c:1823
#4  0x0000557358ba38c5 in lazy_scan_heap (aggressive=false,
params=0x7ffef5bc13a0, vacrel=<optimized out>)
    at vacuumlazy.c:1384
#5  heap_vacuum_rel (rel=0x7f3d183ebec8, params=0x7ffef5bc13a0,
bstrategy=<optimized out>) at vacuumlazy.c:638
#6  0x0000557358cec785 in table_relation_vacuum (bstrategy=<optimized
out>, params=0x7ffef5bc13a0, rel=0x7f3d183ebec8)
    at ../../../src/include/access/tableam.h:1678
#7  vacuum_rel (relid=1259, relation=<optimized out>,
params=params@entry=0x7ffef5bc13a0) at vacuum.c:2034
#8  0x0000557358cedf9e in vacuum (relations=0x55735a57bea8,
params=0x7ffef5bc13a0, bstrategy=<optimized out>,
    isTopLevel=<optimized out>) at vacuum.c:475
#9  0x0000557358cee4f5 in ExecVacuum
(pstate=pstate@entry=0x55735a4ab730, vacstmt=vacstmt@entry=0x55735a48a408,
    isTopLevel=isTopLevel@entry=true) at vacuum.c:268
#10 0x0000557358e9de4e in standard_ProcessUtility (pstmt=0x55735a48a758,
    queryString=0x55735a4895e0 "vacuum (verbose, skip_locked,
index_cleanup off) pg_catalog.pg_class;",
    readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x55735a48a848,
    qc=0x7ffef5bc1700) at utility.c:858
#11 0x0000557358e9c411 in PortalRunUtility
(portal=portal@entry=0x55735a4ed040, pstmt=pstmt@entry=0x55735a48a758,
    isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55735a48a848,
    qc=qc@entry=0x7ffef5bc1700) at pquery.c:1155
#12 0x0000557358e9c54d in PortalRunMulti
(portal=portal@entry=0x55735a4ed040, isTopLevel=isTopLevel@entry=true,
    setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55735a48a848, altdest=altdest@entry=0x55735a48a848,
    qc=qc@entry=0x7ffef5bc1700) at pquery.c:1312
#13 0x0000557358e9cbe9 in PortalRun (portal=portal@entry=0x55735a4ed040,
count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55735a48a848,
    altdest=altdest@entry=0x55735a48a848, qc=0x7ffef5bc1700) at pquery.c:788
#14 0x0000557358e9895b in exec_simple_query (
    query_string=0x55735a4895e0 "vacuum (verbose, skip_locked,
index_cleanup off) pg_catalog.pg_class;")
    at postgres.c:1214
#15 0x0000557358e9a561 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffef5bc1b70, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4486
#16 0x0000557358e064dd in BackendRun (port=0x55735a4aeb00,
port=0x55735a4aeb00) at postmaster.c:4506
#17 BackendStartup (port=0x55735a4aeb00) at postmaster.c:4228
#18 ServerLoop () at postmaster.c:1745
#19 0x0000557358e07481 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1417
#20 0x0000557358b352e2 in main (argc=3, argv=0x55735a4838a0) at main.c:209

Sometimes the server crashes other way as described in bug #17255.

And I've also seen the following crash (maybe it's yet another bug, but
I can't explore it now):
Core was generated by `postgres: law regression [local]
EXPLAIN                                   '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  RelationBuildPartitionDesc (omit_detached=true, rel=0x7f09b87a27e8)
at partdesc.c:230
230                             datum = heap_getattr(tuple,
Anum_pg_class_relpartbound,
(gdb) bt
#0  RelationBuildPartitionDesc (omit_detached=true, rel=0x7f09b87a27e8)
at partdesc.c:230
#1  RelationGetPartitionDesc (rel=0x7f09b87a27e8,
omit_detached=<optimized out>) at partdesc.c:110
#2  0x000055ec12169032 in PartitionDirectoryLookup (pdir=0x55ec12ee7660,
rel=rel@entry=0x7f09b87a27e8)
    at partdesc.c:425
#3  0x000055ec12158b35 in set_relation_partition_info
(relation=0x7f09b87a27e8, rel=0x55ec12f337c0,
    root=0x55ec12effae0) at plancat.c:2197
#4  get_relation_info (root=<optimized out>, relationObjectId=<optimized
out>, inhparent=<optimized out>,
    rel=<optimized out>) at plancat.c:472
#5  0x000055ec1215d26f in build_simple_rel
(root=root@entry=0x55ec12effae0, relid=5,
    parent=parent@entry=0x55ec12ee7330) at relnode.c:307
#6  0x000055ec1214f495 in expand_partitioned_rtentry
(root=0x55ec12effae0, relinfo=0x55ec12ee7330,
    parentrte=0x55ec12e006d0, parentRTindex=1, parentrel=0x7f09b87a0630,
top_parentrc=0x0, lockmode=1) at inherit.c:398
#7  0x000055ec1214f67e in expand_inherited_rtentry
(root=root@entry=0x55ec12effae0, rel=0x55ec12ee7330,
    rte=0x55ec12e006d0, rti=rti@entry=1) at inherit.c:143
#8  0x000055ec1212eb8d in add_other_rels_to_query
(root=root@entry=0x55ec12effae0) at initsplan.c:163
#9  0x000055ec12131e3f in query_planner (root=root@entry=0x55ec12effae0,
    qp_callback=qp_callback@entry=0x55ec12132b20 <standard_qp_callback>,
qp_extra=qp_extra@entry=0x7ffcffe4e890)
    at planmain.c:264
#10 0x000055ec121371b7 in grouping_planner (root=<optimized out>,
tuple_fraction=<optimized out>) at planner.c:1442
#11 0x000055ec12139d0e in subquery_planner
(glob=glob@entry=0x55ec12ef3eb8, parse=parse@entry=0x55ec12e005b8,
    parent_root=parent_root@entry=0x0,
hasRecursion=hasRecursion@entry=false,
tuple_fraction=tuple_fraction@entry=0)
    at planner.c:1019
#12 0x000055ec1213a383 in standard_planner (parse=0x55ec12e005b8,
query_string=<optimized out>, cursorOptions=2048,
    boundParams=<optimized out>) at planner.c:400
#13 0x000055ec1220d7f8 in pg_plan_query (querytree=0x55ec12e005b8,
querytree@entry=0x7ffcffe4ead0,
    query_string=query_string@entry=0x55ec12dff520 "explain (costs off)
select * from range_list_parted;",
    cursorOptions=cursorOptions@entry=2048,
boundParams=boundParams@entry=0x0) at postgres.c:847
#14 0x000055ec12025fef in ExplainOneQuery (query=0x7ffcffe4ead0,
cursorOptions=2048, into=0x0, es=0x55ec12ef3bd0,
    queryString=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;", params=0x0, queryEnv=0x0)
    at explain.c:397
#15 0x000055ec1202679f in ExplainQuery (pstate=0x55ec12ef3a70,
stmt=0x55ec12e003d8, params=0x0, dest=0x55ec12ef39d8)
    at ../../../src/include/nodes/nodes.h:604
#16 0x000055ec122132c9 in standard_ProcessUtility (pstmt=0x55ec12e01058,
    queryString=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;", readOnlyTree=<optimized out>,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x55ec12ef39d8, qc=0x7ffcffe4ed60)
    at utility.c:862
#17 0x000055ec122116cf in PortalRunUtility
(portal=portal@entry=0x55ec12e63060, pstmt=0x55ec12e01058,
    isTopLevel=<optimized out>,
setHoldSnapshot=setHoldSnapshot@entry=true, dest=dest@entry=0x55ec12ef39d8,
    qc=qc@entry=0x7ffcffe4ed60) at pquery.c:1155
#18 0x000055ec12211bb0 in FillPortalStore (portal=0x55ec12e63060,
isTopLevel=<optimized out>)
    at ../../../src/include/nodes/nodes.h:604
#19 0x000055ec12211ebd in PortalRun (portal=portal@entry=0x55ec12e63060,
count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55ec12e01148,
    altdest=altdest@entry=0x55ec12e01148, qc=0x7ffcffe4ef60) at pquery.c:760
#20 0x000055ec1220dcf5 in exec_simple_query (
    query_string=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;") at postgres.c:1214
#21 0x000055ec1220f886 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffcffe4f4b0, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4486
#22 0x000055ec1218071a in BackendRun (port=<optimized out>,
port=<optimized out>) at postmaster.c:4506
#23 BackendStartup (port=<optimized out>) at postmaster.c:4228
#24 ServerLoop () at postmaster.c:1745
#25 0x000055ec12181682 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55ec12df9ce0) at postmaster.c:1417
#26 0x000055ec11eca3a0 in main (argc=3, argv=0x55ec12df9ce0) at main.c:209

Best regards,
Alexander

Attachment

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I can propose the debugging patch to reproduce the issue that replaces
> the hang with the assert and modifies a pair of crash-causing test
> scripts to simplify the reproducing. (Sorry, I have no time now to prune
> down the scripts further as I have to leave for a week.)

This bug is similar to the one fixed in commit d9d8aa9b. And so I
wonder if code like GlobalVisTestFor() is missing something that it
needs for partitioned tables.


-- 
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Matthias van de Meent
Date:
On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > I can propose the debugging patch to reproduce the issue that replaces
> > the hang with the assert and modifies a pair of crash-causing test
> > scripts to simplify the reproducing. (Sorry, I have no time now to prune
> > down the scripts further as I have to leave for a week.)
>
> This bug is similar to the one fixed in commit d9d8aa9b. And so I
> wonder if code like GlobalVisTestFor() is missing something that it
> needs for partitioned tables.

Without `autovacuum = off; fsync = off` I could not replicate the
issue in the configured 10m time window; with those options I did get
the reported trace in minutes.

I think that I also have found the culprit, which is something we
talked about in [0]: GlobalVisState->maybe_needed was not guaranteed
to never move backwards when recalculated, and because vacuum can
update its snapshot bounds (heap_prune_satisfies_vacuum ->
GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
could move backwards, resulting in the observed behaviour.

It was my understanding based on the mail conversation that Andres
would fix this observed issue too while fixing [0] (whose fix was
included with beta 2), but apparently I was wrong; I can't find the
code for 'maybe_needed'-won't-move-backwards-in-a-backend.

I (again) propose the attached patch, which ensures that this
maybe_needed field will not move backwards for a backend. It is
based on 14, but should be applied on head as well, because it's
lacking there as well.

Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).

Note: After fixing the issue with retreating maybe_needed I also hit
your segfault, and I'm still trying to find out what the source of
that issue might be. I do think it is an issue seperate from stuck
vacuum, though.


Kind regards,

Matthias van de Meent

[0]
https://www.postgresql.org/message-id/flat/20210609184506.rqm5rikoikm47csf%40alap3.anarazel.de#e9d55b5cfff34238a24dc85c8c75a46f

Attachment

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Matthias van de Meent
Date:
On Mon, 1 Nov 2021 at 16:15, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > I can propose the debugging patch to reproduce the issue that replaces
> > > the hang with the assert and modifies a pair of crash-causing test
> > > scripts to simplify the reproducing. (Sorry, I have no time now to prune
> > > down the scripts further as I have to leave for a week.)
> >
> > This bug is similar to the one fixed in commit d9d8aa9b. And so I
> > wonder if code like GlobalVisTestFor() is missing something that it
> > needs for partitioned tables.
>
> Without `autovacuum = off; fsync = off` I could not replicate the
> issue in the configured 10m time window; with those options I did get
> the reported trace in minutes.
>
> I think that I also have found the culprit, which is something we
> talked about in [0]: GlobalVisState->maybe_needed was not guaranteed
> to never move backwards when recalculated, and because vacuum can
> update its snapshot bounds (heap_prune_satisfies_vacuum ->
> GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
> could move backwards, resulting in the observed behaviour.
>
> It was my understanding based on the mail conversation that Andres
> would fix this observed issue too while fixing [0] (whose fix was
> included with beta 2), but apparently I was wrong; I can't find the
> code for 'maybe_needed'-won't-move-backwards-in-a-backend.
>
> I (again) propose the attached patch, which ensures that this
> maybe_needed field will not move backwards for a backend. It is
> based on 14, but should be applied on head as well, because it's
> lacking there as well.
>
> Another alternative would be to replace the use of vacrel->OldestXmin
> with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
> that is not legal in how vacuum works (we cannot unilaterally decide
> that we want to retain tuples < OldestXmin).
>
> Note: After fixing the issue with retreating maybe_needed I also hit
> your segfault, and I'm still trying to find out what the source of
> that issue might be. I do think it is an issue seperate from stuck
> vacuum, though.

After further debugging, I think these both might be caused by the
same issue, due to xmin horizon confusion as a result from restored
snapshots:

I seem to repeatedly get backends of which the xmin is set from
InvalidTransactionId to some value < min(ProcGlobal->xids), which then
result in shared_oldest_nonremovable (and others) being less than the
value of their previous iteration. This leads to the infinite loop in
lazy_scan_prune (it stores and uses one value of
*_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
variant). Ergo, this issue is not really solved by my previous patch,
because apparently at this point we have snapshots wih an xmin that is
only registered in the backend's procarray entry when the xmin is
already out of scope, which makes it generally impossible to determine
what tuples may or may not yet be vacuumed.

I noticed that when this happens, generally a parallel vacuum worker
is involved. I also think that this is intimately related to [0], and
how snapshots are restored in parallel workers: A vacuum worker is
generally ignored, but if its snapshot has the oldest xmin available,
then a parallel worker launched from that vacuum worker will move the
visible xmin backwards. Same for concurrent index creation jobs.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/flat/202110191807.5svc3kmm32tl%40alvherre.pgsql



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Wed, Nov 3, 2021 at 8:46 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I seem to repeatedly get backends of which the xmin is set from
> InvalidTransactionId to some value < min(ProcGlobal->xids), which then
> result in shared_oldest_nonremovable (and others) being less than the
> value of their previous iteration. This leads to the infinite loop in
> lazy_scan_prune (it stores and uses one value of
> *_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
> variant).

> I noticed that when this happens, generally a parallel vacuum worker
> is involved.

Hmm. That is plausible. The way that VACUUM (and concurrent index
builds) avoid being seen via the PROC_IN_VACUUM thing is pretty
delicate. Wouldn't surprise me if the parallel VACUUM issue subtly
broke lazy_scan_prune in the way that we see here.

What about testing? Can we find a simple way of reducing this
complicated repro to a less complicated repro with a failing
assertion? Maybe an assertion that we get to keep after the bug is
fixed?

-- 
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Matthias van de Meent
Date:
On Wed, 3 Nov 2021 at 17:21, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Nov 3, 2021 at 8:46 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I seem to repeatedly get backends of which the xmin is set from
> > InvalidTransactionId to some value < min(ProcGlobal->xids), which then
> > result in shared_oldest_nonremovable (and others) being less than the
> > value of their previous iteration. This leads to the infinite loop in
> > lazy_scan_prune (it stores and uses one value of
> > *_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
> > variant).
>
> > I noticed that when this happens, generally a parallel vacuum worker
> > is involved.
>
> Hmm. That is plausible. The way that VACUUM (and concurrent index
> builds) avoid being seen via the PROC_IN_VACUUM thing is pretty
> delicate. Wouldn't surprise me if the parallel VACUUM issue subtly
> broke lazy_scan_prune in the way that we see here.
>
> What about testing? Can we find a simple way of reducing this
> complicated repro to a less complicated repro with a failing
> assertion? Maybe an assertion that we get to keep after the bug is
> fixed?

I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:

- It is always safe to set ->xmin to InvalidTransactionId: This
removes any claim that we have a snapshot anyone should worry about.
- If we have a valid ->xmin set, it is always safe to increase its value.
- Otherwise, the xmin must not lower the overall xmin of the database
it is connected to, plus some potential conditions for status flags.
  It also may not be set without first taking the ProcArrayLock:
without synchronised access to the proc array, you cannot guarantee
you can set your xmin to a globally correct value.

It worked well with the bgworker flags patch [0], until I added this
instrumentation to SnapshotResetXmin and ran the regression tests: I
stumbled upon the following issue with aborting transactions, and I
don't know what the correct solution is to solve it:

AbortTransaction (see xact.c) calls ProcArrayEndTransaction, which can
reset MyProc->xmin to InvalidTransactionId (both directly and through
ProcArrayEndTransactionInternal). So far, this is safe.

However, later in AbortTransaction we call ResourceOwnerRelease(...,
RESOURCE_RELEASE_AFTER_LOCKS...), which will clean up the snapshots
stored in its owner->snapshotarr array using UnregisterSnapshot.
Then, if UnregisterSnapshot determines that a snapshot is now not
referenced anymore, and that snapshot has no active count, then it
will call SnapshotResetXmin().
Finally, when SnapshotResetXmin() is called, the oldest still
registered snapshot in RegisteredSnapshots will be pulled and
MyProc->xmin will be set to that snapshot's xmin.

Similarly, in AbortTransaction we call AtEOXact_Inval, which calls
ProcessInvalidationMessages -> LocalExecuteInvalidationMessage ->
InvalidateCatalogSnapshot -> SnapshotResetXmin, also setting
MyProc->xmin back to a non-InvalidXid value.

Note that from a third-party observer's standpoint we've just moved
our horizons backwards, and the regression tests (correctly) fail when
assertions are enabled.

I don't know what the expected behaviour is, but I do know that this
is a violation of the expected invariant of xmin never goes backwards
(for any of the cluster, database or data level).

Kind regards,

Matthias van de Meent


[0] https://www.postgresql.org/message-id/CAD21AoDkERUJkGEuQRiyGKmVRt2duU378UgnwBpqXQjA%2BEY3Lg%40mail.gmail.com

Attachment

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Fri, Nov 5, 2021 at 4:43 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I added the attached instrumentation for checking xmin validity, which
> asserts what I believe are correct claims about the proc
> infrastructure:

This test case involves partitioning, but also pruning, which is very
particular about heap tuple headers being a certain way following
updates. I wonder if we're missing a
HeapTupleHeaderIndicatesMovedPartitions() test somewhere. Could be in
heapam/VACUUM/pruning code, or could be somewhere else.

Take a look at commit f16241bef7 to get some idea of what I mean.

-- 
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Matthias van de Meent
Date:
On Fri, 5 Nov 2021 at 22:25, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Nov 5, 2021 at 4:43 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I added the attached instrumentation for checking xmin validity, which
> > asserts what I believe are correct claims about the proc
> > infrastructure:
>
> This test case involves partitioning, but also pruning, which is very
> particular about heap tuple headers being a certain way following
> updates. I wonder if we're missing a
> HeapTupleHeaderIndicatesMovedPartitions() test somewhere. Could be in
> heapam/VACUUM/pruning code, or could be somewhere else.

If you watch closely, the second backtrace in [0] (the segfault)
originates from the code that builds the partition bounds based on
relcaches / catalog tables, which are never partitioned. Although it
is indeed in the partition infrastructure, if we'd have a tuple with
HeapTupleHeaderIndicatesMovedPartitions() at that point, then that'd
be a bug (we do not partition catalogs).

But I hit this same segfault earlier while testing, and I deduced that
problem that I hit at that point was that there was that an index
entry could not resolve to a heap tuple (or the scan at partdesc.c:227
otherwise returned NULL where one result was expected); so that tuple
is NULL at partdesc.c:230, and heap_getattr subsequently segfaults
when it dereferences the null tuple pointer to access it's fields.

Due to the blatant visibility horizon confusion, the failing scan
being on the pg_class table, and the test case including aggressive
manual vacuuming of the pg_class table, I assume that the error was
caused by vacuum having removed tuples from pg_class, while other
backends still required / expected access to these tuples.

Kind regards,

Matthias

[0] https://www.postgresql.org/message-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3%40gmail.com



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Andres Freund
Date:
On 2021-11-05 12:43:00 +0100, Matthias van de Meent wrote:
> I added the attached instrumentation for checking xmin validity, which
> asserts what I believe are correct claims about the proc
> infrastructure:
> 
> - It is always safe to set ->xmin to InvalidTransactionId: This
> removes any claim that we have a snapshot anyone should worry about.
> - If we have a valid ->xmin set, it is always safe to increase its value.

I think I know what you mean, but of course you cannot just increase xmin if
there are existing snapshots requiring that xmin.


> - Otherwise, the xmin must not lower the overall xmin of the database
> it is connected to, plus some potential conditions for status flags.

walsenders can end up doing this IIRC.


>   It also may not be set without first taking the ProcArrayLock:
> without synchronised access to the proc array, you cannot guarantee
> you can set your xmin to a globally correct value.

There's possibly one exception around this, which is snapshot import. But
that's rare enough that an unnecessary acquisition is fine.


> It worked well with the bgworker flags patch [0], until I added this
> instrumentation to SnapshotResetXmin and ran the regression tests: I
> stumbled upon the following issue with aborting transactions, and I
> don't know what the correct solution is to solve it:
> 
> AbortTransaction (see xact.c) calls ProcArrayEndTransaction, which can
> reset MyProc->xmin to InvalidTransactionId (both directly and through
> ProcArrayEndTransactionInternal). So far, this is safe.
> 
> However, later in AbortTransaction we call ResourceOwnerRelease(...,
> RESOURCE_RELEASE_AFTER_LOCKS...), which will clean up the snapshots
> stored in its owner->snapshotarr array using UnregisterSnapshot.
> Then, if UnregisterSnapshot determines that a snapshot is now not
> referenced anymore, and that snapshot has no active count, then it
> will call SnapshotResetXmin().
> Finally, when SnapshotResetXmin() is called, the oldest still
> registered snapshot in RegisteredSnapshots will be pulled and
> MyProc->xmin will be set to that snapshot's xmin.

Yea, that's not great. This is a pretty old behaviour, IIRC?


> We have an unwritten rule that a backend's xmin may not go back, but
> this is not enforced.

I don't think we can make any of this hard assertions. There's e.g. the
following comment:

 * Note: despite the above, it's possible for the calculated values to move
 * backwards on repeated calls. The calculated values are conservative, so
 * that anything older is definitely not considered as running by anyone
 * anymore, but the exact values calculated depend on a number of things. For
 * example, if there are no transactions running in the current database, the
 * horizon for normal tables will be latestCompletedXid. If a transaction
 * begins after that, its xmin will include in-progress transactions in other
 * databases that started earlier, so another call will return a lower value.
 * Nonetheless it is safe to vacuum a table in the current database with the
 * first result.  There are also replication-related effects: a walsender
 * process can set its xmin based on transactions that are no longer running
 * on the primary but are still being replayed on the standby, thus possibly
 * making the values go backwards.  In this case there is a possibility that
 * we lose data that the standby would like to have, but unless the standby
 * uses a replication slot to make its xmin persistent there is little we can
 * do about that --- data is only protected if the walsender runs continuously
 * while queries are executed on the standby.  (The Hot Standby code deals
 * with such cases by failing standby queries that needed to access
 * already-removed data, so there's no integrity bug.)  The computed values
 * are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
 * on the fly is another easy way to make horizons move backwards, with no
 * consequences for data integrity.

Greetings,

Andres Freund



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I can propose the debugging patch to reproduce the issue that replaces
> the hang with the assert and modifies a pair of crash-causing test
> scripts to simplify the reproducing. (Sorry, I have no time now to prune
> down the scripts further as I have to leave for a week.)
>
> The reproducing script is:

I cannot reproduce this bug by following your steps, even when the
assertion is made to fail after only 5 retries (5 is still ludicrously
excessive, 100 might be overkill). And even when I don't use a debug
build (and make the assertion into an equivalent PANIC). I wonder why
that is. I didn't have much trouble following your similar repro for
bug #17255.

My immediate goal in trying to follow your reproducer was to determine
what effect (if any) the pending bugfix for #17255 [1] has on this
bug. It seems more than possible that this bug is in fact a different
manifestation of the same underlying problem we see in #17255. And so
that should be the next thing we check here.

[1] https://postgr.es/m/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com
--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Noah Misch
Date:
On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
> Another alternative would be to replace the use of vacrel->OldestXmin
> with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe

v17 commit 1ccc1e05ae essentially did that.

> that is not legal in how vacuum works (we cannot unilaterally decide
> that we want to retain tuples < OldestXmin).

Do you think commit 1ccc1e05ae creates problems in that respect?  It does have
the effect of retaining tuples for which GlobalVisState rules "retain" but
HeapTupleSatisfiesVacuum() would have ruled "delete".  If that doesn't create
problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
infinite-retries scenarios, if any.

On Wed, Nov 10, 2021 at 12:28:43PM -0800, Peter Geoghegan wrote:
> On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > I can propose the debugging patch to reproduce the issue that replaces
> > the hang with the assert and modifies a pair of crash-causing test
> > scripts to simplify the reproducing. (Sorry, I have no time now to prune
> > down the scripts further as I have to leave for a week.)
> >
> > The reproducing script is:
> 
> I cannot reproduce this bug by following your steps, even when the
> assertion is made to fail after only 5 retries (5 is still ludicrously
> excessive, 100 might be overkill). And even when I don't use a debug
> build (and make the assertion into an equivalent PANIC). I wonder why
> that is. I didn't have much trouble following your similar repro for
> bug #17255.

For what it's worth, I needed "-X" on the script's psql command lines to keep
my ~/.psqlrc from harming things.  I also wondered if the regression database
needed to be populated with a "make installcheck" run.  The script had a
"createdb regression" without a "make installcheck", so I assumed an empty
regression database was intended.

> My immediate goal in trying to follow your reproducer was to determine
> what effect (if any) the pending bugfix for #17255 [1] has on this
> bug. It seems more than possible that this bug is in fact a different
> manifestation of the same underlying problem we see in #17255. And so
> that should be the next thing we check here.
> 
> [1] https://postgr.es/m/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com

Using the https://postgr.es/m/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
procedure, I see these results:

- A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
  "numretries" assertion failure in each of five 10m runs.

- At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
  script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
  File: "pruneheap.c", Line: 964.  That happened once in two 10m runs, so it
  was harder to reach than the numretries failure.

- At 18b87b201f, a 1440m script run got no failures.

I've seen symptoms that suggest the infinite-numretries bug remains reachable,
but I don't know how to reproduce them.  (Given the upthread notes about xmin
going backward during end-of-xact, I'd first try some pauses there.)  If it
does remain reachable, likely some other code change between 2021-10 and
2022-01 made this particular test script no longer reach it.

Thanks,
nm



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah@leadboat.com> wrote:
> On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
> > Another alternative would be to replace the use of vacrel->OldestXmin
> > with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
>
> v17 commit 1ccc1e05ae essentially did that.

Obviously, 1ccc1e05ae would fix the immediate problem of infinite
retries, since it just rips out the loop.

> > that is not legal in how vacuum works (we cannot unilaterally decide
> > that we want to retain tuples < OldestXmin).
>
> Do you think commit 1ccc1e05ae creates problems in that respect?  It does have
> the effect of retaining tuples for which GlobalVisState rules "retain" but
> HeapTupleSatisfiesVacuum() would have ruled "delete".  If that doesn't create
> problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
> infinite-retries scenarios, if any.

My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here. Seems to be
specific to partitioning with cross-partition updates.

> Using the https://postgr.es/m/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
> procedure, I see these results:
>
> - A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
>   "numretries" assertion failure in each of five 10m runs.
>
> - At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
>   script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
>   File: "pruneheap.c", Line: 964.  That happened once in two 10m runs, so it
>   was harder to reach than the numretries failure.
>
> - At 18b87b201f, a 1440m script run got no failures.
>
> I've seen symptoms that suggest the infinite-numretries bug remains reachable,
> but I don't know how to reproduce them.  (Given the upthread notes about xmin
> going backward during end-of-xact, I'd first try some pauses there.)  If it
> does remain reachable, likely some other code change between 2021-10 and
> 2022-01 made this particular test script no longer reach it.

I am aware of a production database that appears to run into the same
problem. Inserts and concurrent cross-partition updates are used
heavily on this instance (the table in question uses partitioning).
Perhaps you happened upon a similar problematic production database,
and found this thread when researching the issue? Maybe we've both
seen the same problem in the wild?

I have seen VACUUM get stuck like this on multiple versions, all
associated with the same application code/partitioning
scheme/workload. This includes a 15.4 instance, and various 14.* point
release instances. It seems likely that there is a bug here, and that
it affects Postgres 14, 15, and 16.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Noah Misch
Date:
On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
> On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
> > > Another alternative would be to replace the use of vacrel->OldestXmin
> > > with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
> >
> > v17 commit 1ccc1e05ae essentially did that.
> 
> Obviously, 1ccc1e05ae would fix the immediate problem of infinite
> retries, since it just rips out the loop.

Yep.

> > > that is not legal in how vacuum works (we cannot unilaterally decide
> > > that we want to retain tuples < OldestXmin).
> >
> > Do you think commit 1ccc1e05ae creates problems in that respect?  It does have
> > the effect of retaining tuples for which GlobalVisState rules "retain" but
> > HeapTupleSatisfiesVacuum() would have ruled "delete".  If that doesn't create
> > problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
> > infinite-retries scenarios, if any.
> 
> My guess is that there is a decent chance that backpatching 1ccc1e05ae
> would be okay, but that isn't much use. I really don't know either way
> right now. And I wouldn't like to speculate too much further before
> gaining a proper understanding of what's going on here.

Fair enough.  While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized.  Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin.  Then relfrozenxid disagrees with table contents.

> Seems to be
> specific to partitioning with cross-partition updates.
> 
> > Using the https://postgr.es/m/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
> > procedure, I see these results:
> >
> > - A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
> >   "numretries" assertion failure in each of five 10m runs.
> >
> > - At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
> >   script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
> >   File: "pruneheap.c", Line: 964.  That happened once in two 10m runs, so it
> >   was harder to reach than the numretries failure.
> >
> > - At 18b87b201f, a 1440m script run got no failures.
> >
> > I've seen symptoms that suggest the infinite-numretries bug remains reachable,
> > but I don't know how to reproduce them.  (Given the upthread notes about xmin
> > going backward during end-of-xact, I'd first try some pauses there.)  If it
> > does remain reachable, likely some other code change between 2021-10 and
> > 2022-01 made this particular test script no longer reach it.
> 
> I am aware of a production database that appears to run into the same
> problem. Inserts and concurrent cross-partition updates are used
> heavily on this instance (the table in question uses partitioning).
> Perhaps you happened upon a similar problematic production database,
> and found this thread when researching the issue? Maybe we've both
> seen the same problem in the wild?

I did find this thread while researching the symptoms I was seeing.  No
partitioning where I saw them.

> I have seen VACUUM get stuck like this on multiple versions, all
> associated with the same application code/partitioning
> scheme/workload. This includes a 15.4 instance, and various 14.* point
> release instances. It seems likely that there is a bug here, and that
> it affects Postgres 14, 15, and 16.

Agreed.



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:
> On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
> > My guess is that there is a decent chance that backpatching 1ccc1e05ae
> > would be okay, but that isn't much use. I really don't know either way
> > right now. And I wouldn't like to speculate too much further before
> > gaining a proper understanding of what's going on here.
>
> Fair enough.  While I agree there's a decent chance back-patching would be
> okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> Matthias theorized.  Something like: we update relfrozenxid based on
> OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> OldestXmin.  Then relfrozenxid disagrees with table contents.

Either every relevant code path has the same OldestXmin to work off
of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
expected to work as designed. I find it a bit odd that
pruneheap.c/GlobalVisState has no direct understanding of this
dependency (none that I can discern, at least). Wouldn't it at least
be more natural if pruneheap.c could access OldestXmin when run inside
VACUUM? (Could just be used by defensive hardening code.)

We're also relying on vacuumlazy.c's call to vacuum_get_cutoffs()
(which itself calls GetOldestNonRemovableTransactionId) taking place
before vacuumlazy.c goes on to call GlobalVisTestFor() a few lines
further down (I think). It seems like even the code in procarray.c
might have something to say about the vacuumlazy.c dependency, too.
But offhand it doesn't look like it does, either. Why shouldn't we
expect random implementation details in code like ComputeXidHorizons()
to break the assumption/dependency within vacuumlazy.c?

I also worry about the possibility that GlobalVisTestShouldUpdate()
masks problems in this area (as opposed to causing the problems). It
seems very hard to test.

> I did find this thread while researching the symptoms I was seeing.  No
> partitioning where I saw them.

If this was an isolated incident then it could perhaps have been a
symptom of corruption. Though corruption seems highly unlikely to be
involved with the cases that I've seen, since they appear
intermittently, across a variety of different contexts/versions.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Sat, Jan 6, 2024 at 1:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:
> > Fair enough.  While I agree there's a decent chance back-patching would be
> > okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> > Matthias theorized.  Something like: we update relfrozenxid based on
> > OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> > OldestXmin.  Then relfrozenxid disagrees with table contents.
>
> Either every relevant code path has the same OldestXmin to work off
> of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
> expected to work as designed. I find it a bit odd that
> pruneheap.c/GlobalVisState has no direct understanding of this
> dependency (none that I can discern, at least).

What do you think of the idea of adding a defensive "can't happen"
error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
tuples with storage that still contain an xmax < OldestXmin? This
probably won't catch every possible problem, but I suspect it'll work
well enough.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Noah Misch
Date:
On Sat, Jan 06, 2024 at 01:30:40PM -0800, Peter Geoghegan wrote:
> On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:
> > On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
> > > My guess is that there is a decent chance that backpatching 1ccc1e05ae
> > > would be okay, but that isn't much use. I really don't know either way
> > > right now. And I wouldn't like to speculate too much further before
> > > gaining a proper understanding of what's going on here.
> >
> > Fair enough.  While I agree there's a decent chance back-patching would be
> > okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> > Matthias theorized.  Something like: we update relfrozenxid based on
> > OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> > OldestXmin.  Then relfrozenxid disagrees with table contents.
> 
> Either every relevant code path has the same OldestXmin to work off
> of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
> expected to work as designed. I find it a bit odd that
> pruneheap.c/GlobalVisState has no direct understanding of this
> dependency (none that I can discern, at least). Wouldn't it at least
> be more natural if pruneheap.c could access OldestXmin when run inside
> VACUUM? (Could just be used by defensive hardening code.)

Tied to that decision is the choice of semantics when the xmin horizon moves
backward during one VACUUM, e.g. when a new walsender xmin does so.  Options:

1. Continue to remove tuples based on the OldestXmin from VACUUM's start.  We
   could have already removed some of those tuples, so the walsender xmin
   won't achieve a guarantee anyway.  (VACUUM would want ratchet-like behavior
   in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
   say.)

2. Move OldestXmin backward, to reflect the latest xmin horizon.  (Perhaps
   VACUUM would just pass GlobalVisState to a function that returns the
   compatible OldestXmin.)

Which way is better?

> We're also relying on vacuumlazy.c's call to vacuum_get_cutoffs()
> (which itself calls GetOldestNonRemovableTransactionId) taking place
> before vacuumlazy.c goes on to call GlobalVisTestFor() a few lines
> further down (I think). It seems like even the code in procarray.c
> might have something to say about the vacuumlazy.c dependency, too.
> But offhand it doesn't look like it does, either. Why shouldn't we
> expect random implementation details in code like ComputeXidHorizons()
> to break the assumption/dependency within vacuumlazy.c?

Makes sense.

On Sat, Jan 06, 2024 at 01:41:23PM -0800, Peter Geoghegan wrote:
> What do you think of the idea of adding a defensive "can't happen"
> error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
> tuples with storage that still contain an xmax < OldestXmin? This
> probably won't catch every possible problem, but I suspect it'll work
> well enough.

So before the "goto retry", ERROR if the tuple header suggests this mismatch
is happening?  That, or limiting the retry count, could help.



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Sat, Jan 6, 2024 at 5:44 PM Noah Misch <noah@leadboat.com> wrote:
> Tied to that decision is the choice of semantics when the xmin horizon moves
> backward during one VACUUM, e.g. when a new walsender xmin does so.  Options:
>
> 1. Continue to remove tuples based on the OldestXmin from VACUUM's start.  We
>    could have already removed some of those tuples, so the walsender xmin
>    won't achieve a guarantee anyway.  (VACUUM would want ratchet-like behavior
>    in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
>    say.)
>
> 2. Move OldestXmin backward, to reflect the latest xmin horizon.  (Perhaps
>    VACUUM would just pass GlobalVisState to a function that returns the
>    compatible OldestXmin.)
>
> Which way is better?

I suppose that a hybrid of these two approaches makes the most sense.
A design that's a lot closer to your #1 than to your #2.

Under this scheme, pruneheap.c would be explicitly aware of
OldestXmin, and would promise to respect the exact invariant that we
need to avoid getting stuck in lazy_scan_prune's loop (or avoid
confused NewRelfrozenXid tracking on HEAD, which no longer has this
loop). But it'd be limited to that exact invariant; we'd still avoid
unduly imposing any requirements on pruning-away deleted tuples whose
xmax was >= OldestXmin. lazy_scan_prune/vacuumlazy.c shouldn't care if
we prune away any "extra" heap tuples, just because we can (or just
because it's convenient to the implementation). Andres has in the past
placed a lot of emphasis on being able to update the
GlobalVisState-wise bounds on the fly. Not sure that it's really that
important that VACUUM does that, but there is no reason to not allow
it. So we can keep that property (as well as the aforementioned
high-level OldestXmin immutability property).

More importantly (at least to me), this scheme allows vacuumlazy.c to
continue to treat OldestXmin as an immutable cutoff for both pruning
and freezing -- the high level design doesn't need any revisions. We
already "freeze away" multixact member XIDs >= OldestXmin in certain
rare cases (i.e. we remove lockers that are determined to no longer be
running in FreezeMultiXactId's "second pass" slow path), so there is
nothing fundamentally novel about the idea of removing some extra XIDs
>= OldestXmin in passing, just because it happens to be convenient to
some low-level piece of code that's external to vacuumlazy.c.

What do you think of that general approach? I see no reason why it
matters if OldestXmin goes backwards across two VACUUM operations, so
I haven't tried to avoid that.

> On Sat, Jan 06, 2024 at 01:41:23PM -0800, Peter Geoghegan wrote:
> > What do you think of the idea of adding a defensive "can't happen"
> > error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
> > tuples with storage that still contain an xmax < OldestXmin? This
> > probably won't catch every possible problem, but I suspect it'll work
> > well enough.
>
> So before the "goto retry", ERROR if the tuple header suggests this mismatch
> is happening?  That, or limiting the retry count, could help.

When I wrote this code, my understanding was that the sole reason for
needing to loop back was a concurrently-aborted xact. In principle we
ought to be able to test the tuple to detect if it's that exact case
(the only truly valid case), and then throw an error if we somehow got
it wrong. That kind of hardening would at least be correct according
to my original understanding of things.

There is an obvious practical concern with adding such hardening now:
what if the current loop is accidentally protective, in whatever way?
That seems quite possible. I seem to recall that Andres supposed at
some point that the loop's purpose wasn't limited to the
concurrently-aborted-inserter case that I believed was the only
relevant case back when I worked on what became commit 8523492d4e
("Remove tupgone special case from vacuumlazy.c"). I don't have a
reference for that, but I'm pretty sure it was said at some point
around the release of 14.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Noah Misch
Date:
On Mon, Jan 08, 2024 at 12:02:01PM -0500, Peter Geoghegan wrote:
> On Sat, Jan 6, 2024 at 5:44 PM Noah Misch <noah@leadboat.com> wrote:
> > Tied to that decision is the choice of semantics when the xmin horizon moves
> > backward during one VACUUM, e.g. when a new walsender xmin does so.  Options:
> >
> > 1. Continue to remove tuples based on the OldestXmin from VACUUM's start.  We
> >    could have already removed some of those tuples, so the walsender xmin
> >    won't achieve a guarantee anyway.  (VACUUM would want ratchet-like behavior
> >    in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
> >    say.)
> >
> > 2. Move OldestXmin backward, to reflect the latest xmin horizon.  (Perhaps
> >    VACUUM would just pass GlobalVisState to a function that returns the
> >    compatible OldestXmin.)
> >
> > Which way is better?
> 
> I suppose that a hybrid of these two approaches makes the most sense.
> A design that's a lot closer to your #1 than to your #2.
> 
> Under this scheme, pruneheap.c would be explicitly aware of
> OldestXmin, and would promise to respect the exact invariant that we
> need to avoid getting stuck in lazy_scan_prune's loop (or avoid
> confused NewRelfrozenXid tracking on HEAD, which no longer has this
> loop). But it'd be limited to that exact invariant; we'd still avoid
> unduly imposing any requirements on pruning-away deleted tuples whose
> xmax was >= OldestXmin. lazy_scan_prune/vacuumlazy.c shouldn't care if
> we prune away any "extra" heap tuples, just because we can (or just
> because it's convenient to the implementation). Andres has in the past
> placed a lot of emphasis on being able to update the
> GlobalVisState-wise bounds on the fly. Not sure that it's really that
> important that VACUUM does that, but there is no reason to not allow
> it. So we can keep that property (as well as the aforementioned
> high-level OldestXmin immutability property).
> 
> More importantly (at least to me), this scheme allows vacuumlazy.c to
> continue to treat OldestXmin as an immutable cutoff for both pruning
> and freezing -- the high level design doesn't need any revisions. We
> already "freeze away" multixact member XIDs >= OldestXmin in certain
> rare cases (i.e. we remove lockers that are determined to no longer be
> running in FreezeMultiXactId's "second pass" slow path), so there is
> nothing fundamentally novel about the idea of removing some extra XIDs
> >= OldestXmin in passing, just because it happens to be convenient to
> some low-level piece of code that's external to vacuumlazy.c.
> 
> What do you think of that general approach?

That all sounds good to me.

> I see no reason why it
> matters if OldestXmin goes backwards across two VACUUM operations, so
> I haven't tried to avoid that.

That may be fully okay, or we may want to clamp OldestXmin to be no older than
relfrozenxid.  I don't feel great about the system moving relfrozenxid
backward unless it observed an older XID, and observing an older XID would be
a corruption signal.  I don't have a specific way non-monotonic relfrozenxid
breaks things, though.



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Mon, Jan 8, 2024 at 1:21 PM Noah Misch <noah@leadboat.com> wrote:
> > I see no reason why it
> > matters if OldestXmin goes backwards across two VACUUM operations, so
> > I haven't tried to avoid that.
>
> That may be fully okay, or we may want to clamp OldestXmin to be no older than
> relfrozenxid.  I don't feel great about the system moving relfrozenxid
> backward unless it observed an older XID, and observing an older XID would be
> a corruption signal.  I don't have a specific way non-monotonic relfrozenxid
> breaks things, though.

We're already prepared for this -- relfrozenxid simply cannot go
backwards, regardless of what vacuumlazy.c thinks. That is,
vac_update_relstats() won't accept a new relfrozenxid that is < its
existing value (unless it's a value "from the future", which is a way
of recovering after historical pg_upgrade-related corruption bugs).

If memory serves it doesn't take much effort to exercise the relevant
code within vac_update_relstats(). I'm pretty sure that the regression
tests will fail if you run them after removing its defensive
no-older-relfrozenxid test (though I haven't checked recently).

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Sat, Jan 6, 2024 at 3:24 PM Noah Misch <noah@leadboat.com> wrote:
> On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
> > I am aware of a production database that appears to run into the same
> > problem. Inserts and concurrent cross-partition updates are used
> > heavily on this instance (the table in question uses partitioning).
> > Perhaps you happened upon a similar problematic production database,
> > and found this thread when researching the issue? Maybe we've both
> > seen the same problem in the wild?
>
> I did find this thread while researching the symptoms I was seeing.  No
> partitioning where I saw them.

I'm starting to think that partitioning was just a red herring. The
instance had problems in the presence of DELETEs, not UPDATEs (or
cross-partition UPDATEs).

Did the affected system that you investigated happen to have an
atypically high number of databases? The system 15.4 system that I saw
the problem on had almost 3,000 databases.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Noah Misch
Date:
On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> On Sat, Jan 6, 2024 at 3:24 PM Noah Misch <noah@leadboat.com> wrote:
> > On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
> > > I am aware of a production database that appears to run into the same
> > > problem. Inserts and concurrent cross-partition updates are used
> > > heavily on this instance (the table in question uses partitioning).
> > > Perhaps you happened upon a similar problematic production database,
> > > and found this thread when researching the issue? Maybe we've both
> > > seen the same problem in the wild?
> >
> > I did find this thread while researching the symptoms I was seeing.  No
> > partitioning where I saw them.
> 
> I'm starting to think that partitioning was just a red herring. The
> instance had problems in the presence of DELETEs, not UPDATEs (or
> cross-partition UPDATEs).
> 
> Did the affected system that you investigated happen to have an
> atypically high number of databases? The system 15.4 system that I saw
> the problem on had almost 3,000 databases.

No, single-digit database count here.



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > Did the affected system that you investigated happen to have an
> > atypically high number of databases? The system 15.4 system that I saw
> > the problem on had almost 3,000 databases.
>
> No, single-digit database count here.

My suspicion was that this factor might increase the propensity of
calls to GetOldestNonRemovableTransactionId (used to establish
VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
by pruneheap.c, in the way that we need to worry about here (i.e.
inconsistencies that lead to VACUUM getting stuck inside
lazy_scan_prune's loop).

Using gdb I was able to determine that
ComputeXidHorizonsResultLastXmin == RecentXmin at some point long
after the system gets stuck (when I actually looked). So
GlobalVisTestShouldUpdate() doesn't return true at that point. And, I
see that VACUUM's OldestXmin value is between
GlobalVisDataRels.maybe_needed and
GlobalVisDataRels.definitely_needed. The deleted tuple's xmax is
committed according to OldestXmin (i.e. it's a value < OldestXmin),
and is < GlobalVisDataRels.definitely_needed, too. The same tuple xmax
is > GlobalVisDataRels.maybe_needed. As for this tuple's xmin, it was
already frozen by a previous VACUUM operation. The tuple infomask
flags indicate that it's a pretty standard deleted tuple.

Overall, there aren't a lot of details here that seem like they might
be out of the ordinary, hinting at a specific underlying cause.

It looks more like the assumptions that we make about OldestXmin
agreeing with GlobalVis* state just aren't quite robust, in general.
Ideally I'd be able to point to some specific assumption that has been
violated -- and we might yet tie the problem to some specific detail
that I've yet to identify. As I said upthread, I'm concerned that code
in places like procarray.c is rather loose about how the horizons are
recomputed, in a way that doesn't sit well with me.
GlobalVisTestShouldUpdate() thinks that it's okay to use
ComputeXidHorizonsResultLastXmin-based heuristics to decide when to
recompute horizons. It is more or less treated as a matter of weighing
costs against benefits -- not as a potential correctness issue.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > Did the affected system that you investigated happen to have an
> > > atypically high number of databases? The system 15.4 system that I saw
> > > the problem on had almost 3,000 databases.
> >
> > No, single-digit database count here.
>
> My suspicion was that this factor might increase the propensity of
> calls to GetOldestNonRemovableTransactionId (used to establish
> VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> by pruneheap.c, in the way that we need to worry about here (i.e.
> inconsistencies that lead to VACUUM getting stuck inside
> lazy_scan_prune's loop).

Another question about your database/system: does VACUUM get stuck
while scanning a page some time after it has already completed a round
of index vacuuming? And if so, does an nbtree bulk delete end up
deleting and then recycling many index leaf pages (e.g., due to bulk
range deletions)?

That's what I see here -- I don't think that pruning leaves behind
even a single live heap tuple, despite scanning thousands of pages
before reaching the page that it gets stuck on. Could be another red
herring. But it doesn't seem impossible that some of the nbtree calls
to procarray.c routines performed by code added by my commit
9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
somehow related. That is, that code could be part of the chain of
events that cause the problem (whether or not the code itself is
technically at fault).

I'm referring to calls such as the
"GetOldestNonRemovableTransactionId(NULL)" and
"GlobalVisCheckRemovableFullXid()" calls that take place inside
_bt_pendingfsm_finalize(). It's not like we do stuff like that in very
many other places.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Noah Misch
Date:
On Wed, Jan 10, 2024 at 02:06:42PM -0500, Peter Geoghegan wrote:
> On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> > > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > > Did the affected system that you investigated happen to have an
> > > > atypically high number of databases? The system 15.4 system that I saw
> > > > the problem on had almost 3,000 databases.
> > >
> > > No, single-digit database count here.
> >
> > My suspicion was that this factor might increase the propensity of
> > calls to GetOldestNonRemovableTransactionId (used to establish
> > VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> > by pruneheap.c, in the way that we need to worry about here (i.e.
> > inconsistencies that lead to VACUUM getting stuck inside
> > lazy_scan_prune's loop).
> 
> Another question about your database/system: does VACUUM get stuck
> while scanning a page some time after it has already completed a round
> of index vacuuming?

I don't know.  That particular system experienced the infinite loop only once.

> That's what I see here -- I don't think that pruning leaves behind
> even a single live heap tuple, despite scanning thousands of pages
> before reaching the page that it gets stuck on. Could be another red
> herring. But it doesn't seem impossible that some of the nbtree calls
> to procarray.c routines performed by code added by my commit
> 9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
> somehow related. That is, that code could be part of the chain of
> events that cause the problem (whether or not the code itself is
> technically at fault).
> 
> I'm referring to calls such as the
> "GetOldestNonRemovableTransactionId(NULL)" and
> "GlobalVisCheckRemovableFullXid()" calls that take place inside
> _bt_pendingfsm_finalize(). It's not like we do stuff like that in very
> many other places.

I see what you mean about the rarity and potential importance of
"GetOldestNonRemovableTransactionId(NULL)".  There's just one other caller,
vac_update_datfrozenxid(), which calls it for an unrelated cause.



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Wed, Jan 10, 2024 at 2:38 PM Noah Misch <noah@leadboat.com> wrote:
> I don't know.  That particular system experienced the infinite loop only once.

While I certainly can't recreate the problem on demand, it has been
seen on this same application far more than once.

> > I'm referring to calls such as the
> > "GetOldestNonRemovableTransactionId(NULL)" and
> > "GlobalVisCheckRemovableFullXid()" calls that take place inside
> > _bt_pendingfsm_finalize(). It's not like we do stuff like that in very
> > many other places.
>
> I see what you mean about the rarity and potential importance of
> "GetOldestNonRemovableTransactionId(NULL)".  There's just one other caller,
> vac_update_datfrozenxid(), which calls it for an unrelated cause.

I just noticed another detail that adds significant weight to this
theory: it looks like the problem is hit on the first tuple located on
the first heap page that VACUUM scans *after* it completes its first
round of index vacuuming (I'm inferring this from vacrel state,
particular its lpdead_items instrumentation counter). The dead_items
array is as large as possible here (just under 1 GiB), and
lpdead_items is 178956692 (which uses up all of our dead_items space).
VACUUM scans tens of gigabytes of heap pages before it begins this
initial round of index vacuuming (according to vacrel->scanned_pages).

What are the chances that all of this is just a coincidence? Low, I'd say.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Noah Misch
Date:
On Wed, Jan 10, 2024 at 02:57:34PM -0500, Peter Geoghegan wrote:
> On Wed, Jan 10, 2024 at 2:38 PM Noah Misch <noah@leadboat.com> wrote:
> > > I'm referring to calls such as the
> > > "GetOldestNonRemovableTransactionId(NULL)" and
> > > "GlobalVisCheckRemovableFullXid()" calls that take place inside
> > > _bt_pendingfsm_finalize(). It's not like we do stuff like that in very
> > > many other places.
> >
> > I see what you mean about the rarity and potential importance of
> > "GetOldestNonRemovableTransactionId(NULL)".  There's just one other caller,
> > vac_update_datfrozenxid(), which calls it for an unrelated cause.
> 
> I just noticed another detail that adds significant weight to this
> theory: it looks like the problem is hit on the first tuple located on
> the first heap page that VACUUM scans *after* it completes its first
> round of index vacuuming (I'm inferring this from vacrel state,
> particular its lpdead_items instrumentation counter). The dead_items
> array is as large as possible here (just under 1 GiB), and
> lpdead_items is 178956692 (which uses up all of our dead_items space).
> VACUUM scans tens of gigabytes of heap pages before it begins this
> initial round of index vacuuming (according to vacrel->scanned_pages).
> 
> What are the chances that all of this is just a coincidence? Low, I'd say.

Agreed.  I bet you've made an important finding, there.



relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Sat, Jan 06, 2024 at 02:44:48PM -0800, Noah Misch wrote:
> On Sat, Jan 06, 2024 at 01:30:40PM -0800, Peter Geoghegan wrote:
> > On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:
> > > Fair enough.  While I agree there's a decent chance back-patching would be
> > > okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> > > Matthias theorized.  Something like: we update relfrozenxid based on
> > > OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> > > OldestXmin.  Then relfrozenxid disagrees with table contents.

I figure Matthias's upthread theory is more likely than not to hold.  If it
does hold, commit 1ccc1e05ae created a new corruption route.  Hence, I'm
adding a v17 open item for commit 1ccc1e05ae.

> > Either every relevant code path has the same OldestXmin to work off
> > of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
> > expected to work as designed. I find it a bit odd that
> > pruneheap.c/GlobalVisState has no direct understanding of this
> > dependency (none that I can discern, at least). Wouldn't it at least
> > be more natural if pruneheap.c could access OldestXmin when run inside
> > VACUUM? (Could just be used by defensive hardening code.)
> 
> Tied to that decision is the choice of semantics when the xmin horizon moves
> backward during one VACUUM, e.g. when a new walsender xmin does so.  Options:
> 
> 1. Continue to remove tuples based on the OldestXmin from VACUUM's start.  We
>    could have already removed some of those tuples, so the walsender xmin
>    won't achieve a guarantee anyway.  (VACUUM would want ratchet-like behavior
>    in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
>    say.)
> 
> 2. Move OldestXmin backward, to reflect the latest xmin horizon.  (Perhaps
>    VACUUM would just pass GlobalVisState to a function that returns the
>    compatible OldestXmin.)
> 
> Which way is better?



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Robert Haas
Date:
On Sun, Mar 3, 2024 at 7:07 PM Noah Misch <noah@leadboat.com> wrote:
> I figure Matthias's upthread theory is more likely than not to hold.  If it
> does hold, commit 1ccc1e05ae created a new corruption route.  Hence, I'm
> adding a v17 open item for commit 1ccc1e05ae.

I need some help understanding what's going on here. I became aware of
this thread because I took a look at the open items list.

This email seems to have branched off of the thread for bug #17257,
reported 2021-10-29. The antecedent of "Matthias's upthread theory" is
unclear to me. These emails seem like the most relevant ones:

https://www.postgresql.org/message-id/CAEze2Wj7O5tnM_U151Baxr5ObTJafwH%3D71_JEmgJV%2B6eBgjL7g%40mail.gmail.com
https://www.postgresql.org/message-id/CAEze2WhxhEQEx%2Bc%2BCXoDpQs1H1HgkYUK4BW-hFw5_eQxuVWqRw%40mail.gmail.com
https://www.postgresql.org/message-id/20240106202413.e5%40rfd.leadboat.com

But I'm having a hard time piecing it all together. The general
picture seems to be that pruning and vacuum disagree about whether a
particular tuple is prunable; before 1ccc1e05ae, that caused the retry
loop in heap_page_prune() to retry forever. Now, it causes
relfrozenxid to be set to too new a value, which is a data-corruption
scenario. If that's right, I'm slightly miffed to find this being
labeled as an open item, since that makes it seem like 1ccc1e05ae
didn't create any new problem but only caused existing defects in the
GlobalVisTest machinery to have different consequences. Perhaps it's
all for the best, though. It's kind of embarrassing that we haven't
fixed whatever the problem is here yet.

But what exactly is the problem, and what's the fix? In the first of
the emails linked above, Matthias argues that the problem is that
GlobalVisState->maybe_needed can move backward. Peter Geoghegan seems
to agree with that here:

https://www.postgresql.org/message-id/CAH2-Wzk_L7Z7LREHTtg5vY08eeWdnHO70m98eWx4U1uwvW%3D0sA%40mail.gmail.com

And Peter seems to have been trying to make sense of Andres's remarks
here, which I think are saying the same thing:

https://www.postgresql.org/message-id/20210616192202.6q63mu66h4uyn343%40alap3.anarazel.de

So it seems like Matthias, Peter, and Andres all agree that
GlobalVisState->maybe_needed going backward is bad and causes this
problem. Unfortunately, I don't understand the mechanism.
vacuumlazy.c's LVRelState has this:

        /* VACUUM operation's cutoffs for freezing and pruning */
        struct VacuumCutoffs cutoffs;
        GlobalVisState *vistest;
        /* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
        TransactionId NewRelfrozenXid;
        MultiXactId NewRelminMxid;

This looks scary, because we've got an awful lot of different
variables serving rather closely-related purposes. cutoffs actually
contains three separate XID/MXID pairs; GlobalVisTest contains XIDs;
and then we've got another XID/MXID pair in the form of
NewRel{frozenXid,minMxid}. It's certainly understandable how there
could be a bug here: you just need whichever of these things
ultimately goes into relfrozenxid to disagree with whichever of these
things actually controls the pruning behavior. But it is not clear to
me exactly which of these things disagree with each other in exactly
what way, and, perhaps as a result, it is also not clear to my why
preventing GlobalVisTest->maybe_needed from moving backward would fix
anything.

Can someone explain, or point me to the relevant previous email?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Matthias van de Meent
Date:
On Thu, 21 Mar 2024 at 17:15, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Mar 3, 2024 at 7:07 PM Noah Misch <noah@leadboat.com> wrote:
> > I figure Matthias's upthread theory is more likely than not to hold.  If it
> > does hold, commit 1ccc1e05ae created a new corruption route.  Hence, I'm
> > adding a v17 open item for commit 1ccc1e05ae.
>
> I need some help understanding what's going on here. I became aware of
> this thread because I took a look at the open items list.
>
> This email seems to have branched off of the thread for bug #17257,
> reported 2021-10-29. The antecedent of "Matthias's upthread theory" is
> unclear to me. These emails seem like the most relevant ones:
>
> https://www.postgresql.org/message-id/CAEze2Wj7O5tnM_U151Baxr5ObTJafwH%3D71_JEmgJV%2B6eBgjL7g%40mail.gmail.com
> https://www.postgresql.org/message-id/CAEze2WhxhEQEx%2Bc%2BCXoDpQs1H1HgkYUK4BW-hFw5_eQxuVWqRw%40mail.gmail.com
> https://www.postgresql.org/message-id/20240106202413.e5%40rfd.leadboat.com
>
> But I'm having a hard time piecing it all together. The general
> picture seems to be that pruning and vacuum disagree about whether a
> particular tuple is prunable; before 1ccc1e05ae, that caused the retry
> loop in heap_page_prune() to retry forever. Now, it causes
> relfrozenxid to be set to too new a value, which is a data-corruption
> scenario. If that's right, I'm slightly miffed to find this being
> labeled as an open item, since that makes it seem like 1ccc1e05ae
> didn't create any new problem but only caused existing defects in the
> GlobalVisTest machinery to have different consequences. Perhaps it's
> all for the best, though. It's kind of embarrassing that we haven't
> fixed whatever the problem is here yet.
>
> But what exactly is the problem, and what's the fix? In the first of
> the emails linked above, Matthias argues that the problem is that
> GlobalVisState->maybe_needed can move backward. Peter Geoghegan seems
> to agree with that here:
>
> https://www.postgresql.org/message-id/CAH2-Wzk_L7Z7LREHTtg5vY08eeWdnHO70m98eWx4U1uwvW%3D0sA%40mail.gmail.com
>
> And Peter seems to have been trying to make sense of Andres's remarks
> here, which I think are saying the same thing:
>
> https://www.postgresql.org/message-id/20210616192202.6q63mu66h4uyn343%40alap3.anarazel.de
>
> So it seems like Matthias, Peter, and Andres all agree that
> GlobalVisState->maybe_needed going backward is bad and causes this
> problem. Unfortunately, I don't understand the mechanism.

There are 2 mechanisms I know of which allow this value to go backwards:

1. Replication slots that connect may set their backend's xmin to an
xmin < GlobalXmin.
This is known and has been documented, and was considered OK when this
was discussed on the list previously.

2. The commit abort path has a short window in which the backend's
xmin is unset and does not mirror the xmin of registered snapshots.
This is what I described in [0], and may be the worst (?) offender.

> --
> Robert Haas
> EDB: http://www.enterprisedb.com

[0] https://www.postgresql.org/message-id/CAEze2Wj%2BV0kTx86xB_YbyaqTr5hnE_igdWAwuhSyjXBYscf5-Q%40mail.gmail.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Robert Haas
Date:
On Thu, Mar 21, 2024 at 1:22 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > So it seems like Matthias, Peter, and Andres all agree that
> > GlobalVisState->maybe_needed going backward is bad and causes this
> > problem. Unfortunately, I don't understand the mechanism.
>
> There are 2 mechanisms I know of which allow this value to go backwards:

I actually wasn't asking about the mechanism by which
GlobalVisState->maybe_needed could go backwards. I was asking about
the mechanism by which that could cause bad things to happen.

> 1. Replication slots that connect may set their backend's xmin to an
> xmin < GlobalXmin.
> This is known and has been documented, and was considered OK when this
> was discussed on the list previously.

Right, OK.

> 2. The commit abort path has a short window in which the backend's
> xmin is unset and does not mirror the xmin of registered snapshots.
> This is what I described in [0], and may be the worst (?) offender.
>
> [0] https://www.postgresql.org/message-id/CAEze2Wj%2BV0kTx86xB_YbyaqTr5hnE_igdWAwuhSyjXBYscf5-Q%40mail.gmail.com

So, what I would say is that this sounds inadvertent and so perhaps we
should do something about it, but also, it seems wrong to me that it
causes any serious problem. As far as I know, we've always treated the
result of an xmin calculation going backward as a rare but expected
case with which everything that depends on xmin calculations must
cope.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Fri, Mar 22, 2024 at 8:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 1:22 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > > So it seems like Matthias, Peter, and Andres all agree that
> > > GlobalVisState->maybe_needed going backward is bad and causes this
> > > problem. Unfortunately, I don't understand the mechanism.
> >
> > There are 2 mechanisms I know of which allow this value to go backwards:
>
> I actually wasn't asking about the mechanism by which
> GlobalVisState->maybe_needed could go backwards. I was asking about
> the mechanism by which that could cause bad things to happen.
>
> > 1. Replication slots that connect may set their backend's xmin to an
> > xmin < GlobalXmin.
> > This is known and has been documented, and was considered OK when this
> > was discussed on the list previously.
>
> Right, OK.
>
> > 2. The commit abort path has a short window in which the backend's
> > xmin is unset and does not mirror the xmin of registered snapshots.
> > This is what I described in [0], and may be the worst (?) offender.
> >
> > [0] https://www.postgresql.org/message-id/CAEze2Wj%2BV0kTx86xB_YbyaqTr5hnE_igdWAwuhSyjXBYscf5-Q%40mail.gmail.com
>
> So, what I would say is that this sounds inadvertent and so perhaps we
> should do something about it, but also, it seems wrong to me that it
> causes any serious problem. As far as I know, we've always treated the
> result of an xmin calculation going backward as a rare but expected
> case with which everything that depends on xmin calculations must
> cope.

I'm still catching up here, so forgive me if this is a dumb question:
Does using GlobalVisState instead of VacuumCutoffs->OldestXmin when
freezing and determining relfrozenxid not solve the problem?

- Melanie



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Robert Haas
Date:
On Fri, Mar 22, 2024 at 2:41 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I'm still catching up here, so forgive me if this is a dumb question:
> Does using GlobalVisState instead of VacuumCutoffs->OldestXmin when
> freezing and determining relfrozenxid not solve the problem?

I think we need to understand what's actually happening here before we
jump to solutions. I think it's clear that we can't determine a
relfrozenxid in a way that is untethered from the decisions made while
pruning, but surely whoever wrote this code intended for there to be
some relationship between GlobalVisState and
VacuumCutoffs->OldestXmin. For example, it may be that they intended
for VacuumCutoffs->OldestXmin to always contain an XID that is older
than any XID that the GlobalVisState could possibly have pruned; but
they might have done that incorrectly, in such a fashion that when
GlobalVisState->maybe_needed moves backward the intended invariant no
longer holds.

But I think that we surely cannot think that they just threw some XID
that fell off the cart into VacuumCutoffs->OldestXmin and some other
XID that they pulled out of the air into GlobalVisState and hoped
everything worked out. It seems really likely that they had some
reason for thinking that we needed two things here rather than just
one. If not, OK, let's get rid of one of them and make it simpler, but
if we just assume that there's no real reason for having both fields
here and simplify accordingly, we're probably just going to be trading
this bug for some other one.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Fri, Mar 22, 2024 at 02:41:25PM -0400, Melanie Plageman wrote:
> On Fri, Mar 22, 2024 at 8:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Mar 21, 2024 at 1:22 PM Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> > > > So it seems like Matthias, Peter, and Andres all agree that
> > > > GlobalVisState->maybe_needed going backward is bad and causes this
> > > > problem. Unfortunately, I don't understand the mechanism.
> > >
> > > There are 2 mechanisms I know of which allow this value to go backwards:
> >
> > I actually wasn't asking about the mechanism by which
> > GlobalVisState->maybe_needed could go backwards. I was asking about
> > the mechanism by which that could cause bad things to happen.
> >
> > > 1. Replication slots that connect may set their backend's xmin to an
> > > xmin < GlobalXmin.
> > > This is known and has been documented, and was considered OK when this
> > > was discussed on the list previously.
> >
> > Right, OK.
> >
> > > 2. The commit abort path has a short window in which the backend's
> > > xmin is unset and does not mirror the xmin of registered snapshots.
> > > This is what I described in [0], and may be the worst (?) offender.
> > >
> > > [0] https://www.postgresql.org/message-id/CAEze2Wj%2BV0kTx86xB_YbyaqTr5hnE_igdWAwuhSyjXBYscf5-Q%40mail.gmail.com
> >
> > So, what I would say is that this sounds inadvertent and so perhaps we
> > should do something about it, but also, it seems wrong to me that it
> > causes any serious problem. As far as I know, we've always treated the
> > result of an xmin calculation going backward as a rare but expected
> > case with which everything that depends on xmin calculations must
> > cope.
> 
> I'm still catching up here, so forgive me if this is a dumb question:
> Does using GlobalVisState instead of VacuumCutoffs->OldestXmin when
> freezing and determining relfrozenxid not solve the problem?

One could fix it along those lines.  If GlobalVisState moves forward during
VACUUM, that's fine, but relfrozenxid needs to reflect the overall outcome,
not just the final GlobalVisState.  Suppose we remove XIDs <100 at page 1, <99
at page 2, and <101 at page 3.  relfrozenxid needs the value it would get if
we had removed <99 at every page.  I think GlobalVisState doesn't track that
today, but it could.  The 2024-03-14 commit e85662d added
GetStrictOldestNonRemovableTransactionId(), which targets a similar problem.
I've not reviewed it, but I suggest checking it for relevance to $SUBJECT.



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Robert Haas
Date:
On Fri, Mar 22, 2024 at 3:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think we need to understand what's actually happening here before we
> jump to solutions. I think it's clear that we can't determine a
> relfrozenxid in a way that is untethered from the decisions made while
> pruning, but surely whoever wrote this code intended for there to be
> some relationship between GlobalVisState and
> VacuumCutoffs->OldestXmin. For example, it may be that they intended
> for VacuumCutoffs->OldestXmin to always contain an XID that is older
> than any XID that the GlobalVisState could possibly have pruned; but
> they might have done that incorrectly, in such a fashion that when
> GlobalVisState->maybe_needed moves backward the intended invariant no
> longer holds.

OK, I did some more analysis of this. I kind of feel like maybe this
should be getting more attention from Melanie (since it was the commit
of her patch that triggered the addition of this to the open items
list) or Andres (because it has been alleged to be related to the
GlobalVisTest stuff), but here we are.

Maybe this is already clear to some other people here, but I think the
bug is essentially the result of this code in heap_vacuum_rel:

    vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
    vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
    vacrel->vistest = GlobalVisTestFor(rel);

The idea here seems to be that because we call vacuum_get_cutoffs()
before we call GlobalVisTestFor(), the cutoff established by the
former should precede the cutoff established by the latter.
vacrel->cutoffs.OldestXmin is propagated into vacrel->NewRelfrozenXid,
and although the latter value seems like it can get updated later, it
remains true that vacrel->cutoffs.OldestXmin affects how relfrozenxid
ultimately gets set. However, that value is set in
vacuum_get_cutoffs() to the result of
GetOldestNonRemovableTransactionId(rel); and that function works with
the results of ComputeXidHorizons(). Meanwhile, GlobalVisTestFor()
creates a visibility test object which is ultimately going to be
tested inside heap_page_prune() using GlobalVisTestIsRemovableXid()
which is a wrapper around GlobalVisTestIsRemovableFullXid() which can
call GlobalVisUpdate() which can call ... ta da ...
ComputeXidHorizons(). So the behavior of ComputeXidHorizons() is the
key point here. And sure enough, the header comment for that function
says this:

 * Note: despite the above, it's possible for the calculated values to move
 * backwards on repeated calls. The calculated values are conservative, so
 * that anything older is definitely not considered as running by anyone
 * anymore, but the exact values calculated depend on a number of things. For
 * example, if there are no transactions running in the current database, the
 * horizon for normal tables will be latestCompletedXid. If a transaction
 * begins after that, its xmin will include in-progress transactions in other
 * databases that started earlier, so another call will return a lower value.
 * Nonetheless it is safe to vacuum a table in the current database with the
 * first result.  There are also replication-related effects: a walsender
 * process can set its xmin based on transactions that are no longer running
 * on the primary but are still being replayed on the standby, thus possibly
 * making the values go backwards.  In this case there is a possibility that
 * we lose data that the standby would like to have, but unless the standby
 * uses a replication slot to make its xmin persistent there is little we can
 * do about that --- data is only protected if the walsender runs continuously
 * while queries are executed on the standby.  (The Hot Standby code deals
 * with such cases by failing standby queries that needed to access
 * already-removed data, so there's no integrity bug.)

That means that heap_vacuum_rel() isn't guaranteeing anything by
calling vacuum_get_cutoffs() before it calls GlobalVisTestFor(). I
started wondering when this bug got introduced, and concluded that it
probably dates all the way back to
dc7420c2c9274a283779ec19718d2d16323640c0, when Andres first introduced
the GlobalVisTest stuff. The code was substantially refactored
afterwards by Peter Geoghegan, in
b4af70cb210393c9c8f41643acf6b213e21178e7 and
73f6ec3d3c8d5786c54373e71a096e5acf78e7ca. However, as far as I can
see, neither of those commits created the problem; they just moved
code around, and in fact I'd say that the problem is easier to see now
than it was before all of that refactoring because the two problematic
initializations are now closer together than they were originally. I
think that Peter may have been under the impression at the time of the
latter commit that the order of initializing vacrel->cutoffs vs.
vacrel->vistest was significant, because the commit message says:

    Explain the relationship between vacuumlazy.c's vistest and OldestXmin
    cutoffs.  These closely related cutoffs are different in subtle but
    important ways.  Also document a closely related rule: we must establish
    rel_pages _after_ OldestXmin to ensure that no XID < OldestXmin can be
    missed by lazy_scan_heap().

And a comment in that commit says:

+     * We expect vistest will always make heap_page_prune remove any deleted
+     * tuple whose xmax is < OldestXmin.  lazy_scan_prune must never become

Neither of these is perhaps 100% explicit about the ordering being
relevant, but if that wasn't thought to make this code safe, then I
don't know what was thought to make it safe.

At any rate, unless I'm missing something, which is definitely
possible, the actual bug dates to dc7420c2c, which means whatever fix
we come up with here will need to be back-patched back to v14.

But what exactly is the fix here? I now understand why there was
discussion of preventing maybe_needed from going backward, but I'm not
sure that's quite the right idea. I think the problem isn't exactly
that the maybe_needed value that is updated by GlobalVisUpdateApply()
can go backwards, but rather that the SOMETHING_oldest_removable value
on which it depends can go backwards. AFAICS, the problem isn't so
much that the GlobalVisTest itself can retreat to an older value, but
that it can retreat compared to the OldestXmin obtained from calling
GetOldestNonRemovableTransactionId(), which depends on the
SOMETHING_oldest_removable threshold but *not* on maybe_needed.
Perhaps, for some reason I'm not grokking at the moment, preventing
maybe_needed from retreating would be good enough to prevent trouble
in practice, but it doesn't appear to me to be the most principled
solution as of this writing.

I kind of feel like the code is just going at this in a way that's
completely backwards. It feels like what we should be doing is looking
at the actual states of the pages post-pruning, and specifically what
the oldest (M)XIDs that are really present there are. But changing
that seems like a lot to do in a fix that needs to be back-patched.
It's tempting to look for a narrower fix, where we just find out the
oldest horizon that was ever actually used. But I don't feel very good
about doing that by tampering with ComputeXidHorizons(), either. That
seems like it could have nasty side effects in terms of either
performance or correctness, which is really scary for a fix that needs
to be back-patched. I wonder if there's some way that we can get a
more reliable idea of what XIDs were used for pruning that is strictly
local to vacuum and can't affect the rest of the system (and is also
hopefully not so conservative that it causes damage by unduly reducing
relfrozenxid advancement). However, I don't have a good idea about
what such a fix might look like right now, and I'm out of time for
today.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Peter Geoghegan
Date:
On Fri, Mar 29, 2024 at 12:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
> That means that heap_vacuum_rel() isn't guaranteeing anything by
> calling vacuum_get_cutoffs() before it calls GlobalVisTestFor(). I
> started wondering when this bug got introduced, and concluded that it
> probably dates all the way back to
> dc7420c2c9274a283779ec19718d2d16323640c0, when Andres first introduced
> the GlobalVisTest stuff. The code was substantially refactored
> afterwards by Peter Geoghegan, in
> b4af70cb210393c9c8f41643acf6b213e21178e7 and
> 73f6ec3d3c8d5786c54373e71a096e5acf78e7ca. However, as far as I can
> see, neither of those commits created the problem; they just moved
> code around, and in fact I'd say that the problem is easier to see now
> than it was before all of that refactoring because the two problematic
> initializations are now closer together than they were originally. I
> think that Peter may have been under the impression at the time of the
> latter commit that the order of initializing vacrel->cutoffs vs.
> vacrel->vistest was significant, because the commit message says:

FWIW I never thought that the order that we called
vacuum_get_cutoffs() relative to when we call GlobalVisTestFor() was
directly significant (though I did think that about the order that we
attain VACUUM's rel_pages and the vacuum_get_cutoffs() call). I can't
have thought that, because clearly GlobalVisTestFor() just returns a
pointer, and so cannot directly affect backend local state.

>     Explain the relationship between vacuumlazy.c's vistest and OldestXmin
>     cutoffs.  These closely related cutoffs are different in subtle but
>     important ways.  Also document a closely related rule: we must establish
>     rel_pages _after_ OldestXmin to ensure that no XID < OldestXmin can be
>     missed by lazy_scan_heap().
>
> And a comment in that commit says:
>
> +     * We expect vistest will always make heap_page_prune remove any deleted
> +     * tuple whose xmax is < OldestXmin.  lazy_scan_prune must never become
>
> Neither of these is perhaps 100% explicit about the ordering being
> relevant, but if that wasn't thought to make this code safe, then I
> don't know what was thought to make it safe.

I intended to highlight the general nature of the dependency  --
nothing more. I don't claim to be an authoritative source on this,
though I was confident that the code was *supposed* to work like this
(how else could it possibly work?).

It was clear that this is an important issue, from an early stage.
Pre-release 14 had 2 or 3 bugs that all had the same symptom:
lazy_scan_prune would loop forever. This was true even though each of
the bugs had fairly different underlying causes (all tied to
dc7420c2c). I figured that there might well be more bugs like that in
the future.

> At any rate, unless I'm missing something, which is definitely
> possible, the actual bug dates to dc7420c2c, which means whatever fix
> we come up with here will need to be back-patched back to v14.

I agree.

> I kind of feel like the code is just going at this in a way that's
> completely backwards. It feels like what we should be doing is looking
> at the actual states of the pages post-pruning, and specifically what
> the oldest (M)XIDs that are really present there are. But changing
> that seems like a lot to do in a fix that needs to be back-patched.

I said the same thing recently, to Noah. This is still something that
ought to be made very explicit -- pruneheap.c should be directly aware
of its responsibility to not leave any XID < VACUUM's OldestXmin
behind, except for any XIDs that only freezing can deal with. (We're
moving in the direction of making pruneheap.c responsible for
freezing, so perhaps my caveat about freezing shouldn't apply on
HEAD.)

> It's tempting to look for a narrower fix, where we just find out the
> oldest horizon that was ever actually used. But I don't feel very good
> about doing that by tampering with ComputeXidHorizons(), either.

I have every reason to believe that the remaining problems in this
area are extremely rare. I wonder if it would make sense to focus on
making the infinite loop behavior in lazy_scan_prune just throw an
error.

I now fear that that'll be harder than one might think. At the time
that I added the looping behavior (in commit 8523492d), I believed
that the only "legitimate" reason that it could ever be needed was the
same reason why we needed the old tupgone behavior (to deal with
concurrently-inserted tuples from transactions that abort in flight).
But now I worry that it's actually protective, in some way that isn't
generally understood. And so it might be that converting the retry
into a hard error (e.g., erroring-out after MaxHeapTuplesPerPage
retries) will create new problems.

--
Peter Geoghegan



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Robert Haas
Date:
On Fri, Mar 29, 2024 at 1:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
> FWIW I never thought that the order that we called
> vacuum_get_cutoffs() relative to when we call GlobalVisTestFor() was
> directly significant (though I did think that about the order that we
> attain VACUUM's rel_pages and the vacuum_get_cutoffs() call). I can't
> have thought that, because clearly GlobalVisTestFor() just returns a
> pointer, and so cannot directly affect backend local state.

Hmm, OK.

> It was clear that this is an important issue, from an early stage.
> Pre-release 14 had 2 or 3 bugs that all had the same symptom:
> lazy_scan_prune would loop forever. This was true even though each of
> the bugs had fairly different underlying causes (all tied to
> dc7420c2c). I figured that there might well be more bugs like that in
> the future.

Looks like you were right.

> I have every reason to believe that the remaining problems in this
> area are extremely rare. I wonder if it would make sense to focus on
> making the infinite loop behavior in lazy_scan_prune just throw an
> error.
>
> I now fear that that'll be harder than one might think. At the time
> that I added the looping behavior (in commit 8523492d), I believed
> that the only "legitimate" reason that it could ever be needed was the
> same reason why we needed the old tupgone behavior (to deal with
> concurrently-inserted tuples from transactions that abort in flight).
> But now I worry that it's actually protective, in some way that isn't
> generally understood. And so it might be that converting the retry
> into a hard error (e.g., erroring-out after MaxHeapTuplesPerPage
> retries) will create new problems.

It also sounds like it would boil down to "ERROR: our code sucks", so
count me as not a fan of that approach. As much as I don't like the
idea of significant changes to the back-branches, I think I like that
idea even less.

On the other hand, I also don't have an idea that I do like right now,
so it's probably too early to decide anything just yet. I'll try to
find more time to study this (and I hope others do the same).

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Fri, Mar 29, 2024 at 12:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 22, 2024 at 3:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I think we need to understand what's actually happening here before we
> > jump to solutions. I think it's clear that we can't determine a
> > relfrozenxid in a way that is untethered from the decisions made while
> > pruning, but surely whoever wrote this code intended for there to be
> > some relationship between GlobalVisState and
> > VacuumCutoffs->OldestXmin. For example, it may be that they intended
> > for VacuumCutoffs->OldestXmin to always contain an XID that is older
> > than any XID that the GlobalVisState could possibly have pruned; but
> > they might have done that incorrectly, in such a fashion that when
> > GlobalVisState->maybe_needed moves backward the intended invariant no
> > longer holds.
>
> OK, I did some more analysis of this. I kind of feel like maybe this
> should be getting more attention from Melanie (since it was the commit
> of her patch that triggered the addition of this to the open items
> list) or Andres (because it has been alleged to be related to the
> GlobalVisTest stuff), but here we are.

Agreed that I should be doing more of the analysis. I plan to spend
time on this very soon.

Obviously we should actually fix this on back branches, but could we
at least make the retry loop interruptible in some way so people could
use pg_cancel/terminate_backend() on a stuck autovacuum worker or
vacuum process?

- Melanie



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Robert Haas
Date:
On Sat, Mar 30, 2024 at 12:19 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Obviously we should actually fix this on back branches, but could we
> at least make the retry loop interruptible in some way so people could
> use pg_cancel/terminate_backend() on a stuck autovacuum worker or
> vacuum process?

Yeah, as long as that's not the only thing we do.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Robert Haas
Date:
On Fri, Oct 29, 2021 at 9:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I can propose the debugging patch to reproduce the issue that replaces
> the hang with the assert and modifies a pair of crash-causing test
> scripts to simplify the reproducing. (Sorry, I have no time now to prune
> down the scripts further as I have to leave for a week.)

Just FYI, I tried to reproduce this today on v16, using this formula,
with some hacking around to try to get it working on my MacBook, and I
couldn't get it to crash.

I doubt my failure to reproduce is because anything has been fixed.
It's probably due to some kind of dumbitude on my part. The patch
doesn't apply to the head of v16, unsurprisingly, and a bunch of the
Linux commands don't work here, so it's all kind of a muddle for me
trying to get the same setup in place. I should probably be better at
reproducing things like this than I am.

I also tried to reproduce with a simpler setup where I just ran a
normal pgbench in one terminal and a pgbench running only "vacuum
pgbench_accounts;" in another. What I see is that such a setup never
does a "goto retry;". I tried putting a loop around this code:

        res = HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
                                       buf);

...to do the same thing 1000 times, and I do see that if I do that,
the return value of HeapTupleSatisfiesVacuum() sometimes changes part
way through the thousand iterations. But so far I haven't seen a
single instance of it changing to HEAPTUPLE_DEAD. It seems to be all
stuff like HEAP_TUPLE_LIVE -> HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_DELETE_IN_PROGRESS -> HEAPTUPLE_RECENTLY_DEAD. There are a
few other combinations that I see appearing too, but the point is that
the new state in this test never seems to be HEAPTUPLE_DEAD, and
therefore there's no retry and no ability to reproduce the bug.

Any ideas?

P.S. See also discussion on the "relfrozenxid may disagree with row
XIDs after 1ccc1e05ae" thread.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Alexander Lakhin
Date:
Hi Robert,

05.04.2024 23:20, Robert Haas wrote:
> On Fri, Oct 29, 2021 at 9:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> I can propose the debugging patch to reproduce the issue that replaces
>> the hang with the assert and modifies a pair of crash-causing test
>> scripts to simplify the reproducing. (Sorry, I have no time now to prune
>> down the scripts further as I have to leave for a week.)
> Just FYI, I tried to reproduce this today on v16, using this formula,
> with some hacking around to try to get it working on my MacBook, and I
> couldn't get it to crash.

I've refreshed the script and simplified it a bit not to use Linux
specifics. This works for me (on REL_14_0, with the patch applied,
CPPFLAGS="-O0" ./configure --enable-debug --enable-cassert ...):
echo "
autovacuum=off
fsync=off
" >> "$PGDATA/postgresql.conf"

pg_ctl -w -l server.log start

export PGDATABASE=regression
createdb regression

echo "
vacuum (verbose, skip_locked, index_cleanup off) pg_catalog.pg_class;
select pg_sleep(random()/50);
" >/tmp/17257/pseudo-autovacuum.sql

export PGDATABASE=regression
createdb regression
pgbench -n -f /tmp/17257/inherit.sql -C -T 1200 >pgbench-1.log 2>&1 &
pgbench -n -f /tmp/17257/vacuum.sql -C -T 1200 >pgbench-2.log 2>&1 &
pgbench -n -f /tmp/17257/pseudo-autovacuum.sql -C -c 10 -T 1200 >pgbench-3.log 2>&1 &
wait
grep -E "(TRAP|terminated)" server.log

(Please use the attached inherit.sql, vacuum.sql (excerpts from
src/test/sql/{inherit,vacuum}.sql).)

With PGDATA placed on tmpfs, this script failed for me after 1m31s,
2m35s, 4m12s:
TRAP: FailedAssertion("numretries < 100", File: "vacuumlazy.c", Line: 1726, PID: 951498)

Another possible outcome:
TRAP: FailedAssertion("relid == targetRelId", File: "relcache.c", Line: 1062, PID: 1257766)

And also:
2024-04-07 05:03:21.656 UTC [2905313] LOG:  server process (PID 2984687) was terminated by signal 6: Aborted
2024-04-07 05:03:21.656 UTC [2905313] DETAIL:  Failed process was running: create table matest0 (id serial primary key,

name text);
With the stack trace:
...
#4  0x00007fc30b4007f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000559f50220719 in index_delete_sort_cmp (deltid1=0x559f523a9f40, deltid2=0x7ffd2f9623f8) at heapam.c:7582
#6  0x0000559f50220847 in index_delete_sort (delstate=0x7ffd2f9636f0) at heapam.c:7623
...
(as in [1])

But on dad1539ae I got no failures for 3 runs (the same is on
REL_16_STABLE with a slightly modified lazy_scan_prune patch).

[1] https://www.postgresql.org/message-id/17255-14c0ac58d0f9b583%40postgresql.org

Best regards,
Alexander
Attachment

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Robert Haas
Date:
On Sun, Apr 7, 2024 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've refreshed the script and simplified it a bit not to use Linux
> specifics.

Thanks. I was able to reproduce this today on macOS 13.6.2 with
REL_14_0. It didn't seem like it reproduced as quickly for me as it
did for you, but I got it to fail an assertion eventually. Then my
debugging efforts were frustrated by
http://postgr.es/m/CA+TgmobW9bEuvSrQR1D1K6_8=DmY2tzkuepAjCWF=j4B1w0rWw@mail.gmail.com
- gah

> But on dad1539ae I got no failures for 3 runs (the same is on
> REL_16_STABLE with a slightly modified lazy_scan_prune patch).

I'm having trouble understanding what this means exactly -- are you
talking about REL_16_STABLE, or about dad1539ae, or both, or what? At
any rate, it's really important here that we understand whether we
still have a bug here, and if so, in which releases and with what test
case. I wasn't aware of dad1539ae but that certainly seems like it
might've made a big difference, if not fixing the problem entirely
then at least making it a lot less likely. And I think it's possible
that some of the related freezing+pruning commits on master might have
removed the problem altogether, but that needs to be tested.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Alexander Lakhin
Date:
09.04.2024 21:59, Robert Haas wrote:
>
>> But on dad1539ae I got no failures for 3 runs (the same is on
>> REL_16_STABLE with a slightly modified lazy_scan_prune patch).
> I'm having trouble understanding what this means exactly -- are you
> talking about REL_16_STABLE, or about dad1539ae, or both, or what? At
> any rate, it's really important here that we understand whether we
> still have a bug here, and if so, in which releases and with what test
> case. I wasn't aware of dad1539ae but that certainly seems like it
> might've made a big difference, if not fixing the problem entirely
> then at least making it a lot less likely. And I think it's possible
> that some of the related freezing+pruning commits on master might have
> removed the problem altogether, but that needs to be tested.

I was talking about both — dad1539ae eliminated the bug (judging from the
repro) on REL_14_STABLE, and I suppose that 18b87b201 did the same for
REL_15_STABLE/REL_16_STABLE...
(On 18b87b201~1 I've got (twice):
TRAP: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)", File: "pruneheap.c", Line: 964, PID: 1854895)

I can double-check exactly 18b87b201 with the repro, if it makes sense.)

Best regards,
Alexander



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Robert Haas
Date:
On Tue, Apr 9, 2024 at 4:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> I was talking about both — dad1539ae eliminated the bug (judging from the
> repro) on REL_14_STABLE, and I suppose that 18b87b201 did the same for
> REL_15_STABLE/REL_16_STABLE...
> (On 18b87b201~1 I've got (twice):
> TRAP: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)", File: "pruneheap.c", Line: 964, PID: 1854895)

I'm inclined to believe that there's a residual bug here, or at least
that there was one before 6dbb490261a6170a3fc3e326c6983ad63e795047 and
related commits. But if we can't reproduce the issue, I don't know how
much we can realistically do here. This logic is so complicated and so
tricky that I don't trust myself to say "yes, this is definitely
wrong, and after change XYZ it's definitely right." The bug is as
likely to be in my analysis as it is to be in the code itself.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Robert Haas
Date:
Hi,

I propose to remove this open item from
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

On the original thread (BUG #17257), Alexander Lakhin says that he
can't reproduce this after dad1539ae/18b87b201. Based on my analysis
of the code, I suspect that there is a residual bug, or at least that
there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90. (On
the other thread, I cited 6dbb490261a6170a3fc3e326c6983ad63e795047,
but that's not really what I meant.)

But this code is too hairy for me to be certain whether there's a bug
or not without some kind of a test case, and six weeks after the open
item was added, we still don't have one that works on any v16 commit,
either before or after the one named in the subject line. I hate to
leave a potential data-corrupting bug unaddressed, but under present
circumstances, I find it hard to justify spending more time on this.

...Robert



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Andres Freund
Date:
Hi,

I've tried a couple times to catch up with this thread. But always kinda felt
I must be missing something. It might be that this is one part of the
confusion:

On 2024-01-06 12:24:13 -0800, Noah Misch wrote:
> Fair enough.  While I agree there's a decent chance back-patching would be
> okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> Matthias theorized.  Something like: we update relfrozenxid based on
> OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> OldestXmin.  Then relfrozenxid disagrees with table contents.

Looking at the state as of 1ccc1e05ae, I don't see how - in lazy_scan_prune(),
if heap_page_prune() spuriously didn't prune a tuple, because the horizon went
backwards, we'd encounter the tuple in the loop below and call
heap_prepare_freeze_tuple(), which would error out with one of

    /*
     * Process xmin, while keeping track of whether it's already frozen, or
     * will become frozen iff our freeze plan is executed by caller (could be
     * neither).
     */
    xid = HeapTupleHeaderGetXmin(tuple);
    if (!TransactionIdIsNormal(xid))
        xmin_already_frozen = true;
    else
    {
        if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
            ereport(ERROR,
                    (errcode(ERRCODE_DATA_CORRUPTED),
                     errmsg_internal("found xmin %u from before relfrozenxid %u",
                                     xid, cutoffs->relfrozenxid)));

or
        if (TransactionIdPrecedes(update_xact, cutoffs->relfrozenxid))
            ereport(ERROR,
                    (errcode(ERRCODE_DATA_CORRUPTED),
                     errmsg_internal("multixact %u contains update XID %u from before relfrozenxid %u",
                                     multi, update_xact,
                                     cutoffs->relfrozenxid)));
or
        /* Raw xmax is normal XID */
        if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
            ereport(ERROR,
                    (errcode(ERRCODE_DATA_CORRUPTED),
                     errmsg_internal("found xmax %u from before relfrozenxid %u",
                                     xid, cutoffs->relfrozenxid)));


I'm not saying that spuriously erroring out would be ok. But I guess I just
don't understand the data corruption theory in this subthread, because we'd
error out if we encountered a tuple that should have been frozen but wasn't?

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-04-15 12:35:59 -0400, Robert Haas wrote:
> I propose to remove this open item from
> https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> 
> On the original thread (BUG #17257), Alexander Lakhin says that he
> can't reproduce this after dad1539ae/18b87b201. Based on my analysis
> of the code, I suspect that there is a residual bug, or at least that
> there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90. (On
> the other thread, I cited 6dbb490261a6170a3fc3e326c6983ad63e795047,
> but that's not really what I meant.)

I assume the bug you suspect is that the horizon could go backwards during
pruning, which'd lead to a spurious "found xmin %u from before relfrozenxid
%u" error?


I think Matthias' observation of transaction aborts leading to the xmin
horizon going backward during transaction aborts needs to be fixed - it seems
quite problematic, regardless of whether it causes issues during VACUUM. Even
if it were perfectly safe today, it seems very likely to cause issues in the
future.


> But this code is too hairy for me to be certain whether there's a bug
> or not without some kind of a test case, and six weeks after the open
> item was added, we still don't have one that works on any v16 commit,
> either before or after the one named in the subject line.

Nor does any of the test cases fail on any of the other branches, right?
I.e. they're just testing a bug that's been fixed for ~3 years?

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-03-29 12:05:31 -0400, Robert Haas wrote:
> But what exactly is the fix here? I now understand why there was
> discussion of preventing maybe_needed from going backward, but I'm not
> sure that's quite the right idea. I think the problem isn't exactly
> that the maybe_needed value that is updated by GlobalVisUpdateApply()
> can go backwards, but rather that the SOMETHING_oldest_removable value
> on which it depends can go backwards. AFAICS, the problem isn't so
> much that the GlobalVisTest itself can retreat to an older value, but
> that it can retreat compared to the OldestXmin obtained from calling
> GetOldestNonRemovableTransactionId(), which depends on the
> SOMETHING_oldest_removable threshold but *not* on maybe_needed.

I agree that in theory SOMETHING_oldest_removable should never go
backwards. But in practice it's decidedly nontrivial to prevent that. There's
a reason we've had a version of this comment:
 * Note: despite the above, it's possible for the calculated values to move
 * backwards on repeated calls.
for a long time.

We really ought to ensure that it doesn't go backwards even across sessions,
because one session vacuuming a relation with one horizon, and then a
subsequent session vacuuming the same relation with an earlier horizon seems
like a recipe for bugs.

But because the horizon depends on the database of the affected object, we
can't just stash a value for each of the possible horizons in shared
memory. We'd need a database-oid keyed hash-table or such.


OTOH, for hot_standby_feedback without a slot, we kinda *do* want the horizons
to go backwards, so that a brief failure of the replication connection is less
likely to cause ongoing queries to fail over the next while.


> Perhaps, for some reason I'm not grokking at the moment, preventing
> maybe_needed from retreating would be good enough to prevent trouble
> in practice, but it doesn't appear to me to be the most principled
> solution as of this writing.

If we prevent maybe_needed from retreating below the OldestXmin value used for
cutoff, then pruning should never be less aggressive than what's needed by
lazy_scan_prune(). So lazy_scan_prune() shouldn't be able to encounter tuples
that weren't considered removable in heap_page_prune(), in < 17, nor would
heap_page_prune_and_freeze() have that issue.



I think one fairly easy fix for this would be to pass OldestXmin to
heap_page_prune[_and_freeze](), and have heap_prune_satisfies_vacuum() first
check OldestXmin and only if that considers the tuple still running, use
GlobalVisTestIsRemovableXid(). That continues to allow us to prune more
aggressively, without any potential risk of IsRemoval being too conservative.



> I kind of feel like the code is just going at this in a way that's
> completely backwards. It feels like what we should be doing is looking
> at the actual states of the pages post-pruning, and specifically what
> the oldest (M)XIDs that are really present there are.

Isn't that pretty much what the code is doing these days? We do additionally
enforce a specific boundary in some cases, to be able to advance relfrozenxid
in some cases, but we advance it more aggressively if there aren't any xids
preventing that (if we scan all the necessary pages etc).


Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Mon, Apr 15, 2024 at 12:35:59PM -0400, Robert Haas wrote:
> I propose to remove this open item from
> https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> 
> On the original thread (BUG #17257), Alexander Lakhin says that he
> can't reproduce this after dad1539ae/18b87b201. Based on my analysis

I have observed the infinite loop in production with v15.5, so that
non-reproduce outcome is a limitation in the test procedure.  (v14.2 added
those two commits.)

> of the code, I suspect that there is a residual bug, or at least that
> there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90.

Can you say more about how 6f47f6883151366c031cd6fd4011e66d2c702a90 mitigated
the regression that 1ccc1e05ae introduced?  Thanks for discovering that.



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-04-15 13:52:04 -0700, Noah Misch wrote:
> On Mon, Apr 15, 2024 at 12:35:59PM -0400, Robert Haas wrote:
> > I propose to remove this open item from
> > https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> > 
> > On the original thread (BUG #17257), Alexander Lakhin says that he
> > can't reproduce this after dad1539ae/18b87b201. Based on my analysis
> 
> I have observed the infinite loop in production with v15.5, so that
> non-reproduce outcome is a limitation in the test procedure.  (v14.2 added
> those two commits.)

How closely have you analyzed those production occurences? It's not too hard
to imagine some form of corruption that leads to such a loop, but which isn't
related to the horizon going backwards?  E.g. a corrupted HOT chain can lead
to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would
then encounter a DEAD tuple.


> > of the code, I suspect that there is a residual bug, or at least that
> > there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90.
> 
> Can you say more about how 6f47f6883151366c031cd6fd4011e66d2c702a90 mitigated
> the regression that 1ccc1e05ae introduced?  Thanks for discovering that.

Which regression has 1ccc1e05ae actually introduced?  As I pointed out
upthread, the proposed path to corruption doesn't seem to actually lead to
corruption, "just" an error? Which actually seems considerably better than an
endless retry loop that cannot be cancelled.

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Michael Paquier
Date:
On Mon, Apr 15, 2024 at 12:35:59PM -0400, Robert Haas wrote:
> I propose to remove this open item from
> https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

Rather than remove this item and take the risk to forget about it, I'd
recommend to move it to the section "Older bugs affecting stable
branches" for "Live issues".
--
Michael

Attachment

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Mon, Apr 15, 2024 at 02:10:20PM -0700, Andres Freund wrote:
> On 2024-04-15 13:52:04 -0700, Noah Misch wrote:
> > On Mon, Apr 15, 2024 at 12:35:59PM -0400, Robert Haas wrote:
> > > I propose to remove this open item from
> > > https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> > > 
> > > On the original thread (BUG #17257), Alexander Lakhin says that he
> > > can't reproduce this after dad1539ae/18b87b201. Based on my analysis
> > 
> > I have observed the infinite loop in production with v15.5, so that
> > non-reproduce outcome is a limitation in the test procedure.  (v14.2 added
> > those two commits.)
> 
> How closely have you analyzed those production occurences?  It's not too hard
> to imagine some form of corruption that leads to such a loop, but which isn't
> related to the horizon going backwards?  E.g. a corrupted HOT chain can lead
> to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would
> then encounter a DEAD tuple.

One occurrence had these facts:

HeapTupleHeaderGetXmin             = 95271613
HeapTupleHeaderGetUpdateXid        = 95280147
vacrel->OldestXmin                 = 95317451
vacrel->vistest->definitely_needed = 95318928
vacrel->vistest->maybe_needed      = 93624425

How compatible are those with the corruption vectors you have in view?

> > > of the code, I suspect that there is a residual bug, or at least that
> > > there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90.
> > 
> > Can you say more about how 6f47f6883151366c031cd6fd4011e66d2c702a90 mitigated
> > the regression that 1ccc1e05ae introduced?  Thanks for discovering that.
> 
> Which regression has 1ccc1e05ae actually introduced?  As I pointed out
> upthread, the proposed path to corruption doesn't seem to actually lead to
> corruption, "just" an error? Which actually seems considerably better than an
> endless retry loop that cannot be cancelled.

A transient, spurious error is far better than an uninterruptible infinite
loop with a buffer content lock held.  If a transient error is the consistent
outcome, I would agree 1ccc1e05ae improved the situation and didn't regress
it.  That would close the open item.  I tried briefly to understand
https://postgr.es/m/flat/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de
but I felt verifying its argument was going to be a big job for me.  Would
those errors happen transiently, like the infinite loop, or would they persist
until something resets the tuple fields (e.g. ATRewriteTables())?

Thanks,
nm



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-04-15 20:58:25 -0700, Noah Misch wrote:
> On Mon, Apr 15, 2024 at 02:10:20PM -0700, Andres Freund wrote:
> > On 2024-04-15 13:52:04 -0700, Noah Misch wrote:
> > > On Mon, Apr 15, 2024 at 12:35:59PM -0400, Robert Haas wrote:
> > > > I propose to remove this open item from
> > > > https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> > > >
> > > > On the original thread (BUG #17257), Alexander Lakhin says that he
> > > > can't reproduce this after dad1539ae/18b87b201. Based on my analysis
> > >
> > > I have observed the infinite loop in production with v15.5, so that
> > > non-reproduce outcome is a limitation in the test procedure.  (v14.2 added
> > > those two commits.)
> >
> > How closely have you analyzed those production occurences?  It's not too hard
> > to imagine some form of corruption that leads to such a loop, but which isn't
> > related to the horizon going backwards?  E.g. a corrupted HOT chain can lead
> > to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would
> > then encounter a DEAD tuple.
>
> One occurrence had these facts:
>
> HeapTupleHeaderGetXmin             = 95271613
> HeapTupleHeaderGetUpdateXid        = 95280147
> vacrel->OldestXmin                 = 95317451
> vacrel->vistest->definitely_needed = 95318928
> vacrel->vistest->maybe_needed      = 93624425
>
> How compatible are those with the corruption vectors you have in view?

Do you have more information about the page this was on? E.g. pageinspect
output? Or at least the infomasks of that tuple?  I assume this was a normal
data table (i.e. not a [shared|user] catalog table or temp table)?

Do you know what ComputeXidHorizonsResultLastXmin, RecentXmin were set to?


> > > > of the code, I suspect that there is a residual bug, or at least that
> > > > there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90.
> > >
> > > Can you say more about how 6f47f6883151366c031cd6fd4011e66d2c702a90 mitigated
> > > the regression that 1ccc1e05ae introduced?  Thanks for discovering that.
> >
> > Which regression has 1ccc1e05ae actually introduced?  As I pointed out
> > upthread, the proposed path to corruption doesn't seem to actually lead to
> > corruption, "just" an error? Which actually seems considerably better than an
> > endless retry loop that cannot be cancelled.
>
> A transient, spurious error is far better than an uninterruptible infinite
> loop with a buffer content lock held.  If a transient error is the consistent
> outcome, I would agree 1ccc1e05ae improved the situation and didn't regress
> it.

I don't think an error is a guaranteed outcome - if there's no conflict
between the tuple's xids and the visibility cutoff, we'd just continue without
a problem. Bt that'd not be a correctness issue, we'd just not be able to
advance relfrozenxid as far as we'd like. But in cases where the xids are
before the cutoffs, yes, an error would occur.


> I tried briefly to understand
> https://postgr.es/m/flat/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de
> but I felt verifying its argument was going to be a big job for me.  Would
> those errors happen transiently, like the infinite loop, or would they
> persist until something resets the tuple fields (e.g. ATRewriteTables())?

I think they'd be transient, because the visibility information during the
next vacuum would presumably not be "skewed" anymore? Of course it's possible
you'd re-encounter the problem, if you constantly have horizons going back and
forth. But I'd still classify that as transient.

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Tue, Apr 16, 2024 at 11:01:08AM -0700, Andres Freund wrote:
> On 2024-04-15 20:58:25 -0700, Noah Misch wrote:
> > On Mon, Apr 15, 2024 at 02:10:20PM -0700, Andres Freund wrote:
> > > On 2024-04-15 13:52:04 -0700, Noah Misch wrote:
> > > > I have observed the infinite loop in production with v15.5, so that
> > > > non-reproduce outcome is a limitation in the test procedure.  (v14.2 added
> > > > those two commits.)
> > >
> > > How closely have you analyzed those production occurences?  It's not too hard
> > > to imagine some form of corruption that leads to such a loop, but which isn't
> > > related to the horizon going backwards?  E.g. a corrupted HOT chain can lead
> > > to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would
> > > then encounter a DEAD tuple.

I've not seen this recur for any one table, so I think we can rule out
corruption modes that would reach the loop every time.  (If a hypothesized
loop explanation calls for both corruption and horizon movement, that could
still apply.)

> > One occurrence had these facts:
> >
> > HeapTupleHeaderGetXmin             = 95271613
> > HeapTupleHeaderGetUpdateXid        = 95280147
> > vacrel->OldestXmin                 = 95317451
> > vacrel->vistest->definitely_needed = 95318928
> > vacrel->vistest->maybe_needed      = 93624425
> >
> > How compatible are those with the corruption vectors you have in view?
> 
> Do you have more information about the page this was on? E.g. pageinspect
> output? Or at least the infomasks of that tuple?

No, unfortunately.

> I assume this was a normal
> data table (i.e. not a [shared|user] catalog table or temp table)?

Normal data table

> Do you know what ComputeXidHorizonsResultLastXmin, RecentXmin were set to?

No.

> > I tried briefly to understand
> > https://postgr.es/m/flat/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de
> > but I felt verifying its argument was going to be a big job for me.  Would
> > those errors happen transiently, like the infinite loop, or would they
> > persist until something resets the tuple fields (e.g. ATRewriteTables())?
> 
> I think they'd be transient, because the visibility information during the
> next vacuum would presumably not be "skewed" anymore?

That is good.

> Of course it's possible
> you'd re-encounter the problem, if you constantly have horizons going back and
> forth. But I'd still classify that as transient.

Certainly.



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Mon, Apr 15, 2024 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2024-03-29 12:05:31 -0400, Robert Haas wrote:
> > Perhaps, for some reason I'm not grokking at the moment, preventing
> > maybe_needed from retreating would be good enough to prevent trouble
> > in practice, but it doesn't appear to me to be the most principled
> > solution as of this writing.
>
> If we prevent maybe_needed from retreating below the OldestXmin value used for
> cutoff, then pruning should never be less aggressive than what's needed by
> lazy_scan_prune(). So lazy_scan_prune() shouldn't be able to encounter tuples
> that weren't considered removable in heap_page_prune(), in < 17, nor would
> heap_page_prune_and_freeze() have that issue.
>
> I think one fairly easy fix for this would be to pass OldestXmin to
> heap_page_prune[_and_freeze](), and have heap_prune_satisfies_vacuum() first
> check OldestXmin and only if that considers the tuple still running, use
> GlobalVisTestIsRemovableXid(). That continues to allow us to prune more
> aggressively, without any potential risk of IsRemoval being too conservative.

It seems to me that in all stable versions containing the retry loop
(from 8523492d4e34), the following can theoretically happen, causing
an infinite loop in lazy_scan_prune():

1) tuple's xmin and xmax are older than VacuumCutoffs->OldestXmin
2) heap_page_prune()-> heap_prune_satisfies_vacuum()->
HeapTupleSatisfiesVacuumHorizon() returns HEAPTUPLE_RECENTLY_DEAD
3) in GlobalVisTestIsRemovableXid(), dead_after is between
GlobalVisState->maybe_needed and definitely_needed, causing
GlobalVisState to be recomputed
4) GlobalVisState->maybe_needed moves backwards
5) tuple is not removable because dead_after is now newer than maybe_needed
6) in lazy_scan_prune(), HeapTupleSatisfiesVacuum() returns
HEAPTUPLE_DEAD because dead_after is older than OldestXmin
7) infinite loop

One way to fix this (as mentioned earlier in this thread) is to
backport 1ccc1e05ae because it eliminates the retry loop. In our above
example, lazy_scan_prune() reuses the tuple's HEAPTUPLE_RECENTLY_DEAD
HTSV_Result instead of recomputing it. We'd have to rule out any of
the data corruption claims about that commit to justify this, but I am
not yet convinced that 1ccc1e05ae could cause any problems with
relfrozenxid advancement.

Andres' idea of passing VacuumCutoffs->OldestXmin to heap_page_prune()
makes sense. We could add a member to PruneState, oldest_xmin, and
initialize it to InvalidTransactionId for on-access pruning and to
VacuumCutoffs->OldestXmin for vacuum. Then, in
heap_prune_satisfies_vacuum(), we could add this if branch:

if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin))

to here:

-   if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
+
+   if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin))
+       res = HEAPTUPLE_DEAD;
+   else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
        res = HEAPTUPLE_DEAD;
    else if (OldSnapshotThresholdActive())

This seems like a more targeted fix than backpatching 1ccc1e05ae.

I plan to attempt to write a test case that repros this scenario
(infinite loop on stable branch) next week.

- Melanie



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Melanie Plageman
Date:
On Mon, Apr 15, 2024 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I've tried a couple times to catch up with this thread. But always kinda felt
> I must be missing something. It might be that this is one part of the
> confusion:
>
> On 2024-01-06 12:24:13 -0800, Noah Misch wrote:
> > Fair enough.  While I agree there's a decent chance back-patching would be
> > okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> > Matthias theorized.  Something like: we update relfrozenxid based on
> > OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> > OldestXmin.  Then relfrozenxid disagrees with table contents.
>
> Looking at the state as of 1ccc1e05ae, I don't see how - in lazy_scan_prune(),
> if heap_page_prune() spuriously didn't prune a tuple, because the horizon went
> backwards, we'd encounter the tuple in the loop below and call
> heap_prepare_freeze_tuple(), which would error out with one of
>
>     /*
>      * Process xmin, while keeping track of whether it's already frozen, or
>      * will become frozen iff our freeze plan is executed by caller (could be
>      * neither).
>      */
>     xid = HeapTupleHeaderGetXmin(tuple);
>     if (!TransactionIdIsNormal(xid))
>         xmin_already_frozen = true;
>     else
>     {
>         if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
>             ereport(ERROR,
>                     (errcode(ERRCODE_DATA_CORRUPTED),
>                      errmsg_internal("found xmin %u from before relfrozenxid %u",
>                                      xid, cutoffs->relfrozenxid)));
>
> or
>                 if (TransactionIdPrecedes(update_xact, cutoffs->relfrozenxid))
>                         ereport(ERROR,
>                                         (errcode(ERRCODE_DATA_CORRUPTED),
>                                          errmsg_internal("multixact %u contains update XID %u from before
relfrozenxid%u", 
>                                                                          multi, update_xact,
>                                                                          cutoffs->relfrozenxid)));
> or
>                 /* Raw xmax is normal XID */
>                 if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
>                         ereport(ERROR,
>                                         (errcode(ERRCODE_DATA_CORRUPTED),
>                                          errmsg_internal("found xmax %u from before relfrozenxid %u",
>                                                                          xid, cutoffs->relfrozenxid)));
>
>
> I'm not saying that spuriously erroring out would be ok. But I guess I just
> don't understand the data corruption theory in this subthread, because we'd
> error out if we encountered a tuple that should have been frozen but wasn't?

I have a more basic question. How could GlobalVisState->maybe_needed
going backwards cause a problem with relfrozenxid? Yes, if
maybe_needed goes backwards, we may not remove a tuple whose xmin/xmax
are older than VacuumCutoffs->OldestXmin. But, if that tuple's
xmin/xmax are older than OldestXmin, then wouldn't we freeze it? If we
freeze it, there isn't an issue. And if the tuple's xids are not newer
than OldestXmin, then how could we end up advancing relfrozenxid to a
value greater than the tuple's xids?

- Melanie



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Fri, Apr 26, 2024 at 4:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I have a more basic question. How could GlobalVisState->maybe_needed
> going backwards cause a problem with relfrozenxid? Yes, if
> maybe_needed goes backwards, we may not remove a tuple whose xmin/xmax
> are older than VacuumCutoffs->OldestXmin. But, if that tuple's
> xmin/xmax are older than OldestXmin, then wouldn't we freeze it?

You can't freeze every XID older than OldestXmin.
heap_prepare_freeze_tuple() isn't prepared for HEAPTUPLE_DEAD tuples,
and expects that those will be taken care of by the time it is called.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Melanie Plageman
Date:
On Fri, Apr 26, 2024 at 5:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Apr 26, 2024 at 4:46 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I have a more basic question. How could GlobalVisState->maybe_needed
> > going backwards cause a problem with relfrozenxid? Yes, if
> > maybe_needed goes backwards, we may not remove a tuple whose xmin/xmax
> > are older than VacuumCutoffs->OldestXmin. But, if that tuple's
> > xmin/xmax are older than OldestXmin, then wouldn't we freeze it?
>
> You can't freeze every XID older than OldestXmin.
> heap_prepare_freeze_tuple() isn't prepared for HEAPTUPLE_DEAD tuples,
> and expects that those will be taken care of by the time it is called.

But, the tuple isn't HEAPTUPLE_DEAD -- it's HEAPTUPLE_RECENTLY_DEAD.
It will always be HEAPTUPLE_RECENTLY_DEAD in 17 and in <= 16, if
HeapTupleSatisfiesVacuum() returns HEAPTUPLE_DEAD, we wouldn't call
heap_prepare_freeze_tuple() because of the retry loop.

- Melanie



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Fri, Apr 26, 2024 at 5:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Fri, Apr 26, 2024 at 5:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Fri, Apr 26, 2024 at 4:46 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > > I have a more basic question. How could GlobalVisState->maybe_needed
> > > going backwards cause a problem with relfrozenxid? Yes, if
> > > maybe_needed goes backwards, we may not remove a tuple whose xmin/xmax
> > > are older than VacuumCutoffs->OldestXmin. But, if that tuple's
> > > xmin/xmax are older than OldestXmin, then wouldn't we freeze it?
> >
> > You can't freeze every XID older than OldestXmin.
> > heap_prepare_freeze_tuple() isn't prepared for HEAPTUPLE_DEAD tuples,
> > and expects that those will be taken care of by the time it is called.
>
> But, the tuple isn't HEAPTUPLE_DEAD -- it's HEAPTUPLE_RECENTLY_DEAD.

Why? What tuple is this?

> It will always be HEAPTUPLE_RECENTLY_DEAD in 17 and in <= 16, if
> HeapTupleSatisfiesVacuum() returns HEAPTUPLE_DEAD, we wouldn't call
> heap_prepare_freeze_tuple() because of the retry loop.

The retry loop exists precisely because heap_prepare_freeze_tuple()
isn't prepared to deal with HEAPTUPLE_DEAD tuples. So I agree that
that won't be allowed to happen on versions that have the retry loop
(14 - 16).

As Andres pointed out, even if we were to call
heap_prepare_freeze_tuple() with a HEAPTUPLE_DEAD tuple, we'd get a
"can't happen" error (though it's hard to see this because it doesn't
actually rely on the hint bits set in the tuple).

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Melanie Plageman
Date:
On Fri, Apr 26, 2024 at 6:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Apr 26, 2024 at 5:56 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Fri, Apr 26, 2024 at 5:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > >
> > > On Fri, Apr 26, 2024 at 4:46 PM Melanie Plageman
> > > <melanieplageman@gmail.com> wrote:
> > > > I have a more basic question. How could GlobalVisState->maybe_needed
> > > > going backwards cause a problem with relfrozenxid? Yes, if
> > > > maybe_needed goes backwards, we may not remove a tuple whose xmin/xmax
> > > > are older than VacuumCutoffs->OldestXmin. But, if that tuple's
> > > > xmin/xmax are older than OldestXmin, then wouldn't we freeze it?
> > >
> > > You can't freeze every XID older than OldestXmin.
> > > heap_prepare_freeze_tuple() isn't prepared for HEAPTUPLE_DEAD tuples,
> > > and expects that those will be taken care of by the time it is called.
> >
> > But, the tuple isn't HEAPTUPLE_DEAD -- it's HEAPTUPLE_RECENTLY_DEAD.
>
> Why? What tuple is this?

In 17, we don't ever get a new HTSV_Result, so if the tuple is not
removed, it would be because HeapTupleSatisfiesVacuumHorizon()
returned HEAPTUPLE_RECENTLY_DEAD and, if GlobalVisTestIsRemovableXid()
was called, dead_after did not precede GlobalVisState->maybe_needed.
This tuple, during this vacuum of the relation, would never be
determined to be HEAPTUPLE_DEAD or it would have been removed.

> > It will always be HEAPTUPLE_RECENTLY_DEAD in 17 and in <= 16, if
> > HeapTupleSatisfiesVacuum() returns HEAPTUPLE_DEAD, we wouldn't call
> > heap_prepare_freeze_tuple() because of the retry loop.
>
> The retry loop exists precisely because heap_prepare_freeze_tuple()
> isn't prepared to deal with HEAPTUPLE_DEAD tuples. So I agree that
> that won't be allowed to happen on versions that have the retry loop
> (14 - 16).

So, it can't happen in back branches. Let's just address 17. Help me
understand how this can happen in 17.

We are talking about a tuple with xmin/xmax older than
VacuumCutoffs->OldestXmin in heap_prepare_freeze_tuple(). So,
freeze_xmin is true:
        freeze_xmin = TransactionIdPrecedes(xid, cutoffs->OldestXmin);

Then, in heap_tuple_should_freeze(), we may ratchet back
NoFreezePageRelfrozenXid:
        if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
            *NoFreezePageRelfrozenXid = xid;
NoFreezePageRelfrozenXid was initialized with
VacuumCutoffs->OldestXmin and our tuple xmin/xmax is older than
VacuumCutoffs->OldestXmin.

We make a freeze plan for the tuple and move on. Assuming
HeapPageFreeze->freeze_required is never set to true and we don't end
up meeting the other criteria for opportunistic freezing, we may
decide not to freeze any tuples on the page (including this tuple). In
this case, we set relfrozenxid to NoFreezePageRelfrozenXid. This
should not be a value newer than our tuple's xids.

> As Andres pointed out, even if we were to call
> heap_prepare_freeze_tuple() with a HEAPTUPLE_DEAD tuple, we'd get a
> "can't happen" error (though it's hard to see this because it doesn't
> actually rely on the hint bits set in the tuple).

I just don't see how in 17 we could end up calling
heap_prepare_freeze_tuple() on a HEAPTUPLE_DEAD tuple. If
heap_prune_satisfies_vacuum() returned HEAPTUPLE_DEAD, we wouldn't
heap_prepare_freeze_tuple() the tuple.

I assume you are talking about a HEAPTUPLE_RECENTLY_DEAD tuple that
*should* be HEAPTUPLE_DEAD. But it doesn't really matter if it should
be HEAPTUPLE_DEAD and we fail to remove it, as long as we don't leave
it unfrozen and advance relfrozenxid past its xids. And I don't see
how we would do that.

- Melanie

- Melanie



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Peter Geoghegan
Date:
On Sat, Apr 27, 2024 at 10:38 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> In 17, we don't ever get a new HTSV_Result, so if the tuple is not
> removed, it would be because HeapTupleSatisfiesVacuumHorizon()
> returned HEAPTUPLE_RECENTLY_DEAD and, if GlobalVisTestIsRemovableXid()
> was called, dead_after did not precede GlobalVisState->maybe_needed.
> This tuple, during this vacuum of the relation, would never be
> determined to be HEAPTUPLE_DEAD or it would have been removed.

That makes sense.

> > > It will always be HEAPTUPLE_RECENTLY_DEAD in 17 and in <= 16, if
> > > HeapTupleSatisfiesVacuum() returns HEAPTUPLE_DEAD, we wouldn't call
> > > heap_prepare_freeze_tuple() because of the retry loop.
> >
> > The retry loop exists precisely because heap_prepare_freeze_tuple()
> > isn't prepared to deal with HEAPTUPLE_DEAD tuples. So I agree that
> > that won't be allowed to happen on versions that have the retry loop
> > (14 - 16).
>
> So, it can't happen in back branches. Let's just address 17. Help me
> understand how this can happen in 17.

Just to be clear, I never said that it was possible in 17. If I
somehow implied it, then I didn't mean to.

> > As Andres pointed out, even if we were to call
> > heap_prepare_freeze_tuple() with a HEAPTUPLE_DEAD tuple, we'd get a
> > "can't happen" error (though it's hard to see this because it doesn't
> > actually rely on the hint bits set in the tuple).
>
> I just don't see how in 17 we could end up calling
> heap_prepare_freeze_tuple() on a HEAPTUPLE_DEAD tuple. If
> heap_prune_satisfies_vacuum() returned HEAPTUPLE_DEAD, we wouldn't
> heap_prepare_freeze_tuple() the tuple.

That part seems relatively straightforward.

> I assume you are talking about a HEAPTUPLE_RECENTLY_DEAD tuple that
> *should* be HEAPTUPLE_DEAD. But it doesn't really matter if it should
> be HEAPTUPLE_DEAD and we fail to remove it, as long as we don't leave
> it unfrozen and advance relfrozenxid past its xids.

I'm not sure that I agree with this part. You're describing a scheme
under which aggressive VACUUM isn't strictly guaranteed to respect
FreezeLimit. And so (for example) a VACUUM FREEZE wouldn't advance
relfrozenxid right up to OldestXmin/FreezeLimit. Right?

It certainly seems possible that, all things considered, you're right
-- perhaps it literally doesn't matter at all on 17, at least in
practice. I cannot think of any problem that's worse than an assertion
failure within heap_vacuum_rel, just before we go to advance the rel's
pg_class.relfrozenxid (and even that seems extremely unlikely). Nobody
particularly intended for things to work this way (the FreezeLimit
invariant isn't supposed to be violated), but it'd hardly be the first
time that something like this happened. Far from it.

--
Peter Geoghegan



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Alena Rybakina
Date:
On 02.05.2024 19:52, Peter Geoghegan wrote:
On Sat, Apr 27, 2024 at 10:38 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
In 17, we don't ever get a new HTSV_Result, so if the tuple is not
removed, it would be because HeapTupleSatisfiesVacuumHorizon()
returned HEAPTUPLE_RECENTLY_DEAD and, if GlobalVisTestIsRemovableXid()
was called, dead_after did not precede GlobalVisState->maybe_needed.
This tuple, during this vacuum of the relation, would never be
determined to be HEAPTUPLE_DEAD or it would have been removed.
That makes sense.

It will always be HEAPTUPLE_RECENTLY_DEAD in 17 and in <= 16, if
HeapTupleSatisfiesVacuum() returns HEAPTUPLE_DEAD, we wouldn't call
heap_prepare_freeze_tuple() because of the retry loop.
The retry loop exists precisely because heap_prepare_freeze_tuple()
isn't prepared to deal with HEAPTUPLE_DEAD tuples. So I agree that
that won't be allowed to happen on versions that have the retry loop
(14 - 16).
So, it can't happen in back branches. Let's just address 17. Help me
understand how this can happen in 17.
Just to be clear, I never said that it was possible in 17. If I
somehow implied it, then I didn't mean to.

Hi! I also investigated this issue and reproduced it using this test added to the isolated tests, where I added 2 tuples, deleted them and ran vacuum and printed the tuple_deleted and dead_tuples statistics (I attached test c to this email as a patch). Within 400 iterations or more, I got the results:

n_dead_tup|n_live_tup|n_tup_del ----------------+------------+------------- 0| 0| 0 (1 row)

After 400 or more running cycles, I felt the differences, as shown earlier:

 n_dead_tup|n_live_tup|n_tup_del
 ----------+----------+---------
-         0|         0|        0
+         2|         0|        0
 (1 row)


I debugged and found that the test produces results with 0 dead tuples if GlobalVisTempRels.maybe_needed is less than the x_max of the tuple. In the code, this condition works in heap_prune_satisfies_vacuum:

else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
{
     res = HEAPTUPLE_DEAD;
} But when GlobalVisTempRels.maybe_needed is equal to the x_max xid of the tuple, vacuum does not touch this tuple, because the heap_prune_satisfies_vacuum function returns the status of the RECENTLY_DEAD tuple.

Unfortunately, I have not found any explanation why GlobalVisTempRels.maybe_needed does not change after 400 iterations or more. I'm still studying it. Perhaps this information will help you.

I reproduced the problem on REL_16_STABLE.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Alena Rybakina
Date:
On 02.05.2024 21:01, Alena Rybakina wrote:
On 02.05.2024 19:52, Peter Geoghegan wrote:
On Sat, Apr 27, 2024 at 10:38 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
In 17, we don't ever get a new HTSV_Result, so if the tuple is not
removed, it would be because HeapTupleSatisfiesVacuumHorizon()
returned HEAPTUPLE_RECENTLY_DEAD and, if GlobalVisTestIsRemovableXid()
was called, dead_after did not precede GlobalVisState->maybe_needed.
This tuple, during this vacuum of the relation, would never be
determined to be HEAPTUPLE_DEAD or it would have been removed.
That makes sense.

It will always be HEAPTUPLE_RECENTLY_DEAD in 17 and in <= 16, if
HeapTupleSatisfiesVacuum() returns HEAPTUPLE_DEAD, we wouldn't call
heap_prepare_freeze_tuple() because of the retry loop.
The retry loop exists precisely because heap_prepare_freeze_tuple()
isn't prepared to deal with HEAPTUPLE_DEAD tuples. So I agree that
that won't be allowed to happen on versions that have the retry loop
(14 - 16).
So, it can't happen in back branches. Let's just address 17. Help me
understand how this can happen in 17.
Just to be clear, I never said that it was possible in 17. If I
somehow implied it, then I didn't mean to.

Hi! I also investigated this issue and reproduced it using this test added to the isolated tests, where I added 2 tuples, deleted them and ran vacuum and printed the tuple_deleted and dead_tuples statistics (I attached test c to this email as a patch). Within 400 iterations or more, I got the results:

n_dead_tup|n_live_tup|n_tup_del ----------------+------------+------------- 0| 0| 0 (1 row)

After 400 or more running cycles, I felt the differences, as shown earlier:

 n_dead_tup|n_live_tup|n_tup_del
 ----------+----------+---------
-         0|         0|        0
+         2|         0|        0
 (1 row)


I debugged and found that the test produces results with 0 dead tuples if GlobalVisTempRels.maybe_needed is less than the x_max of the tuple. In the code, this condition works in heap_prune_satisfies_vacuum:

else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
{
     res = HEAPTUPLE_DEAD;
} But when GlobalVisTempRels.maybe_needed is equal to the x_max xid of the tuple, vacuum does not touch this tuple, because the heap_prune_satisfies_vacuum function returns the status of the RECENTLY_DEAD tuple.

Unfortunately, I have not found any explanation why GlobalVisTempRels.maybe_needed does not change after 400 iterations or more. I'm still studying it. Perhaps this information will help you.

I reproduced the problem on REL_16_STABLE.

I reproduced this test in the master branch as well, but used a more complex test for it: I added 700 tuples to the table, deleted half of the table, and then started vacuum. I expected to get only 350 live tuples and 0 dead and deleted tuples, but after 800 iterations I got 350 dead tuples and 350 live tuples: n_dead_tup|n_live_tup|n_tup_del

 ---------------+-------------+-------------
-                0|          350|             0
+          350|          350|             0
 (1 row)

I have added other steps in the test, but so far I have not seen any falls there or have not reached them.


Just in case, I ran the test with this bash command:

for i in `seq 2000`;do echo "ITER $i"; make -s installcheck -C src/test/isolation/ || break;done

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Matthias van de Meent
Date:
On Thu, 2 May 2024 at 20:36, Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> Hi! I also investigated this issue and reproduced it using this test added to the isolated tests, where I added 2
tuples,deleted them and ran vacuum and printed the tuple_deleted and dead_tuples statistics (I attached test c to this
emailas a patch). Within 400 iterations or more, I got the results: 
>
> n_dead_tup|n_live_tup|n_tup_del ----------------+------------+------------- 0| 0| 0 (1 row)
>
> After 400 or more running cycles, I felt the differences, as shown earlier:
>
>  n_dead_tup|n_live_tup|n_tup_del
>  ----------+----------+---------
> -         0|         0|        0
> +         2|         0|        0
>  (1 row)
>
>
> I debugged and found that the test produces results with 0 dead tuples if GlobalVisTempRels.maybe_needed is less than
thex_max of the tuple. In the code, this condition works in heap_prune_satisfies_vacuum: 
>
> else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
> {
>      res = HEAPTUPLE_DEAD;
> }
>
> But when GlobalVisTempRels.maybe_needed is equal to the x_max xid of the tuple, vacuum does not touch this tuple,
becausethe heap_prune_satisfies_vacuum function returns the status of the RECENTLY_DEAD tuple. 
>
> Unfortunately, I have not found any explanation why GlobalVisTempRels.maybe_needed does not change after 400
iterationsor more. I'm still studying it. Perhaps this information will help you. 

You should probably check GlobalVisTestShouldUpdate to see why it
wouldn't get updated. IIRC, GobalVis* are only updated every so often
based on a local copy of a shared variable, to reduce contention on
the shared state in the proc arrays. It is quite likely that your
backend was waiting for new other changes, when the current
GlobalVisTempRels state was already "good enough" for the current
visibility snapshot.

> I reproduced the problem on REL_16_STABLE.
>
> I reproduced this test in the master branch as well, but used a more complex test for it: I added 700 tuples to the
table,deleted half of the table, and then started vacuum. I expected to get only 350 live tuples and 0 dead and deleted
tuples,but after 800 iterations I got 350 dead tuples and 350 live tuples: n_dead_tup|n_live_tup|n_tup_del 
>  ---------------+-------------
>
> +-------------
> -                0|          350|             0
> +          350|          350|             0
> (1 row)

I think this output is expected. Concurrent tasks (such as autovacuum
and autoanalyze) can have their own visibility snapshots and
transactions, which will block VACUUM from removing rows that may be
visible to those snapshots. Did you verify there was no other
(background) process active in the PostgreSQL test node during your
tests?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From
Alena Rybakina
Date:
Hi!
On 03.05.2024 12:35, Matthias van de Meent wrote:
On Thu, 2 May 2024 at 20:36, Alena Rybakina <lena.ribackina@yandex.ru> wrote:
Hi! I also investigated this issue and reproduced it using this test added to the isolated tests, where I added 2 tuples, deleted them and ran vacuum and printed the tuple_deleted and dead_tuples statistics (I attached test c to this email as a patch). Within 400 iterations or more, I got the results:

n_dead_tup|n_live_tup|n_tup_del ----------------+------------+------------- 0| 0| 0 (1 row)

After 400 or more running cycles, I felt the differences, as shown earlier:
 n_dead_tup|n_live_tup|n_tup_del ----------+----------+---------
-         0|         0|        0
+         2|         0|        0 (1 row)


I debugged and found that the test produces results with 0 dead tuples if GlobalVisTempRels.maybe_needed is less than the x_max of the tuple. In the code, this condition works in heap_prune_satisfies_vacuum:

else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
{     res = HEAPTUPLE_DEAD;
}

But when GlobalVisTempRels.maybe_needed is equal to the x_max xid of the tuple, vacuum does not touch this tuple, because the heap_prune_satisfies_vacuum function returns the status of the RECENTLY_DEAD tuple.

Unfortunately, I have not found any explanation why GlobalVisTempRels.maybe_needed does not change after 400 iterations or more. I'm still studying it. Perhaps this information will help you.
You should probably check GlobalVisTestShouldUpdate to see why it
wouldn't get updated. IIRC, GobalVis* are only updated every so often
based on a local copy of a shared variable, to reduce contention on
the shared state in the proc arrays. It is quite likely that your
backend was waiting for new other changes, when the current
GlobalVisTempRels state was already "good enough" for the current
visibility snapshot.

Thank you for your advice, I am checking this now. It is a new area for me and it needs from me some more investigation.

I reproduced the problem on REL_16_STABLE.

I reproduced this test in the master branch as well, but used a more complex test for it: I added 700 tuples to the table, deleted half of the table, and then started vacuum. I expected to get only 350 live tuples and 0 dead and deleted tuples, but after 800 iterations I got 350 dead tuples and 350 live tuples: n_dead_tup|n_live_tup|n_tup_del ---------------+-------------

+-------------
-                0|          350|             0
+          350|          350|             0
(1 row)
I think this output is expected. Concurrent tasks (such as autovacuum
and autoanalyze) can have their own visibility snapshots and
transactions, which will block VACUUM from removing rows that may be
visible to those snapshots. Did you verify there was no other
(background) process active in the PostgreSQL test node during your
tests?

I see backgrounds processes like autovacuum launcher, client backend, logical replication launcher, checkpointerwalwriter, and background writer. I checked background backends throw pg_stat_activity view and I didn't find a special backend, which impacted on difference in the test. Maybe I missed something?

Besides, I had a hypothesis that some backends might be so many or we have some special backend, which interferes with doing of updating the snapshot, so because of this, some transaction couldn't finish on time and we got finally RECENTLY_DEAD tuple, but I didn't find any evidence. To prove it I added the query with output all backends before vacuum and collected its results, when the test was passed and only when the test failed I differed last diff with others to find something new, but it was usual. Even when the test shows the expected results (0 dead_tuples and 0 deleted tuples) I saw the same backends, maybe some of them were more, but I didn't find any unusual. 

Maybe I should pay attention to something special?

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Fri, Apr 26, 2024 at 4:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Apr 15, 2024 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2024-03-29 12:05:31 -0400, Robert Haas wrote:
> > > Perhaps, for some reason I'm not grokking at the moment, preventing
> > > maybe_needed from retreating would be good enough to prevent trouble
> > > in practice, but it doesn't appear to me to be the most principled
> > > solution as of this writing.
> >
> > If we prevent maybe_needed from retreating below the OldestXmin value used for
> > cutoff, then pruning should never be less aggressive than what's needed by
> > lazy_scan_prune(). So lazy_scan_prune() shouldn't be able to encounter tuples
> > that weren't considered removable in heap_page_prune(), in < 17, nor would
> > heap_page_prune_and_freeze() have that issue.
> >
> > I think one fairly easy fix for this would be to pass OldestXmin to
> > heap_page_prune[_and_freeze](), and have heap_prune_satisfies_vacuum() first
> > check OldestXmin and only if that considers the tuple still running, use
> > GlobalVisTestIsRemovableXid(). That continues to allow us to prune more
> > aggressively, without any potential risk of IsRemoval being too conservative.
>
> It seems to me that in all stable versions containing the retry loop
> (from 8523492d4e34), the following can theoretically happen, causing
> an infinite loop in lazy_scan_prune():
>
> 1) tuple's xmin and xmax are older than VacuumCutoffs->OldestXmin
> 2) heap_page_prune()-> heap_prune_satisfies_vacuum()->
> HeapTupleSatisfiesVacuumHorizon() returns HEAPTUPLE_RECENTLY_DEAD
> 3) in GlobalVisTestIsRemovableXid(), dead_after is between
> GlobalVisState->maybe_needed and definitely_needed, causing
> GlobalVisState to be recomputed
> 4) GlobalVisState->maybe_needed moves backwards
> 5) tuple is not removable because dead_after is now newer than maybe_needed
> 6) in lazy_scan_prune(), HeapTupleSatisfiesVacuum() returns
> HEAPTUPLE_DEAD because dead_after is older than OldestXmin
> 7) infinite loop
>
> One way to fix this (as mentioned earlier in this thread) is to
> backport 1ccc1e05ae because it eliminates the retry loop. In our above
> example, lazy_scan_prune() reuses the tuple's HEAPTUPLE_RECENTLY_DEAD
> HTSV_Result instead of recomputing it. We'd have to rule out any of
> the data corruption claims about that commit to justify this, but I am
> not yet convinced that 1ccc1e05ae could cause any problems with
> relfrozenxid advancement.
>
> Andres' idea of passing VacuumCutoffs->OldestXmin to heap_page_prune()
> makes sense. We could add a member to PruneState, oldest_xmin, and
> initialize it to InvalidTransactionId for on-access pruning and to
> VacuumCutoffs->OldestXmin for vacuum. Then, in
> heap_prune_satisfies_vacuum(), we could add this if branch:
>
> if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin))
>
> to here:
>
> -   if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
> +
> +   if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin))
> +       res = HEAPTUPLE_DEAD;
> +   else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
>         res = HEAPTUPLE_DEAD;
>     else if (OldSnapshotThresholdActive())
>
> This seems like a more targeted fix than backpatching 1ccc1e05ae.
>
> I plan to attempt to write a test case that repros this scenario
> (infinite loop on stable branch) next week.

I can repro the hang on 14 and 15 with the following:

I start a primary with the following options:

wal_level = replica, hot_standby_feedback=on
primary_conninfo='host=localhost port=6432 dbname=postgres'

then take a basebackup and start the replica.

Then I connect with psql five times (sessions A, B, and C on the
primary and sessions A and B on the replica):

Then I do the following steps.

step 1:
  PRIMARY SESSION A:
  --
  CREATE TABLE foo(a INT) WITH (autovacuum_enabled=false);
  CREATE INDEX foo_idx ON foo(a);
  INSERT INTO foo VALUES (1);
  UPDATE foo SET a = 100 WHERE a = 1;

  DROP TABLESPACE IF EXISTS foo_tablespace;
  CREATE TABLESPACE foo_tablespace LOCATION 'somelocation';

step 2:
  REPLICA SESSION A:
  --
  ALTER SYSTEM SET primary_conninfo = 'host=localhost port=9999
dbname=postgres';
  SELECT pg_reload_conf();
  SELECT pg_terminate_backend((select pid from pg_stat_activity where
backend_type = 'walreceiver'));
  BEGIN;
    DECLARE c1 CURSOR FOR SELECT * FROM foo;
    FETCH FORWARD FROM c1;

step 3:
  PRIMARY SESSION A:
  --
  INSERT INTO foo VALUES (99);
  UPDATE foo SET a = 100 WHERE a = 99;

step 4:
  PRIMARY SESSION B:
  --
  BEGIN;
    DECLARE c1 CURSOR FOR SELECT * FROM foo;
    FETCH FORWARD FROM c1;

step 5:
  PRIMARY SESSION C:
  --
  BEGIN;
    ALTER INDEX foo_idx SET TABLESPACE foo_tablespace;

step 6:
  PRIMARY SESSION A:
  --
  VACUUM (FREEZE) foo;

step 7:
  PRIMARY SESSION B:
  --
  COMMIT;

step 8:
  REPLICA SESSION B:
  --
  ALTER SYSTEM SET primary_conninfo = 'host=localhost port=6432
dbname=postgres';
  SELECT pg_reload_conf();

step 9:
  PRIMARY SESSION C:
  --
  COMMIT;

step 10:
  REPLICA SESSION A:
  --
  COMMIT;

This infinitely loops in lazy_scan_prune() on 14 and 15. On 16 and
master, everything is normal (for this specific scenario).

The steps roughly follow what I wrote in my previous email.

The difference between 16/master and 14/15, is that in 14/15,
vacuum_set_xid_limits() is called before opening indexes. This allows
an intervening relcache invalidation caused by modifying an index on
the table being vacuumed to force us to rebuild the catalog snapshot
between vacuum_set_xid_limits() and eventually pruning the tuples.
Rebuilding the catalog snapshot will update RecentXmin to a
potentially different value than ComputeXidHorizonsResultLastXmin (
ComputeXidHorizonResultLastXmin is set to RecentXmin during
vacuum_set_xid_limits()). This provides the opportunity for
GlobalVisTestShouldUpdate() to return true while pruning RECENTLY_DEAD
tuples.

See this line in GlobalVisTestShouldUpdate():
    return RecentXmin != ComputeXidHorizonsResultLastXmin;

On 16 and master, AFAICT, RecentXmin will always equal
ComputeXidHorizonsResultLastXmin in GlobalVisTestShouldUpdate() when
called through heap_prune_satisfies_vacuum(). So, even if a wal sender
pushes the horizon backwards on the primary, vacuum's
GlobalVisState->maybe_needed won't be pushed back to a value lower
than OldestXmin.

I haven't yet written the fix and tested it. I want to investigate a
bit more and write more detailed notes on the repro steps. I also want
to fully convince myself this isn't possible on master or 16.

- Melanie



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
Hi, 
Obviously we should actually fix this on back branches, but could we
at least make the retry loop interruptible in some way so people could
use pg_cancel/terminate_backend() on a stuck autovacuum worker or
vacuum process?

If the problem happens in versions <= PG 16, we don't have a good solution (vacuum process holds the exclusive lock cause checkpoint hangs).

Maybe we can make the retry loop interruptible first. However, since we are using START_CRIT_SECTION, we cannot simply use CHECK_FOR_INTERRUPTS to handle it.

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Sun, May 12, 2024 at 11:19 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>
> Hi,
>>
>> Obviously we should actually fix this on back branches, but could we
>> at least make the retry loop interruptible in some way so people could
>> use pg_cancel/terminate_backend() on a stuck autovacuum worker or
>> vacuum process?
>
>
> If the problem happens in versions <= PG 16, we don't have a good solution (vacuum process holds the exclusive lock
causecheckpoint hangs). 
>
> Maybe we can make the retry loop interruptible first. However, since we are using START_CRIT_SECTION, we cannot
simplyuse CHECK_FOR_INTERRUPTS to handle it. 

As far as I can tell, in 14 and 15, the versions where the issue
reported here is present, there is not a critical section in the
section of code looped through in the retry loop in lazy_scan_prune().
We can actually fix the particular issue I reproduced with the
attached patch. However, I think it is still worth making the retry
loop interruptible in case there are other ways to end up infinitely
looping in the retry loop in lazy_scan_prune().

- Melanie

Attachment

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
On Mon, May 13, 2024 at 10:42 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
On Sun, May 12, 2024 at 11:19 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>
> Hi,
>>
>> Obviously we should actually fix this on back branches, but could we
>> at least make the retry loop interruptible in some way so people could
>> use pg_cancel/terminate_backend() on a stuck autovacuum worker or
>> vacuum process?
>
>
> If the problem happens in versions <= PG 16, we don't have a good solution (vacuum process holds the exclusive lock cause checkpoint hangs).
>
> Maybe we can make the retry loop interruptible first. However, since we are using START_CRIT_SECTION, we cannot simply use CHECK_FOR_INTERRUPTS to handle it.

As far as I can tell, in 14 and 15, the versions where the issue
reported here is present, there is not a critical section in the
section of code looped through in the retry loop in lazy_scan_prune().

You are correct, I tried again to add CHECK_FOR_INTERRUPTS in the retry loop, and when attempting to interrupt the current loop using the pg_terminate_backend function, the value of InterruptHoldoffCount is 1, which causes the vacuum to not end.

We can actually fix the particular issue I reproduced with the
attached patch. However, I think it is still worth making the retry
loop interruptible in case there are other ways to end up infinitely
looping in the retry loop in lazy_scan_prune().

I attempted to apply the patch on the REL_15_STABLE branch, but encountered some conflicts. May I ask which branch you are using?

--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Mon, May 13, 2024 at 11:42 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>
> On Mon, May 13, 2024 at 10:42 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
>>
>> On Sun, May 12, 2024 at 11:19 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>> >
>> > Hi,
>> >>
>> >> Obviously we should actually fix this on back branches, but could we
>> >> at least make the retry loop interruptible in some way so people could
>> >> use pg_cancel/terminate_backend() on a stuck autovacuum worker or
>> >> vacuum process?
>> >
>> >
>> > If the problem happens in versions <= PG 16, we don't have a good solution (vacuum process holds the exclusive
lockcause checkpoint hangs). 
>> >
>> > Maybe we can make the retry loop interruptible first. However, since we are using START_CRIT_SECTION, we cannot
simplyuse CHECK_FOR_INTERRUPTS to handle it. 
>>
>> As far as I can tell, in 14 and 15, the versions where the issue
>> reported here is present, there is not a critical section in the
>> section of code looped through in the retry loop in lazy_scan_prune().
>
>
> You are correct, I tried again to add CHECK_FOR_INTERRUPTS in the retry loop, and when attempting to interrupt the
currentloop using the pg_terminate_backend function, the value of InterruptHoldoffCount is 1, which causes the vacuum
tonot end. 

Yes, great point. Actually, Andres and I discussed this today
off-list, and he reminded me that because vacuum is holding a content
lock on the page here, InterruptHoldoffCount will be at least 1. We
could RESUME_INTERRUPTS() but we probably don't want to process
interrupts while holding the page lock here if we don't do it in other
places. And it's hard to say under what conditions we would want to
drop the page lock here.

Are you reproducing the hang locally with my repro? Or do you have
your own repro? How are you testing pg_terminate_backend() and seeing
that the InterruptHoldoffCount is 1?

>> We can actually fix the particular issue I reproduced with the
>> attached patch. However, I think it is still worth making the retry
>> loop interruptible in case there are other ways to end up infinitely
>> looping in the retry loop in lazy_scan_prune().
>
>
> I attempted to apply the patch on the REL_15_STABLE branch, but encountered some conflicts. May I ask which branch
youare using? 

Sorry, I should have mentioned that patch was against REL_14_STABLE.
Attached patch has the same functionality but should apply cleanly
against REL_15_STABLE.

- Melanie

Attachment

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:

On Wed, May 15, 2024 at 7:33 AM Melanie Plageman <melanieplageman@gmail.com> wrote:

Are you reproducing the hang locally with my repro? Or do you have
your own repro? How are you testing pg_terminate_backend() and seeing
that the InterruptHoldoffCount is 1?

I noticed this problem occurring several times on PG16, but I didn't use your repro, and haven't found a way to reproduce the problem. Instead, I simply force the vacuum to enter a retry loop with the following code and found InterruptHoldoffCount is 1.

// if (unlikely(res == HEAPTUPLE_DEAD))
    goto retry;

Sorry, I should have mentioned that patch was against REL_14_STABLE.
Attached patch has the same functionality but should apply cleanly
against REL_15_STABLE.
 
Thanks, I would try that.


--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Tue, May 14, 2024 at 10:43 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>
>
>> On Wed, May 15, 2024 at 7:33 AM Melanie Plageman <melanieplageman@gmail.com> wrote:
>>
>>
>> Are you reproducing the hang locally with my repro? Or do you have
>> your own repro? How are you testing pg_terminate_backend() and seeing
>> that the InterruptHoldoffCount is 1?
>
>
> I noticed this problem occurring several times on PG16, but I didn't use your repro, and haven't found a way to
reproducethe problem. Instead, I simply force the vacuum to enter a retry loop with the following code and found
InterruptHoldoffCountis 1. 

You saw this problem on PG16? Did you see vacuum hanging or you saw
some kind of incorrect relfrozenxid advancement? My repro does not
work for PG16, and I do not see how we could face this problem on
PG16. It would be very helpful if you have any more details.

- Melanie



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
On Wed, May 15, 2024 at 9:27 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
 
You saw this problem on PG16? Did you see vacuum hanging or you saw
some kind of incorrect relfrozenxid advancement? My repro does not
work for PG16, and I do not see how we could face this problem on
PG16. It would be very helpful if you have any more details.

Yes,  I found the vacuum process hanging in a retry loop on PG16. I think it was the problem discussed in this thread. Here is some information:

vacrel: OldestXmin = 1106, relfrozenxid = 927
vacrel->vistest: maybe_needed = 1079, definitely_need = 1106

--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-05-16 10:21:59 +0800, Bowen Shi wrote:
> Yes,  I found the vacuum process hanging in a retry loop on PG16. I think
> it was the problem discussed in this thread. Here is some information:

Very interesting.


Do you have the page contents that triggered the issue?


> vacrel: OldestXmin = 1106, relfrozenxid = 927
> vacrel->vistest: maybe_needed = 1079, definitely_need = 1106

Do you have any other variables? E.g. RecentXmin,
ComputeXidHorizonsResultLastXmin?

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
Hi, 

On Thu, May 16, 2024 at 11:33 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2024-05-16 10:21:59 +0800, Bowen Shi wrote:
> Yes,  I found the vacuum process hanging in a retry loop on PG16. I think
> it was the problem discussed in this thread. Here is some information:

Very interesting.


Do you have the page contents that triggered the issue?
 
Yes, I have preserved the coredump file of vacuum and several other related processes. It has been at least 10 minutes stuck in the retry loop since I dumped this core file.

(gdb) p *(BufferDesc *) (&BufferDescriptors[(buffer)].bufferdesc)
$8 = {tag = {spcOid = 1663, dbOid = 16425, relNumber = 61755, forkNum = MAIN_FORKNUM,
    blockNum = 50002}, buf_id = 67521, state = {value = 3548643328},
  wait_backend_pgprocno = 2147483647, freeNext = -2, content_lock = {tranche = 78, state = {
      value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}

And in lazy_scan_prune() :
(gdb) p maxoff
$11 = 32
(gdb) p offnum
$12 = 1

Item info for offsets 1 to 32 using loop:
(gdb) define print_itemids
> set $i = 1
> while $i <= 32
> printf "offset %d: ", $i
> print ((PageHeader)(page))->pd_linp[$i - 1]
> set $i = $i + 1
> end
> end
(gdb) print_itemids

offset 1: $1 = {lp_off = 7944, lp_flags = 1, lp_len = 248}
offset 2: $2 = {lp_off = 7696, lp_flags = 1, lp_len = 248}
offset 3: $3 = {lp_off = 7448, lp_flags = 1, lp_len = 248}
offset 4: $4 = {lp_off = 7200, lp_flags = 1, lp_len = 248}
offset 5: $5 = {lp_off = 6952, lp_flags = 1, lp_len = 248}
offset 6: $6 = {lp_off = 6704, lp_flags = 1, lp_len = 248}
offset 7: $7 = {lp_off = 6456, lp_flags = 1, lp_len = 248}
offset 8: $8 = {lp_off = 6208, lp_flags = 1, lp_len = 248}
offset 9: $9 = {lp_off = 5960, lp_flags = 1, lp_len = 248}
offset 10: $10 = {lp_off = 5712, lp_flags = 1, lp_len = 248}
offset 11: $11 = {lp_off = 5464, lp_flags = 1, lp_len = 248}
offset 12: $12 = {lp_off = 5216, lp_flags = 1, lp_len = 248}
offset 13: $13 = {lp_off = 4968, lp_flags = 1, lp_len = 248}
offset 14: $14 = {lp_off = 4720, lp_flags = 1, lp_len = 248}
offset 15: $15 = {lp_off = 4472, lp_flags = 1, lp_len = 248}
offset 16: $16 = {lp_off = 4224, lp_flags = 1, lp_len = 248}
offset 17: $17 = {lp_off = 3976, lp_flags = 1, lp_len = 248}
offset 18: $18 = {lp_off = 3728, lp_flags = 1, lp_len = 248}
offset 19: $19 = {lp_off = 3480, lp_flags = 1, lp_len = 248}
offset 20: $20 = {lp_off = 3232, lp_flags = 1, lp_len = 248}
offset 21: $21 = {lp_off = 2984, lp_flags = 1, lp_len = 248}
offset 22: $22 = {lp_off = 2736, lp_flags = 1, lp_len = 248}
offset 23: $23 = {lp_off = 2488, lp_flags = 1, lp_len = 248}
offset 24: $24 = {lp_off = 2240, lp_flags = 1, lp_len = 248}
offset 25: $25 = {lp_off = 1992, lp_flags = 1, lp_len = 248}
offset 26: $26 = {lp_off = 1744, lp_flags = 1, lp_len = 248}
offset 27: $27 = {lp_off = 1496, lp_flags = 1, lp_len = 248}
offset 28: $28 = {lp_off = 1248, lp_flags = 1, lp_len = 248}
offset 29: $29 = {lp_off = 1000, lp_flags = 1, lp_len = 248}
offset 30: $30 = {lp_off = 752, lp_flags = 1, lp_len = 248}
offset 31: $31 = {lp_off = 504, lp_flags = 1, lp_len = 248}
offset 32: $32 = {lp_off = 256, lp_flags = 1, lp_len = 248}

Using following function to print HeapTupleHeader info:
(gdb) define print_all_heaptupleheaders
>set $i = 1
>while $i <= 32
 >set $tup = (HeapTupleHeader) ((char *) (page) + ((PageHeader) (page))->pd_linp[$i - 1].lp_off)
 >print *$tup
 >set $i = $i + 1
 >end
>end
(gdb) print_all_heaptupleheaders

offset 1: $1 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 7},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729f1f ""}
offset 2: $2 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 8},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729e27 ""}
offset 3: $3 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 9},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729d2f ""}
offset 4: $4 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 10},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729c37 ""}
offset 5: $5 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 11},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729b3f ""}
offset 6: $6 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 12},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729a47 ""}
offset 7: $7 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 13},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b72994f ""}
offset 8: $8 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
---Type <return> to continue, or q <return> to quit---
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 14},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729857 ""}
offset 9: $9 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 15},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b72975f ""}
offset 10: $10 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 16},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729667 ""}
offset 11: $11 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 17},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b72956f ""}
offset 12: $12 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 18},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729477 ""}
offset 13: $13 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 19},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b72937f ""}
offset 14: $14 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 20},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729287 ""}
offset 15: $15 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 21},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b72918f ""}
---Type <return> to continue, or q <return> to quit---
offset 16: $16 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 22},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b729097 ""}
offset 17: $17 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 23},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b728f9f ""}
offset 18: $18 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 24},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b728ea7 ""}
offset 19: $19 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 25},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b728daf ""}
offset 20: $20 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 26},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b728cb7 ""}
offset 21: $21 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 27},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b728bbf ""}
offset 22: $22 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 28},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b728ac7 ""}
offset 23: $23 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
---Type <return> to continue, or q <return> to quit---
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 29},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b7289cf ""}
offset 24: $24 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 30},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b7288d7 ""}
offset 25: $25 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 31},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b7287df ""}
offset 26: $26 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54842}, ip_posid = 32},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b7286e7 ""}
offset 27: $27 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54843}, ip_posid = 1},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b7285ef ""}
offset 28: $28 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54843}, ip_posid = 2},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b7284f7 ""}
offset 29: $29 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54843}, ip_posid = 3},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b7283ff ""}
offset 30: $30 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54843}, ip_posid = 4},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b728307 ""}
---Type <return> to continue, or q <return> to quit---
offset 31: $31 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54843}, ip_posid = 5},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b72820f ""}
offset 32: $32 = {t_choice = {t_heap = {t_xmin = 1094, t_xmax = 1094, t_field3 = {t_cid = 5986,
        t_xvac = 5986}}, t_datum = {datum_len_ = 1094, datum_typmod = 1094,
      datum_typeid = 5986}}, t_ctid = {ip_blkid = {bi_hi = 6, bi_lo = 54843}, ip_posid = 6},
  t_infomask2 = 24, t_infomask = 9506, t_hoff = 24 '\030', t_bits = 0x7fa25b728117 ""}

Do you have any other variables? E.g. RecentXmin,
ComputeXidHorizonsResultLastXmin?

RecentXmin = 1079
ComputeXidHorizonsResultLastXmin = 1079


--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-05-16 15:31:20 +0800, Bowen Shi wrote:
> Yes, I have preserved the coredump file of vacuum and several other related
> processes. It has been at least 10 minutes stuck in the retry loop since I
> dumped this core file.

Could you print out *vacrel?

I'm wondering if there was index processing, due to the number of tuples. And
if so, what type of indexes. There'd need to be something that could lead to
new snapshots being acquired...

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Peter Geoghegan
Date:
On Thu, May 16, 2024 at 12:38 PM Andres Freund <andres@anarazel.de> wrote:
> I'm wondering if there was index processing, due to the number of tuples. And
> if so, what type of indexes. There'd need to be something that could lead to
> new snapshots being acquired...

Did you ever see this theory of mine, about B-Tree page deletion +
recycling? See:


https://www.postgresql.org/message-id/flat/CAH2-Wz%3DzLcnZO8MqPXQLqOLY%3DCAwQhdvs5Ncg6qMb5nMAam0EA%40mail.gmail.com#d058a6d4b8c8fa7d1ff14349b3a50c3c

(And related nearby emails from me.)

It looked very much like index vacuuming was involved in some way when
I actually had the opportunity to use gdb against an affected
production instance that ran into the problem.

--
Peter Geoghegan



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-05-16 12:49:00 -0400, Peter Geoghegan wrote:
> On Thu, May 16, 2024 at 12:38 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm wondering if there was index processing, due to the number of tuples. And
> > if so, what type of indexes. There'd need to be something that could lead to
> > new snapshots being acquired...
> 
> Did you ever see this theory of mine, about B-Tree page deletion +
> recycling? See:
> 
>
https://www.postgresql.org/message-id/flat/CAH2-Wz%3DzLcnZO8MqPXQLqOLY%3DCAwQhdvs5Ncg6qMb5nMAam0EA%40mail.gmail.com#d058a6d4b8c8fa7d1ff14349b3a50c3c
> 
> (And related nearby emails from me.)

> It looked very much like index vacuuming was involved in some way when
> I actually had the opportunity to use gdb against an affected
> production instance that ran into the problem.

Hm, were the cases you observed that way using parallel vacuuming? And what
index types were involved?

Melanies reproducer works because there are catalog accesses that can trigger
a recomputation of fuzzy horizon. For testing the "easy" window for that is
the vac_open_indexes() < 16, because it happens after determining the horizon,
but before actually vacuuming.

Now I wonder if there is some codepath triggering catalog lookups during bulk
delete.

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-05-16 12:39:53 -0700, Andres Freund wrote:
> On 2024-05-16 12:49:00 -0400, Peter Geoghegan wrote:
> > It looked very much like index vacuuming was involved in some way when
> > I actually had the opportunity to use gdb against an affected
> > production instance that ran into the problem.
> 
> Hm, were the cases you observed that way using parallel vacuuming? And what
> index types were involved?
> 
> Melanies reproducer works because there are catalog accesses that can trigger
> a recomputation of fuzzy horizon. For testing the "easy" window for that is
> the vac_open_indexes() < 16, because it happens after determining the horizon,
> but before actually vacuuming.
> 
> Now I wonder if there is some codepath triggering catalog lookups during bulk
> delete.

There is in gin:

2024-05-16 13:02:56.114 PDT [1632692][client backend][2/4:0][psql] CONTEXT:  while scanning block 33 of relation
"public.foo"
2024-05-16 13:02:56.114 PDT [1632692][client backend][2/4:0][psql] STATEMENT:  VACUUM (VERBOSE, PARALLEL 0) foo;
1632692: AIM
1632692: AIM
1632692: AIM
1632692: AIM
1632692: AIM
1632692: AIM
1632692: AIM
1632692: AIM
1632692: AIM
2024-05-16 13:02:56.116 PDT [1632261][postmaster][:0][] DEBUG:  server process (PID 1632484) exited with exit code 0
2024-05-16 13:02:56.140 PDT [1632692][client backend][2/4:0][psql] DEBUG:  scanned index "foo_gin" to remove 7458 row
versions


#0  AcceptInvalidationMessages () at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/inval.c:808
#1  0x0000000000ce1342 in LockRelationOid (relid=1255, lockmode=1) at
../../../../../home/andres/src/postgresql/src/backend/storage/lmgr/lmgr.c:136
#2  0x000000000081b718 in relation_open (relationId=1255, lockmode=1) at
../../../../../home/andres/src/postgresql/src/backend/access/common/relation.c:55
#3  0x00000000008d8116 in table_open (relationId=1255, lockmode=1) at
../../../../../home/andres/src/postgresql/src/backend/access/table/table.c:44
#4  0x0000000000ec8b04 in SearchCatCacheMiss (cache=0x289f180, nkeys=1, hashValue=3157698538, hashIndex=106, v1=16641,
v2=0,v3=0, v4=0)
 
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1468
#5  0x0000000000ec8a80 in SearchCatCacheInternal (cache=0x289f180, nkeys=1, v1=16641, v2=0, v3=0, v4=0)
    at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1413
#6  0x0000000000ec8773 in SearchCatCache1 (cache=0x289f180, v1=16641) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1281
#7  0x0000000000ee59cf in SearchSysCache1 (cacheId=45, key1=16641) at
../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:225
#8  0x0000000000ef6496 in fmgr_info_cxt_security (functionId=16641, finfo=0x28e7dd8, mcxt=0x28fa5b0,
ignore_security=false)
    at ../../../../../home/andres/src/postgresql/src/backend/utils/fmgr/fmgr.c:181
#9  0x0000000000ef63cb in fmgr_info_cxt (functionId=16641, finfo=0x28e7dd8, mcxt=0x28fa5b0)
    at ../../../../../home/andres/src/postgresql/src/backend/utils/fmgr/fmgr.c:139
#10 0x0000000000894ee7 in index_getprocinfo (irel=0x7f53fae86e60, attnum=1, procnum=2)
    at ../../../../../home/andres/src/postgresql/src/backend/access/index/indexam.c:902
#11 0x000000000083d962 in initGinState (state=0x7ffc7c077960, index=0x7f53fae86e60)
    at ../../../../../home/andres/src/postgresql/src/backend/access/gin/ginutil.c:155
#12 0x000000000083ff1c in ginbulkdelete (info=0x7ffc7c07a010, stats=0x0, callback=0xa559ca <vac_tid_reaped>,
callback_state=0x29036d8)
    at ../../../../../home/andres/src/postgresql/src/backend/access/gin/ginvacuum.c:582
#13 0x00000000008949bd in index_bulk_delete (info=0x7ffc7c07a010, istat=0x0, callback=0xa559ca <vac_tid_reaped>,
callback_state=0x29036d8)
    at ../../../../../home/andres/src/postgresql/src/backend/access/index/indexam.c:758
#14 0x0000000000a55872 in vac_bulkdel_one_index (ivinfo=0x7ffc7c07a010, istat=0x0, dead_items=0x29036d8,
dead_items_info=0x29036b8)
    at ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2498
#15 0x000000000088e94c in lazy_vacuum_one_index (indrel=0x7f53fae86e60, istat=0x0, reltuples=40000, vacrel=0x2902770)
    at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2451
#16 0x000000000088dc94 in lazy_vacuum_all_indexes (vacrel=0x2902770) at
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2032
#17 0x000000000088d9e4 in lazy_vacuum (vacrel=0x2902770) at
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1946
#18 0x000000000088c2e9 in lazy_scan_heap (vacrel=0x2902770) at
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:897


Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Peter Geoghegan
Date:
On Thu, May 16, 2024 at 3:39 PM Andres Freund <andres@anarazel.de> wrote:
> Hm, were the cases you observed that way using parallel vacuuming? And what
> index types were involved?

They definitely didn't use parallel VACUUM, since autovacuum was
involved in all cases. Only B-Tree indexes were involved. There were
plenty of index page deletions involved, since the underlying workload
involved bulk deletions. Deletions that seemed to leave every heap
page without any heap tuples.

Partitioning also happened to be involved. No particular reason to
suspect that that matters, though.

> Melanies reproducer works because there are catalog accesses that can trigger
> a recomputation of fuzzy horizon. For testing the "easy" window for that is
> the vac_open_indexes() < 16, because it happens after determining the horizon,
> but before actually vacuuming.

What about the call to GetOldestNonRemovableTransactionId() that takes
place in _bt_pendingfsm_finalize()?

> Now I wonder if there is some codepath triggering catalog lookups during bulk
> delete.

I don't think that there's any rule that says that VACUUM cannot do
catalog lookups during bulk deletions. B-Tree page deletion needs to
generate an insertion scan key, so that it can "refind" a page
undergoing deletion. That might require catalog lookups.

--
Peter Geoghegan



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-05-16 16:13:35 -0400, Peter Geoghegan wrote:
> On Thu, May 16, 2024 at 3:39 PM Andres Freund <andres@anarazel.de> wrote:
> > Melanies reproducer works because there are catalog accesses that can trigger
> > a recomputation of fuzzy horizon. For testing the "easy" window for that is
> > the vac_open_indexes() < 16, because it happens after determining the horizon,
> > but before actually vacuuming.
>
> What about the call to GetOldestNonRemovableTransactionId() that takes
> place in _bt_pendingfsm_finalize()?

Ah, good catch! That'd do it.



> > Now I wonder if there is some codepath triggering catalog lookups during bulk
> > delete.
>
> I don't think that there's any rule that says that VACUUM cannot do
> catalog lookups during bulk deletions. B-Tree page deletion needs to
> generate an insertion scan key, so that it can "refind" a page
> undergoing deletion. That might require catalog lookups.

I'm not saying there's a hard rule against it. Just that there wasn't an
immediately apparent, nor immediately observable, path for it. As I didn't see
the path to the horizon recomputation, I didn't know how a btbulkdelete in the
middle of the scan would potentially trigger the problem.

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Peter Geoghegan
Date:
On Thu, May 16, 2024 at 4:29 PM Andres Freund <andres@anarazel.de> wrote:
> On 2024-05-16 16:13:35 -0400, Peter Geoghegan wrote:
> > On Thu, May 16, 2024 at 3:39 PM Andres Freund <andres@anarazel.de> wrote:
> > > Melanies reproducer works because there are catalog accesses that can trigger
> > > a recomputation of fuzzy horizon. For testing the "easy" window for that is
> > > the vac_open_indexes() < 16, because it happens after determining the horizon,
> > > but before actually vacuuming.
> >
> > What about the call to GetOldestNonRemovableTransactionId() that takes
> > place in _bt_pendingfsm_finalize()?
>
> Ah, good catch! That'd do it.

That was definitely what happened in the problem cases I saw -- plenty
of B-Tree page deletions. Plus the heap page that lazy_scan_prune
actually got stuck on happened to be the first one after the first
round of bulk deletes of index tuples -- *exactly* the first heap page
scanned. Not just on one occasion -- there were several affected
instances that I had access to at various points, that all looked like
this (same workload for all of them, though). These were 14 and 15
instances (no 16 instances, likely just because 16 wasn't even really
available at the time).

It seems just about impossible that these details were all coincidental.

--
Peter Geoghegan



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Andres Freund
Date:
Hi,

On 2024-05-16 13:29:49 -0700, Andres Freund wrote:
> On 2024-05-16 16:13:35 -0400, Peter Geoghegan wrote:
> > > Now I wonder if there is some codepath triggering catalog lookups during bulk
> > > delete.
> >
> > I don't think that there's any rule that says that VACUUM cannot do
> > catalog lookups during bulk deletions. B-Tree page deletion needs to
> > generate an insertion scan key, so that it can "refind" a page
> > undergoing deletion. That might require catalog lookups.
>
> I'm not saying there's a hard rule against it. Just that there wasn't an
> immediately apparent, nor immediately observable, path for it. As I didn't see
> the path to the horizon recomputation, I didn't know how a btbulkdelete in the
> middle of the scan would potentially trigger the problem.

Hm. Actually. I think it might not be correct to do catalog lookups at that
point. But it's a bigger issue than just catalog lookups during bulkdelete:
Once we've done
     MyProc->statusFlags |= PROC_IN_VACUUM;

the current backend's snapshots don't prevent rows from being removed
anymore.

I first wrote:
> That's not a huge issue for the pg_class entry itself, as the locks should
> prevent it from being updated. But there are a lot of catalog lookups that
> aren't protected by locks, just normal snapshot semantics.

but as it turns out we haven't even locked the relation at the point we set
PROC_IN_VACUUM.

That seems quite broken.


WRT bulkdelete, there's this comment where we set PROC_IN_VACUUM:

         * In lazy vacuum, we can set the PROC_IN_VACUUM flag, which lets
         * other concurrent VACUUMs know that they can ignore this one while
         * determining their OldestXmin.  (The reason we don't set it during a
         * full VACUUM is exactly that we may have to run user-defined
         * functions for functional indexes, and we want to make sure that if
         * they use the snapshot set above, any tuples it requires can't get
         * removed from other tables.  An index function that depends on the
         * contents of other tables is arguably broken, but we won't break it
         * here by violating transaction semantics.)

the parenthetical explains that/why we can't evaluate user defined
functions. Which seems to be violated by doing key comparisons, no?

Greetings,

Andres Freund



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Thu, May 9, 2024 at 5:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I can repro the hang on 14 and 15 with the following:

-- snip --

I finally managed to write a TAP test which does the same thing as my
repro. The attached patch contains a test that is meant purely for
demonstration and reproduction by others -- not as a suggestion to add
to the recovery test suite on an ongoing basis.

I also attached the fix (targeting REL_15_STABLE) so that you can try
the repro TAP test with and without the fix.

The test is a bit fragile, so I won't be surprised if it doesn't work
for everyone.

Thank you so much to Andres, Thomas, and Daniel, all of whom helped me
debug various issues I was having with trying to make background_psql
work reliably for this test on 15. This was truly painful, but I never
would have gotten through it without my IPC::Run support group.

I'll probably add more robust comments to the test next week in
preparation for writing a detailed commit message for the fix
explaining the scenario.

- Melanie

Attachment

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
Hi

On Fri, May 17, 2024 at 12:38 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2024-05-16 15:31:20 +0800, Bowen Shi wrote:
> Yes, I have preserved the coredump file of vacuum and several other related
> processes. It has been at least 10 minutes stuck in the retry loop since I
> dumped this core file.

Could you print out *vacrel?

I'm wondering if there was index processing, due to the number of tuples. And
if so, what type of indexes. There'd need to be something that could lead to
new snapshots being acquired...
 
(gdb) p * vacrel
$2 = {rel = 0x2403028, indrels = 0x234e550, nindexes = 12, bstrategy = 0x2368e28, pvs = 0x0,
  aggressive = false, skipwithvm = true, consider_bypass_optimization = false,
  do_index_vacuuming = true, do_index_cleanup = true, do_rel_truncate = true, cutoffs = {
    relfrozenxid = 927, relminmxid = 1, OldestXmin = 1106, OldestMxact = 1,
    FreezeLimit = 4244968402, MultiXactCutoff = 4289967297},
  vistest = 0xe693f0 <GlobalVisDataRels>, NewRelfrozenXid = 1106, NewRelminMxid = 1,
  skippedallvis = false, dbname = 0x234e7f0 "db1", relnamespace = 0x234eb38 "public",
  relname = 0x234eb48 "t1", indname = 0x0, blkno = 349565,
  offnum = 10, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP, verbose = false,
  dead_items = 0x7fa22f48d040, rel_pages = 1302478, scanned_pages = 349566, removed_pages = 0,
  frozen_pages = 592, lpdead_item_pages = 349565, missed_dead_pages = 0,
  nonempty_pages = 349495, new_rel_tuples = 0, new_live_tuples = 0, indstats = 0x234e6e0,
  num_index_scans = 1, tuples_deleted = 11184540, tuples_frozen = 1540, lpdead_items = 11184540,
  live_tuples = 1540, recently_dead_tuples = 0, missed_dead_tuples = 0}
(gdb) p * vacrel->bstrategy
$3 = {btype = BAS_VACUUM, nbuffers = 32, current = 30, buffers = 0x2368e34}




--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
Hi,

On Fri, May 17, 2024 at 12:49 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, May 16, 2024 at 12:38 PM Andres Freund <andres@anarazel.de> wrote:
> I'm wondering if there was index processing, due to the number of tuples. And
> if so, what type of indexes. There'd need to be something that could lead to
> new snapshots being acquired...

Did you ever see this theory of mine, about B-Tree page deletion +
recycling? See:

https://www.postgresql.org/message-id/flat/CAH2-Wz%3DzLcnZO8MqPXQLqOLY%3DCAwQhdvs5Ncg6qMb5nMAam0EA%40mail.gmail.com#d058a6d4b8c8fa7d1ff14349b3a50c3c

(And related nearby emails from me.)

It looked very much like index vacuuming was involved in some way when
I actually had the opportunity to use gdb against an affected
production instance that ran into the problem.

In my case, the call stack is following:

#0  0x000000000050d50e in heap_page_prune (relation=relation@entry=0x2403028,
    buffer=buffer@entry=67521, vistest=<optimized out>, old_snap_xmin=old_snap_xmin@entry=0,
    old_snap_ts=old_snap_ts@entry=0, nnewlpdead=nnewlpdead@entry=0x7fff8be98ecc,
    off_loc=off_loc@entry=0x234e3cc)
#1  0x0000000000510678 in lazy_scan_prune (vacrel=vacrel@entry=0x234e348, buf=buf@entry=67521,
    blkno=blkno@entry=349565, page=page@entry=0x7fa25b728000 "'",
    prunestate=prunestate@entry=0x7fff8be9a0d0)
#2  0x0000000000511a70 in lazy_scan_heap (vacrel=0x234e348)
#3  heap_vacuum_rel (rel=0x2403028, params=0x2358064, bstrategy=<optimized out>)
#4  0x00000000006767e7 in table_relation_vacuum (bstrategy=0x2368e28, params=0x2358064,
    rel=0x2403028)
#5  vacuum_rel (relid=18930, relation=<optimized out>, params=params@entry=0x2358064,
    bstrategy=bstrategy@entry=0x2368e28)
#6  0x0000000000677be0 in vacuum (relations=0x23652f8, relations@entry=0x2363310,
    params=params@entry=0x2358064, bstrategy=bstrategy@entry=0x2368e28,
    vac_context=vac_context@entry=0x23651a0, isTopLevel=isTopLevel@entry=true)
#7  0x0000000000778080 in autovacuum_do_vac_analyze (bstrategy=0x2368e28, tab=0x2358060)
#8  do_autovacuum ()
#9  0x0000000000778510 in AutoVacWorkerMain (argv=0x0, argc=0)
#10 0x00000000007785eb in StartAutoVacWorker ()
#11 0x000000000077efe1 in StartAutovacuumWorker ()
#12 process_pm_pmsignal ()
#13 ServerLoop ()
#14 0x0000000000780328 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x22583a0)
#15 0x00000000004bc368 in main (argc=3, argv=0x22583a0) 

(gdb) p * MyProc            
$4 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x7fa34153ecb8, sem = 0x7fa237311138,
  waitStatus = PROC_WAIT_STATUS_OK, procLatch = {is_set = 1, maybe_sleeping = 0,
    is_shared = true, owner_pid = 2303}, xid = 0, xmin = 1079, lxid = 237689, pid = 2303,
  pgxactoff = 13, pgprocno = 2050, backendId = 13, databaseId = 16425, roleId = 0,
  tempNamespaceId = 0, isBackgroundWorker = false, recoveryConflictPending = false,
  lwWaiting = 0 '\000', lwWaitMode = 0 '\000', lwWaitLink = {next = 0, prev = 0}, cvWaitLink = {
    next = 0, prev = 0}, waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0,
  waitStart = {value = 0}, delayChkptFlags = 0, statusFlags = 3 '\003', waitLSN = 0,
  syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0}, myProcLocks = {{head = {
        prev = 0x7fa3416f7518, next = 0x7fa3416f7518}}, {head = {prev = 0x7fa33e401d60,
        next = 0x7fa33e401d60}}, {head = {prev = 0x7fa33e453db0, next = 0x7fa33e453db0}}, {
      head = {prev = 0x7fa3416f7548, next = 0x7fa3416f7548}}, {head = {prev = 0x7fa33ea10060,
        next = 0x7fa33e4f7220}}, {head = {prev = 0x7fa33e548cd0, next = 0x7fa33e548cd0}}, {
      head = {prev = 0x7fa33eab22b0, next = 0x7fa33eab2350}}, {head = {prev = 0x7fa33e5f9020,
        next = 0x7fa33eb11960}}, {head = {prev = 0x7fa3416f7598, next = 0x7fa3416f7598}}, {
      head = {prev = 0x7fa33eba74f0, next = 0x7fa33eba74f0}}, {head = {prev = 0x7fa3416f75b8,
        next = 0x7fa3416f75b8}}, {head = {prev = 0x7fa3416f75c8, next = 0x7fa3416f75c8}}, {
      head = {prev = 0x7fa33eca8c60, next = 0x7fa33eca89e0}}, {head = {prev = 0x7fa3416f75e8,
        next = 0x7fa3416f75e8}}, {head = {prev = 0x7fa3416f75f8, next = 0x7fa3416f75f8}}, {
      head = {prev = 0x7fa33e884de0, next = 0x7fa33e884de0}}}, subxidStatus = {count = 0 '\000',
    overflowed = false}, subxids = {xids = {0 <repeats 64 times>}},
  procArrayGroupMember = false, procArrayGroupNext = {value = 2147483647},
  procArrayGroupMemberXid = 0, wait_event_info = 0, clogGroupMember = false, clogGroupNext = {
    value = 2147483647}, clogGroupMemberXid = 0, clogGroupMemberXidStatus = 0,
  clogGroupMemberPage = -1, clogGroupMemberLsn = 0, fpInfoLock = {tranche = 81, state = {
      value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, fpLockBits = 0,
  fpRelId = {2840, 2662, 2659, 3379, 2841, 2840, 2662, 27770, 26889, 26008, 24246, 23365, 2659,
    1249, 2690, 53019}, fpVXIDLock = true, fpLocalTransactionId = 237689, lockGroupLeader = 0x0,
  lockGroupMembers = {head = {prev = 0x7fa3416f77b0, next = 0x7fa3416f77b0}}, lockGroupLink = {
    prev = 0x0, next = 0x0}}

From the current situation, I feel that the scenario I encountered is different from yours because it doesn't seem to be the first page of the vacuum scanning.




--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Peter Geoghegan
Date:
On Thu, May 16, 2024 at 11:36 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
> From the current situation, I feel that the scenario I encountered is different from yours because it doesn't seem to
bethe first page of the vacuum scanning. 

Really? I think it looks like the same issue.

What I saw was that the heap page that lazy_scan_prune locked up on
was the first page scanned *after* the first round of index vacuuming
-- *not* the first heap page scanned overall.

I also saw "num_index_scans == 1", "tuples_deleted == lpdead_items",
and "scanned_pages == lpdead_item_pages" when I looked at the locked
up Postgres 15 production instance (or close to it). I can't tell if
your dead_items is shown as empty, too, but if it is then that
indicates that the heap page that your lazy_scan_prune trace relates
to is the first page scanned after a round of index vacuuming (the
first round).

If you can show more about dead_items, then it will either confirm or
disprove my theory about it being the same issue.

--
Peter Geoghegan



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Thu, May 16, 2024 at 07:57:27PM -0400, Melanie Plageman wrote:
> On Thu, May 9, 2024 at 5:56 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
> > I can repro the hang on 14 and 15 with the following:

> I'll probably add more robust comments to the test next week in
> preparation for writing a detailed commit message for the fix
> explaining the scenario.

Are there obstacles to fixing the hang by back-patching 1ccc1e05ae instead of
this?  We'll need to get confident about 1ccc1e05ae before v17, and that
sounds potentially easier than getting confident about both 1ccc1e05ae and
this other approach.  Regarding getting confident about 1ccc1e05ae, I think I
follow the upthread arguments that it does operate correctly.  As a cross
check, I looked at each mention of oldestxmin in vacuumlazy.c and vacuum.c.
Does the following heap_vacuum_rel() comment need an update?

    /*
     * Get cutoffs that determine which deleted tuples are considered DEAD,
     * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
     * the extent of the blocks that we'll scan in lazy_scan_heap.  It has to
     * happen in this order to ensure that the OldestXmin cutoff field works
     * as an upper bound on the XIDs stored in the pages we'll actually scan
     * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).
     *
     * Next acquire vistest, a related cutoff that's used in pruning.  We
     * expect vistest will always make heap_page_prune_and_freeze() remove any
     * deleted tuple whose xmax is < OldestXmin.



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Sat, May 18, 2024 at 6:23 PM Noah Misch <noah@leadboat.com> wrote:
>
> Are there obstacles to fixing the hang by back-patching 1ccc1e05ae instead of
> this?  We'll need to get confident about 1ccc1e05ae before v17, and that
> sounds potentially easier than getting confident about both 1ccc1e05ae and
> this other approach.

I haven't tried back-patching 1ccc1e05ae yet, but I don't understand
why we would want to use stable back branches to get comfortable with
an approach committed to an unreleased version of Postgres.

The small fix proposed in this thread is extremely minimal and
straightforward. It seems much less risky as a backpatch.

> Regarding getting confident about 1ccc1e05ae, I think I
> follow the upthread arguments that it does operate correctly.  As a cross
> check, I looked at each mention of oldestxmin in vacuumlazy.c and vacuum.c.
> Does the following heap_vacuum_rel() comment need an update?
>
>         /*
>          * Get cutoffs that determine which deleted tuples are considered DEAD,
>          * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
>          * the extent of the blocks that we'll scan in lazy_scan_heap.  It has to
>          * happen in this order to ensure that the OldestXmin cutoff field works
>          * as an upper bound on the XIDs stored in the pages we'll actually scan
>          * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).
>          *
>          * Next acquire vistest, a related cutoff that's used in pruning.  We
>          * expect vistest will always make heap_page_prune_and_freeze() remove any
>          * deleted tuple whose xmax is < OldestXmin.

Which part are you thinking needs an update? The part about vistest
always making heap_page_prune_and_freeze() remove any deleted tuples
whose xmax < OldestXmin?

- Melanie



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Mon, May 20, 2024 at 11:58:23AM -0400, Melanie Plageman wrote:
> On Sat, May 18, 2024 at 6:23 PM Noah Misch <noah@leadboat.com> wrote:
> > Are there obstacles to fixing the hang by back-patching 1ccc1e05ae instead of
> > this?  We'll need to get confident about 1ccc1e05ae before v17, and that
> > sounds potentially easier than getting confident about both 1ccc1e05ae and
> > this other approach.
> 
> I haven't tried back-patching 1ccc1e05ae yet, but I don't understand
> why we would want to use stable back branches to get comfortable with
> an approach committed to an unreleased version of Postgres.

I wouldn't say we want to use stable back branches to get comfortable with an
approach.  I wanted to say that it's less work to be confident about one new
way of doing things than two new ways of doing things.

> The small fix proposed in this thread is extremely minimal and
> straightforward. It seems much less risky as a backpatch.

Here's how I model the current and proposed code:

1. v15 VACUUM removes tuples that are HEAPTUPLE_DEAD according to VisTest.
   OldestXmin doesn't cause tuple removal, but there's a hang when OldestXmin
   rules DEAD after VisTest ruled RECENTLY_DEAD.

2. With 1ccc1e05ae, v17 VACUUM still removes tuples that are HEAPTUPLE_DEAD
   according to VisTest only.  OldestXmin doesn't come into play.

3. fix_hang_15.patch would make v15 VACUUM remove tuples that are
   HEAPTUPLE_DEAD according to _either_ VisTest or OldestXmin.

Since (3) is the only list entry that removes tuples that VisTest ruled
RECENTLY_DEAD, I find it higher risk.  For all three, the core task of
certifying the behavior is confirming that its outcome is sound when VisTest
and OldestXmin disagree.  How do you model it?

> > Regarding getting confident about 1ccc1e05ae, I think I
> > follow the upthread arguments that it does operate correctly.  As a cross
> > check, I looked at each mention of oldestxmin in vacuumlazy.c and vacuum.c.
> > Does the following heap_vacuum_rel() comment need an update?
> >
> >         /*
> >          * Get cutoffs that determine which deleted tuples are considered DEAD,
> >          * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine

Since 1ccc1e05ae, the cutoffs (values acquired in vacuum_get_cutoffs()) don't
decide DEAD.

> >          * the extent of the blocks that we'll scan in lazy_scan_heap.  It has to

Still accurate.

> >          * happen in this order to ensure that the OldestXmin cutoff field works
> >          * as an upper bound on the XIDs stored in the pages we'll actually scan

That sentence's problem predates 1ccc1e05ae.  (OldestXmin is a lower bound on
XIDs in pages we don't scan, not an upper bound on XIDs in pages we scan.  To
be pedantic, snapshots don't affect this bound, so the oldest running XID is
also a lower bound.)

> >          * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).

Still accurate.

> >          *
> >          * Next acquire vistest, a related cutoff that's used in pruning.  We

Still accurate.

> >          * expect vistest will always make heap_page_prune_and_freeze() remove any
> >          * deleted tuple whose xmax is < OldestXmin.

Since 1ccc1e05ae, that sentence does not apply.

> Which part are you thinking needs an update? The part about vistest
> always making heap_page_prune_and_freeze() remove any deleted tuples
> whose xmax < OldestXmin?

That and one other sentence marked above.


(Long-term, I favor removal of the cutoffs->OldestXmin field.  I'm not aware
of any calculation for which it remains the optimal value.)



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Mon, May 20, 2024 at 4:48 PM Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, May 20, 2024 at 11:58:23AM -0400, Melanie Plageman wrote:
> > On Sat, May 18, 2024 at 6:23 PM Noah Misch <noah@leadboat.com> wrote:
> > > Are there obstacles to fixing the hang by back-patching 1ccc1e05ae instead of
> > > this?  We'll need to get confident about 1ccc1e05ae before v17, and that
> > > sounds potentially easier than getting confident about both 1ccc1e05ae and
> > > this other approach.
> >
> > I haven't tried back-patching 1ccc1e05ae yet, but I don't understand
> > why we would want to use stable back branches to get comfortable with
> > an approach committed to an unreleased version of Postgres.
>
> I wouldn't say we want to use stable back branches to get comfortable with an
> approach.  I wanted to say that it's less work to be confident about one new
> way of doing things than two new ways of doing things.

I think I understand your point better now.

> > The small fix proposed in this thread is extremely minimal and
> > straightforward. It seems much less risky as a backpatch.
>
> Here's how I model the current and proposed code:
>
> 1. v15 VACUUM removes tuples that are HEAPTUPLE_DEAD according to VisTest.
>    OldestXmin doesn't cause tuple removal, but there's a hang when OldestXmin
>    rules DEAD after VisTest ruled RECENTLY_DEAD.
>
> 2. With 1ccc1e05ae, v17 VACUUM still removes tuples that are HEAPTUPLE_DEAD
>    according to VisTest only.  OldestXmin doesn't come into play.
>
> 3. fix_hang_15.patch would make v15 VACUUM remove tuples that are
>    HEAPTUPLE_DEAD according to _either_ VisTest or OldestXmin.
>
> Since (3) is the only list entry that removes tuples that VisTest ruled
> RECENTLY_DEAD, I find it higher risk.  For all three, the core task of
> certifying the behavior is confirming that its outcome is sound when VisTest
> and OldestXmin disagree.  How do you model it?

Okay, I see your point. In 1 and 2, tuples that would have been
considered HEAPTUPLE_DEAD at the beginning of vacuum but are
considered HEAPTUPLE_RECENTLY_DEAD when pruning them are not removed.
In 3, tuples that would have been considered HEAPTUPLE_DEAD at the
beginning of vacuum are always removed, regardless of whether or not
they would be considered HEAPTUPLE_RECENTLY_DEAD when pruning them.

One option is to add the logic in fix_hang_15.patch to master as well
(always remove tuples older than OldestXmin). This addresses your
concern about gaining confidence in a single solution.

However, I can see how removing more tuples could be concerning. In
the case that the horizon moves backwards because of a standby
reconnecting, I think the worst case is that removing that tuple
causes a recovery conflict on the standby (depending on the value of
max_standby_streaming_delay et al).

I'll experiment with applying 1ccc1e05ae to 14 and 15 and see how it goes.

> > > Regarding getting confident about 1ccc1e05ae, I think I
> > > follow the upthread arguments that it does operate correctly.  As a cross
> > > check, I looked at each mention of oldestxmin in vacuumlazy.c and vacuum.c.
> > > Does the following heap_vacuum_rel() comment need an update?
> > >
> > >         /*
> > >          * Get cutoffs that determine which deleted tuples are considered DEAD,
> > >          * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
>
> Since 1ccc1e05ae, the cutoffs (values acquired in vacuum_get_cutoffs()) don't
> decide DEAD.

Thinking about it more, this is misleading either way. The values
acquired in vacuum_get_cutoffs() don't determine which tuples are
removed because they are considered HEAPTUPLE_DEAD. If they are
considered HEAPTUPLE_RECENTLY_DEAD by VisTest and HEAPTUPLE_DEAD when
compared to OldestXmin, there will be a hang.

> > >          * the extent of the blocks that we'll scan in lazy_scan_heap.  It has to
>
> Still accurate.

I thought about this more, and I actually don't understand what "Then
determine the extent of the blocks that we'll scan in lazy_scan_heap"
means. I assume it refers to RelationGetNumberOfBlocks() which is
called after vacuum_get_cutoffs(). But, if so, I have no idea how that
relates to the following sentence which says that it "has to happen in
this order to ensure that the OldestXmin cutoff field works as an
upper bound"...

> > >          * happen in this order to ensure that the OldestXmin cutoff field works
> > >          * as an upper bound on the XIDs stored in the pages we'll actually scan
>
> That sentence's problem predates 1ccc1e05ae.  (OldestXmin is a lower bound on
> XIDs in pages we don't scan, not an upper bound on XIDs in pages we scan.  To
> be pedantic, snapshots don't affect this bound, so the oldest running XID is
> also a lower bound.)

Hmm. I'm not sure I agree with your interpretation here. Or perhaps I
don't understand it. We initialize NewRelfrozenXid with OldestXmin and
that is, in turn, used to initialize pagefrz.NoFreezePageRelfrozenXid.

HeapPageFreeze->NoFreezePageRelfrozenXid is updated with any older xid
found in an unfrozen tuple on the pages we _do_ scan. That makes
OldestXmin an upper bound on unfrozen xids on the pages we do scan.

I don't think we can say anything about the XIDs of tuples on pages we
don't scan. We skip pages if they are all visible. We don't know that
tuples on those pages are newer than OldestXmin. That's why we can't
update relfrozenxid when we skip pages that are not all frozen.

And even if we skip scanning pages because they are all frozen, their
xids will be older than OldestXmin, so it wouldn't be a lower bound on
those either.

I don't understand "snapshots don't affect this bound, so the oldest
running XID is also a lower bound".

> > >          * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).
>
> Still accurate.

Agree

> > >          *
> > >          * Next acquire vistest, a related cutoff that's used in pruning.  We
>
> Still accurate.

Agree.

> > >          * expect vistest will always make heap_page_prune_and_freeze() remove any
> > >          * deleted tuple whose xmax is < OldestXmin.
>
> Since 1ccc1e05ae, that sentence does not apply.

That's true. I wonder how we should modify it. Some options:

1) change it from "always" to "usually"
2) remove the sentence because it is not worth mentioning OldestXmin
in relation to removing tuples
3) explain the situation with GlobalVisTest moving the horizon
backwards to before OldestXmin
4) change it to say vistest will always cause pruning to remove
deleted tuples visible as deleted to everyone

I also think we should change this part:

     *  lazy_scan_prune must never
     * become confused about whether a tuple should be frozen or removed.  (In
     * the future we might want to teach lazy_scan_prune to recompute vistest
     * from time to time, to increase the number of dead tuples it can prune
     * away.)

lazy_scan_prune() can't really become confused about whether a tuple
should be frozen or removed. It may choose to freeze a tuple that was
ruled RECENTLY_DEAD by vistest and not removed. But that is okay. I
don't think there is any confusion.

The second part I'm not sure about either. Recomputing vistest in
lazy_scan_prune() would not cause more tuples to be removed. Doing so
in heap_page_prune_and_freeze() for reasons other than the current
logic in GlobalVisTestShouldUpdate() would have this effect. But, I
can't tell if that is even what this comment is suggesting.

> (Long-term, I favor removal of the cutoffs->OldestXmin field.  I'm not aware
> of any calculation for which it remains the optimal value.)

I'm trying to think if there is any complication with using
GlobalVisState for freezing given that it could be updated during the
course of vacuuming the relation. If we used GlobalVisState when
determining what to freeze and maybe_needed moves backwards, perhaps
we could end up incorrectly updating relfrozenxid?

- Melanie



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Wed, May 22, 2024 at 12:57:26PM -0400, Melanie Plageman wrote:
> On Mon, May 20, 2024 at 4:48 PM Noah Misch <noah@leadboat.com> wrote:
> > On Mon, May 20, 2024 at 11:58:23AM -0400, Melanie Plageman wrote:
> > > On Sat, May 18, 2024 at 6:23 PM Noah Misch <noah@leadboat.com> wrote:
> > > > I looked at each mention of oldestxmin in vacuumlazy.c and vacuum.c.
> > > > Does the following heap_vacuum_rel() comment need an update?
> > > >
> > > >         /*
> > > >          * Get cutoffs that determine which deleted tuples are considered DEAD,
> > > >          * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
> >
> > Since 1ccc1e05ae, the cutoffs (values acquired in vacuum_get_cutoffs()) don't
> > decide DEAD.
> 
> Thinking about it more, this is misleading either way. The values
> acquired in vacuum_get_cutoffs() don't determine which tuples are
> removed because they are considered HEAPTUPLE_DEAD. If they are
> considered HEAPTUPLE_RECENTLY_DEAD by VisTest and HEAPTUPLE_DEAD when
> compared to OldestXmin, there will be a hang.

Agreed.

> > > >          * the extent of the blocks that we'll scan in lazy_scan_heap.  It has to
> >
> > Still accurate.
> 
> I thought about this more, and I actually don't understand what "Then
> determine the extent of the blocks that we'll scan in lazy_scan_heap"
> means. I assume it refers to RelationGetNumberOfBlocks() which is
> called after vacuum_get_cutoffs().

I think it does refer to that.

> But, if so, I have no idea how that
> relates to the following sentence which says that it "has to happen in
> this order to ensure that the OldestXmin cutoff field works as an
> upper bound"...

When the comment says "pages we'll actually scan", it means "pages that
existed at the start of VACUUM" or equivalently "pages of the rel except those
pages extended after this point in time".  I bet that phrase of the comment
predates the other ways VACUUM can now opt not to visit specific pages.

The lines where order is important:

    vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
    vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);

Suppose the first line sets vacrel->cutoffs.OldestXmin=1234 and the second
line sets orig_rel_pages=10.  Suppose, by the end of the VACUUM, the rel size
has grown to 15 pages.  The last 5 pages can contain only XIDs >= 1234.  We
can assume that without scanning those pages because of the order of those two
lines.  If we swapped the order of those two lines, XID=1233 could extend the
rel and commit between the two lines, without any cutoff reflecting that.

> I don't think we can say anything about the XIDs of tuples on pages we
> don't scan. We skip pages if they are all visible. We don't know that
> tuples on those pages are newer than OldestXmin. That's why we can't
> update relfrozenxid when we skip pages that are not all frozen.

Agreed.  We can infer bounds only for newly-extended pages, not for pages not
scanned for some other reason.

> And even if we skip scanning pages because they are all frozen, their
> xids will be older than OldestXmin, so it wouldn't be a lower bound on
> those either.
> 
> I don't understand "snapshots don't affect this bound, so the oldest
> running XID is also a lower bound".

Assume OldestXmin=1M and OldestXid=2M, because there's a read-only transaction
with an old snapshot.  If something extends the rel after this code runs,
newly-extended pages won't add any transaction older than xid=2M.

> > > >          * expect vistest will always make heap_page_prune_and_freeze() remove any
> > > >          * deleted tuple whose xmax is < OldestXmin.
> >
> > Since 1ccc1e05ae, that sentence does not apply.
> 
> That's true. I wonder how we should modify it. Some options:
> 
> 1) change it from "always" to "usually"
> 2) remove the sentence because it is not worth mentioning OldestXmin
> in relation to removing tuples
> 3) explain the situation with GlobalVisTest moving the horizon
> backwards to before OldestXmin
> 4) change it to say vistest will always cause pruning to remove
> deleted tuples visible as deleted to everyone

Any of (2),(3), or (4) works for me.

> I also think we should change this part:
> 
>      *  lazy_scan_prune must never
>      * become confused about whether a tuple should be frozen or removed.  (In
>      * the future we might want to teach lazy_scan_prune to recompute vistest
>      * from time to time, to increase the number of dead tuples it can prune
>      * away.)
> 
> lazy_scan_prune() can't really become confused about whether a tuple
> should be frozen or removed. It may choose to freeze a tuple that was
> ruled RECENTLY_DEAD by vistest and not removed. But that is okay. I
> don't think there is any confusion.

Agreed.

> The second part I'm not sure about either. Recomputing vistest in
> lazy_scan_prune() would not cause more tuples to be removed. Doing so
> in heap_page_prune_and_freeze() for reasons other than the current
> logic in GlobalVisTestShouldUpdate() would have this effect. But, I
> can't tell if that is even what this comment is suggesting.

I read it as the following general idea.  If other backends update their xmin
values and/or end their transactions while a VACUUM is running, VACUUM could
opt to prune more as that news arrives.  VACUUM could check for such news
before starting on each page or whatever.

> > (Long-term, I favor removal of the cutoffs->OldestXmin field.  I'm not aware
> > of any calculation for which it remains the optimal value.)

For avoidance of doubt, I don't think this thread should make such a change.

> I'm trying to think if there is any complication with using
> GlobalVisState for freezing given that it could be updated during the
> course of vacuuming the relation. If we used GlobalVisState when
> determining what to freeze and maybe_needed moves backwards, perhaps
> we could end up incorrectly updating relfrozenxid?

I think it suffices to calculate relfrozenxid=min(OldestXid at VACUUM start,
all XIDs actually present in pages).  OldestXid might still insert a tuple on
any page of the rel.  OldestXid might not even start until after VACUUM
finishes.  Hence, we can't set any relfrozenxid lower than OldestXid.  But a
transaction with an old snapshot and no XID doesn't constrain relfrozenxid.
If that transaction does assign an XID, it will assign a value >= OldestXid.



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
Hi,

On Fri, May 17, 2024 at 12:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, May 16, 2024 at 11:36 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
> From the current situation, I feel that the scenario I encountered is different from yours because it doesn't seem to be the first page of the vacuum scanning.

Really? I think it looks like the same issue.

What I saw was that the heap page that lazy_scan_prune locked up on
was the first page scanned *after* the first round of index vacuuming
-- *not* the first heap page scanned overall.

I misunderstood earlier.
 
I also saw "num_index_scans == 1", "tuples_deleted == lpdead_items",
and "scanned_pages == lpdead_item_pages" when I looked at the locked
up Postgres 15 production instance (or close to it). I can't tell if
your dead_items is shown as empty, too, but if it is then that
indicates that the heap page that your lazy_scan_prune trace relates
to is the first page scanned after a round of index vacuuming (the
first round).

If you can show more about dead_items, then it will either confirm or
disprove my theory about it being the same issue.

(gdb) p *vacrel->dead_items
$2 = {max_items = 11184809, num_items = 0, items = 0x7fa22f48d048}
(gdb) p *vacrel->dead_items->items
$3 = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 1}

The num_items is zero.


--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Peter Geoghegan
Date:
On Mon, May 27, 2024 at 11:32 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>> If you can show more about dead_items, then it will either confirm or
>> disprove my theory about it being the same issue.
>
>
> (gdb) p *vacrel->dead_items
> $2 = {max_items = 11184809, num_items = 0, items = 0x7fa22f48d048}
> (gdb) p *vacrel->dead_items->items
> $3 = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 1}
>
> The num_items is zero.

That is what I suspected. And so the issue you see looks exactly the
same as the one that I saw when debugging a production instance.

--
Peter Geoghegan



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
Hi,

I only had time to check the fix_hang_15.patch until today. 

On Thu, May 23, 2024 at 12:57 AM Melanie Plageman <melanieplageman@gmail.com> wrote:
On Mon, May 20, 2024 at 4:48 PM Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, May 20, 2024 at 11:58:23AM -0400, Melanie Plageman wrote:
> > On Sat, May 18, 2024 at 6:23 PM Noah Misch <noah@leadboat.com> wrote:
> > > Are there obstacles to fixing the hang by back-patching 1ccc1e05ae instead of
> > > this?  We'll need to get confident about 1ccc1e05ae before v17, and that
> > > sounds potentially easier than getting confident about both 1ccc1e05ae and
> > > this other approach.
> >
> > I haven't tried back-patching 1ccc1e05ae yet, but I don't understand
> > why we would want to use stable back branches to get comfortable with
> > an approach committed to an unreleased version of Postgres.
>
> I wouldn't say we want to use stable back branches to get comfortable with an
> approach.  I wanted to say that it's less work to be confident about one new
> way of doing things than two new ways of doing things.

I think I understand your point better now.

> > The small fix proposed in this thread is extremely minimal and
> > straightforward. It seems much less risky as a backpatch.
>
> Here's how I model the current and proposed code:
>
> 1. v15 VACUUM removes tuples that are HEAPTUPLE_DEAD according to VisTest.
>    OldestXmin doesn't cause tuple removal, but there's a hang when OldestXmin
>    rules DEAD after VisTest ruled RECENTLY_DEAD.
>
> 2. With 1ccc1e05ae, v17 VACUUM still removes tuples that are HEAPTUPLE_DEAD
>    according to VisTest only.  OldestXmin doesn't come into play.
>
> 3. fix_hang_15.patch would make v15 VACUUM remove tuples that are
>    HEAPTUPLE_DEAD according to _either_ VisTest or OldestXmin.
>
> Since (3) is the only list entry that removes tuples that VisTest ruled
> RECENTLY_DEAD, I find it higher risk.  For all three, the core task of
> certifying the behavior is confirming that its outcome is sound when VisTest
> and OldestXmin disagree.  How do you model it?

Okay, I see your point. In 1 and 2, tuples that would have been
considered HEAPTUPLE_DEAD at the beginning of vacuum but are
considered HEAPTUPLE_RECENTLY_DEAD when pruning them are not removed.
In 3, tuples that would have been considered HEAPTUPLE_DEAD at the
beginning of vacuum are always removed, regardless of whether or not
they would be considered HEAPTUPLE_RECENTLY_DEAD when pruning them.

One option is to add the logic in fix_hang_15.patch to master as well
(always remove tuples older than OldestXmin). This addresses your
concern about gaining confidence in a single solution.

However, I can see how removing more tuples could be concerning. In
the case that the horizon moves backwards because of a standby
reconnecting, I think the worst case is that removing that tuple
causes a recovery conflict on the standby (depending on the value of
max_standby_streaming_delay et al).

What would happen if we simply skipped the current page when we found the vacuum process had entered the infinite loop (use a counter)?

--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Bowen Shi
Date:
Hi,

On Thu, May 23, 2024 at 12:57 AM Melanie Plageman <melanieplageman@gmail.com> wrote:
One option is to add the logic in fix_hang_15.patch to master as well
(always remove tuples older than OldestXmin). This addresses your
concern about gaining confidence in a single solution.

However, I can see how removing more tuples could be concerning. In
the case that the horizon moves backwards because of a standby
reconnecting, I think the worst case is that removing that tuple
causes a recovery conflict on the standby (depending on the value of
max_standby_streaming_delay et al).

I'm confused about this part. The comment above OldestXmin says:
/*
 * OldestXmin is the Xid below which tuples deleted by any xact (that
 * committed) should be considered DEAD, not just RECENTLY_DEAD.
 */
With the fix_hang_15 patch, why is there a risk here when we use Oldestxmin to judge whether a tuple could be moved?

I think the keypoint is: OldestXmin and VisTest, which one is more accurate when we judge to remove the tuple.



--
Regards
Bowen Shi

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Tue, May 28, 2024 at 4:29 AM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>
> Hi,
>
> I only had time to check the fix_hang_15.patch until today.

Did it end up working for you?

> What would happen if we simply skipped the current page when we found the vacuum process had entered the infinite
loop(use a counter)? 

Skipping the page would mean we can't advance relfrozenxid for the
table on this vacuum. Since we have ways to fix it, it seems better to
do that.

- Melanie



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Tue, May 28, 2024 at 5:03 AM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>
> On Thu, May 23, 2024 at 12:57 AM Melanie Plageman <melanieplageman@gmail.com> wrote:
>>
>> One option is to add the logic in fix_hang_15.patch to master as well
>> (always remove tuples older than OldestXmin). This addresses your
>> concern about gaining confidence in a single solution.
>>
>> However, I can see how removing more tuples could be concerning. In
>> the case that the horizon moves backwards because of a standby
>> reconnecting, I think the worst case is that removing that tuple
>> causes a recovery conflict on the standby (depending on the value of
>> max_standby_streaming_delay et al).
>
>
> I'm confused about this part. The comment above OldestXmin says:
> /*
>  * OldestXmin is the Xid below which tuples deleted by any xact (that
>  * committed) should be considered DEAD, not just RECENTLY_DEAD.
>  */
> With the fix_hang_15 patch, why is there a risk here when we use Oldestxmin to judge whether a tuple could be moved?

The risk is not so much about whether or not it is okay to remove the
tuple. At least for the case of a lagging standby reconnecting during
vacuum and moving maybe_needed backwards, there is no real danger in
removing tuples considered non-removable when compared to the new
value of maybe_needed. The standby will not replay the tuple removal
until the tuple is removable. If the horizon doesn't move forward on
the standby after some time, a recovery conflict will occur --
canceling the culprit holding back the horizon and allowing the vacuum
to proceed.

I have not, however, investigated the second way maybe_needed may go
backwards cited by Matthias in [1] to see if it is similarly okay to
remove tuples older than OldestXmin but newer than maybe_needed in
this case.

I think Noah's concern about using OldestXmin is more that it is a
different behavior than what we have on master. Picking one solution
and standardizing on it would reduce maintenance complexity across all
branches.

You make a good point about the comment though. If OldestXmin is only
used during vacuum for freezing tuples and determining full page
visibility status, then we should update the comment above it in
VacuumCutoffs and remove mention of HEAPTUPLE_DEAD.

I'll propose both this comment update and the one Noah suggested to
heap_vacuum_rel() against master in a separate thread. Then we can
backport the comments (as relevant) when we fix the back branches.

> I think the keypoint is: OldestXmin and VisTest, which one is more accurate when we judge to remove the tuple.

GlobalVisState will have a potentially more up-to-date horizon value
than OldestXmin when determining whether or not to remove the tuple.
But I wouldn't call it incorrect to remove a tuple older than
OldestXmin but younger than maybe_needed.

- Melanie

[1]
https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Wed, May 22, 2024 at 12:57 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 4:48 PM Noah Misch <noah@leadboat.com> wrote:
> >
> > On Mon, May 20, 2024 at 11:58:23AM -0400, Melanie Plageman wrote:
> > > On Sat, May 18, 2024 at 6:23 PM Noah Misch <noah@leadboat.com> wrote:
> > > > Are there obstacles to fixing the hang by back-patching 1ccc1e05ae instead of
> > > > this?  We'll need to get confident about 1ccc1e05ae before v17, and that
> > > > sounds potentially easier than getting confident about both 1ccc1e05ae and
> > > > this other approach.
> > >
> > > I haven't tried back-patching 1ccc1e05ae yet, but I don't understand
> > > why we would want to use stable back branches to get comfortable with
> > > an approach committed to an unreleased version of Postgres.
> >
> > I wouldn't say we want to use stable back branches to get comfortable with an
> > approach.  I wanted to say that it's less work to be confident about one new
> > way of doing things than two new ways of doing things.
>
> I think I understand your point better now.
>
> > > The small fix proposed in this thread is extremely minimal and
> > > straightforward. It seems much less risky as a backpatch.
> >
> > Here's how I model the current and proposed code:
> >
> > 1. v15 VACUUM removes tuples that are HEAPTUPLE_DEAD according to VisTest.
> >    OldestXmin doesn't cause tuple removal, but there's a hang when OldestXmin
> >    rules DEAD after VisTest ruled RECENTLY_DEAD.
> >
> > 2. With 1ccc1e05ae, v17 VACUUM still removes tuples that are HEAPTUPLE_DEAD
> >    according to VisTest only.  OldestXmin doesn't come into play.
> >
> > 3. fix_hang_15.patch would make v15 VACUUM remove tuples that are
> >    HEAPTUPLE_DEAD according to _either_ VisTest or OldestXmin.
> >
> > Since (3) is the only list entry that removes tuples that VisTest ruled
> > RECENTLY_DEAD, I find it higher risk.  For all three, the core task of
> > certifying the behavior is confirming that its outcome is sound when VisTest
> > and OldestXmin disagree.  How do you model it?
>
> Okay, I see your point. In 1 and 2, tuples that would have been
> considered HEAPTUPLE_DEAD at the beginning of vacuum but are
> considered HEAPTUPLE_RECENTLY_DEAD when pruning them are not removed.
> In 3, tuples that would have been considered HEAPTUPLE_DEAD at the
> beginning of vacuum are always removed, regardless of whether or not
> they would be considered HEAPTUPLE_RECENTLY_DEAD when pruning them.
>
> One option is to add the logic in fix_hang_15.patch to master as well
> (always remove tuples older than OldestXmin). This addresses your
> concern about gaining confidence in a single solution.
>
> However, I can see how removing more tuples could be concerning. In
> the case that the horizon moves backwards because of a standby
> reconnecting, I think the worst case is that removing that tuple
> causes a recovery conflict on the standby (depending on the value of
> max_standby_streaming_delay et al).
>
> I'll experiment with applying 1ccc1e05ae to 14 and 15 and see how it goes.

I ended up manually backporting the logic from 1ccc1e05ae as opposed
to cherry-picking because it relied on a struct introduced in
4e9fc3a9762065.  Attached is a patch set with this backport against
REL_15_STABLE. The first patch is an updated repro (now even more
minimal) with copious additional comments. I am not proposing we add
this as an ongoing test. It won't be stable. It is purely for
illustration.
The fix's commit message still needs editing and citations.

My repro no longer works against REL_14_STABLE, though I was able to
backport the fix there. I'll investigate that.

There are some comments that should be updated around OldestXmin if we
go with this approach. I think they should be updated first in master
and then backported. So, I'll start a thread on those.

Finally, upthread there is discussion of how we could end up doing a
catalog lookup after vacuum_get_cutoffs() and before the tuple
visibility check on 16. Assuming this is true, we would want to
backport the fix to 16 as well. I could use some help getting a repro
(using btree index deletion for example) of the infinite loop on 16.

- Melanie

Attachment

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Tue, Jun 18, 2024 at 6:51 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I ended up manually backporting the logic from 1ccc1e05ae as opposed
> to cherry-picking because it relied on a struct introduced in
> 4e9fc3a9762065.  Attached is a patch set with this backport against
> REL_15_STABLE. The first patch is an updated repro (now even more
> minimal) with copious additional comments. I am not proposing we add
> this as an ongoing test. It won't be stable. It is purely for
> illustration.
> The fix's commit message still needs editing and citations.
>
> My repro no longer works against REL_14_STABLE, though I was able to
> backport the fix there. I'll investigate that.

I figured out why it wasn't repro-ing on 14 -- just a timing issue. I
threw in a sleep for now. There is also something I can do with
pg_stat_progress_vacuum's phase column in the repro, but that was
harder without poll_query_until() (added in 16).

> Finally, upthread there is discussion of how we could end up doing a
> catalog lookup after vacuum_get_cutoffs() and before the tuple
> visibility check on 16. Assuming this is true, we would want to
> backport the fix to 16 as well. I could use some help getting a repro
> (using btree index deletion for example) of the infinite loop on 16.

So, I ended up working on a new repro that works by forcing a round of
index vacuuming after the standby reconnects and before pruning a dead
tuple whose xmax is older than OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
(vacuum_set_xid_limits() on 14/15) and pruning because I have a cursor
on the page keeping VACUUM FREEZE from getting a cleanup lock.

See the repros for step-by-step explanations of how it works.

With this, I can repro the infinite loop on 14-16.

Backporting 1ccc1e05ae fixes 16 but, with the new repro, 14 and 15
error out with "cannot freeze committed xmax". I'm going to
investigate further why this is happening. It definitely makes me
wonder about the fix.

Attached is the backport and repros for 15 and 16. Note that because
of differences in background_psl, the perl test had to be different in
15 than 16, so you'll have to use the repro targeted at the correct
version

- Melanie

Attachment

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Thu, Jun 20, 2024 at 11:49 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Jun 18, 2024 at 6:51 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > Finally, upthread there is discussion of how we could end up doing a
> > catalog lookup after vacuum_get_cutoffs() and before the tuple
> > visibility check on 16. Assuming this is true, we would want to
> > backport the fix to 16 as well. I could use some help getting a repro
> > (using btree index deletion for example) of the infinite loop on 16.
>
> So, I ended up working on a new repro that works by forcing a round of
> index vacuuming after the standby reconnects and before pruning a dead
> tuple whose xmax is older than OldestXmin.
>
> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> calls GetOldestNonRemovableTransactionId(), thereby updating the
> backend's GlobalVisState and moving maybe_needed backwards.
>
> Then vacuum's first pass will continue with pruning and find our later
> inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
> maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
>
> I make sure that the standby reconnects between vacuum_get_cutoffs()
> (vacuum_set_xid_limits() on 14/15) and pruning because I have a cursor
> on the page keeping VACUUM FREEZE from getting a cleanup lock.
>
> See the repros for step-by-step explanations of how it works.
>
> With this, I can repro the infinite loop on 14-16.
>
> Backporting 1ccc1e05ae fixes 16 but, with the new repro, 14 and 15
> error out with "cannot freeze committed xmax". I'm going to
> investigate further why this is happening. It definitely makes me
> wonder about the fix.

It turns out it was also erroring out on 16 (i.e. backporting
1ccc1e05ae did not fix anything), but I didn't notice it because the
perl TAP test passed. I also discovered we can hit this error in
master, so I started a thread about that here [1].

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_bDD7oq9ZwB2OJqub5BovMG6UjEYsoK2LVttadjEqyRGg%40mail.gmail.com



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Thu, Jun 20, 2024 at 11:49:50AM -0400, Melanie Plageman wrote:
> On Tue, Jun 18, 2024 at 6:51 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
> > I ended up manually backporting the logic from 1ccc1e05ae as opposed
> > to cherry-picking because it relied on a struct introduced in
> > 4e9fc3a9762065.

> Attached is the backport and repros for 15 and 16.

> When this happends, a tuple with an xmax older than OldestXmin but newer

s/happends/happens/

> @@ -700,7 +707,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>              break;
>  
>          Assert(ItemIdIsNormal(lp));
> -        Assert(prstate->htsv[offnum] != -1);
> +        Assert(htsv[offnum] != -1);

The master patch removed this assert instead.  Is the back branch non-removal
deliberate?  If so, the patch is ready.  I also confirm your results from your
repro recipe; thanks for including that.



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Melanie Plageman
Date:
On Tue, Jun 25, 2024 at 3:37 PM Noah Misch <noah@leadboat.com> wrote:
>
> On Thu, Jun 20, 2024 at 11:49:50AM -0400, Melanie Plageman wrote:
> > On Tue, Jun 18, 2024 at 6:51 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
> > > I ended up manually backporting the logic from 1ccc1e05ae as opposed
> > > to cherry-picking because it relied on a struct introduced in
> > > 4e9fc3a9762065.
>
> > Attached is the backport and repros for 15 and 16.

I think we are going with the fix proposed for master [1] which
compares dead_after to OldestXmin before using GlobalVisState.
Backporting 1ccc1e05ae doesn't actually fix the problem. We just end
up erroring out when attempting to freeze the tuple we didn't remove.

As such, attached is my proposed fix for affected stable branches. It
is based off of the fix proposed in [1] but is a bit different in each
version due to surrounding code changes.

The test I added passes locally and on linux and windows in CI (on 15+
which have CI). I don't have enough cirrus credits to run the tests on
mac. I am nervous about the test flaking on the buildfarm. But I did
the best I could to try to make it stable. I think keeping it as a
separate commit should be easiest in case we have to revert it?

Thanks to Heikki for backporting BackgroundPsql -- this made my life
much easier!!

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Z4PybtZ0i_NKOr-vbrFW5p1ZdfEfUqaeU8fLPhszpP_g%40mail.gmail.com

Attachment

Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Noah Misch
Date:
On Mon, Jul 08, 2024 at 05:23:52PM -0400, Melanie Plageman wrote:
> I am nervous about the test flaking on the buildfarm. But I did
> the best I could to try to make it stable. I think keeping it as a
> separate commit should be easiest in case we have to revert it?

Whether one uses a separate commit or not, it's easy to remove a new test.  I
think either way is fine.  We've not typically committed tests separately.



Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From
Michael Paquier
Date:
On Thu, Jul 11, 2024 at 02:16:08PM -0700, Noah Misch wrote:
> On Mon, Jul 08, 2024 at 05:23:52PM -0400, Melanie Plageman wrote:
>> I am nervous about the test flaking on the buildfarm. But I did
>> the best I could to try to make it stable. I think keeping it as a
>> separate commit should be easiest in case we have to revert it?
>
> Whether one uses a separate commit or not, it's easy to remove a new test.  I
> think either way is fine.  We've not typically committed tests separately.

Depends.  Melanie, if you are nervous about the test stability, I'd
suggest to do what you think is better and separate that into two
commits to ease your flow, then.  This will limit the damage should a
revert of the test be done.  I'm pretty sure that I have done that in
the past more than once when I had similar concerns in mind.
--
Michael

Attachment