Thread: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17255
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 14.0
Operating system:   Ubuntu 20.04
Description:

The following scenario:
###
createdb regression
export PGDATABASE=regression

echo "
-- excerpt from inherit.sql
CREATE TABLE errtst_parent (
    partid int not null,
    shdata int not null,
    data int NOT NULL DEFAULT 0,
    CONSTRAINT shdata_small CHECK(shdata < 3)
) PARTITION BY RANGE (partid);

CREATE TABLE errtst_child_plaindef (
    partid int not null,
    shdata int not null,
    data int NOT NULL DEFAULT 0,
    CONSTRAINT shdata_small CHECK(shdata < 3),
    CHECK(data < 10)
);

CREATE TABLE errtst_child_reorder (
    data int NOT NULL DEFAULT 0,
    shdata int not null,
    partid int not null,
    CONSTRAINT shdata_small CHECK(shdata < 3),
    CHECK(data < 10)
);

ALTER TABLE errtst_parent ATTACH PARTITION errtst_child_plaindef FOR VALUES
FROM (10) TO (20);
ALTER TABLE errtst_parent ATTACH PARTITION errtst_child_reorder FOR VALUES
FROM (20) TO (30);

DROP TABLE errtst_parent;
" >/tmp/mini-inherit.sql

echo "
-- excerpt from vacuum.sql
CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);
SELECT pg_sleep(random()/500);
CREATE INDEX tmp_idx1 ON tmp (a);
" >/tmp/mini-vacuum.sql

echo "
VACUUM (skip_locked, index_cleanup off) pg_catalog.pg_class;
SELECT pg_sleep(random()/500);
" >/tmp/pseudo-autovacuum.sql

pgbench -n -f /tmp/mini-vacuum.sql -f /tmp/pseudo-autovacuum.sql -C -c 40 -T
600 >/dev/null 2>&1 &
pgbench -n -f /tmp/mini-inherit.sql -C -c 1 -T 600 >/dev/null 2>&1 &

wait
###
with the settings:
autovacuum=off
fsync=off
in postgresql.conf

causes the server crash:
TIME                            PID   UID   GID SIG COREFILE  EXE
Fri 2021-10-29 09:37:09 MSK  383805  1000  1000   6 present
.../usr/local/pgsql/bin/postgres

real    3m20,335s
user    0m7,245s
sys     0m8,306s

with the following stack:
Core was generated by `postgres: law regression [local] CREATE INDEX
                        '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f8a7f97a859 in __GI_abort () at abort.c:79
#2  0x0000562dabb49700 in index_delete_sort_cmp (deltid2=<synthetic
pointer>, deltid1=<optimized out>) at heapam.c:7582
#3  index_delete_sort (delstate=0x7fff6f609f10, delstate=0x7fff6f609f10) at
heapam.c:7623
#4  heap_index_delete_tuples (rel=0x7f8a76523e08, delstate=0x7fff6f609f10)
at heapam.c:7296
#5  0x0000562dabc5519a in table_index_delete_tuples
(delstate=0x7fff6f609f10, rel=0x562dac23d6c2)
    at ../../../../src/include/access/tableam.h:1327
#6  _bt_delitems_delete_check (rel=rel@entry=0x7f8a7652cc80,
buf=buf@entry=191, heapRel=heapRel@entry=0x7f8a76523e08, 
    delstate=delstate@entry=0x7fff6f609f10) at nbtpage.c:1541
#7  0x0000562dabc4dbe1 in _bt_simpledel_pass (maxoff=<optimized out>,
minoff=<optimized out>, newitem=<optimized out>, 
    ndeletable=55, deletable=0x7fff6f609f30, heapRel=0x7f8a76523e08,
buffer=191, rel=0x7f8a7652cc80)
    at nbtinsert.c:2899
#8  _bt_delete_or_dedup_one_page (rel=0x7f8a7652cc80,
heapRel=0x7f8a76523e08, insertstate=0x7fff6f60a340, 
    simpleonly=<optimized out>, checkingunique=<optimized out>,
uniquedup=<optimized out>, indexUnchanged=false)
    at nbtinsert.c:2712
#9  0x0000562dabc523f3 in _bt_findinsertloc (heapRel=0x7f8a76523e08,
stack=0x562dad8c1320, indexUnchanged=false, 
    checkingunique=true, insertstate=0x7fff6f60a340, rel=0x7f8a7652cc80) at
nbtinsert.c:904
#10 _bt_doinsert (rel=rel@entry=0x7f8a7652cc80,
itup=itup@entry=0x562dad8b9f20, 
    checkUnique=checkUnique@entry=UNIQUE_CHECK_YES,
indexUnchanged=indexUnchanged@entry=false, 
    heapRel=heapRel@entry=0x7f8a76523e08) at nbtinsert.c:255
#11 0x0000562dabc58451 in btinsert (rel=0x7f8a7652cc80, values=<optimized
out>, isnull=<optimized out>, 
    ht_ctid=0x562dad81f7d4, heapRel=0x7f8a76523e08,
checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, 
    indexInfo=0x562dad8b9ad8) at nbtree.c:199
#12 0x0000562dabcc0f14 in CatalogIndexInsert
(indstate=indstate@entry=0x562dad8c0fb0, 
    heapTuple=heapTuple@entry=0x562dad81f7d0) at indexing.c:158
#13 0x0000562dabcc119f in CatalogTupleInsert (heapRel=0x7f8a76523e08,
tup=0x562dad81f7d0) at indexing.c:231
#14 0x0000562dabcb92a6 in InsertPgClassTuple
(pg_class_desc=pg_class_desc@entry=0x7f8a76523e08, 
    new_rel_desc=new_rel_desc@entry=0x7f8a7654cd28, new_rel_oid=<optimized
out>, relacl=relacl@entry=0, 
    reloptions=reloptions@entry=0) at heap.c:986
#15 0x0000562dabcbec8d in index_create
(heapRelation=heapRelation@entry=0x7f8a7654c6a0, 
    indexRelationName=indexRelationName@entry=0x562dad7f7ff8 "tmp_idx1",
indexRelationId=571743, 
    indexRelationId@entry=0, parentIndexRelid=parentIndexRelid@entry=0,
parentConstraintId=parentConstraintId@entry=0, 
    relFileNode=<optimized out>, indexInfo=<optimized out>,
indexColNames=<optimized out>, 
    accessMethodObjectId=<optimized out>, tableSpaceId=<optimized out>,
collationObjectId=<optimized out>, 
    classObjectId=<optimized out>, coloptions=<optimized out>,
reloptions=<optimized out>, flags=<optimized out>, 
    constr_flags=<optimized out>, allow_system_table_mods=<optimized out>,
is_internal=<optimized out>, 
    constraintId=<optimized out>) at index.c:968
#16 0x0000562dabd533f8 in DefineIndex (relationId=relationId@entry=571674,
stmt=stmt@entry=0x562dad7f8168, 
    indexRelationId=indexRelationId@entry=0,
parentIndexId=parentIndexId@entry=0, 
    parentConstraintId=parentConstraintId@entry=0,
is_alter_table=is_alter_table@entry=false, check_rights=true, 
    check_not_in_use=true, skip_build=false, quiet=false) at
indexcmds.c:1137
#17 0x0000562dabf41217 in ProcessUtilitySlow (pstate=0x562dad8c0c80,
pstmt=0x562dad7f8518, 
    queryString=0x562dad7f75e0 "CREATE INDEX tmp_idx1 ON tmp (a);",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
    queryEnv=0x0, qc=0x7fff6f60b2a0, dest=<optimized out>) at
utility.c:1534
#18 0x0000562dabf3fd23 in standard_ProcessUtility (pstmt=0x562dad7f8518, 
    queryString=0x562dad7f75e0 "CREATE INDEX tmp_idx1 ON tmp (a);",
readOnlyTree=<optimized out>, 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x562dad7f8608, qc=0x7fff6f60b2a0)
    at utility.c:1066
#19 0x0000562dabf3e3f1 in PortalRunUtility
(portal=portal@entry=0x562dad85b040, pstmt=pstmt@entry=0x562dad7f8518, 
    isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x562dad7f8608,

    qc=qc@entry=0x7fff6f60b2a0) at pquery.c:1155
#20 0x0000562dabf3e52d in PortalRunMulti
(portal=portal@entry=0x562dad85b040, isTopLevel=isTopLevel@entry=true, 
    setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x562dad7f8608, altdest=altdest@entry=0x562dad7f8608, 
    qc=qc@entry=0x7fff6f60b2a0) at pquery.c:1312
#21 0x0000562dabf3ebc9 in PortalRun (portal=portal@entry=0x562dad85b040,
count=count@entry=9223372036854775807, 
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x562dad7f8608, 
    altdest=altdest@entry=0x562dad7f8608, qc=0x7fff6f60b2a0) at
pquery.c:788
#22 0x0000562dabf3a93b in exec_simple_query (query_string=0x562dad7f75e0
"CREATE INDEX tmp_idx1 ON tmp (a);")
    at postgres.c:1214
#23 0x0000562dabf3c541 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7fff6f60b710, dbname=<optimized out>, 
    username=<optimized out>) at postgres.c:4486
#24 0x0000562dabea84bd in BackendRun (port=0x562dad81be40,
port=0x562dad81be40) at postmaster.c:4506
#25 BackendStartup (port=0x562dad81be40) at postmaster.c:4228
#26 ServerLoop () at postmaster.c:1745
#27 0x0000562dabea9461 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1417
#28 0x0000562dabbd72e2 in main (argc=3, argv=0x562dad7f18a0) at main.c:209

Discovered while hunting to another bug related to autovacuum (unfortunately
I still can't produce the reliable reproducing script for that).

Best regards,
Alexander


> On Fri, Oct 29, 2021 at 07:00:01AM +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17255
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 14.0
> Operating system:   Ubuntu 20.04
> Description:
>
> with the following stack:
> Core was generated by `postgres: law regression [local] CREATE INDEX
>                         '.
> Program terminated with signal SIGABRT, Aborted.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f8a7f97a859 in __GI_abort () at abort.c:79
> #2  0x0000562dabb49700 in index_delete_sort_cmp (deltid2=<synthetic
> pointer>, deltid1=<optimized out>) at heapam.c:7582
> #3  index_delete_sort (delstate=0x7fff6f609f10, delstate=0x7fff6f609f10) at
> heapam.c:7623
> #4  heap_index_delete_tuples (rel=0x7f8a76523e08, delstate=0x7fff6f609f10)
> at heapam.c:7296
> #5  0x0000562dabc5519a in table_index_delete_tuples
> (delstate=0x7fff6f609f10, rel=0x562dac23d6c2)
>     at ../../../../src/include/access/tableam.h:1327
> #6  _bt_delitems_delete_check (rel=rel@entry=0x7f8a7652cc80,
> buf=buf@entry=191, heapRel=heapRel@entry=0x7f8a76523e08,
>     delstate=delstate@entry=0x7fff6f609f10) at nbtpage.c:1541
> #7  0x0000562dabc4dbe1 in _bt_simpledel_pass (maxoff=<optimized out>,
> minoff=<optimized out>, newitem=<optimized out>,
>     ndeletable=55, deletable=0x7fff6f609f30, heapRel=0x7f8a76523e08,
> buffer=191, rel=0x7f8a7652cc80)
>     at nbtinsert.c:2899
> ...
>
> Discovered while hunting to another bug related to autovacuum (unfortunately
> I still can't produce the reliable reproducing script for that).

Thanks for reporting (in fact I'm impressed how many issues you've
discovered, hopefully there are at least some t-shirts "I've found X
bugs in PostgreSQL" available as a reward) and putting efforts into the
reproducing steps. I believe I've managed to reproduce at least a
similar crash with the same trace.

In my case it crashed on pg_unreachable (which is an abort, when asserts
are enabled) inside index_delete_sort_cmp. It seems like item pointers
to compare both have the same block and offset number. In the view of
the recent discussions I was thinking it could be somehow related to the
issues with duplicated TIDs, but delstate->deltids doesn't in fact have
any duplicated entries -- so not sure about that, still investigating
the core dump.



Hi,

On 2021-10-29 16:55:32 +0200, Dmitry Dolgov wrote:
> > On Fri, Oct 29, 2021 at 07:00:01AM +0000, PG Bug reporting form wrote:
> > The following bug has been logged on the website:
> >
> > Bug reference:      17255
> > Logged by:          Alexander Lakhin
> > Email address:      exclusion@gmail.com
> > PostgreSQL version: 14.0
> > Operating system:   Ubuntu 20.04
> > Description:
> >
> > with the following stack:
> > Core was generated by `postgres: law regression [local] CREATE INDEX
> >                         '.
> > Program terminated with signal SIGABRT, Aborted.
> > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> > (gdb) bt
> > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > #1  0x00007f8a7f97a859 in __GI_abort () at abort.c:79
> > #2  0x0000562dabb49700 in index_delete_sort_cmp (deltid2=<synthetic
> > pointer>, deltid1=<optimized out>) at heapam.c:7582
> > #3  index_delete_sort (delstate=0x7fff6f609f10, delstate=0x7fff6f609f10) at
> > heapam.c:7623
> > #4  heap_index_delete_tuples (rel=0x7f8a76523e08, delstate=0x7fff6f609f10)
> > at heapam.c:7296
> > #5  0x0000562dabc5519a in table_index_delete_tuples
> > (delstate=0x7fff6f609f10, rel=0x562dac23d6c2)
> >     at ../../../../src/include/access/tableam.h:1327
> > #6  _bt_delitems_delete_check (rel=rel@entry=0x7f8a7652cc80,
> > buf=buf@entry=191, heapRel=heapRel@entry=0x7f8a76523e08,
> >     delstate=delstate@entry=0x7fff6f609f10) at nbtpage.c:1541
> > #7  0x0000562dabc4dbe1 in _bt_simpledel_pass (maxoff=<optimized out>,
> > minoff=<optimized out>, newitem=<optimized out>,
> >     ndeletable=55, deletable=0x7fff6f609f30, heapRel=0x7f8a76523e08,
> > buffer=191, rel=0x7f8a7652cc80)
> >     at nbtinsert.c:2899
> > ...
> >
> > Discovered while hunting to another bug related to autovacuum (unfortunately
> > I still can't produce the reliable reproducing script for that).
> 
> Thanks for reporting (in fact I'm impressed how many issues you've
> discovered, hopefully there are at least some t-shirts "I've found X
> bugs in PostgreSQL" available as a reward) and putting efforts into the
> reproducing steps. I believe I've managed to reproduce at least a
> similar crash with the same trace.
> 
> In my case it crashed on pg_unreachable (which is an abort, when asserts
> are enabled) inside index_delete_sort_cmp. It seems like item pointers
> to compare both have the same block and offset number. In the view of
> the recent discussions I was thinking it could be somehow related to the
> issues with duplicated TIDs, but delstate->deltids doesn't in fact have
> any duplicated entries -- so not sure about that, still investigating
> the core dump.

I suspect this is the same bug as #17245. Could you check if it's fixed by
https://www.postgresql.org/message-id/CAH2-WzkN5aESSLfK7-yrYgsXxYUi__VzG4XpZFwXm98LUtoWuQ%40mail.gmail.com

The crash is somewhere in pg_class, which is also manually VACUUMed by the
test, which could trigger the issue we found in the other thread. The likely
reason the loop in the repro is needed is that that'll push one of the indexes
on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
using paralell vacuum.

Greetings,

Andres Freund



On Sat, Oct 30, 2021 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:
> The crash is somewhere in pg_class, which is also manually VACUUMed by the
> test, which could trigger the issue we found in the other thread. The likely
> reason the loop in the repro is needed is that that'll push one of the indexes
> on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
> using paralell vacuum.

Quite possible, but I can't rule out the possibility that it's
actually this other bug, since as you say this is pg_class, a system
catalog index:

https://www.postgresql.org/message-id/CAH2-WzkjjCoq5Y4LeeHJcjYJVxGm3M3SAWZ0%3D6J8K1FPSC9K0w%40mail.gmail.com

I am in too much of a hurry right now to spend the 5 minutes it would
take to confirm either way. (Even if it's not this other bug, I needed
to remind you of its existence anyway.)

-- 
Peter Geoghegan



> On Sat, Oct 30, 2021 at 02:39:48PM -0700, Andres Freund wrote:
> > In my case it crashed on pg_unreachable (which is an abort, when asserts
> > are enabled) inside index_delete_sort_cmp. It seems like item pointers
> > to compare both have the same block and offset number. In the view of
> > the recent discussions I was thinking it could be somehow related to the
> > issues with duplicated TIDs, but delstate->deltids doesn't in fact have
> > any duplicated entries -- so not sure about that, still investigating
> > the core dump.
>
> I suspect this is the same bug as #17245. Could you check if it's fixed by
> https://www.postgresql.org/message-id/CAH2-WzkN5aESSLfK7-yrYgsXxYUi__VzG4XpZFwXm98LUtoWuQ%40mail.gmail.com
>
> The crash is somewhere in pg_class, which is also manually VACUUMed by the
> test, which could trigger the issue we found in the other thread. The likely
> reason the loop in the repro is needed is that that'll push one of the indexes
> on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
> using paralell vacuum.

I've applied both patches from Peter, the fix itself and
index-points-to-LP_UNUSED-item assertions. Now it doesn't crash on
pg_unreachable, but hits those extra assertions in the second patch:

    #0  0x00007f251875f2fb in raise () from /lib64/libc.so.6
    #1  0x00007f2518748ef6 in abort () from /lib64/libc.so.6
    #2  0x000056387b62a4c7 in ExceptionalCondition (conditionName=0x56387b6be622 "ItemIdIsUsed(iid)",
errorType=0x56387b6bc849"FailedAssertion", fileName=0x56387b6bc928 "heapam.c", lineNumber=7467) at assert.c:69
 
    #3  0x000056387afb4ba9 in heap_index_delete_tuples (rel=0x7f25195f8e20, delstate=0x7ffe817bdf00) at heapam.c:7467
    #4  0x000056387afe4a38 in table_index_delete_tuples (rel=0x7f25195f8e20, delstate=0x7ffe817bdf00) at
../../../../src/include/access/tableam.h:1327
    #5  0x000056387afe83b7 in _bt_delitems_delete_check (rel=0x7f2519601880, buf=182, heapRel=0x7f25195f8e20,
delstate=0x7ffe817bdf00)at nbtpage.c:1541
 
    #6  0x000056387afe4452 in _bt_simpledel_pass (rel=0x7f2519601880, buffer=182, heapRel=0x7f25195f8e20,
deletable=0x7ffe817bdfb0,ndeletable=55, newitem=0x56387c05cfe0, minoff=1, maxoff=271) at nbtinsert.c:2896
 
    #7  0x000056387afe3cb1 in _bt_delete_or_dedup_one_page (rel=0x7f2519601880, heapRel=0x7f25195f8e20,
insertstate=0x7ffe817be3b0,simpleonly=false, checkingunique=true, uniquedup=false, indexUnchanged=false) at
nbtinsert.c:2709
    #8  0x000056387afdea4b in _bt_findinsertloc (rel=0x7f2519601880, insertstate=0x7ffe817be3b0, checkingunique=true,
indexUnchanged=false,stack=0x56387c05d008, heapRel=0x7f25195f8e20) at nbtinsert.c:901
 
    #9  0x000056387afdd3c7 in _bt_doinsert (rel=0x7f2519601880, itup=0x56387c05cfe0, checkUnique=UNIQUE_CHECK_YES,
indexUnchanged=false,heapRel=0x7f25195f8e20) at nbtinsert.c:255
 
    #10 0x000056387afecfee in btinsert (rel=0x7f2519601880, values=0x7ffe817be510, isnull=0x7ffe817be4f0,
ht_ctid=0x56387c05b994,heapRel=0x7f25195f8e20, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false,
indexInfo=0x56387c05cec8)at nbtree.c:199
 
    #11 0x000056387afd7f05 in index_insert (indexRelation=0x7f2519601880, values=0x7ffe817be510, isnull=0x7ffe817be4f0,
heap_t_ctid=0x56387c05b994,heapRelation=0x7f25195f8e20, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false,
indexInfo=0x56387c05cec8)at indexam.c:193
 
    #12 0x000056387b08c396 in CatalogIndexInsert (indstate=0x56387bfc0388, heapTuple=0x56387c05b990) at indexing.c:158
    #13 0x000056387b08c51a in CatalogTupleInsert (heapRel=0x7f25195f8e20, tup=0x56387c05b990) at indexing.c:231
    #14 0x000056387b07ed40 in InsertPgClassTuple (pg_class_desc=0x7f25195f8e20, new_rel_desc=0x7f251960fa18,
new_rel_oid=957915,relacl=0, reloptions=0) at heap.c:984
 
    #15 0x000056387b07eec6 in AddNewRelationTuple (pg_class_desc=0x7f25195f8e20, new_rel_desc=0x7f251960fa18,
new_rel_oid=957915,new_type_oid=957917, reloftype=0, relowner=10, relkind=114 'r', relfrozenxid=412531, relminmxid=1,
relacl=0,reloptions=0) at heap.c:1056
 
    #16 0x000056387b07f60d in heap_create_with_catalog (relname=0x7ffe817bec60 "tmp", relnamespace=16686,
reltablespace=0,relid=957915, reltypeid=0, reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x56387bfbb6b0,
cooked_constraints=0x0,relkind=114 'r', relpersistence=116 't', shared_relation=false, mapped_relation=false,
oncommit=ONCOMMIT_NOOP,reloptions=0, use_user_acl=true, allow_system_table_mods=false, is_internal=false, relrewrite=0,
typaddress=0x0)at heap.c:1409
 
    #17 0x000056387b1a9bb7 in DefineRelation (stmt=0x56387bf98620, relkind=114 'r', ownerId=10, typaddress=0x0,
queryString=0x56387bf97810"CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);") at tablecmds.c:933
 
    #18 0x000056387b47fde1 in ProcessUtilitySlow (pstate=0x56387bfb9890, pstmt=0x56387bf989d0,
queryString=0x56387bf97810"CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0,queryEnv=0x0, dest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at utility.c:1163
 
    #19 0x000056387b47fb3d in standard_ProcessUtility (pstmt=0x56387bf989d0, queryString=0x56387bf97810 "CREATE
TEMPORARYTABLE tmp (a int PRIMARY KEY);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0,dest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at utility.c:1066
 
    #20 0x000056387b47eb01 in ProcessUtility (pstmt=0x56387bf989d0, queryString=0x56387bf97810 "CREATE TEMPORARY TABLE
tmp(a int PRIMARY KEY);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x56387bf98ac0,qc=0x7ffe817bf4a0) at utility.c:527
 
    #21 0x000056387b47d5e8 in PortalRunUtility (portal=0x56387bffb860, pstmt=0x56387bf989d0, isTopLevel=true,
setHoldSnapshot=false,dest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at pquery.c:1155
 
    #22 0x000056387b47d858 in PortalRunMulti (portal=0x56387bffb860, isTopLevel=true, setHoldSnapshot=false,
dest=0x56387bf98ac0,altdest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at pquery.c:1312
 
    #23 0x000056387b47ccdb in PortalRun (portal=0x56387bffb860, count=9223372036854775807, isTopLevel=true,
run_once=true,dest=0x56387bf98ac0, altdest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at pquery.c:788
 
    #24 0x000056387b475fad in exec_simple_query (query_string=0x56387bf97810 "CREATE TEMPORARY TABLE tmp (a int PRIMARY
KEY);")at postgres.c:1214
 
    #25 0x000056387b47ab2b in PostgresMain (dbname=0x56387bfc3748 "regression", username=0x56387bfc3728 "erthalion") at
postgres.c:4497
    #26 0x000056387b3a636f in BackendRun (port=0x56387bfbb460) at postmaster.c:4560
    #27 0x000056387b3a5c60 in BackendStartup (port=0x56387bfbb460) at postmaster.c:4288
    #28 0x000056387b3a1e81 in ServerLoop () at postmaster.c:1801
    #29 0x000056387b3a1630 in PostmasterMain (argc=3, argv=0x56387bf91ca0) at postmaster.c:1473
    #30 0x000056387b297f62 in main (argc=3, argv=0x56387bf91ca0) at main.c:198

The ItemId in question:

    >>> p *iid
    $2 = {
      lp_off = 0,
      lp_flags = 0,
      lp_len = 0
    }



31.10.2021 22:20, Dmitry Dolgov wrote:
>>
>> I suspect this is the same bug as #17245. Could you check if it's fixed by
>> https://www.postgresql.org/message-id/CAH2-WzkN5aESSLfK7-yrYgsXxYUi__VzG4XpZFwXm98LUtoWuQ%40mail.gmail.com
>>
>> The crash is somewhere in pg_class, which is also manually VACUUMed by the
>> test, which could trigger the issue we found in the other thread. The likely
>> reason the loop in the repro is needed is that that'll push one of the indexes
>> on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
>> using paralell vacuum.
> I've applied both patches from Peter, the fix itself and
> index-points-to-LP_UNUSED-item assertions. Now it doesn't crash on
> pg_unreachable, but hits those extra assertions in the second patch:
Yes, the committed fix for the bug #17245 doesn't help here.
I've also noticed that the server crash is not the only possible
outcome. You can also get unexpected errors like:
ERROR:  relation "errtst_parent" already exists
ERROR:  relation "tmp_idx1" already exists
ERROR:  relation "errtst_child_plaindef" already exists
or
ERROR:  could not open relation with OID 1033921
STATEMENT:  DROP TABLE errtst_parent;
in the server.log (and no crash).
These strange errors and the crash inside index_delete_sort_cmp() can be
seen starting from the commit dc7420c2.
On the previous commit (b8443eae) the reproducing script completes
without a crash or errors (triple-checked).
Probably, the bug #17257 has the same root cause, but the patch [1]
applied to REL_14_STABLE (b0f6bd48) doesn't prevent the crash.
Initially I've thought that the infinite loop in vacuum is a problem
itself, so I decided to separate that one, but maybe both bugs are too
related to be discussed apart.

Best regards,
Alexander

[1]
https://www.postgresql.org/message-id/CAEze2Wj7O5tnM_U151Baxr5ObTJafwH%3D71_JEmgJV%2B6eBgjL7g%40mail.gmail.com



On Mon, Nov 8, 2021 at 9:28 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Interesting, I don't think I've observed those errors. In fact after the
> recent changes (I've compiled here from 39a31056) around assertion logic
> and index_delete_check_htid now I'm getting another type of crashes
> using your scripts. This time heap_page_prune_execute stumbles upon a
> non heap-only tuple trying to update unused line pointers:

It looks like the new heap_page_prune_execute() assertions catch the
same problem earlier. It's hard to not suspect the code in pruneheap.c
itself. Whatever else may have happened, the code in pruneheap.c ought
to not even try to set a non-heap-only tuple item to LP_UNUSED. ISTM
that it should explicitly look out for and avoid doing that.

Offhand, I wonder if we just need to have an additional check in
heap_prune_chain(), which is like the existing "If we find a redirect
somewhere else, stop --- it must not be same chain" handling used for
LP_REDIRECT items that aren't at the start of a HOT chain:

diff --git a/src/backend/access/heap/pruneheap.c
b/src/backend/access/heap/pruneheap.c
index c7331d810..3c72cdf67 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -646,6 +646,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber
rootoffnum, PruneState *prstate)
             !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
             break;

+        if (nchain > 0 && !HeapTupleHeaderIsHeapOnly(htup))
+            break;      /* not at start of chain */
+
         /*
          * OK, this tuple is indeed a member of the chain.
          */

I would expect the "Check the tuple XMIN against prior XMAX" test to
do what we need in the vast vast majority of cases, but why should it
be 100% effective?

-- 
Peter Geoghegan



> On Mon, Nov 08, 2021 at 10:32:39AM -0800, Peter Geoghegan wrote:
> On Mon, Nov 8, 2021 at 9:28 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Interesting, I don't think I've observed those errors. In fact after the
> > recent changes (I've compiled here from 39a31056) around assertion logic
> > and index_delete_check_htid now I'm getting another type of crashes
> > using your scripts. This time heap_page_prune_execute stumbles upon a
> > non heap-only tuple trying to update unused line pointers:
>
> It looks like the new heap_page_prune_execute() assertions catch the
> same problem earlier. It's hard to not suspect the code in pruneheap.c
> itself. Whatever else may have happened, the code in pruneheap.c ought
> to not even try to set a non-heap-only tuple item to LP_UNUSED. ISTM
> that it should explicitly look out for and avoid doing that.
>
> Offhand, I wonder if we just need to have an additional check in
> heap_prune_chain(), which is like the existing "If we find a redirect
> somewhere else, stop --- it must not be same chain" handling used for
> LP_REDIRECT items that aren't at the start of a HOT chain:

Yes, adding such condition works in this case, no non-heap-only tuples
were recorded as unused in heap_prune_chain, and nothing else popped up
afterwards. But now after a couple of runs I could also reproduce (at
least partially) what Alexander was talking about:

    ERROR:  could not open relation with OID 1056321

Not sure yet where is it coming from.



On Tue, Nov 9, 2021 at 7:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Yes, adding such condition works in this case, no non-heap-only tuples
> were recorded as unused in heap_prune_chain, and nothing else popped up
> afterwards. But now after a couple of runs I could also reproduce (at
> least partially) what Alexander was talking about:
>
>     ERROR:  could not open relation with OID 1056321

I've seen that too.

> Not sure yet where is it coming from.

I think that the additional check that I sketched (in
heap_prune_chain()) is protective in that it prevents a bad situation
from becoming even worse. At the same time it doesn't actually fix
anything.

I've discussed this privately with Andres -- expect more from him
soon. I came up with more sophisticated instrumentation (better
assertions, really) that shows that the problem begins in VACUUM, not
opportunistic pruning (certainly with the test case we have).

The real problem (identified by Andres) seems to be that pruning feels
entitled to corrupt HOT chains by making an existing LP_REDIRECT
continue to point to a DEAD item that it marks LP_UNUSED. That's never
supposed to happen. This seems to occur in the aborted heap-only tuple
"If the tuple is DEAD and doesn't chain to anything else, mark it
unused immediately" code path at the top of heap_prune_chain(). That
code path seems wonky, despite not having changed much in many years.

This wonky heap_prune_chain() code cares about whether the
to-be-set-LP_UNUSED heap-only tuple item chains to other items -- it's
only really needed for aborted tuples, but doesn't discriminate
between aborted tuples and other kinds of DEAD tuples. It doesn't seem
to get all the details right. In particular, it doesn't account for
the fact that it's not okay to break a HOT chain between the root
LP_REDIRECT item and the first heap-only tuple. It's only okay to do
that between two heap-only tuples.

HOT chain traversal code (in places like heap_hot_search_buffer())
knows how to deal with broken HOT chains when the breakage occurs
between two heap-only tuples, but not in this other LP_REDIRECT case.
It's just not possible for HOT chain traversal code to deal with that,
because there is not enough information in the LP_REDIRECT (just a
link to another item, no xmin or xmax) to validate anything on the
fly. It's pretty clear that we need to specifically make sure that
LP_REDIRECT items always point to something sensible. Anything less
risks causing confusion about which HOT chain lives at which TID.

Obviously we were quite right to suspect that there wasn't enough
rigor around the HOT chain invariants. Wasn't specifically expecting
it to help with this bug.

-- 
Peter Geoghegan



On Tue, Nov 9, 2021 at 9:51 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I've discussed this privately with Andres -- expect more from him
> soon. I came up with more sophisticated instrumentation (better
> assertions, really) that shows that the problem begins in VACUUM, not
> opportunistic pruning (certainly with the test case we have).

Attached is a WIP fix for the bug. The idea here is to follow all HOT
chains in an initial pass over the page, while even following LIVE
heap-only tuples. Any heap-only tuples that we don't determine are
part of some valid HOT chain (following an initial pass over the whole
heap page) will now be processed in a second pass over the page. We
expect (and assert) that these "disconnected" heap-only tuples will
all be either DEAD or RECENTLY_DEAD. We treat them as DEAD either way,
on the grounds that they must be from an aborted xact in any case.
Note that we sometimes do something very similar already -- we can
sometimes consider some tuples from a HOT chain DEAD, even though
they're RECENTLY_DEAD (provided a later tuple from the chain really is
DEAD).

The patch also has more detailed assertions inside heap_page_prune().
These should catch any HOT chain invariant violations at just about
the earliest opportunity, at least when assertions are enabled.
Especially because we're now following every HOT chain from beginning
to end now, even when we already know that there are no more
DEAD/RECENTLY_DEAD tuples in the chain to be found.

I'm not sure why this seems to have become more of a problem following
the snapshot scalability work from Andres -- Alexander mentioned that
commit dc7420c2 looked like it was the source of the problem here, but
I can't see any reason why that might be true (even though I accept
that it might well *appear* to be true). I believe Andres has some
theory on that, but I don't know the details myself. AFAICT, this is a
live bug on all supported versions. We simply weren't being careful
enough about breaking the invariant that an LP_REDIRECT can only point
to a valid heap-only tuple. The really surprising thing here is that
it took this long for it to visibly break.

-- 
Peter Geoghegan

Attachment
On Tue, Nov 9, 2021 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is a WIP fix for the bug. The idea here is to follow all HOT
> chains in an initial pass over the page, while even following LIVE
> heap-only tuples. Any heap-only tuples that we don't determine are
> part of some valid HOT chain (following an initial pass over the whole
> heap page) will now be processed in a second pass over the page.

I realized that I could easily go further than in v1, and totally get
rid of the "marked" array (which tracks whether we have decided to
mark an item as LP_DEAD/LP_UNUSED/a new LP_REDIRECT/newly pointed to
by another LP_REDIRECT). In my v1 from earlier today we already had an
array that records whether or not each item is part of any known valid
chain, which is strictly better than knowing whether or not they were
"marked" earlier. So why bother with the "marked" array at all, even
for assertions? It is less robust (not to mention less efficient) than
just using the new "fromvalidchain" array.

Attached is v2, which gets rid of the "marked" array as described. It
also has better worked out comments and assertions. The patch has
stood up to a fair amount of stress-testing. I repeated Alexander's
original test case for over an hour with this. Getting the test case
to cause an assertion failure would usually take about 5 minutes
without any fix.

I have yet to do any work on validating the performance of this patch,
though that definitely needs to happen.

Anybody have any thoughts on how far this should be backpatched? We'll
probably need to do that for Postgres 14. Less sure about other
branches, which haven't been directly demonstrated to be affected by
the bug so far. Haven't tried to break earlier branches with
Alexander's test case, though I will note again that Alexander
couldn't do that when he tried.

-- 
Peter Geoghegan

Attachment
Hi,

On 2021-11-09 15:31:37 -0800, Peter Geoghegan wrote:
> I'm not sure why this seems to have become more of a problem following
> the snapshot scalability work from Andres -- Alexander mentioned that
> commit dc7420c2 looked like it was the source of the problem here, but
> I can't see any reason why that might be true (even though I accept
> that it might well *appear* to be true). I believe Andres has some
> theory on that, but I don't know the details myself. AFAICT, this is a
> live bug on all supported versions. We simply weren't being careful
> enough about breaking the invariant that an LP_REDIRECT can only point
> to a valid heap-only tuple. The really surprising thing here is that
> it took this long for it to visibly break.

The way this definitely breaks - I have been able to reproduce this in
isolation - is when one tuple is processed twice by heap_prune_chain(), and
the result of HeapTupleSatisfiesVacuum() changes from
HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.

Consider a page like this:

lp 1: redirect to lp2
lp 2: deleted by xid x, not yet committed

and a sequence of events like this:

1) heap_prune_chain(rootlp = 1)
2) commit x
3) heap_prune_chain(rootlp = 2)

1) heap_prune_chain(rootlp = 1) will go to lp2, and see a
HEAPTUPLE_DELETE_IN_PROGRESS and thus not do anything.

3) then could, with the snapshot scalability changes, get DEAD back from
HTSV. Due to the "fuzzy" nature of the post-snapshot-scalability xid horizons,
that is possible, because we can end up rechecking the boundary condition and
seeing that now the horizon allows us to prune x / lp2.

At that point we have a redirect tuple pointing into an unused slot. Which is
"illegal", because something independent can be inserted into that slot.


What made this hard to understand (and likely hard to hit) is that we don't
recompute the xid horizons more than once per hot pruning ([1]). At first I
concluded that a change from RECENTLY_DEAD to DEAD could thus not happen - and
it doesn't: We go from HEAPTUPLE_DELETE_IN_PROGRESS to DEAD, which is possible
because there was no horizon test for HEAPTUPLE_DELETE_IN_PROGRESS.



Note that there are several paths < 14, that cause HTSV()'s answer to change
for the same xid. E.g. when the transaction inserting a tuple version aborts,
we go from HEAPTUPLE_INSERT_IN_PROGRESS to DEAD. But I haven't quite found a
path to trigger problems with that, because there won't be redirects to a
tuple version that is HEAPTUPLE_INSERT_IN_PROGRESS (but there can be redirects
to a HEAPTUPLE_DELETE_IN_PROGRESS or RECENTLY_DEAD).



I hit a crash once in 13 with a slightly evolved version of the test (many
connections creating / dropping the partitions as in the original scenario,
using :client_id to target different tables). It's possible that my
instrumentation was the cause of that. Unfortunately it took quite a few hours
to hit the problem in 13...



Greetings,

Andres Freund

[1] it's a bit more complicated than that, we only recompute the horizon when
a) we've not done it before in the current xact, b) RecentXmin changed during
a snapshot computation. Recomputing the horizon is expensive-ish, so we don't
want to do it constantly.



On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> The way this definitely breaks - I have been able to reproduce this in
> isolation - is when one tuple is processed twice by heap_prune_chain(), and
> the result of HeapTupleSatisfiesVacuum() changes from
> HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.

I had no idea that that was now possible. I really think that this
ought to be documented centrally.

As you know, I don't like the way that vacuumlazy.c doesn't explain
anything about the relationship between OldestXmin (which still
exists, but isn't used for pruning), and the similar GlobalVisState
state (used only during pruning). Surely this deserves to be
addressed, because we expect these two things to agree in certain
specific ways. But not necessarily in others.

> Note that there are several paths < 14, that cause HTSV()'s answer to change
> for the same xid. E.g. when the transaction inserting a tuple version aborts,
> we go from HEAPTUPLE_INSERT_IN_PROGRESS to DEAD.

Right -- that one I certainly knew about.  After all, the
tupgone-ectomy work from my commit 8523492d specifically targeted this
case.

> But I haven't quite found a
> path to trigger problems with that, because there won't be redirects to a
> tuple version that is HEAPTUPLE_INSERT_IN_PROGRESS (but there can be redirects
> to a HEAPTUPLE_DELETE_IN_PROGRESS or RECENTLY_DEAD).

That explains why the snapshot scalability either made these problems
possible for the first time, or at the very least made them far far
more likely in practice.

The relevant code in pruneheap.c was always incredibly fragile -- no
question. Even still, there is really no good reason to believe that
that was actually a problem before commit dc7420c2. Even if we assume
that there's a problem before 14, the surface area is vastly smaller
than on 14 -- the relevant pruneheap.c code hasn't really ever changed
since HOT went in. And so I think that the most sensible course of
action here is this: commit a fix to Postgres 14 + HEAD only -- no
backpatch to earlier versions.

We could go back further than that, but ISTM that the risk of causing
new problems far outweighs the benefits. Whereas I feel pretty
confident that we need to do something on 14.

-- 
Peter Geoghegan



On Wed, Nov 10, 2021 at 1:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> > The way this definitely breaks - I have been able to reproduce this in
> > isolation - is when one tuple is processed twice by heap_prune_chain(), and
> > the result of HeapTupleSatisfiesVacuum() changes from
> > HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.
>
> I had no idea that that was now possible. I really think that this
> ought to be documented centrally.

BTW, is it just a coincidence that we have only seen these problems
with pg_class? Or does it have something to do with the specifics of
VISHORIZON_CATALOG/horizons.catalog_oldest_nonremovable, or something
else along those lines?

-- 
Peter Geoghegan



On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> I hit a crash once in 13 with a slightly evolved version of the test (many
> connections creating / dropping the partitions as in the original scenario,
> using :client_id to target different tables). It's possible that my
> instrumentation was the cause of that. Unfortunately it took quite a few hours
> to hit the problem in 13...

Have you thought about the case where a transaction does a HOT update
of the same row twice, and then aborts?

I'm asking because I notice that the fragile "We need this primarily
to handle aborted HOT updates" precheck for
HeapTupleHeaderIsHeapOnly() doesn't just check if the heap-only tuple
is DEAD before deciding to mark it LP_UNUSED. It also checks
HeapTupleHeaderIsHotUpdated() against the target tuple -- that's
another condition of the tuple being marked unused. Of course, whether
or not a given tuple is considered HeapTupleHeaderIsHotUpdated() can
change from true to false when an updater concurrently aborts. Could
that have race conditions?

In other words: what if the aforementioned "aborted HOT updates"
precheck code doesn't deal with a DEAD tuple, imagining that it's not
a relevant tuple, while at the same time the later HOT-chain-chasing
code *also* doesn't get to the tuple? What if they each assume that
the other will/has taken care of it, due to a race?

So far we've been worried about cases where these two code paths
clobber each other -- that's what we've actually seen. We should also
at least consider the possibility that we have the opposite problem,
which is what this really is.

-- 
Peter Geoghegan



Hi,

On 2021-11-10 13:04:43 -0800, Peter Geoghegan wrote:
> On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> > The way this definitely breaks - I have been able to reproduce this in
> > isolation - is when one tuple is processed twice by heap_prune_chain(), and
> > the result of HeapTupleSatisfiesVacuum() changes from
> > HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.
> 
> I had no idea that that was now possible. I really think that this
> ought to be documented centrally.

Where would you suggest?


> The relevant code in pruneheap.c was always incredibly fragile -- no
> question. Even still, there is really no good reason to believe that
> that was actually a problem before commit dc7420c2. Even if we assume
> that there's a problem before 14, the surface area is vastly smaller
> than on 14 -- the relevant pruneheap.c code hasn't really ever changed
> since HOT went in. And so I think that the most sensible course of
> action here is this: commit a fix to Postgres 14 + HEAD only -- no
> backpatch to earlier versions.

Yea. The fact that I also saw *one* error in 13 worries me a bit, but perhaps
that was something else. Even if we eventually need to backpatch something
further, having it in 14/master first is good.

The fact that 13 didn't trigger the problem reliably doesn't necessarily much
- it's a pretty limited workload. There e.g. are no aborts.


I think we might be able to do something a bit more limited than what you
propose. But I'm not sure it's worth going for that.

Greetings,

Andres Freund



Hi,

On 2021-11-10 14:18:01 -0800, Peter Geoghegan wrote:
> On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> > I hit a crash once in 13 with a slightly evolved version of the test (many
> > connections creating / dropping the partitions as in the original scenario,
> > using :client_id to target different tables). It's possible that my
> > instrumentation was the cause of that. Unfortunately it took quite a few hours
> > to hit the problem in 13...
> 
> Have you thought about the case where a transaction does a HOT update
> of the same row twice, and then aborts?

Yes. I don't think it's problematic right now, because the redirect would, I
think, in all cases have to point to the chain element before those tuples,
because the preceding value would just have to be DELETE_IN_PROGRESS, which we
we don't follow in heap_prune_chain().


> I'm asking because I notice that the fragile "We need this primarily
> to handle aborted HOT updates" precheck for
> HeapTupleHeaderIsHeapOnly() doesn't just check if the heap-only tuple
> is DEAD before deciding to mark it LP_UNUSED. It also checks
> HeapTupleHeaderIsHotUpdated() against the target tuple -- that's
> another condition of the tuple being marked unused. Of course, whether
> or not a given tuple is considered HeapTupleHeaderIsHotUpdated() can
> change from true to false when an updater concurrently aborts. Could
> that have race conditions?

I wondered about that too, but I couldn't *quite* come up with a problematic
scenario, because I don't think any of the cases that can change
HeapTupleHeaderIsHotUpdated() would have allowed to set the redirect to a
subsequent chain element.


> In other words: what if the aforementioned "aborted HOT updates"
> precheck code doesn't deal with a DEAD tuple, imagining that it's not
> a relevant tuple, while at the same time the later HOT-chain-chasing
> code *also* doesn't get to the tuple? What if they each assume that
> the other will/has taken care of it, due to a race?

Then we'd just end up not pruning the tuple, I think. Which should be fine, as
it could only happen for fairly new tuples.

Greetings,

Andres Freund



Hi,

On 2021-11-10 13:40:25 -0800, Peter Geoghegan wrote:
> On Wed, Nov 10, 2021 at 1:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> > > The way this definitely breaks - I have been able to reproduce this in
> > > isolation - is when one tuple is processed twice by heap_prune_chain(), and
> > > the result of HeapTupleSatisfiesVacuum() changes from
> > > HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.
> >
> > I had no idea that that was now possible. I really think that this
> > ought to be documented centrally.
> 
> BTW, is it just a coincidence that we have only seen these problems
> with pg_class? Or does it have something to do with the specifics of
> VISHORIZON_CATALOG/horizons.catalog_oldest_nonremovable, or something
> else along those lines?

It's the testcase. The test instructions specify autovacuum=off and only
pg_class is vacuumed. There are a other hot updates than to pg_class, but
there's much less HOT pruning on them because there's no loop of VACUUMs.

Greetings,

Andres Freund



On Wed, Nov 10, 2021 at 4:57 PM Andres Freund <andres@anarazel.de> wrote:
> Yes. I don't think it's problematic right now, because the redirect would, I
> think, in all cases have to point to the chain element before those tuples,
> because the preceding value would just have to be DELETE_IN_PROGRESS, which we
> we don't follow in heap_prune_chain().

Actually, I was more worried about versions before Postgres 14 here --
versions that don't have that new DELETE_IN_PROGRESS pruning behavior
you mentioned.

> > I'm asking because I notice that the fragile "We need this primarily
> > to handle aborted HOT updates" precheck for
> > HeapTupleHeaderIsHeapOnly() doesn't just check if the heap-only tuple
> > is DEAD before deciding to mark it LP_UNUSED. It also checks
> > HeapTupleHeaderIsHotUpdated() against the target tuple -- that's
> > another condition of the tuple being marked unused. Of course, whether
> > or not a given tuple is considered HeapTupleHeaderIsHotUpdated() can
> > change from true to false when an updater concurrently aborts. Could
> > that have race conditions?
>
> I wondered about that too, but I couldn't *quite* come up with a problematic
> scenario, because I don't think any of the cases that can change
> HeapTupleHeaderIsHotUpdated() would have allowed to set the redirect to a
> subsequent chain element.

It's pretty complicated.

> > In other words: what if the aforementioned "aborted HOT updates"
> > precheck code doesn't deal with a DEAD tuple, imagining that it's not
> > a relevant tuple, while at the same time the later HOT-chain-chasing
> > code *also* doesn't get to the tuple? What if they each assume that
> > the other will/has taken care of it, due to a race?
>
> Then we'd just end up not pruning the tuple, I think. Which should be fine, as
> it could only happen for fairly new tuples.

It's pretty far from ideal if the situation cannot correct itself in a
timely manner. Offhand I wonder if this might have something to do
with remaining suspected cases where we restart pruning during VACUUM,
based on the belief that an inserter concurrently aborted. If the DEAD
tuple is unreachable by pruning *forever*, then we're in trouble.

The good news is that the fix for missing DEAD tuples (a hypothetical
problem) is the same as the fix for the known, concrete problem:
process whole HOT chains first, and only later (in a "second pass"
over the page) process remaining heap-only tuples -- those heap-only
tuples that cannot be located any other way (because we tried and
failed).

You probably noticed that my patch does *not* have the same
"!HeapTupleHeaderIsHotUpdated()" check for these
unreachable-via-HOT-chain heap-only tuples. The mere fact that they're
unreachable is a good enough reason to consider them DEAD (and so mark
them LP_UNUSED). We do check heap_prune_satisfies_vacuum() for these
tuples too, but that is theoretically unnecessary. This feels pretty
water tight to me -- contradictory interpretations of what heap-only
tuple is and is in what HOT chain (if any) now seem impossible.

I benchmarked the patch, and it looks like there is a consistent
reduction in latency -- which you probably won't find too surprising.
That isn't the goal, but it is nice to see that new performance
regressions are unlikely to be a problem for the bug fix.

-- 
Peter Geoghegan



On Wed, Nov 10, 2021 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-11-10 13:04:43 -0800, Peter Geoghegan wrote:
> > On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> > > The way this definitely breaks - I have been able to reproduce this in
> > > isolation - is when one tuple is processed twice by heap_prune_chain(), and
> > > the result of HeapTupleSatisfiesVacuum() changes from
> > > HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.
> >
> > I had no idea that that was now possible. I really think that this
> > ought to be documented centrally.
>
> Where would you suggest?

Offhand I'd say that it would be a good idea to add comments over the
call to vacuum_set_xid_limits() made from vacuumlazy.c.

You might also move the call to GlobalVisTestFor() out of
lazy_scan_heap(), so that it gets called right after
vacuum_set_xid_limits(). That would make the new explanation easier to
follow, since you are after all explaining the relationship between
OldestXmin (or the vacuum_set_xid_limits() call itself) and vistest
(or the GlobalVisTestFor() call itself).

Why do they have to be called in that order? Or do they? I noticed
that "make check-world" won't break if you switch the order.

I assume that you're going to want to say something about what needs
to happen in lazy_scan_prune() in these new comments -- since that is
where the relationship between these two things is most crucial.

Thanks
-- 
Peter Geoghegan



Hi,

On 2021-11-10 17:19:14 -0800, Peter Geoghegan wrote:
> On Wed, Nov 10, 2021 at 4:57 PM Andres Freund <andres@anarazel.de> wrote:
> > Yes. I don't think it's problematic right now, because the redirect would, I
> > think, in all cases have to point to the chain element before those tuples,
> > because the preceding value would just have to be DELETE_IN_PROGRESS, which we
> > we don't follow in heap_prune_chain().
> 
> Actually, I was more worried about versions before Postgres 14 here --
> versions that don't have that new DELETE_IN_PROGRESS pruning behavior
> you mentioned.

I think we're actually saying the same thing. I.e. that we, so far, don't have
a concrete reason to worry about pre 14 versions.


> > > In other words: what if the aforementioned "aborted HOT updates"
> > > precheck code doesn't deal with a DEAD tuple, imagining that it's not
> > > a relevant tuple, while at the same time the later HOT-chain-chasing
> > > code *also* doesn't get to the tuple? What if they each assume that
> > > the other will/has taken care of it, due to a race?
> >
> > Then we'd just end up not pruning the tuple, I think. Which should be fine, as
> > it could only happen for fairly new tuples.
> 
> It's pretty far from ideal if the situation cannot correct itself in a
> timely manner.

Wouldn't it immediately corrected when called from vacuum due to the DEAD
check? Afaict this is a one-way ratched, so the next prune is guaranteed to
get it?


> Offhand I wonder if this might have something to do with remaining suspected
> cases where we restart pruning during VACUUM, based on the belief that an
> inserter concurrently aborted. If the DEAD tuple is unreachable by pruning
> *forever*, then we're in trouble.

But why would it be unreachable forever? The next hot prune will see
!HeapTupleHeaderIsHotUpdated() and process it via the "If the tuple is DEAD
and doesn't chain to anything else" block?


> You probably noticed that my patch does *not* have the same
> "!HeapTupleHeaderIsHotUpdated()" check for these
> unreachable-via-HOT-chain heap-only tuples. The mere fact that they're
> unreachable is a good enough reason to consider them DEAD (and so mark
> them LP_UNUSED). We do check heap_prune_satisfies_vacuum() for these
> tuples too, but that is theoretically unnecessary. This feels pretty
> water tight to me -- contradictory interpretations of what heap-only
> tuple is and is in what HOT chain (if any) now seem impossible.

Hm. I guess you *have* to actually process them regardless of
!HeapTupleHeaderIsHotUpdated() with the new approach, because otherwise we'd
potentially only process only the tail item of a disconnected chain in each
heap_hot_prune() call? Although that might be unreachable due to the HTSV
calls likely breaking the chain in all disconnected cases (but perhaps not
with multixacts, there could be remaining lockers).

Greetings,

Andres Freund



On Wed, Nov 10, 2021 at 5:43 PM Andres Freund <andres@anarazel.de> wrote:
> > Actually, I was more worried about versions before Postgres 14 here --
> > versions that don't have that new DELETE_IN_PROGRESS pruning behavior
> > you mentioned.
>
> I think we're actually saying the same thing. I.e. that we, so far, don't have
> a concrete reason to worry about pre 14 versions.

Mostly, yes. I think that it's not a good idea to backpatch a fix
beyond 14 at this time.

All I was saying beyond that was: If there was a problem before 14,
then it would probably be a problem with the
"HEAPTUPLE_INSERT_IN_PROGRESS becomes DEAD" case. And, it's probably
also more likely that it would be "leaking DEAD tuples", since that is
much less obvious than the bug we know about, in 14. In general
problems with "incorrectly broken" HOT chains are very subtle, and
less well understood than other problems.

> Wouldn't it immediately corrected when called from vacuum due to the DEAD
> check? Afaict this is a one-way ratched, so the next prune is guaranteed to
> get it?

I believe that's true, but I still worry just a little about extreme
cases. I'm not concerned enough to advocate backpatch beyond 14 (not
even close!), at least for now. And so my concern is pretty abstract
and theoretical at this point.

It'll be nice to obviously not have the theoretical problem on 14,
which needs a fix for the non-theoretical problem anyway -- AFAICT
there is no extra risk anyway.

> But why would it be unreachable forever? The next hot prune will see
> !HeapTupleHeaderIsHotUpdated() and process it via the "If the tuple is DEAD
> and doesn't chain to anything else" block?

I agree.

> > You probably noticed that my patch does *not* have the same
> > "!HeapTupleHeaderIsHotUpdated()" check for these
> > unreachable-via-HOT-chain heap-only tuples. The mere fact that they're
> > unreachable is a good enough reason to consider them DEAD (and so mark
> > them LP_UNUSED). We do check heap_prune_satisfies_vacuum() for these
> > tuples too, but that is theoretically unnecessary. This feels pretty
> > water tight to me -- contradictory interpretations of what heap-only
> > tuple is and is in what HOT chain (if any) now seem impossible.
>
> Hm. I guess you *have* to actually process them regardless of
> !HeapTupleHeaderIsHotUpdated() with the new approach, because otherwise we'd
> potentially only process only the tail item of a disconnected chain in each
> heap_hot_prune() call?

That's true.

> Although that might be unreachable due to the HTSV
> calls likely breaking the chain in all disconnected cases (but perhaps not
> with multixacts, there could be remaining lockers).

Also true. But I like that we don't really have to worry about it with
the fix in place.

If we cannot reach a heap-only tuple via a valid HOT chain, it's
already effectively DEAD, anyway -- all bets are off. We could even
add a "can't happen" ERROR to catch the case where HTSV disagrees.
Unsure if that would be worth it, but it's certainly worth an
assertion -- which I've already added.

What do you think of the idea of a "can't happen" ERROR here?

-- 
Peter Geoghegan



Hi,

On 2021-11-10 17:37:38 -0800, Peter Geoghegan wrote:
> On Wed, Nov 10, 2021 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:
> Offhand I'd say that it would be a good idea to add comments over the
> call to vacuum_set_xid_limits() made from vacuumlazy.c.

Hm. To me all of this is more general than vacuum[lazy].c. Or even than
anything heap related.


> You might also move the call to GlobalVisTestFor() out of
> lazy_scan_heap(), so that it gets called right after
> vacuum_set_xid_limits(). That would make the new explanation easier to
> follow, since you are after all explaining the relationship between
> OldestXmin (or the vacuum_set_xid_limits() call itself) and vistest
> (or the GlobalVisTestFor() call itself).
> 
> Why do they have to be called in that order? Or do they? I noticed
> that "make check-world" won't break if you switch the order.

We need pruning to be at least as aggressive as relfrozenxid. If we did it the
other way round, we couldn't guarantee that.

I think we should work towards not actually using a statically determined
relfrozenxid. We cause a lot of unnecessary re-vacuuming by using a static
cutoff - instead we should check what the actually oldest xid in the table is
and set relfrozenxid to that.

Greetings,

Andres Freund



Hi,

On 2021-11-10 18:16:18 -0800, Andres Freund wrote:
> On 2021-11-10 17:37:38 -0800, Peter Geoghegan wrote:
> > You might also move the call to GlobalVisTestFor() out of
> > lazy_scan_heap(), so that it gets called right after
> > vacuum_set_xid_limits(). That would make the new explanation easier to
> > follow, since you are after all explaining the relationship between
> > OldestXmin (or the vacuum_set_xid_limits() call itself) and vistest
> > (or the GlobalVisTestFor() call itself).
> > 
> > Why do they have to be called in that order? Or do they? I noticed
> > that "make check-world" won't break if you switch the order.
> 
> We need pruning to be at least as aggressive as relfrozenxid. If we did it the
> other way round, we couldn't guarantee that.

There's a relevant comment above ComputeXidHorizons():

 * Note: the approximate horizons (see definition of GlobalVisState) are
 * updated by the computations done here. That's currently required for
 * correctness and a small optimization. Without doing so it's possible that
 * heap vacuum's call to heap_page_prune() uses a more conservative horizon
 * than later when deciding which tuples can be removed - which the code
 * doesn't expect (breaking HOT).

Greetings,

Andres Freund



On Wed, Nov 10, 2021 at 6:16 PM Andres Freund <andres@anarazel.de> wrote:
> Hm. To me all of this is more general than vacuum[lazy].c. Or even than
> anything heap related.

Here is a sensible compromise: put most of what you want to say
wherever (I guess procarray.c), and then move the vacuumlazy.c call to
GlobalVisTestFor() back, so that it comes immediately after the
vacuum_set_xid_limits() call. Then place a few breadcrumb comments
that reference the place in procarray.c that has the real discussion.

> We need pruning to be at least as aggressive as relfrozenxid. If we did it the
> other way round, we couldn't guarantee that.

I thought that that's what it was, but the code doesn't actually say
anything about it. The distance between the two actually-related
things is jarring, at least to me.

> I think we should work towards not actually using a statically determined
> relfrozenxid. We cause a lot of unnecessary re-vacuuming by using a static
> cutoff - instead we should check what the actually oldest xid in the table is
> and set relfrozenxid to that.

I agree, but that doesn't seem relevant to me. AFAICT the "effective"
relfrozenxid when applying this hypothetical future optimization (the
"actually oldest xid in the table", as you put it) must never end up
exceeding the original OldestXmin cutoff. And so I don't think that it
changes the fundamental invariants for either OldestXmin, or for
freezeLimit/relfrozenxid. Specifically, the "freezeLimit <=
OldestXmin" invariant.

We could probably *also* freeze tuples opportunistically (e.g., freeze
a few tuples on a page early to be able to mark it all-frozen sooner),
since freezing is basically just an all-visible marking that applies
at the tuple level. We could perhaps even do this when the tuples
would not be visible to our original OldestXmin (they just have to be
visible to every possible MVCC snapshot, and so VACUUM's OldestXmin
itself doesn't necessarily have to be considered). This additional
optimization doesn't seem like it changes the invariants, either,
though. Since I'm pretty sure that freezing tuples early isn't
compatible with allowing those tuples to affect the final
freezeLimit/relfrozenxid (when we have both optimization, working
together).

-- 
Peter Geoghegan



Hi,

On 2021-11-09 19:07:02 -0800, Peter Geoghegan wrote:
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index c7331d810..31fa4d2a3 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -31,6 +31,7 @@
>  typedef struct
>  {
>      Relation    rel;
> +    BlockNumber targetblkno;

It's a good optimization to remove the reundant BufferGetBlockNumber(), but
perhaps it's worth committing that separately?


> @@ -256,10 +260,8 @@ heap_page_prune(Relation relation, Buffer buffer,
>           offnum <= maxoff;
>           offnum = OffsetNumberNext(offnum))
>      {
> -        ItemId        itemid;
> -
>          /* Ignore items already processed as part of an earlier chain */
> -        if (prstate.marked[offnum])
> +        if (prstate.fromvalidchain[offnum])
>              continue;
>  
>          /*

ISTM that we should now call heap_prune_chain() only for the start of actual
chains (i.e. redirect or non-hot tuple).

I think it'd be good to have a comment above the loop explaining that we're
just iterating over chains starting at a non-HOT element.


> @@ -269,15 +271,29 @@ heap_page_prune(Relation relation, Buffer buffer,
>          if (off_loc)
>              *off_loc = offnum;
>  
> -        /* Nothing to do if slot is empty or already dead */
> -        itemid = PageGetItemId(page, offnum);
> -        if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> -            continue;
> -
>          /* Process this item or chain of items */
>          ndeleted += heap_prune_chain(buffer, offnum, &prstate);
>      }

Why did you move this into heap_prune_chain()?


> +    /*
> +     * Scan the page again, processing any heap-only tuples that we now know
> +     * are not part of any valid HOT chain
> +     */
> +    for (offnum = FirstOffsetNumber;
> +         offnum <= maxoff;
> +         offnum = OffsetNumberNext(offnum))
> +    {
> +        /* Ignore items already processed as part of a known valid chain */
> +        if (prstate.fromvalidchain[offnum])
> +            continue;
> +
> +        if (off_loc)
> +            *off_loc = offnum;
> +
> +        /* Process this disconnected heap-only tuple */
> +        ndeleted += heap_prune_heaponly(buffer, offnum, &prstate);
> +    }
> +
>      /* Clear the offset information once we have processed the given page. */
>      if (off_loc)
>          *off_loc = InvalidOffsetNumber;
> @@ -383,19 +399,8 @@ heap_page_prune(Relation relation, Buffer buffer,
>          pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
>  
>      /*
> -     * XXX Should we update the FSM information of this page ?
> -     *
> -     * There are two schools of thought here. We may not want to update FSM
> -     * information so that the page is not used for unrelated UPDATEs/INSERTs
> -     * and any free space in this page will remain available for further
> -     * UPDATEs in *this* page, thus improving chances for doing HOT updates.
> -     *
> -     * But for a large table and where a page does not receive further UPDATEs
> -     * for a long time, we might waste this space by not updating the FSM
> -     * information. The relation may get extended and fragmented further.
> -     *
> -     * One possibility is to leave "fillfactor" worth of space in this page
> -     * and update FSM with the remaining space.
> +     * Don't update the FSM information on this page.  We leave it up to our
> +     * caller to decide what to do about that.
>       */

I don't think this should be part of the commit?



>      offnum = rootoffnum;
> +    nchain = 0;

>      /* while not end of the chain */
>      for (;;)
>      {
>          ItemId        lp;
> -        bool        tupdead,
> -                    recent_dead;
> +        HeapTupleHeader htup;
> +        HeapTupleData tup;
> +        bool        tupdead;
>  
>          /* Sanity check (pure paranoia) */
>          if (offnum < FirstOffsetNumber)
> @@ -601,15 +603,11 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>              break;
>  
>          /* If item is already processed, stop --- it must not be same chain */
> -        if (prstate->marked[offnum])
> +        if (nchain != 0 && prstate->fromvalidchain[offnum])
>              break;

I think we should make it an error to reach the same tuple multiple ways. It's
not *quite* as trivial as making this an assert/error though, as we can only
raise an error after checking that the xmin/xmax thing passes.


>          /*
>           * If we are looking at the redirected root line pointer, jump to the
>           * first normal tuple in the chain.  If we find a redirect somewhere
> @@ -619,42 +617,63 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>          {
>              if (nchain > 0)
>                  break;            /* not at start of chain */
> +            Assert(rootisredirect);
>              chainitems[nchain++] = offnum;
>              offnum = ItemIdGetRedirect(rootlp);
>              continue;
>          }

Why are we handling this inside the loop?


> -        /*
> -         * Likewise, a dead line pointer can't be part of the chain. (We
> -         * already eliminated the case of dead root tuple outside this
> -         * function.)
> -         */
> -        if (ItemIdIsDead(lp))
> +        /* LP_UNUSED or LP_DEAD items obviously not part of the chain */
> +        if (!ItemIdIsUsed(lp) || ItemIdIsDead(lp))
> +        {
> +            /*
> +             * XXX What if we just came from root item, which is a plain heap
> +             * tuple?  Do we need assert that notices when we reach an LP_DEAD
> +             * or LP_UNUSED item having started from such a root item?
> +             */
> +            Assert(!rootisredirect || nchain > 1);

I don't think we can assert that. It's perfectly possible to have a non-hot
update chain where the following element can be vacuumed, but the preceding
element can't (e.g. when the updater has an older xid that prevents it from
being pruned).



>          chainitems[nchain++] = offnum;
> +        prstate->fromvalidchain[offnum] = true;
>  
>          /*
>           * Check tuple's visibility status.
>           */
> -        tupdead = recent_dead = false;
> +        tupdead = false;
>  
>          switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
>          {
> @@ -663,7 +682,6 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>                  break;
>  
>              case HEAPTUPLE_RECENTLY_DEAD:
> -                recent_dead = true;
>  
>                  /*
>                   * This tuple may soon become DEAD.  Update the hint field so
> @@ -679,6 +697,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>                   * This tuple may soon become DEAD.  Update the hint field so
>                   * that the page is reconsidered for pruning in future.
>                   */
> +                Assert(!tupdead);
>                  heap_prune_record_prunable(prstate,
>                                             HeapTupleHeaderGetUpdateXid(htup));
>                  break;

> @@ -692,6 +711,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>                   * But we don't.  See related decisions about when to mark the
>                   * page prunable in heapam.c.
>                   */
> +                Assert(!tupdead);
>                  break;
>  
>              default:

I don't understand these new assert? We just set tupdead to false above?


> @@ -700,11 +720,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>          }
>  
>          /*
> -         * Remember the last DEAD tuple seen.  We will advance past
> -         * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
> -         * but we can't advance past anything else.  We have to make sure that
> -         * we don't miss any DEAD tuples, since DEAD tuples that still have
> -         * tuple storage after pruning will confuse VACUUM.
> +         * Remember the last DEAD tuple seen from this HOT chain.
> +         *
> +         * We expect to sometimes find a RECENTLY_DEAD tuple after a DEAD
> +         * tuple.  When this happens, the RECENTLY_DEAD tuple will be treated
> +         * as if it was DEAD all along.  Manage that now.

I now actually wonder why this is correct. There's another comment about it:

> We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
> * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really

But that doesn't justify very much.

What prevents the scenario that some other backend e.g. has a snapshot with
xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
row, but *not* the RECENTLY_DEAD one.


> +         * We're not just interested in DEAD and RECENTLY_DEAD tuples.  We
> +         * need to traverse each and every HOT chain exhaustively, in order to
> +         * determine which heap-only tuples are part of a valid HOT chain.
> +         * Heap-only tuples that cannot be located like this must not be part
> +         * of a valid HOT chain.  They are therefore processed during our
> +         * second pass over the page.
>           */
>          if (tupdead)
>          {

This comment can't really be understood without the historical behaviour of
the function.


> +static int
> +heap_prune_heaponly(Buffer buffer, OffsetNumber offnum, PruneState *prstate)
> +{
> +    Page        dp = (Page) BufferGetPage(buffer);
> +    ItemId        lp;
> +    HeapTupleHeader htup;
> +    HeapTupleData tup;
> +    HTSV_Result res;
> +
> +    lp = PageGetItemId(dp, offnum);
> +    Assert(ItemIdIsNormal(lp));
> +    htup = (HeapTupleHeader) PageGetItem(dp, lp);
> +
> +    /*
> +     * Caller must make sure that the tuple at 'offnum' is in fact a heap-only
> +     * tuple that is disconnected from its HOT chain (i.e. isn't reachable
> +     * through a HOT traversal that starts at any plausible-looking root item
> +     * on the page).
> +     */
> +    Assert(!prstate->fromvalidchain[offnum]);
> +    Assert(HeapTupleHeaderIsHeapOnly(htup));
> +
> +    /*
> +     * We expect that disconnected heap-only tuples must be from aborted
> +     * transactions.  Any RECENTLY_DEAD tuples we see here are really DEAD,
> +     * but the heap_prune_satisfies_vacuum test is too coarse to detect it.
> +     */
> +    tup.t_len = ItemIdGetLength(lp);
> +    tup.t_tableOid = RelationGetRelid(prstate->rel);
> +    tup.t_data = htup;
> +    ItemPointerSet(&(tup.t_self), prstate->targetblkno, offnum);
> +    res = heap_prune_satisfies_vacuum(prstate, &tup, buffer);
> +    if (res == HEAPTUPLE_DEAD || res == HEAPTUPLE_RECENTLY_DEAD)
> +    {
> +        heap_prune_record_unused(prstate, offnum);
> +        HeapTupleHeaderAdvanceLatestRemovedXid(htup,
> +                                               &prstate->latestRemovedXid);
> +    }
> +    else
> +        Assert(false);
> +
> +    /*
> +     * Should always be DEAD.  A DEAD heap-only tuple is always counted in
> +     * top-level ndeleted counter for pruning operation.
> +     */
> +    return 1;
> +}

It seems weird to have the if (res == HEAPTUPLE_DEAD ..) branch, but then to
unconditionally return 1.

I'm not actually sure the Assert is unreachable. I can imagine cases where
we'd see e.g. DELETE/INSERT_IN_PROGRESS due to a concurrent subtransaction
abort or such.


Greetings,

Andres Freund



Hi,

On 2021-11-10 18:40:13 -0800, Peter Geoghegan wrote:
> On Wed, Nov 10, 2021 at 6:16 PM Andres Freund <andres@anarazel.de> wrote:
> > Hm. To me all of this is more general than vacuum[lazy].c. Or even than
> > anything heap related.
> 
> Here is a sensible compromise: put most of what you want to say
> wherever (I guess procarray.c), and then move the vacuumlazy.c call to
> GlobalVisTestFor() back, so that it comes immediately after the
> vacuum_set_xid_limits() call. Then place a few breadcrumb comments
> that reference the place in procarray.c that has the real discussion.

I don't really think it's a good idea to move it to earlier. If anything I
would want to move it later (and eventually occasionally re-compute it, that'd
be a *huge* win for longrunning vacuums). We can't really do the same with the
hard horizons. It's currently super likely that the horizon moves forward a
lot between vacuum_set_xid_limits() and GlobalVisTestFor(), but moving it
closer still is going in the wrong direction.

So I'm inclined to add a comment saying that we need to do the
GlobalVisTestFor(), pointing to the comment in ComputeXidHorizons(), and
adding a TODO that we should find a heuristic when to recompute horizons?



> We could probably *also* freeze tuples opportunistically (e.g., freeze
> a few tuples on a page early to be able to mark it all-frozen sooner),

We should definitely do that when a page is already being dirtied for other
reasons.

Greetings,

Andres Freund



On Wed, Nov 10, 2021 at 7:16 PM Andres Freund <andres@anarazel.de> wrote:
> I don't really think it's a good idea to move it to earlier. If anything I
> would want to move it later (and eventually occasionally re-compute it, that'd
> be a *huge* win for longrunning vacuums).

I agree that it would be a huge win, especially when we want to set
the VM after heap vacuuming, during the second heap pass. But...

> We can't really do the same with the
> hard horizons. It's currently super likely that the horizon moves forward a
> lot between vacuum_set_xid_limits() and GlobalVisTestFor(), but moving it
> closer still is going in the wrong direction.

...why is that super likely? I have to admit that I have no idea what
you mean by that.

> So I'm inclined to add a comment saying that we need to do the
> GlobalVisTestFor(), pointing to the comment in ComputeXidHorizons(), and
> adding a TODO that we should find a heuristic when to recompute horizons?

Okay. Even still, I suggest that you say something about this new
DELETE_IN_PROGRESS pruning behavior in vacuumlazy.c. Maybe in
lazy_scan_prune()?

> > We could probably *also* freeze tuples opportunistically (e.g., freeze
> > a few tuples on a page early to be able to mark it all-frozen sooner),
>
> We should definitely do that when a page is already being dirtied for other
> reasons.

Agreed. But at the very least we should be thinking about the
whole-page picture. If we made freezing a whole page cheap enough
(smaller WAL records), then maybe the separate all-visible state
becomes totally unnecessary (except for pg_upgrade).

-- 
Peter Geoghegan



Hi,

On 2021-11-10 19:33:01 -0800, Peter Geoghegan wrote:
> On Wed, Nov 10, 2021 at 7:16 PM Andres Freund <andres@anarazel.de> wrote:
> > We can't really do the same with the
> > hard horizons. It's currently super likely that the horizon moves forward a
> > lot between vacuum_set_xid_limits() and GlobalVisTestFor(), but moving it
> > closer still is going in the wrong direction.
> 
> ...why is that super likely? I have to admit that I have no idea what
> you mean by that.

Err, I missed a *not* in there, sorry for the confusion.


> > So I'm inclined to add a comment saying that we need to do the
> > GlobalVisTestFor(), pointing to the comment in ComputeXidHorizons(), and
> > adding a TODO that we should find a heuristic when to recompute horizons?
> 
> Okay. Even still, I suggest that you say something about this new
> DELETE_IN_PROGRESS pruning behavior in vacuumlazy.c. Maybe in
> lazy_scan_prune()?

That doesn't quite make sense to me. There's quite a few ways that a HTSV can
change for the same tuple, and listing them all in vacuumlazy.c doesn't really
help people notice that when they add a new call or such. There's plenty
other callsites - like the one in heap_prune_chain()...

How about changing the comment above lazy_scan_prune() to note that the
concurrent abort case is one example of this behaviour? And then add a comment
to HeapTupleSatisfiesVacuum() along the lines of:

NB: Repeated calls to HeapTupleSatisfiesVacuum for the same tuple may change
even while the buffer is locked. Cases of this include:
- the transaction that inserted a tuple (version) aborted, changing the result
  from HEAPTUPLE_INSERT_IN_PROGRESS to HEAPTUPLE_DEAD
- when used with GlobalVisTestIsRemovableXid(), HEAPTUPLE_DELETE_IN_PROGRESS
  can change to HEAPTUPLE_DEAD if the horizon is updated

However, HEAPTUPLE_DEAD will not change back to HEAPTUPLE_RECENTLY_DEAD or
such.


> > > We could probably *also* freeze tuples opportunistically (e.g., freeze
> > > a few tuples on a page early to be able to mark it all-frozen sooner),
> >
> > We should definitely do that when a page is already being dirtied for other
> > reasons.
> 
> Agreed. But at the very least we should be thinking about the
> whole-page picture. If we made freezing a whole page cheap enough
> (smaller WAL records), then maybe the separate all-visible state
> becomes totally unnecessary (except for pg_upgrade).

I doubt it. But I think getting rid of unnecessarily dirtying the page a
second (or third) time would be a big enough win already...

Greetings,

Andres Freund



On Thu, Nov 11, 2021 at 12:05 PM Andres Freund <andres@anarazel.de> wrote:
> > ...why is that super likely? I have to admit that I have no idea what
> > you mean by that.
>
> Err, I missed a *not* in there, sorry for the confusion.

Can you at least move the GlobalVisTestFor() call back? Move it from
lazy_scan_heap() to the point right before we call lazy_scan_heap(),
from heap_vacuum_rel()? That still seems like a big improvement, while
obviously having no real side-effects.

> > Okay. Even still, I suggest that you say something about this new
> > DELETE_IN_PROGRESS pruning behavior in vacuumlazy.c. Maybe in
> > lazy_scan_prune()?
>
> That doesn't quite make sense to me. There's quite a few ways that a HTSV can
> change for the same tuple, and listing them all in vacuumlazy.c doesn't really
> help people notice that when they add a new call or such. There's plenty
> other callsites - like the one in heap_prune_chain()...

Well, the retry in lazy_scan_prune() says that it only expects to have
to retry in the event of an aborted transaction. If nothing else, it
seems like you might want to amend that. If in fact the same
"DELETE_IN_PROGRESS becomes DEAD" issue applies (can't see why it
wouldn't).

> How about changing the comment above lazy_scan_prune() to note that the
> concurrent abort case is one example of this behaviour? And then add a comment
> to HeapTupleSatisfiesVacuum() along the lines of:
>
> NB: Repeated calls to HeapTupleSatisfiesVacuum for the same tuple may change
> even while the buffer is locked. Cases of this include:
> - the transaction that inserted a tuple (version) aborted, changing the result
>   from HEAPTUPLE_INSERT_IN_PROGRESS to HEAPTUPLE_DEAD
> - when used with GlobalVisTestIsRemovableXid(), HEAPTUPLE_DELETE_IN_PROGRESS
>   can change to HEAPTUPLE_DEAD if the horizon is updated
>
> However, HEAPTUPLE_DEAD will not change back to HEAPTUPLE_RECENTLY_DEAD or
> such.

Okay - sounds good.

> > Agreed. But at the very least we should be thinking about the
> > whole-page picture. If we made freezing a whole page cheap enough
> > (smaller WAL records), then maybe the separate all-visible state
> > becomes totally unnecessary (except for pg_upgrade).
>
> I doubt it. But I think getting rid of unnecessarily dirtying the page a
> second (or third) time would be a big enough win already...

Why do you doubt it? Do you suspect that the additional cost of
freezing the whole page (relative to just setting it all-visible)
cannot be reduced to the point where it can be ignored? That might be
true, but I'd like to know if that's what you meant. Because I can't
think of any other reason, and it's not like I haven't thought about
it a fair bit already.

-- 
Peter Geoghegan



On Thu, Nov 11, 2021 at 12:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Can you at least move the GlobalVisTestFor() call back? Move it from
> lazy_scan_heap() to the point right before we call lazy_scan_heap(),
> from heap_vacuum_rel()? That still seems like a big improvement, while
> obviously having no real side-effects.

I imagine that you'd also move vistest into the main state struct
(vacrel) -- it could go right next to the existing OldestXmin field in
vacrel/LVRelState. These two things are too similar to not put next to
each other, and address together. We never use one without using the
other (except in heap_page_is_all_visible(), but that's just a
stripped-down version of lazy_scan_prune() anyway, so it shouldn't
count).

-- 
Peter Geoghegan



On Wed, Nov 10, 2021 at 7:08 PM Andres Freund <andres@anarazel.de> wrote:
> It's a good optimization to remove the reundant BufferGetBlockNumber(), but
> perhaps it's worth committing that separately?

Ordinarily I would, but the intention here is to avoid new performance
regressions. Is it okay if I backpatch this performance optimization,
trivial though it may be? Or should I just do that on HEAD?

Attached is v3, which works though some of your feedback, though there
are still unresolved issues. I expect that we'll need to very
carefully review this bugfix, since it's a particularly tricky issue.
I expect to produce several more revisions.

> > @@ -256,10 +260,8 @@ heap_page_prune(Relation relation, Buffer buffer,
> >                offnum <= maxoff;
> >                offnum = OffsetNumberNext(offnum))
> >       {
> > -             ItemId          itemid;
> > -
> >               /* Ignore items already processed as part of an earlier chain */
> > -             if (prstate.marked[offnum])
> > +             if (prstate.fromvalidchain[offnum])
> >                       continue;
> >
> >               /*
>
> ISTM that we should now call heap_prune_chain() only for the start of actual
> chains (i.e. redirect or non-hot tuple).

Seems easier to put the "heap-only tuple at rootoffnum, so it might be
disconnected -- not for me to process unless and until I find it
during later call that uses this item's HOT chain's root offset
number" logic here. That's at least closer to what we have already.
Let me know if you still don't like it in v3, though. I might just
change it, but it's related to the way that the array is now called
"visited" -- a question that I'll consider in the paragraph
immediately following this one.

> I think it'd be good to have a comment above the loop explaining that we're
> just iterating over chains starting at a non-HOT element.

You mean like heap_get_root_tuples() does it, with its own root_offsets array?

I have changed the array -- it's now "visited". It's no longer
specifically about HOT chains. It's actually more like "definitely not
disconnected heap-only tuple" items, but I didn't call it that for the
obvious reason. By the end of the first pass over the heap, all items
that are not heap-only tuples (as well as most heap-only tuples) will
be marked visited in v3.

The advantage of not explicitly making it about root-ness is that it
avoids relying on the already-squishy definition of "root item". We
want to visit an LP_DEAD or LP_UNUSED item in the first pass over the
page -- we must get them out of the way during the second pass (we
don't expect to have to deal with any tuple that isn't a disconnected
heap-only tuple at that point).

Another advantage of the "visited" array design is that it doesn't
matter if we happen to visit an LP_DEAD/LP_UNUSED item via a heap-only
tuples bogus-but-valid t_ctid chain. Might as well just mark it
"visited' at that point, avoiding a second visit during the first heap
pass (in addition to the second pass). That micro optimization may not
be worth much, but it feels kind of natural to me. What do you think?

> > @@ -269,15 +271,29 @@ heap_page_prune(Relation relation, Buffer buffer,
> >               if (off_loc)
> >                       *off_loc = offnum;
> >
> > -             /* Nothing to do if slot is empty or already dead */
> > -             itemid = PageGetItemId(page, offnum);
> > -             if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> > -                     continue;
> > -
> >               /* Process this item or chain of items */
> >               ndeleted += heap_prune_chain(buffer, offnum, &prstate);
> >       }
>
> Why did you move this into heap_prune_chain()?

I find it simpler that way. Now both the first and second pass loops
have essentially the same structure as each other.

> >
> >       /*
> > -      * XXX Should we update the FSM information of this page ?
> > -      *

> I don't think this should be part of the commit?

Agreed. Pushed that as a separate commit just now.

> I think we should make it an error to reach the same tuple multiple ways. It's
> not *quite* as trivial as making this an assert/error though, as we can only
> raise an error after checking that the xmin/xmax thing passes.

But as you go on to say, we should *not* do that for LP_DEAD or
LP_UNUSED items (nor heap-only tuples that don't pass xmin/xmax
crosscheck) -- since in general a heap-only tuples's t_ctid link is
allowed to point to just about anything (e.g., past the end of a heap
page's line pointer array).

Wouldn't it be simpler and more robust if we leaned on
heap_prune_satisfies_vacuum() for this instead? There will end up
being some "disconnected" remaining heap-only tuples when the same
tuple is reachable multiple ways. It's very likely that these
apparently-disconnected-but-who-knows tuples will fail some kind of
validation by heap_prune_satisfies_vacuum() in the second pass over
the page -- we expect to be able to treat them DEAD, which provides a
good cross-check. (Granted, there is an unresolved question about what
exact HTSV return codes we can expect here, but I imagine that  a LIVE
disconnected tuple will be fully impossible, for example).

Note also that v3 directly Asserts() for the truly important case: The
case where a HOT chain whose root item is an LP_REDIRECT has only one
valid-looking item (the root item itself). That assertion is near the
end of heap_prune_chain() in v3. This assertion more or less replaces
the very dangerous "We found a redirect item that doesn't point to a
valid follow-on" block from HEAD.

> >               /*
> >                * If we are looking at the redirected root line pointer, jump to the
> >                * first normal tuple in the chain.  If we find a redirect somewhere
> > @@ -619,42 +617,63 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> >               {
> >                       if (nchain > 0)
> >                               break;                  /* not at start of chain */
> > +                     Assert(rootisredirect);
> >                       chainitems[nchain++] = offnum;
> >                       offnum = ItemIdGetRedirect(rootlp);
> >                       continue;
> >               }
>
> Why are we handling this inside the loop?

Removed in v3.

> > +             if (!ItemIdIsUsed(lp) || ItemIdIsDead(lp))
> > +             {
> > +                     /*
> > +                      * XXX What if we just came from root item, which is a plain heap
> > +                      * tuple?  Do we need assert that notices when we reach an LP_DEAD
> > +                      * or LP_UNUSED item having started from such a root item?
> > +                      */
> > +                     Assert(!rootisredirect || nchain > 1);
>
> I don't think we can assert that.

Agreed, it has been removed in v3.

> I don't understand these new assert? We just set tupdead to false above?

I don't either. Removed in v3.

> I now actually wonder why this is correct. There's another comment about it:
>
> > We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
> > * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
>
> But that doesn't justify very much.
>
> What prevents the scenario that some other backend e.g. has a snapshot with
> xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
> than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
> row, but *not* the RECENTLY_DEAD one.

I'll need to think about this very carefully. I didn't think it was
worth blocking v3 on, though naturally it's a big concern.

> This comment can't really be understood without the historical behaviour of
> the function.

Agreed. Removed in v3.

> > +     if (res == HEAPTUPLE_DEAD || res == HEAPTUPLE_RECENTLY_DEAD)
> > +     {
> > +             heap_prune_record_unused(prstate, offnum);
> > +             HeapTupleHeaderAdvanceLatestRemovedXid(htup,
> > +
&prstate->latestRemovedXid);
> > +     }
> > +     else
> > +             Assert(false);

> It seems weird to have the if (res == HEAPTUPLE_DEAD ..) branch, but then to
> unconditionally return 1.

I changed that in v3.

> I'm not actually sure the Assert is unreachable. I can imagine cases where
> we'd see e.g. DELETE/INSERT_IN_PROGRESS due to a concurrent subtransaction
> abort or such.

I don't know either. I am leaving this question unresolved for now.

--
Peter Geoghegan

Attachment
On Thu, Nov 11, 2021 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > What prevents the scenario that some other backend e.g. has a snapshot with
> > xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
> > than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
> > row, but *not* the RECENTLY_DEAD one.
>
> I'll need to think about this very carefully. I didn't think it was
> worth blocking v3 on, though naturally it's a big concern.

If we're to traverse HOT chains right to the end in
heap_prune_chain(), reading even LIVE tuples (per the approach
proposed in my bugfix patch), we probably need to be more careful
about concurrently aborted xacts -- relying on the usual
!HeapTupleHeaderIsHotUpdated(htup) test doesn't seem safe.

Imagine if we land on a concurrently-aborted DEAD tuple at the end of
a physical HOT chain -- this might not be caught before we test the
previous tuple in the chain using HeapTupleHeaderIsHotUpdated(htup) --
the abort might happen just as we land on the final/aborted tuple. We
certainly shouldn't conclude that the whole HOT chain is now DEAD,
just because that one tuple is dead.

That definitely cannot happen on HEAD, I think, because we just give
up as soon as we see anything that isn't either DEAD or RECENTLY_DEAD.
But maybe it's possible with the patch.

-- 
Peter Geoghegan



On Thu, Nov 11, 2021 at 8:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Nov 11, 2021 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > What prevents the scenario that some other backend e.g. has a snapshot with
> > > xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
> > > than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
> > > row, but *not* the RECENTLY_DEAD one.
> >
> > I'll need to think about this very carefully. I didn't think it was
> > worth blocking v3 on, though naturally it's a big concern.
>
> If we're to traverse HOT chains right to the end in
> heap_prune_chain(), reading even LIVE tuples (per the approach
> proposed in my bugfix patch), we probably need to be more careful
> about concurrently aborted xacts -- relying on the usual
> !HeapTupleHeaderIsHotUpdated(htup) test doesn't seem safe.

I wonder if we're approaching this business with "RECENTLY_DEAD can be
upgraded to DEAD" in entirely the wrong way. Why not just not do that
at all anymore, on the off chance that it's unsafe? Why even take a
small chance? Our decision has to work at the level of the whole
entire HOT chain, and it seems to me that we should make that as
simple as possible.

I'm pretty sure that we're giving up nothing this way. We can just
take the conservative position that it's never okay for
heap_prune_chain() to consider a heap-only tuple from a validated HOT
chain DEAD, unless A.) HTSV says it is DEAD when asked, and B.) HTSV
said the same thing (tuple DEAD) about any and all earlier heap-only
tuples in the chain. Moreover, if HTSV says
LIVE/RECENTLY_DEAD/whatever about one tuple in the chain, then we
refuse to treat *ALL* successor tuples in the same chain as DEAD --
even when HTSV says that they're DEAD directly. As long as these
DEAD-to-HTSV heap-only tuples appear to be from the same original HOT
chain, we don't need to change our mind about the HOT chain.

The original structure of heap_prune_chain() from the HOT commit in
2007 had more or less the same code and structure as it has now, but
almost everything was in a critical structure -- the state arrays and
so on only came about a year later, in commit 6f10eb2111. The proposed
bug fix more or less finishes the work of the second commit, which
didn't go far enough. As long as we are starting by building a
consistent picture of valid HOT chains on the page, and only later
handle disconnected heap-only tuples, everything works out.

-- 
Peter Geoghegan



On Thu, Nov 11, 2021 at 9:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I wonder if we're approaching this business with "RECENTLY_DEAD can be
> upgraded to DEAD" in entirely the wrong way. Why not just not do that
> at all anymore, on the off chance that it's unsafe? Why even take a
> small chance? Our decision has to work at the level of the whole
> entire HOT chain, and it seems to me that we should make that as
> simple as possible.

Attached revision does it that way.

It also addresses the separate issue of DEAD vs RECENTLY_DEAD
disconnected tuples -- that was the other unresolved question. This
revision takes a harder line on the state of disconnected heap-only
tuples. Andres said that he doesn't know for sure that disconnected
heap-only tuples cannot be DELETE/INSERT_IN_PROGRESS -- "I'm not
actually sure the Assert is unreachable. I can imagine cases where
we'd see e.g. DELETE/INSERT_IN_PROGRESS due to a concurrent
subtransaction abort or such". But I don't see how that's possible. In
fact, I don't even see how it's possible for these items to be
RECENTLY_DEAD -- I think that they must always be DEAD (or we're in
big trouble anyway).

These are not just any heap-only tuples. They're heap-only tuples that
cannot possibly be accessed from a HOT chain. And so it's just
physically impossible for them to be returned by index scans -- this
is a certainty. How could they not be DEAD, in every possible sense?
How could they not come from an aborted transaction, specifically?

Naturally, I also went through the exercise of trying to find a
counterexample, where pruning doesn't see a disconnected tuple as DEAD
in its HTSV. I could not get the assertion to fail with Alexander's
test case, nor with make check-world. If the assertion did fail, then
I imagine that that would have to be due to a problem elsewhere -- it
couldn't be a problem with the "disconnected heap-only tuples must
already be DEAD to HTSV" assumption itself. That is one of the few
things about this patch that *isn't* complicated.

-- 
Peter Geoghegan

Attachment
Hi,

On 2021-11-12 13:11:54 -0800, Peter Geoghegan wrote:
> It also addresses the separate issue of DEAD vs RECENTLY_DEAD
> disconnected tuples -- that was the other unresolved question. This
> revision takes a harder line on the state of disconnected heap-only
> tuples. Andres said that he doesn't know for sure that disconnected
> heap-only tuples cannot be DELETE/INSERT_IN_PROGRESS -- "I'm not
> actually sure the Assert is unreachable. I can imagine cases where
> we'd see e.g. DELETE/INSERT_IN_PROGRESS due to a concurrent
> subtransaction abort or such". But I don't see how that's possible. In
> fact, I don't even see how it's possible for these items to be
> RECENTLY_DEAD -- I think that they must always be DEAD (or we're in
> big trouble anyway).
>
> These are not just any heap-only tuples. They're heap-only tuples that
> cannot possibly be accessed from a HOT chain. And so it's just
> physically impossible for them to be returned by index scans -- this
> is a certainty. How could they not be DEAD, in every possible sense?
> How could they not come from an aborted transaction, specifically?

With subtransactions abort is a bit more complicated than with plain
transactions. I'm not at all sure a problematic scenario exists, but I
wouldn't want to rely on it.

Especially if suboverflowed comes into play there can be scenarios where one
backend uses TransactionIdDidAbort() + SubTransGetTopmostTransaction() for
in-progress determination while another just relies on the procarray. Those
aren't updated atomically with respect to each other.

Also, heap_update()'s wait = true path uses a bit different logic again to
wait for other backends than what HeapTupleSatisfiesVacuum() ends up with.


> Naturally, I also went through the exercise of trying to find a
> counterexample, where pruning doesn't see a disconnected tuple as DEAD
> in its HTSV. I could not get the assertion to fail with Alexander's
> test case, nor with make check-world.

I don't think that provides a meaningful coverage. Alexander's test has a
quite limited set operations (which e.g. doesn't include an subxacts), and our
own tests around subtransactions, and particularly concurrent subtransaction
heavy work, is quite, uh, minimal.

Greetings,

Andres Freund



On Fri, Nov 12, 2021 at 2:29 PM Andres Freund <andres@anarazel.de> wrote:
> With subtransactions abort is a bit more complicated than with plain
> transactions. I'm not at all sure a problematic scenario exists, but I
> wouldn't want to rely on it.

What would it actually mean to rely on it, or to not rely on it?

As I've pointed out many times already, a disconnected heap tuple
cannot be accessed from an index scan -- this is something that you
*can* rely on, because we've performed exactly the same steps as
heap_hot_search_buffer() would in making that determination. When you
talk about what HTSV thinks of the tuple, you're merely talking about
how to behave in the event of a specific form of HOT chain corruption
(a theoretical background risk for HOT chains that's nothing new).

This is a question of trade-offs around adding defensive checks and so
on. It is not a question of making the corruption itself any less
likely (unless early detection allows the user to prevent further
corruption). I'm a bit confused here, because it sounds like you might
not agree with that.

> > Naturally, I also went through the exercise of trying to find a
> > counterexample, where pruning doesn't see a disconnected tuple as DEAD
> > in its HTSV. I could not get the assertion to fail with Alexander's
> > test case, nor with make check-world.
>
> I don't think that provides a meaningful coverage. Alexander's test has a
> quite limited set operations (which e.g. doesn't include an subxacts), and our
> own tests around subtransactions, and particularly concurrent subtransaction
> heavy work, is quite, uh, minimal.

It's a start.

We need to be pragmatic here. There is some uncertainty about what
HTSV might say about a disconnected tuple in the absence of
corruption, or there is a risk of a new problem like that coming up in
the future -- let's work within those confines, then. What do you want
to do about that? There aren't that many choices, since, to repeat,
the tuple is "morally" DEAD no matter what. Even with corruption, even
without corruption in the presence of some unanticipated corner case
with HTSV -- this is fundamental.

-- 
Peter Geoghegan



Hi,

On 2021-11-12 14:46:22 -0800, Peter Geoghegan wrote:
> On Fri, Nov 12, 2021 at 2:29 PM Andres Freund <andres@anarazel.de> wrote:
> > With subtransactions abort is a bit more complicated than with plain
> > transactions. I'm not at all sure a problematic scenario exists, but I
> > wouldn't want to rely on it.
> 
> What would it actually mean to rely on it, or to not rely on it?

That we shouldn't throw an error / assert out if we find such a tuple.


> As I've pointed out many times already, a disconnected heap tuple
> cannot be accessed from an index scan -- this is something that you
> *can* rely on, because we've performed exactly the same steps as
> heap_hot_search_buffer() would in making that determination.

Yes, it'd also not be considered visible by SatisfiesMVCC().


> When you talk about what HTSV thinks of the tuple, you're merely talking
> about how to behave in the event of a specific form of HOT chain corruption
> (a theoretical background risk for HOT chains that's nothing new).

My point is that I don't think it necessarily signals corruption. But a very
short term transient state under heavy concurrency.


> We need to be pragmatic here. There is some uncertainty about what
> HTSV might say about a disconnected tuple in the absence of
> corruption, or there is a risk of a new problem like that coming up in
> the future -- let's work within those confines, then. What do you want
> to do about that? There aren't that many choices, since, to repeat,
> the tuple is "morally" DEAD no matter what. Even with corruption, even
> without corruption in the presence of some unanticipated corner case
> with HTSV -- this is fundamental.

I think we can assert/error out if it's visible, that's clearly
corruption. I'd personally not add assert/error checks for other states, given
that it could plausible happen without indicating a problem. Debugging
transient errors that happen rarely, under high load, with nontrivial
workloads isn't fun.

Greetings,

Andres Freund



On Fri, Nov 12, 2021 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:
> > What would it actually mean to rely on it, or to not rely on it?
>
> That we shouldn't throw an error / assert out if we find such a tuple.

I agree with you about making this an error -- let's not do that. But
I disagree about making this an assertion. I *strongly* disagree, in
fact.

I am quite willing to accept general uncertainty about what's really
possible when it comes to what HTSV might say in edge cases. But I
think we need to take a firm position on what we believe is possible,
by having an assertion that formally represents what we believe to be
true, including our reasoning. If that turns out to be wrong, great --
we then have the opportunity to be less wrong in the future. What has
it cost us? The minor annoyance of turning a buildfarm animal red for
a while?

> > When you talk about what HTSV thinks of the tuple, you're merely talking
> > about how to behave in the event of a specific form of HOT chain corruption
> > (a theoretical background risk for HOT chains that's nothing new).
>
> My point is that I don't think it necessarily signals corruption. But a very
> short term transient state under heavy concurrency.

I think that that's reasonable as a working assumption -- I really do.
I also think that you need pretty thorough assertions for this.

> > We need to be pragmatic here. There is some uncertainty about what
> > HTSV might say about a disconnected tuple in the absence of
> > corruption, or there is a risk of a new problem like that coming up in
> > the future -- let's work within those confines, then. What do you want
> > to do about that? There aren't that many choices, since, to repeat,
> > the tuple is "morally" DEAD no matter what. Even with corruption, even
> > without corruption in the presence of some unanticipated corner case
> > with HTSV -- this is fundamental.
>
> I think we can assert/error out if it's visible, that's clearly
> corruption. I'd personally not add assert/error checks for other states, given
> that it could plausible happen without indicating a problem. Debugging
> transient errors that happen rarely, under high load, with nontrivial
> workloads isn't fun.

What if it's just an assertion failure (just for non-LIVE HTSV return
codes)? An assertion is a very different thing to an defensive "can't
happen" ERROR.

There is a decent chance that there is a bug if the return code is,
say, HEAPTUPLE_INSERT_IN_PROGRESS. I just cannot think of a scenario
where this specific code path sees a tuple as INSERT_IN_PROGRESS that
isn't also very broken.


--
Peter Geoghegan



Hi,

On 2021-11-12 13:11:54 -0800, Peter Geoghegan wrote:
> Attached revision does it that way.

I wonder if we should try to go for something considerably simpler for 14. How
about having a new array that just stores the HTSV state for every
ItemIdIsNormal(). For simplicity, we could populate that array eagerly in a
separate loop.

That'd fix the known bugs, and yield better efficiency (because we'd not
re-compute HTSV all the time). Then for HEAD go for something that fixes
pruning more fundamentally.

Greetings,

Andres Freund



On Fri, Nov 12, 2021 at 3:31 PM Andres Freund <andres@anarazel.de> wrote:
> I wonder if we should try to go for something considerably simpler for 14. How
> about having a new array that just stores the HTSV state for every
> ItemIdIsNormal(). For simplicity, we could populate that array eagerly in a
> separate loop.

Why is that simpler than a boolean array, which represents whether or
not the item has had its heap_prune_record_unused() call yet (if it's
a tuple with storage)?

> That'd fix the known bugs, and yield better efficiency (because we'd not
> re-compute HTSV all the time). Then for HEAD go for something that fixes
> pruning more fundamentally.

I don't know what you mean about the patch recomputing HTSV all the
time. The patch doesn't do that.

It's true that we'll call HTSV (heap_prune_record_unused(), actually)
more often when following HOT chains, because now we follow them until
the end. However, the heap_prune_record_unused() calls only happen
after we've already validated that we found a heap-only tuple that's
part of the same HOT chain. That just leaves disconnected tuples. We
do call heap_prune_record_unused() there too (which is theoretically
unnecessary), but only once.

Overall, we are *guaranteed* to only call heap_prune_record_unused()
at most once per tuple with storage. I believe that this is a small
reduction, since HEAD will do the maybe-aborted precheck call to
heap_prune_record_unused() before anything else.

I guess you might have meant something about the more-conservative
behavior with DEAD vs RECENTLY_DEAD during HOT chain traversal. But
the cases where that makes any difference at all ought to be very rare
indeed.

-- 
Peter Geoghegan



On Fri, Nov 12, 2021 at 3:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Overall, we are *guaranteed* to only call heap_prune_record_unused()
> at most once per tuple with storage. I believe that this is a small
> reduction, since HEAD will do the maybe-aborted precheck call to
> heap_prune_record_unused() before anything else.

This is copy-pasta. I meant heap_prune_satisfies_vacuum(), not
heap_prune_record_unused().


-- 
Peter Geoghegan



Hi,

On 2021-11-12 15:47:45 -0800, Peter Geoghegan wrote:
> On Fri, Nov 12, 2021 at 3:31 PM Andres Freund <andres@anarazel.de> wrote:
> > I wonder if we should try to go for something considerably simpler for 14. How
> > about having a new array that just stores the HTSV state for every
> > ItemIdIsNormal(). For simplicity, we could populate that array eagerly in a
> > separate loop.
> 
> Why is that simpler than a boolean array, which represents whether or
> not the item has had its heap_prune_record_unused() call yet (if it's
> a tuple with storage)?

Well, your change is basically a new approach of pruning - a much better
one. But it's a larger change than just eliminating the repeated HTSV calls so
that they cannot change over time. That'd be ~10-15 lines.


> > That'd fix the known bugs, and yield better efficiency (because we'd not
> > re-compute HTSV all the time). Then for HEAD go for something that fixes
> > pruning more fundamentally.
> 
> I don't know what you mean about the patch recomputing HTSV all the
> time. The patch doesn't do that.

That was in comparison to HEAD, not your patch.

Greetings,

Andres Freund



On Fri, Nov 12, 2021 at 3:56 PM Andres Freund <andres@anarazel.de> wrote:
> > Why is that simpler than a boolean array, which represents whether or
> > not the item has had its heap_prune_record_unused() call yet (if it's
> > a tuple with storage)?
>
> Well, your change is basically a new approach of pruning - a much better
> one. But it's a larger change than just eliminating the repeated HTSV calls so
> that they cannot change over time. That'd be ~10-15 lines.

Before Wednesday, I had no idea that a HEAPTUPLE_DELETE_IN_PROGRESS
could turn to DEAD during VACUUM/pruning at all. And so I now have to
wonder: what else don't I know?

There is one thing that I am fairly confident of here: the HOT chain
validation stuff is very robust. Experience has shown the invariants
to be very reliable (and if they're not reliable then we're in big
trouble anyway). The invariants can be leveraged to make the pruning
fix robust against both the issue we know about, and hypothetical
undiscovered issues that also may affect pruning. The overall risk
seems much lower with my patch. It isn't the smallest possible patch,
certainly, but that doesn't seem all that relevant, given the
specifics of the situation.

-- 
Peter Geoghegan



On 2021-11-12 16:37:24 -0800, Peter Geoghegan wrote:
> There is one thing that I am fairly confident of here: the HOT chain
> validation stuff is very robust. Experience has shown the invariants
> to be very reliable (and if they're not reliable then we're in big
> trouble anyway).

What experience? Isn't this all new stuff?



On Fri, Nov 12, 2021 at 5:48 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-11-12 16:37:24 -0800, Peter Geoghegan wrote:
> > There is one thing that I am fairly confident of here: the HOT chain
> > validation stuff is very robust. Experience has shown the invariants
> > to be very reliable (and if they're not reliable then we're in big
> > trouble anyway).
>
> What experience? Isn't this all new stuff?

I meant heapam since 2007, when HOT was first added -- all of it.

Notably, v4 of the patch makes the most conservative possible
assumptions about how HTSV might change its mind about an XID -- no
more "A RECENTLY_DEAD tuple is actually sometimes DEAD in the specific
context of processing a HOT chain". Now it's more like "A DEAD tuple
is actually sometimes RECENTLY_DEAD at the level of a HOT chain,
except for disconnected/aborted heap-only tuples, and if you don't
like that (i.e. during VACUUM) you can just retry pruning from
scratch immediately afterwards".

You said it yourself: who knows exactly what the justification for
RECENTLY_DEAD->DEAD was? I have to imagine it had something to do with the
"INSERT_IN_PROGRESS becomes DEAD due to concurrent xact abort" thing,
but that's unclear. And even if it was clear, and even if we knew that
it was 100% safe at one point, it still wouldn't be clear that it's
safe today, in Postgres 14.

--
Peter Geoghegan



On Fri, Nov 12, 2021 at 5:57 PM Peter Geoghegan <pg@bowt.ie> wrote:
> You said it yourself: who knows exactly what the justification for
> RECENTLY_DEAD->DEAD was? I have to imagine it had something to do with the
> "INSERT_IN_PROGRESS becomes DEAD due to concurrent xact abort" thing,
> but that's unclear. And even if it was clear, and even if we knew that
> it was 100% safe at one point, it still wouldn't be clear that it's
> safe today, in Postgres 14.

Another relevant factor is how we deal with already-corrupt HOT chains
affected by the bug. I would be comfortable with a full "can't happen"
error in the new code path for disconnected and aborted heap-only
tuples, provided the error only gets raised when the tuple is fully
LIVE according to HTSV (and also assert that it's DEAD). Something
like my v4 plus this LIVE-should-be-DEAD defensive error seems very
likely to avoid making the corruption any worse. There is a huge
amount of redundancy in the tuple headers that we can cross check
inexpensively.

-- 
Peter Geoghegan



> On Fri, Nov 12, 2021 at 02:46:22PM -0800, Peter Geoghegan wrote:
> On Fri, Nov 12, 2021 at 2:29 PM Andres Freund <andres@anarazel.de> wrote:
> > > Naturally, I also went through the exercise of trying to find a
> > > counterexample, where pruning doesn't see a disconnected tuple as DEAD
> > > in its HTSV. I could not get the assertion to fail with Alexander's
> > > test case, nor with make check-world.
> >
> > I don't think that provides a meaningful coverage. Alexander's test has a
> > quite limited set operations (which e.g. doesn't include an subxacts), and our
> > own tests around subtransactions, and particularly concurrent subtransaction
> > heavy work, is quite, uh, minimal.
>
> It's a start.
>
> We need to be pragmatic here. There is some uncertainty about what
> HTSV might say about a disconnected tuple in the absence of
> corruption, or there is a risk of a new problem like that coming up in
> the future -- let's work within those confines, then. What do you want
> to do about that? There aren't that many choices, since, to repeat,
> the tuple is "morally" DEAD no matter what. Even with corruption, even
> without corruption in the presence of some unanticipated corner case
> with HTSV -- this is fundamental.

I've got curious if modifying the Alexander's test case could reveal
something interesting, and sprinkled it with savepoints and rollbacks.
Almost immediately a new problem has manifested itself, although the
crash has nothing to do with the disconnected tuples as far as I can
tell -- still probably worth mentioning. In this case vacuum invoked
lazy_scan_prune, and during the first scan one of the chains had a
HEAPTUPLE_DEAD at the third position. The processing flow fell through
to heap_prune_record_prunable and crashed on an assert with an
InvalidTransactionId:

    #3  0x000055a2b260d1f9 in heap_prune_record_prunable (prstate=0x7ffd0c0ecdf0, xid=0) at pruneheap.c:872
    #4  0x000055a2b260ca72 in heap_prune_chain (buffer=2117, rootoffnum=150, prstate=0x7ffd0c0ecdf0) at
pruneheap.c:695
    #5  0x000055a2b260bcd6 in heap_page_prune (relation=0x7fb98e217e20, buffer=2117, vistest=0x55a2b31d2d60
<GlobalVisCatalogRels>,old_snap_xmin=0, old_snap_ts=0, report_stats=false, off_loc=0x55a2b3e6a0cc) at pruneheap.c:288
 
    #6  0x000055a2b261309c in lazy_scan_prune (vacrel=0x55a2b3e6a060, buf=2117, blkno=192, page=0x7fb97856bf80 "",
vistest=0x55a2b31d2d60<GlobalVisCatalogRels>, prunestate=0x7ffd0c0ee9d0) at vacuumlazy.c:1739
 

Applying heap_prune_record_prunable only if TransactionIdIsNormal seems
to help. The original implementation didn't reach
heap_prune_record_prunable either and also doesn't crash.



On Sat, Nov 13, 2021 at 7:05 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I've got curious if modifying the Alexander's test case could reveal
> something interesting, and sprinkled it with savepoints and rollbacks.
> Almost immediately a new problem has manifested itself, although the
> crash has nothing to do with the disconnected tuples as far as I can
> tell -- still probably worth mentioning. In this case vacuum invoked
> lazy_scan_prune, and during the first scan one of the chains had a
> HEAPTUPLE_DEAD at the third position. The processing flow fell through
> to heap_prune_record_prunable and crashed on an assert with an
> InvalidTransactionId:

Is this just with the bugfix applied? I think that it is. Looks like a
minor bug to me.

I think that I need to consistently "break" in the DEAD case, to avoid
ending up here. In other words, it should not literally be
"reinterpreted" as RECENTLY_DEAD by falling through in the switch
statement (though the concept of reinterpreting certain DEAD tuples as
RECENTLY_DEAD still seems perfectly sound).

Here's why the assertion (invalid xmax/update xid cannot be used in
heap_prune_record_prunable() call) fails:

DEAD means that you might not have a valid update XID -- aborted
update is what we expect. But RECENTLY_DEAD means that there must have
been a deleter xact, and that the xact must have committed (can't have
been that the inserter aborted). This is a consequence of the fact
that the tuple is at least still visible to somebody (or could be),
unlike in the DEAD case. And so xmax must be a valid XID, and so the
existing RECENTLY_DEAD case handling can legitimately always expect
that. But I cannot (and should not) allow a call to
heap_prune_record_prunable() with a DEAD-to-HTSV tuple, even when I
"reinterpret" it as RECENTLY_DEAD in order to make a clean
determination of the tuple to delete up until for the entire HOT
chain.

-- 
Peter Geoghegan



Hi,

On 2021-11-10 19:08:21 -0800, Andres Freund wrote:
> On 2021-11-09 19:07:02 -0800, Peter Geoghegan wrote:
> > @@ -700,11 +720,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> >          }
> >
> >          /*
> > -         * Remember the last DEAD tuple seen.  We will advance past
> > -         * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
> > -         * but we can't advance past anything else.  We have to make sure that
> > -         * we don't miss any DEAD tuples, since DEAD tuples that still have
> > -         * tuple storage after pruning will confuse VACUUM.
> > +         * Remember the last DEAD tuple seen from this HOT chain.
> > +         *
> > +         * We expect to sometimes find a RECENTLY_DEAD tuple after a DEAD
> > +         * tuple.  When this happens, the RECENTLY_DEAD tuple will be treated
> > +         * as if it was DEAD all along.  Manage that now.
>
> I now actually wonder why this is correct. There's another comment about it:
>
> > We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
> > * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
>
> But that doesn't justify very much.
>
> What prevents the scenario that some other backend e.g. has a snapshot with
> xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
> than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
> row, but *not* the RECENTLY_DEAD one.

I think the code is actually correct if awfully commented. I.e. the above
scenario cannot happen in a harmful situation. The relevant piece is that the
set of "valid" snapshots is limited (for a specific cluster history).

A row can only be RECENTLY_DEAD if the deleting / updating transaction
committed.

An intermediary RECENTLY_DEAD row followed by a DEAD row would only be visible
to a snapshot that either had

1) row.xmax >= Snapshot.xmax
2) row.xmax in Snapshot.xip.

There cannot be a snapshot that sees the RECENTLY_DEAD row while also seeing
the subsequent row as DEAD, because the existence of one of the two conditions
would have held back the xid horizon:

1) is not possible, because Snapshot.xmin is <= Snapshot.xmax, and
  Snapshot.xmin >= MyProc->xmin. The subsequent row could not be DEAD.

2) is not possible, because .xip only contains xids >= Snapshot.xmin.


IOW, there *can* be a snapshot with xmin=xmax=RECENTLY_DEAD-row, but it'd not
see the RECENTLY_DEAD row anyway.

Greetings,

Andres Freund



On Mon, Nov 15, 2021 at 6:39 PM Andres Freund <andres@anarazel.de> wrote:
> > What prevents the scenario that some other backend e.g. has a snapshot with
> > xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
> > than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
> > row, but *not* the RECENTLY_DEAD one.
>
> I think the code is actually correct if awfully commented. I.e. the above
> scenario cannot happen in a harmful situation. The relevant piece is that the
> set of "valid" snapshots is limited (for a specific cluster history).

> IOW, there *can* be a snapshot with xmin=xmax=RECENTLY_DEAD-row, but it'd not
> see the RECENTLY_DEAD row anyway.

I'm not surprised that this turned out to be correct. It hasn't made
me change my mind; I still believe that the best way forward is to
backpatch a more comprehensive fix, like my patch. Again, I just think
that that approach has the best chance of avoiding any further
problems.

The alternative proposal that you lean towards (minimal fix on 14,
commit the comprehensive fix to HEAD only) seems fine overall, though.
In any case I don't think that there is much point in further debating
it. If I was going to convince you, it would have happened by now. It
is your bug (kind of), and so I defer to you on this.

Why don't you produce a minimal fix for backpatch? I'll review that,
just as you reviewed my patch.

-- 
Peter Geoghegan



Hi,

I spent quite a bit more time trying to have a good testcase for reproducing
this reliably. I didn't feel confident using testing that takes substantial
amounts of time to hit bugs.


On 2021-11-10 11:20:10 -0800, Andres Freund wrote:
> The way this definitely breaks - I have been able to reproduce this in
> isolation - is when one tuple is processed twice by heap_prune_chain(), and
> the result of HeapTupleSatisfiesVacuum() changes from
> HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.
>
> Consider a page like this:
>
> lp 1: redirect to lp2
> lp 2: deleted by xid x, not yet committed
>
> and a sequence of events like this:
>
> 1) heap_prune_chain(rootlp = 1)
> 2) commit x
> 3) heap_prune_chain(rootlp = 2)
>
> 1) heap_prune_chain(rootlp = 1) will go to lp2, and see a
> HEAPTUPLE_DELETE_IN_PROGRESS and thus not do anything.
>
> 3) then could, with the snapshot scalability changes, get DEAD back from
> HTSV. Due to the "fuzzy" nature of the post-snapshot-scalability xid horizons,
> that is possible, because we can end up rechecking the boundary condition and
> seeing that now the horizon allows us to prune x / lp2.
>
> At that point we have a redirect tuple pointing into an unused slot. Which is
> "illegal", because something independent can be inserted into that slot.

This above isn't quite right - it turns out this requires additional steps to
be triggerable (without debugging code to make things easier to hit). As shown
above the result won't change, because the xmin horizon of x will have held
the xmin horizon far enough back that GlobalVisTestShouldUpdate() won't ever
recompute the horizon for x on its own.

However, what *can* happen is that *another* tuple is within the "fuzzy"
horizon window, and that tuple can trigger a recomputation of the
horizon. Which then basically triggers the above problem.

However even that is very hard to hit, because normally we won't recompute the
horizon in that case either, because of

 * The current heuristic is that we update only if RecentXmin has changed
 * since the last update. If the oldest currently running transaction has not
 * finished, it is unlikely that recomputing the horizon would be useful.

I.e. a snapshot needs to be computed between the the vacuum_set_xid_limit()
and the heap_page_prune() call. Turns out, there's a window:
vac_open_indexes() needs to lock the indexes. Which in turn will trigger
AcceptInvalidationMessages(). Relcache invalidations then can trigger a
catalog snapshot to be built, which can increase RecentXmin, and later trigger
a new accurate horizon to be built.

Consider a page like:

lp 1: redirect to lp 3
lp 2: RECENTLY_DEAD tuple xmax x1, held back by session y's xmin horizon
lp 3: DELETE_IN_PROGRESS tuple xmax x2

and the following sequence of actions:

1) vacuum_set_xid_limit()
   -> sets ComputeXidHorizonsResultLastXmin = RecentXmin
2) session y commits
3) vac_open_indexes()
   -> AcceptInvalidationMessages()
      -> InvalidateCatalogSnapshot()
      -> RelationCacheInvalidateEntry()
        -> GetCatalogSnapshot()
          -> GetSnapshotData()
             updates RecentXmin
4) x2 commits
5) heap_prune_chain(1)
   -> sees lp3 as RECENTLY_DEAD (or DELETE_IN_PROGRESS depending on timing)
   -> doesn't do anything
6) heap_prune_chain(2)
   -> sees a RECENTLY_DEAD, updates GlobalVisTestShouldUpdate() says yes,
   updates xid horizon to above x2
7) heap_prune_chain(3)
   -> uses the HeapOnly path at the beginning, sees a DEAD tuple, marks the
      element as unused

Boom.


Not sure if it's worth it, but I think this can be made into a reliably
isolationtest by causing a tables index to locked exclusively, blocking
vac_open_indexes(), which can then reliably schedule a new relcache inval,
which in turn can compute a new RecentXmin. Might be too fragile to be worth
it.


As an independent issue, I don't think it can ever be safe to do catalog
access with PROC_IN_VACUUM set. Which vac_open_indexes() (and likely some
other things) clearly do. Without the current backend's xmin set, another
backend can remove still required tuples. Which can cause us to build corrupt
relcache entries, and then subsequently skip indexing some indexes, causing
corruption.  I think I hit this while trying to repro the above, but it's hard
to be sure.


> I hit a crash once in 13 with a slightly evolved version of the test (many
> connections creating / dropping the partitions as in the original scenario,
> using :client_id to target different tables). It's possible that my
> instrumentation was the cause of that. Unfortunately it took quite a few hours
> to hit the problem in 13...

I wonder if this was actually the above PROC_IN_VACUUM thing.

Greetings,

Andres Freund



Hi,

On 2021-11-17 23:19:41 -0800, Andres Freund wrote:
> Not sure if it's worth it, but I think this can be made into a reliably
> isolationtest by causing a tables index to locked exclusively, blocking
> vac_open_indexes(), which can then reliably schedule a new relcache inval,
> which in turn can compute a new RecentXmin. Might be too fragile to be worth
> it.

Attached is such an isolationtest. In an unmodified HEAD it ends up with

step s1_select_1:
  SET LOCAL enable_seqscan = false;
  SELECT ctid, /*xmin, xmax, */ id FROM many_updates WHERE id = 1;
  RESET enable_seqscan;

ctid |id
-----+--
(0,3)|17
(1 row)

without amcheck detecting corruption at that point:(.


It's clearly not yet something we could consider committable, but I think it
might be a useful basis for such a test.

Greetings,

Andres Freund

Attachment
On Thu, Nov 18, 2021 at 4:36 PM Andres Freund <andres@anarazel.de> wrote:
> Attached is such an isolationtest. In an unmodified HEAD it ends up with

> without amcheck detecting corruption at that point:(.

The best way to teach amcheck to detect this kind of thing is bound to
be verification of HOT chains themselves:

https://www.postgresql.org/message-id/flat/CAH2-Wznphpd490o%2Bur_bi6-YmC47hRu3Qyxod72RLqC-Uvuhcg%40mail.gmail.com

It would also be nice to teach verify_nbtree.c to look out for the
presence of index tuples that shouldn't be there during heapallindexed
verification, but that's significantly harder.

--
Peter Geoghegan



Hi,

On November 18, 2021 6:55:08 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Thu, Nov 18, 2021 at 4:36 PM Andres Freund <andres@anarazel.de> wrote:
>> Attached is such an isolationtest. In an unmodified HEAD it ends up with
>
>> without amcheck detecting corruption at that point:(.
>
>The best way to teach amcheck to detect this kind of thing is bound to
>be verification of HOT chains themselves:
>
>https://www.postgresql.org/message-id/flat/CAH2-Wznphpd490o%2Bur_bi6-YmC47hRu3Qyxod72RLqC-Uvuhcg%40mail.gmail.com

That'd not reliably catch this kind of thing, because the hot chains easily can end up correct enough looking.


>It would also be nice to teach verify_nbtree.c to look out for the
>presence of index tuples that shouldn't be there during heapallindexed
>verification, but that's significantly harder.

I think what we really need is a pass that verifies that heap tuples reached via the index actually match the index
key.There's just too many things that can break that. It'll not be blazingly fast, but there's just way too many things
thatother approaches just won't find. 

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Hi,

On 2021-11-16 16:22:06 -0800, Peter Geoghegan wrote:
> I'm not surprised that this turned out to be correct. It hasn't made
> me change my mind; I still believe that the best way forward is to
> backpatch a more comprehensive fix, like my patch. Again, I just think
> that that approach has the best chance of avoiding any further
> problems.
> 
> The alternative proposal that you lean towards (minimal fix on 14,
> commit the comprehensive fix to HEAD only) seems fine overall, though.
> In any case I don't think that there is much point in further debating
> it. If I was going to convince you, it would have happened by now. It
> is your bug (kind of), and so I defer to you on this.

I'm not quite sure what you referring to with "convincing me"? Whether to go
for something comprehensive or more minimal? I'm kind of agnostic on what the
right approach is.


> Why don't you produce a minimal fix for backpatch? I'll review that,
> just as you reviewed my patch.

Here's a draft of that change.


I think it's worth to clean up the regression test I wrote and use it
regardless of which version of the fix we end up choosing. But I'm a bit bit
on the fence - it's quite complicated.

OTOH, I think with the framework in place it'd not be too hard to write a few
more tests for odd pruning scenarios...

If we were to, it'd probably best to to comment out the mon_* stuff, as that
is what requires the amcheck / pageinspect extensions, which we'd likely not
want to depend on. But it's pretty crucial for development.

Greetings,

Andres Freund

Attachment
On Mon, Nov 22, 2021 at 9:59 AM Andres Freund <andres@anarazel.de> wrote:
> I'm not quite sure what you referring to with "convincing me"? Whether to go
> for something comprehensive or more minimal? I'm kind of agnostic on what the
> right approach is.

I thought that you were clearly leaning in the direction of a
minimal-only fix for backpatch. I just meant that there is no point in
saying much more about it. It's not like I have a problem with your
approach. My personal opinion is that we'd be somewhat better off if
we went with my more comprehensive approach, even on 14. That's not
how you see it - which is fair enough (reasoning about these risks is
hard, and it tends to be quite subjective).

And so, as I said before, I can leave it at that.

> > Why don't you produce a minimal fix for backpatch? I'll review that,
> > just as you reviewed my patch.
>
> Here's a draft of that change.

Looks reasonable to me.

> I think it's worth to clean up the regression test I wrote and use it
> regardless of which version of the fix we end up choosing. But I'm a bit bit
> on the fence - it's quite complicated.

+1 to the idea of keeping it somewhere. Without necessarily running it
on the BF.

> OTOH, I think with the framework in place it'd not be too hard to write a few
> more tests for odd pruning scenarios...

Can we commit a test like this without running it by default, at all?
It's not like it has never been done before.

> If we were to, it'd probably best to to comment out the mon_* stuff, as that
> is what requires the amcheck / pageinspect extensions, which we'd likely not
> want to depend on. But it's pretty crucial for development.

Good idea.

The nice thing about using amcheck for something like this (compared
to pageinspect) is that we don't have to specify exact details about
what it looks like when the test passes. Because passing the test just
means the absence of any error.

How hard would it be to just add amcheck coverage on HEAD? Something
that goes just a bit further with LP_REDIRECT validation (currently
verify_heapam does practically nothing like that). We should add
comprehensive HOT chain validation too, but that can wait a little
while longer. It's a lot more complicated.

Another thing is that the new assertions themselves are part of our
overall testing strategy. I see that you've gone further with that
than I did in your v3-0003-* patch. I should adopt that code to my
more comprehensive fix for HEAD, while still keeping around the
existing heap_prune_chain() assertions (perhaps just as
documentation/comments to make the code easier to follow).

-- 
Peter Geoghegan



Hi,

On 2021-11-22 12:41:32 -0800, Peter Geoghegan wrote:
> On Mon, Nov 22, 2021 at 9:59 AM Andres Freund <andres@anarazel.de> wrote:
> > I'm not quite sure what you referring to with "convincing me"? Whether to go
> > for something comprehensive or more minimal? I'm kind of agnostic on what the
> > right approach is.
> 
> I thought that you were clearly leaning in the direction of a
> minimal-only fix for backpatch. I just meant that there is no point in
> saying much more about it. It's not like I have a problem with your
> approach. My personal opinion is that we'd be somewhat better off if
> we went with my more comprehensive approach, even on 14. That's not
> how you see it - which is fair enough (reasoning about these risks is
> hard, and it tends to be quite subjective).

I think I'm actually just not sure what the better approach for the
backbranches is. I'm worried about the size of your patch, I'm worried about
the craziness of the current architecture (if you can call it that) of
heap_page_prune() with my patch.

I think either way your approach should incorporate the dedicated loop to
check the visibility - for performance reasons alone.


> > I think it's worth to clean up the regression test I wrote and use it
> > regardless of which version of the fix we end up choosing. But I'm a bit bit
> > on the fence - it's quite complicated.
> 
> +1 to the idea of keeping it somewhere. Without necessarily running it
> on the BF.
> 
> > OTOH, I think with the framework in place it'd not be too hard to write a few
> > more tests for odd pruning scenarios...
> 
> Can we commit a test like this without running it by default, at all?
> It's not like it has never been done before.

Is there a reason not to run it if we commit it? IME tests not run by default
tend to bitrot very quickly.  The only issue that I can think of is that it
might be hard to make the test useful and robust in the presence of
debug_discard_caches != 0.


> The nice thing about using amcheck for something like this (compared
> to pageinspect) is that we don't have to specify exact details about
> what it looks like when the test passes. Because passing the test just
> means the absence of any error.

Unless amcheck gets more in-depth there's just too much it doesn't show :(


> How hard would it be to just add amcheck coverage on HEAD? Something
> that goes just a bit further with LP_REDIRECT validation (currently
> verify_heapam does practically nothing like that). We should add
> comprehensive HOT chain validation too, but that can wait a little
> while longer. It's a lot more complicated.

What exactly do you mean with "amcheck coverage" in this context?


> Another thing is that the new assertions themselves are part of our
> overall testing strategy. I see that you've gone further with that
> than I did in your v3-0003-* patch. I should adopt that code to my
> more comprehensive fix for HEAD, while still keeping around the
> existing heap_prune_chain() assertions (perhaps just as
> documentation/comments to make the code easier to follow).

I found that style of check helpful because it notices problems sooner. What
took the longest debugging this issue was that simple assertions (e.g. during
heap_page_prune_execute() or heap_prune_record_unused()) would trigger well
after the actual bug. E.g. marking an item unused that is referenced by a
pre-existing LP_REDIRECT item can't easily be detected at that time. It'd
trigger an assertion at the time the LP_REDIRECT item is marked dead, but that
doesn't help to pinpoint the origin of the problem much.

I'm inclined to think that the costs of the check are worth incurring by
default for cassert builds.


I wish I knew of a good way to have pgbench check the validity of its
results. There's a huge difference between not crashing and returning correct
results. And there's just a lot of bugs that will result in valid-enough
on-disk/in-buffer data, but which are not correct.

Greetings,

Andres Freund



Hi,

On 2021-11-22 14:34:36 -0800, Andres Freund wrote:
> I think I'm actually just not sure what the better approach for the
> backbranches is. I'm worried about the size of your patch, I'm worried about
> the craziness of the current architecture (if you can call it that) of
> heap_page_prune() with my patch.

Given the lack of further discussion, let's go with the "minimal" fix and your
stuff later. Attached is further polished patch - no significant changes,
plenty copy-editing stuff.

0001 is the bugfix

0002 adds the additional assertions - I'm wondering about only committing that
  in HEAD?

  I think your patch should easily be able to use prstate->htsv?

0003 removes the redundant RelationGetNumberOfBlocks() calls. As 0001 does not
  change the total number of RelationGetNumberOfBlocks() calls I'm inclined to
  just apply this to HEAD?

0004 is a patch that tries to address your point about the GlobalVisTestFor()
  placement, sort of

  My earlier point that moving it to earlier would be a bad idea because it'd
  make the horizon "older" was bogus - GlobalVisTestFor() doesn't itself do
  any horizon determination. However moving the GlobalVisTestFor() earlier
  still seems wrong - imo the vacuum_set_xid_limits() call should be moved to
  later. The attached patch moves both, wrapped in a new function, to just
  before the scan in lazy_scan_heap().

  This clearly is HEAD only material - I'm only bringing it up here because
  the issue of the GlobalVisTestFor() placement was raised in here...


> > > I think it's worth to clean up the regression test I wrote and use it
> > > regardless of which version of the fix we end up choosing. But I'm a bit bit
> > > on the fence - it's quite complicated.
> >
> > +1 to the idea of keeping it somewhere. Without necessarily running it
> > on the BF.
> >
> > > OTOH, I think with the framework in place it'd not be too hard to write a few
> > > more tests for odd pruning scenarios...
> >
> > Can we commit a test like this without running it by default, at all?
> > It's not like it has never been done before.
>
> Is there a reason not to run it if we commit it? IME tests not run by default
> tend to bitrot very quickly.  The only issue that I can think of is that it
> might be hard to make the test useful and robust in the presence of
> debug_discard_caches != 0.

I tried pretty hard to get the test to be reliable enough to commit it. But in
the end I think using the index locking as a "control" mechanism just isn't
great - as evidenced by 0003 breaking the test entirely. I think unfortunately
there's not much hope for testing this unless we get wait points - which seems
not too close :(.

Greetings,

Andres Freund

Attachment
Hi,

On 2021-11-13 16:06:40 +0100, Dmitry Dolgov wrote:
> I've got curious if modifying the Alexander's test case could reveal
> something interesting, and sprinkled it with savepoints and rollbacks.
> Almost immediately a new problem has manifested itself, although the
> crash has nothing to do with the disconnected tuples as far as I can
> tell -- still probably worth mentioning. In this case vacuum invoked
> lazy_scan_prune, and during the first scan one of the chains had a
> HEAPTUPLE_DEAD at the third position. The processing flow fell through
> to heap_prune_record_prunable and crashed on an assert with an
> InvalidTransactionId:
> 
>     #3  0x000055a2b260d1f9 in heap_prune_record_prunable (prstate=0x7ffd0c0ecdf0, xid=0) at pruneheap.c:872
>     #4  0x000055a2b260ca72 in heap_prune_chain (buffer=2117, rootoffnum=150, prstate=0x7ffd0c0ecdf0) at
pruneheap.c:695
>     #5  0x000055a2b260bcd6 in heap_page_prune (relation=0x7fb98e217e20, buffer=2117, vistest=0x55a2b31d2d60
<GlobalVisCatalogRels>,old_snap_xmin=0, old_snap_ts=0, report_stats=false, off_loc=0x55a2b3e6a0cc) at pruneheap.c:288
 
>     #6  0x000055a2b261309c in lazy_scan_prune (vacrel=0x55a2b3e6a060, buf=2117, blkno=192, page=0x7fb97856bf80 "",
vistest=0x55a2b31d2d60<GlobalVisCatalogRels>, prunestate=0x7ffd0c0ee9d0) at vacuumlazy.c:1739
 
> 
> Applying heap_prune_record_prunable only if TransactionIdIsNormal seems
> to help. The original implementation didn't reach
> heap_prune_record_prunable either and also doesn't crash.

Does your modified test still find problems with 0001 & 0002 from
https://postgr.es/m/20211211045710.ljtuu4gfloh754rs%40alap3.anarazel.de
applied?

Greetings,

Andres Freund



On Fri, Dec 10, 2021 at 8:57 PM Andres Freund <andres@anarazel.de> wrote:
> Given the lack of further discussion, let's go with the "minimal" fix and your
> stuff later. Attached is further polished patch - no significant changes,
> plenty copy-editing stuff.

Makes sense.

> 0001 is the bugfix

LGTM.

> 0002 adds the additional assertions - I'm wondering about only committing that
>   in HEAD?

I don't think that it makes much difference, since these are only
assertions. I probably wouldn't bother with that myself, but I also
have zero concern about it breaking anything.

>   I think your patch should easily be able to use prstate->htsv?

Let's get this committed, to 14 and HEAD. And then leave it to me to
rebase, as additional HEAD-only hardening.

> 0003 removes the redundant RelationGetNumberOfBlocks() calls. As 0001 does not
>   change the total number of RelationGetNumberOfBlocks() calls I'm inclined to
>   just apply this to HEAD?

Yeah, that can be part of my follow-up hardening patch, which (to
repeat) will be HEAD-only.

> 0004 is a patch that tries to address your point about the GlobalVisTestFor()
>   placement, sort of
>
>   My earlier point that moving it to earlier would be a bad idea because it'd
>   make the horizon "older" was bogus - GlobalVisTestFor() doesn't itself do
>   any horizon determination. However moving the GlobalVisTestFor() earlier
>   still seems wrong - imo the vacuum_set_xid_limits() call should be moved to
>   later. The attached patch moves both, wrapped in a new function, to just
>   before the scan in lazy_scan_heap().

I think that this is the right direction, and I like the comments a
lot. But I don't like the idea of putting this initialization into
lazy_scan_heap(), for purely structural reasons. I plan to get rid of
the VacuumParams argument to lazy_scan_heap() completely, which would
make "VacuumParams setup" the sole responsibility of its caller,
heap_vacuum_rel(). The draft version of my "remove special cases to
advance relfrozenxid" patch (the one you've reviewed) already does
this. In general I really think it's important to make the state in
vacuumlazy.c as simple as possible.

I have no objection to delaying the lazy_scan_heap_limits() stuff
until right before lazy_scan_heap() is called. However, I do think
that we should always know certain basic "immutable" facts about a
VACUUM at the point that we call lazy_scan_heap(), which is not the
case with this patch.

Honestly, I'm surprised that you see much value in delaying the
lazy_scan_heap_limits() stuff until the very last microsecond. How
many microseconds could we possibly delay it by?

> I tried pretty hard to get the test to be reliable enough to commit it. But in
> the end I think using the index locking as a "control" mechanism just isn't
> great - as evidenced by 0003 breaking the test entirely. I think unfortunately
> there's not much hope for testing this unless we get wait points - which seems
> not too close :(.

Hopefully somebody will seriously commit to that at some point. It's
important, but not very glamorous.

-- 
Peter Geoghegan



On Sat, Dec 11, 2021 at 2:03 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I have no objection to delaying the lazy_scan_heap_limits() stuff
> until right before lazy_scan_heap() is called. However, I do think
> that we should always know certain basic "immutable" facts about a
> VACUUM at the point that we call lazy_scan_heap(), which is not the
> case with this patch.
>
> Honestly, I'm surprised that you see much value in delaying the
> lazy_scan_heap_limits() stuff until the very last microsecond. How
> many microseconds could we possibly delay it by?

Have you thought about the implications for the ongoing work to set
pg_class.relfrozenxid to the oldest observed XID in the table, instead
of just using FreezeLimit naively (which I've prototyped but haven't
posted)?

What if the target heap relation gets extended after we've established
nblocks/rel_pages for the lazyvacuum.c operation (by calling
RelationGetNumberOfBlocks()), but before we get to the
lazy_scan_heap_limits() stuff for the same operation? What if there
are a small number of heap pages at the end of the relation that we
won't get to at all in the ongoing VACUUM? They could have heap tuples
whose header XIDs are from just before our OldestXmin cutoff. I
believe it follows that we cannot miss them (at least not in an
aggressive VACUUM, maybe not ever with my patch).

-- 
Peter Geoghegan



> On Fri, Dec 10, 2021 at 08:58:26PM -0800, Andres Freund wrote:
> Hi,
>
> On 2021-11-13 16:06:40 +0100, Dmitry Dolgov wrote:
> > I've got curious if modifying the Alexander's test case could reveal
> > something interesting, and sprinkled it with savepoints and rollbacks.
> > Almost immediately a new problem has manifested itself, although the
> > crash has nothing to do with the disconnected tuples as far as I can
> > tell -- still probably worth mentioning. In this case vacuum invoked
> > lazy_scan_prune, and during the first scan one of the chains had a
> > HEAPTUPLE_DEAD at the third position. The processing flow fell through
> > to heap_prune_record_prunable and crashed on an assert with an
> > InvalidTransactionId:
> >
> >     #3  0x000055a2b260d1f9 in heap_prune_record_prunable (prstate=0x7ffd0c0ecdf0, xid=0) at pruneheap.c:872
> >     #4  0x000055a2b260ca72 in heap_prune_chain (buffer=2117, rootoffnum=150, prstate=0x7ffd0c0ecdf0) at
pruneheap.c:695
> >     #5  0x000055a2b260bcd6 in heap_page_prune (relation=0x7fb98e217e20, buffer=2117, vistest=0x55a2b31d2d60
<GlobalVisCatalogRels>,old_snap_xmin=0, old_snap_ts=0, report_stats=false, off_loc=0x55a2b3e6a0cc) at pruneheap.c:288
 
> >     #6  0x000055a2b261309c in lazy_scan_prune (vacrel=0x55a2b3e6a060, buf=2117, blkno=192, page=0x7fb97856bf80 "",
vistest=0x55a2b31d2d60<GlobalVisCatalogRels>, prunestate=0x7ffd0c0ee9d0) at vacuumlazy.c:1739
 
> >
> > Applying heap_prune_record_prunable only if TransactionIdIsNormal seems
> > to help. The original implementation didn't reach
> > heap_prune_record_prunable either and also doesn't crash.
>
> Does your modified test still find problems with 0001 & 0002 from
> https://postgr.es/m/20211211045710.ljtuu4gfloh754rs%40alap3.anarazel.de
> applied?

Nope, everything seems to be working smoothly.



On 2021-12-13 13:21:54 +0100, Dmitry Dolgov wrote:
> > On Fri, Dec 10, 2021 at 08:58:26PM -0800, Andres Freund wrote:
> > Does your modified test still find problems with 0001 & 0002 from
> > https://postgr.es/m/20211211045710.ljtuu4gfloh754rs%40alap3.anarazel.de
> > applied?
> 
> Nope, everything seems to be working smoothly.

Thanks!



On Fri, Dec 10, 2021 at 8:57 PM Andres Freund <andres@anarazel.de> wrote:
> Given the lack of further discussion, let's go with the "minimal" fix and your
> stuff later. Attached is further polished patch - no significant changes,
> plenty copy-editing stuff.

We have roughly 3 weeks to get this into 14.2. Can you push your
minimal fix soon? Any blockers?

Thanks
-- 
Peter Geoghegan



Hi,

On 2022-01-10 11:31:58 -0800, Peter Geoghegan wrote:
> On Fri, Dec 10, 2021 at 8:57 PM Andres Freund <andres@anarazel.de> wrote:
> > Given the lack of further discussion, let's go with the "minimal" fix and your
> > stuff later. Attached is further polished patch - no significant changes,
> > plenty copy-editing stuff.
> 
> We have roughly 3 weeks to get this into 14.2. Can you push your
> minimal fix soon?

Working on it now. I've a meeting or two in a bit, it'll probably be after...


> Any blockers?

I'm just struggling with / procrastinating on the commit message, tbh. The
whole issue is kinda complicated to explain... :/

Greetings,

Andres Freund



On Wed, Jan 12, 2022 at 11:25 AM Andres Freund <andres@anarazel.de> wrote:
> > Any blockers?
>
> I'm just struggling with / procrastinating on the commit message, tbh. The
> whole issue is kinda complicated to explain... :/

I think that it would make sense for the commit message to frame the
problem as: pruneheap.c doesn't take sufficient care when traversing
HOT chains to determine their full extent, for the purposes of
pruning. There was a general lack of robustness, and the snapshot
scalability work happened to run into that, resulting in hot chain
corruption under very specific conditions.

If I was in your position I think I would resist framing the problem
in this way; I'd probably be concerned that it would come off as
shifting the blame elsewhere. This high level explanation of things
makes the most sense to me, though. Surely that's the most important
thing.


--
Peter Geoghegan



On 2022-01-12 13:05:45 -0800, Peter Geoghegan wrote:
> On Wed, Jan 12, 2022 at 11:25 AM Andres Freund <andres@anarazel.de> wrote:
> > > Any blockers?
> >
> > I'm just struggling with / procrastinating on the commit message, tbh. The
> > whole issue is kinda complicated to explain... :/

After struggling some more, I *finally* pushed the fix and the new assertions.


Thanks for the bug report, investigation, review, etc!


> I think that it would make sense for the commit message to frame the
> problem as: pruneheap.c doesn't take sufficient care when traversing
> HOT chains to determine their full extent, for the purposes of
> pruning. There was a general lack of robustness, and the snapshot
> scalability work happened to run into that, resulting in hot chain
> corruption under very specific conditions.
>
> If I was in your position I think I would resist framing the problem
> in this way; I'd probably be concerned that it would come off as
> shifting the blame elsewhere. This high level explanation of things
> makes the most sense to me, though. Surely that's the most important
> thing.

Thanks!



14.01.2022 05:15, Andres Freund wrote:
> On 2022-01-12 13:05:45 -0800, Peter Geoghegan wrote:
>> On Wed, Jan 12, 2022 at 11:25 AM Andres Freund <andres@anarazel.de> wrote:
>>>> Any blockers?
>>> I'm just struggling with / procrastinating on the commit message, tbh. The
>>> whole issue is kinda complicated to explain... :/
> After struggling some more, I *finally* pushed the fix and the new assertions.
>
>
> Thanks for the bug report, investigation, review, etc!
Thank you, Andres and Peter, for the fix!
It was a tough bug to crack for me.

P. S. I'm not seeing the fix in REL_14_STABLE yet, but as soon as it
will be committed I'm going to continue the concurrent installcheck testing.

Best regards,
Alexander



On 2022-01-14 20:00:00 +0300, Alexander Lakhin wrote:
> P. S. I'm not seeing the fix in REL_14_STABLE yet, but as soon as it
> will be committed I'm going to continue the concurrent installcheck testing.

Gah, I guess I was too tired last night. Pushed now. Thanks for noticing...



Andres Freund <andres@anarazel.de> writes:
>>> I'm just struggling with / procrastinating on the commit message, tbh. The
>>> whole issue is kinda complicated to explain... :/

> After struggling some more, I *finally* pushed the fix and the new assertions.

I'm writing release notes and wondering what I can tell users about
how to detect or recover from this bug.  Is a REINDEX sufficient,
or is the presence of the bogus redirect item going to cause
persistent problems?  If the latter, can we provide some amelioration?
"Your data is broken and there is nothing you can do about it"
is not nice to be writing in release notes.

            regards, tom lane



Hi,

On 2022-02-03 15:54:28 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >>> I'm just struggling with / procrastinating on the commit message, tbh. The
> >>> whole issue is kinda complicated to explain... :/
>
> > After struggling some more, I *finally* pushed the fix and the new assertions.
>
> I'm writing release notes and wondering what I can tell users about
> how to detect or recover from this bug.  Is a REINDEX sufficient,
> or is the presence of the bogus redirect item going to cause
> persistent problems?

Good questions.

It's hard to answer whether there's any danger after a REINDEX. Afaics the
build scan would just pick the "lower offset" version of the root
pointer. Which should be fine.

It's possible there could be trouble down the line, e.g. heap pruning doing
something weird once starting in a corrupted state, that then leads REINDEX to
do something bogus. The simple cases look OK, because a second visit/action by
heap_prune_chain for one tid from two different root pointers would see
->marked[offnum] as true. It gets more complicated once multiple intermediary
row versions are involved, because the intermediary row versions won't be in
->marked if an entire chain is pruned. But afaict that should still end up
looking like a hot chain ending in an aborted tuple or such.



> If the latter, can we provide some amelioration?
> "Your data is broken and there is nothing you can do about it"
> is not nice to be writing in release notes.

We've had quite a few bugs around broken HOT chains (which this is an instance
of), I don't think we've provided a useful recipe for any of them :(. We now
have heap amcheck, but it unfortunately doesn't detect most HOT corruption
(redirect pointing to unused is the only case I think).


One ameliorating factor is that, as far as I can tell, the window for this bug
is pretty narrow (see sequence in [1] for details). Without adding sleeps it's
hard to cause a snapshot computation to happen between vacuum_set_xid_limits()
and lazy_scan_heap(). Cache invalidation processing needs to trigger
GetSnapshotData() below vac_open_indexes() (which in turn requires a catalog
invalidation to be received, followed by use of the catalog snapshot), and the
computed RecentXmin needs to change.


Except that it's not trivial to get right, I could see it being worthwhile to
add verification of hot chains to amcheck, and backpatch that to 14.

Greetings,

Andres Freund


[1] https://www.postgresql.org/message-id/20211118071941.e4far2w5vufnahun%40alap3.anarazel.de



Andres Freund <andres@anarazel.de> writes:
> On 2022-02-03 15:54:28 -0500, Tom Lane wrote:
>> I'm writing release notes and wondering what I can tell users about
>> how to detect or recover from this bug.  Is a REINDEX sufficient,
>> or is the presence of the bogus redirect item going to cause
>> persistent problems?

> Good questions.

> It's hard to answer whether there's any danger after a REINDEX. Afaics the
> build scan would just pick the "lower offset" version of the root
> pointer. Which should be fine.

> It's possible there could be trouble down the line, e.g. heap pruning doing
> something weird once starting in a corrupted state, that then leads REINDEX to
> do something bogus. The simple cases look OK, because a second visit/action by
> heap_prune_chain for one tid from two different root pointers would see
> ->marked[offnum] as true. It gets more complicated once multiple intermediary
> row versions are involved, because the intermediary row versions won't be in
> ->marked if an entire chain is pruned. But afaict that should still end up
> looking like a hot chain ending in an aborted tuple or such.

OK, I'll just recommend REINDEX.

> Except that it's not trivial to get right, I could see it being worthwhile to
> add verification of hot chains to amcheck, and backpatch that to 14.

I'd have thought that'd be a fundamental component of a heap check
module, so +1 for adding it.  Dunno about the back-patch part though.
It seems like a new feature.

            regards, tom lane



On Fri, Dec 10, 2021 at 8:57 PM Andres Freund <andres@anarazel.de> wrote:
> 0004 is a patch that tries to address your point about the GlobalVisTestFor()
>   placement, sort of

Following up on this now (will follow up on the "harden pruning" patch
separately).

>   My earlier point that moving it to earlier would be a bad idea because it'd
>   make the horizon "older" was bogus - GlobalVisTestFor() doesn't itself do
>   any horizon determination. However moving the GlobalVisTestFor() earlier
>   still seems wrong - imo the vacuum_set_xid_limits() call should be moved to
>   later. The attached patch moves both, wrapped in a new function, to just
>   before the scan in lazy_scan_heap().

Attached is (roughly speaking) a rebased version of this patch (i.e.
your v4-0004- patch from the patch series). Like your patch, this
patch documents our expectations for vistest and OldestXmin. Unlike
your patch, it makes all vacrel initialization take place before
lazy_scan_heap is called, from inside heap_vacuum_rel.

An important detail here is how OldestXmin (and vistest) are related
to rel_pages -- I also added something about that. It definitely
wouldn't be okay to (say) move the call to RelationGetNumberOfBlocks
to any place before the call to vacuum_set_xid_limits. And so it very
much seems in scope here.

I'm a bit confused about at least one detail here: you've said here
that "GlobalVisTestFor() doesn't itself do any horizon determination",
which seems to contradict comments from the v4-0004- patch that was
attached to the same email. This WIP patch claims that there is some
advantage to delaying the call to GlobalVisTestFor regardless, though
I think that there is a good chance that that detail is wrong. Please
let me know which it is.

I also updated nearby comments about the update to the target heap
rel's pg_class entry -- that work was done in passing.

Thanks
-- 
Peter Geoghegan

Attachment
On Thu, Mar 3, 2022 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Following up on this now (will follow up on the "harden pruning" patch
> separately).

Attached is a new revision of my fix. This is more or less a
combination of my v4 fix from November 12 [1] and Andres'
already-committed fix (commit 18b87b20), rebased on top of HEAD. This
patch was originally a bugfix, but now it's technically just
refactoring/hardening of the logic in pruneheap.c. It hasn't changed
all that much, though.

We're still doing an up-front scan of the heap page to precalculate
HTSV for each tuple (no change from commit 18b87b20), but we no longer
do that on correctness grounds. It is purely a performance thing now.
Note that I am making a working assumption that that still makes sense
for now (I think that it probably does, but I haven't yet verified it
for myself).

We now do "3 passes" over the page. The first is the aforementioned
"precalculate HTSV" pass, the second is for determining the extent of
HOT chains, and the third is for any remaining disconnected/orphaned
HOT chains. I suppose that that's okay, since the amount of work has
hardly increased in proportion to this "extra pass over the page". Not
100% sure about everything myself right now, though. I guess that "3
passes" might be considered excessive.

[1] https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com
-- 
Peter Geoghegan

Attachment
Hi,

On 2022-03-03 16:22:11 -0800, Peter Geoghegan wrote:
> Attached is (roughly speaking) a rebased version of this patch (i.e.
> your v4-0004- patch from the patch series). Like your patch, this
> patch documents our expectations for vistest and OldestXmin. Unlike
> your patch, it makes all vacrel initialization take place before
> lazy_scan_heap is called, from inside heap_vacuum_rel.

I think it'd be nicer if we did the horizon determination after allocating
space for dead tuples... But it's still better than today, so...


> An important detail here is how OldestXmin (and vistest) are related
> to rel_pages -- I also added something about that. It definitely
> wouldn't be okay to (say) move the call to RelationGetNumberOfBlocks
> to any place before the call to vacuum_set_xid_limits. And so it very
> much seems in scope here.

You're right, it needs to be that way round.


> I'm a bit confused about at least one detail here: you've said here
> that "GlobalVisTestFor() doesn't itself do any horizon determination",
> which seems to contradict comments from the v4-0004- patch that was
> attached to the same email.

Which comment in the patch is the quoted statement contradicting? The on-point
comment seems to be

+        * [...] vacrel->vistest is always at least as aggressive as the
+        * limits vacuum_set_xid_limits() computes because ComputeXidHorizons()
+        * (via vacuum_set_xid_limits() ->GetOldestNonRemovableTransactionId())
+        * ensures the approximate horizons are always at least as aggressive as
+        * the precise horizons.



> + * Also, we delay establishing vistest + * as a minor optimization.  A
> later cutoff can enable more eager pruning.  + */

I don't think the minor optimization does anything (which I had stated wrongly
at some point in this thread).

>      /*
>       * Call lazy_scan_heap to perform all required heap pruning, index
> -     * vacuuming, and heap vacuuming (plus related processing)
> +     * vacuuming, and heap vacuuming (plus related processing).
> +     *
> +     * Note: Resource management for parallel VACUUM is quite subtle, and so
> +     * lazy_scan_heap must allocate (and deallocate) vacrel->dead_items space
> +     * for itself.  Every other vacrel field is initialized by now, though.
>       */
>      lazy_scan_heap(vacrel, params->nworkers);

What are you referring to with "quite subtle"?


Greetings,

Andres Freund



Hi,

On 2022-03-03 19:31:32 -0800, Peter Geoghegan wrote:
> Attached is a new revision of my fix. This is more or less a
> combination of my v4 fix from November 12 [1] and Andres'
> already-committed fix (commit 18b87b20), rebased on top of HEAD. This
> patch was originally a bugfix, but now it's technically just
> refactoring/hardening of the logic in pruneheap.c. It hasn't changed
> all that much, though.

Perhaps worth discussing outside of -bugs?


> We're still doing an up-front scan of the heap page to precalculate
> HTSV for each tuple (no change from commit 18b87b20), but we no longer
> do that on correctness grounds. It is purely a performance thing now.
> Note that I am making a working assumption that that still makes sense
> for now (I think that it probably does, but I haven't yet verified it
> for myself).

I'd be surprised if not.


> We now do "3 passes" over the page. The first is the aforementioned
> "precalculate HTSV" pass, the second is for determining the extent of
> HOT chains, and the third is for any remaining disconnected/orphaned
> HOT chains. I suppose that that's okay, since the amount of work has
> hardly increased in proportion to this "extra pass over the page". Not
> 100% sure about everything myself right now, though. I guess that "3
> passes" might be considered excessive.

We should be able to optimize away the third pass in the majority of cases, by
keeping track of the number of tuples visited or such. That seems like it
might be worth doing?


> +    /*
> +     * Start from the root item.  Mark it as valid up front, since root items
> +     * are always processed here (not as disconnected tuples in third pass
> +     * over page).
> +     */
> +    prstate->visited[rootoffnum] = true;
>      offnum = rootoffnum;
> +    nchain = 0;

I wonder if it'd be clearer if we dealt with redirects outside of the
loop. Would make it easier to assert that the target of a redirect may not be
unused / !heap-only?


> +
> +                /*
> +                 * Remember the offnum of the last DEAD tuple in this HOT
> +                 * chain.  To keep things simple, don't treat heap-only tuples
> +                 * from a HOT chain as DEAD unless they're only preceded by
> +                 * other DEAD tuples (in addition to actually being DEAD).

s/actually/themselves/?


> +                 * Remaining tuples that appear DEAD (but don't get treated as
> +                 * such by us) are from concurrently aborting updaters.

I don't understand this bit. A DEAD tuple following a RECENTLY_DEAD one won't
be removed now, and doesn't need to involve a concurrent abort? Are you
thinking of "remaining" as the tuples not referenced in the previous sentences?


> +                 * VACUUM will ask us to prune the heap page again when it
> +                 * sees that there is a DEAD tuple left behind, but that would
> +                 * be necessary regardless of our approach here.
> +                 */

Only as long as we do another set of HTSV calls...


>              case HEAPTUPLE_LIVE:
>              case HEAPTUPLE_INSERT_IN_PROGRESS:
> +                pastlatestdead = true;    /* no further DEAD tuples in CHAIN */

If we don't do anything to the following tuples, why don't we just abort here?
I assume it is because we'd then treat them as disconnected? That should
probably be called out...



> -    }
> -    else if (nchain < 2 && ItemIdIsRedirected(rootlp))
> -    {
> -        /*
> -         * We found a redirect item that doesn't point to a valid follow-on
> -         * item.  This can happen if the loop in heap_page_prune caused us to
> -         * visit the dead successor of a redirect item before visiting the
> -         * redirect item.  We can clean up by setting the redirect item to
> -         * DEAD state.
> -         */
> -        heap_prune_record_dead(prstate, rootoffnum);
> +
> +        return ndeleted;
>      }

Could there be such tuples from before a pg_upgrade? Do we need to deal with
them somehow?


Greetings,

Andres Freund



Attached is v2 revision of the less critical vacuumlazy.c vistest
patch. I will respond to your second email about the pruneheap.c patch
on a new dedicated thread, per your suggestion.

NB: I intend to commit this revision of the patch (or something pretty
close to it) in the next few days, barring any objections.

On Wed, Mar 9, 2022 at 4:25 PM Andres Freund <andres@anarazel.de> wrote:
> I think it'd be nicer if we did the horizon determination after allocating
> space for dead tuples... But it's still better than today, so...

Why would it be nicer? Do you mean that it would be neater that way,
since every single field in vacrel gets initialized before the
lazy_scan_prune call, without exception?

I agree with that, but it doesn't seem important right now.

> > An important detail here is how OldestXmin (and vistest) are related
> > to rel_pages -- I also added something about that. It definitely
> > wouldn't be okay to (say) move the call to RelationGetNumberOfBlocks
> > to any place before the call to vacuum_set_xid_limits. And so it very
> > much seems in scope here.
>
> You're right, it needs to be that way round.

I think that this is well worth documenting. But it's also nice that
*almost* all initialization takes place inside heap_vacuum_rel() now,
before we call lazy_scan_heap().

> I don't think the minor optimization does anything (which I had stated wrongly
> at some point in this thread).

Got it. Comments that said that there was value in that are removed
from v2. But v2 does at least speculate about teaching lazy_scan_prune
to advance vistest as an optimization, from time to time -- which, as
you pointed out, is likely to be added in the future.

Talking about the optimization as future work makes sense, though not
because the future work is necessarily all that important. It just
seemed to me to be the easiest way of getting the following point
across to the reader: "vistest and OldestXmin are two distinct
representations of the same cutoff for now -- though don't make the
mistake of believing that they have to be the same cutoff".

> >       /*
> >        * Call lazy_scan_heap to perform all required heap pruning, index
> > -      * vacuuming, and heap vacuuming (plus related processing)
> > +      * vacuuming, and heap vacuuming (plus related processing).
> > +      *
> > +      * Note: Resource management for parallel VACUUM is quite subtle, and so
> > +      * lazy_scan_heap must allocate (and deallocate) vacrel->dead_items space
> > +      * for itself.  Every other vacrel field is initialized by now, though.
> >        */
> >       lazy_scan_heap(vacrel, params->nworkers);
>
> What are you referring to with "quite subtle"?

There are 2 subtleties that I can think of offhand:

1. We deliberately do the initial lazy_check_wraparound_failsafe()
precheck before reaching dead_items_alloc(). Per the comments at the
start of lazy_scan_heap(), we won't use parallel VACUUM if the
failsafe kicks in during the precheck.

2. There is also deallocation to consider here: lazy_scan_heap () also
calls dead_items_cleanup(), which might have to release shared memory
-- which is actually handled by ending parallel mode.
dead_items_cleanup() must be called before update_index_statistics(),
which comes last of all -- since we can't update pg_class when in
parallel mode.

In v2 the same comment block now simply says that we have initialized
everything from vacrel except for vacrel->dead_items. It doesn't
explain why.

Thanks
--
Peter Geoghegan

Attachment
Hi,

On 2022-03-10 18:03:46 -0800, Peter Geoghegan wrote:
> Attached is v2 revision of the less critical vacuumlazy.c vistest
> patch. I will respond to your second email about the pruneheap.c patch
> on a new dedicated thread, per your suggestion.
> 
> NB: I intend to commit this revision of the patch (or something pretty
> close to it) in the next few days, barring any objections.

WFM.


> On Wed, Mar 9, 2022 at 4:25 PM Andres Freund <andres@anarazel.de> wrote:
> > I think it'd be nicer if we did the horizon determination after allocating
> > space for dead tuples... But it's still better than today, so...
> 
> Why would it be nicer?

Large allocation can take a bit. Especially dead_item_alloc() sneakily
initializes parallelism (which is darn ugly). Determining the horizon after
doing expensive stuff gives you a slightly better horizon...



>      /*
> +     * Now actually update rel's pg_class entry.
> +     *
>       * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
>       * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
>       * provided we didn't skip any all-visible (not all-frozen) pages using
> @@ -787,7 +832,7 @@ static void
>  lazy_scan_heap(LVRelState *vacrel, int nworkers)
>  {
>      VacDeadItems *dead_items;
> -    BlockNumber nblocks,
> +    BlockNumber rel_pages = vacrel->rel_pages,
>                  blkno,
>                  next_unskippable_block,
>                  next_failsafe_block,

> @@ -843,7 +865,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>  
>      /* Report that we're scanning the heap, advertising total # of blocks */
>      initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
> -    initprog_val[1] = nblocks;
> +    initprog_val[1] = rel_pages;
>      initprog_val[2] = dead_items->max_items;
>      pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
>  
> @@ -867,9 +889,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>       * Before entering the main loop, establish the invariant that
>       * next_unskippable_block is the next block number >= blkno that we can't
>       * skip based on the visibility map, either all-visible for a regular scan
> -     * or all-frozen for an aggressive scan.  We set it to nblocks if there's
> -     * no such block.  We also set up the skipping_blocks flag correctly at
> -     * this stage.
> +     * or all-frozen for an aggressive scan.  We set it to rel_pages when
> +     * there's no such block.  We also set up the skipping_blocks flag
> +     * correctly at this stage.
>       *
>       * Note: The value returned by visibilitymap_get_status could be slightly
>       * out-of-date, since we make this test before reading the corresponding
> @@ -887,7 +909,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>      next_unskippable_block = 0;
>      if (vacrel->skipwithvm)
>      {
> -        while (next_unskippable_block < nblocks)
> +        while (next_unskippable_block < rel_pages)
>          {
>              uint8        vmstatus;
>  
> @@ -914,7 +936,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>      else
>          skipping_blocks = false;
>  
> -    for (blkno = 0; blkno < nblocks; blkno++)
> +    for (blkno = 0; blkno < rel_pages; blkno++)
>      {
>          Buffer        buf;
>          Page        page;
> @@ -932,7 +954,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>              next_unskippable_block++;
>              if (vacrel->skipwithvm)
>              {
> -                while (next_unskippable_block < nblocks)
> +                while (next_unskippable_block < rel_pages)
>                  {
>                      uint8        vmskipflags;
>  
> @@ -977,7 +999,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>              /*
>               * The current page can be skipped if we've seen a long enough run
>               * of skippable blocks to justify skipping it -- provided it's not
> -             * the last page in the relation (according to rel_pages/nblocks).
> +             * the last page in the relation (according to rel_pages).
>               *
>               * We always scan the table's last page to determine whether it
>               * has tuples or not, even if it would otherwise be skipped. This
> @@ -985,7 +1007,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>               * on the table to attempt a truncation that just fails
>               * immediately because there are tuples on the last page.
>               */
> -            if (skipping_blocks && blkno < nblocks - 1)
> +            if (skipping_blocks && blkno < rel_pages - 1)
>              {
>                  /*
>                   * Tricky, tricky.  If this is in aggressive vacuum, the page
> @@ -1153,7 +1175,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>           * were pruned some time earlier.  Also considers freezing XIDs in the
>           * tuple headers of remaining items with storage.
>           */
> -        lazy_scan_prune(vacrel, buf, blkno, page, vistest, &prunestate);
> +        lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
>  
>          Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
>  
> @@ -1352,7 +1374,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
>      vacrel->blkno = InvalidBlockNumber;
>  
>      /* now we can compute the new value for pg_class.reltuples */
> -    vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
> +    vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
>                                                       vacrel->scanned_pages,
>                                                       vacrel->live_tuples);

The whole s/nblocks/rel_pages/ seems like it should be done separately.

Greetings,

Andres Freund



On Thu, Mar 10, 2022 at 7:13 PM Andres Freund <andres@anarazel.de> wrote:
> > NB: I intend to commit this revision of the patch (or something pretty
> > close to it) in the next few days, barring any objections.
>
> WFM.

Cool.

> > On Wed, Mar 9, 2022 at 4:25 PM Andres Freund <andres@anarazel.de> wrote:
> > > I think it'd be nicer if we did the horizon determination after allocating
> > > space for dead tuples... But it's still better than today, so...
> >
> > Why would it be nicer?
>
> Large allocation can take a bit. Especially dead_item_alloc() sneakily
> initializes parallelism (which is darn ugly). Determining the horizon after
> doing expensive stuff gives you a slightly better horizon...

I'm confused. You recently said "I don't think the minor optimization
[delaying establishing vistest] does anything (which I had stated
wrongly at some point in this thread)". But you now seem to be saying
that delaying establishing vistest has at least some small value as an
optimization. At least in theory.

> The whole s/nblocks/rel_pages/ seems like it should be done separately.

I'll break the mechanical s/nblocks/rel_pages/ out into a dedicated
commit, then.

-- 
Peter Geoghegan



Hi,

On 2022-03-10 19:40:21 -0800, Peter Geoghegan wrote:
> On Thu, Mar 10, 2022 at 7:13 PM Andres Freund <andres@anarazel.de> wrote:
> > > On Wed, Mar 9, 2022 at 4:25 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I think it'd be nicer if we did the horizon determination after allocating
> > > > space for dead tuples... But it's still better than today, so...
> > >
> > > Why would it be nicer?
> >
> > Large allocation can take a bit. Especially dead_item_alloc() sneakily
> > initializes parallelism (which is darn ugly). Determining the horizon after
> > doing expensive stuff gives you a slightly better horizon...
> 
> I'm confused. You recently said "I don't think the minor optimization
> [delaying establishing vistest] does anything (which I had stated
> wrongly at some point in this thread)". But you now seem to be saying
> that delaying establishing vistest has at least some small value as an
> optimization. At least in theory.

I'm not talking about just moving the vistest acquisition, but also
vacuum_set_xid_limits(). Which obviously *does* benefit from delaying as long
as possible.


> > The whole s/nblocks/rel_pages/ seems like it should be done separately.
> 
> I'll break the mechanical s/nblocks/rel_pages/ out into a dedicated
> commit, then.

Cool.

Greetings,

Andres Freund



On Thu, Mar 10, 2022 at 7:46 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not talking about just moving the vistest acquisition, but also
> vacuum_set_xid_limits(). Which obviously *does* benefit from delaying as long
> as possible.

That sounds hard, or at least a lot of work given the benefits.

As the patch points out, we are required to establish rel_pages after
we have established OldestXmin. We *also* use rel_pages to determine
the size of the dead_items array -- we don't want to allocate space
that couldn't possibly be used (i.e. a dead_items array with room for
more than `MaxHeapTuplesPerPage * rel_pages` dead items in total).

In short, it will be necessary to break that dependency, somehow.
Which is possibly, certainly, but still quite messy.

-- 
Peter Geoghegan



Hi,

On 2022-03-10 19:52:24 -0800, Peter Geoghegan wrote:
> As the patch points out, we are required to establish rel_pages after
> we have established OldestXmin.

Hm. Now that I think more about it, I'm not as convinced that's entirely true
anymore. We just need to make sure that we have an accurate "xid horizon" from
before determining the relation size, not an "xmin horizon" and use the Min()
of the two horizons.  The "xid horizon" obviously is typically well ahead of
the "xmin horizon" (and always >=).

And even on the "xid horizon" I suspect something more aggressive could be
used.

But that's probably best done separately - but perhaps worth to "weaken" the
comment a bit?

Greetings,

Andres Freund



On Thu, Mar 10, 2022 at 8:08 PM Andres Freund <andres@anarazel.de> wrote:
> But that's probably best done separately - but perhaps worth to "weaken" the
> comment a bit?

I'm confused again.

I don't get why it's only going to be worth delaying establishing
vistest if we can also delay establishing OldestXmin in about the same
way. AFAICT the only compelling reason to have two separate cutoffs
from the point of view of vacuumlazy.c is that it enables periodically
refreshing vistest within lazy_scan_prune, to allow it to prune dead
tuples more eagerly (fewer recently dead tuples, more dead removed
tuples). That I can see, that much I get.

Periodically refreshing OldestXmin for freezing won't work, though. At
the very least it seems much less compelling than the vistest idea.
Yes, it probably is true that we could in principle further split
OldestXmin along "xmin vs xid horizon" lines (I should say "split it
again" -- you already split OldestXmin into "OldestXmin for freezing,
vistest for pruning" as part of your snapshot scalability work). But
would it really make any sense? If so, I can't see why. At least not
right now.

--
Peter Geoghegan



On Thu, Mar 10, 2022 at 7:13 PM Andres Freund <andres@anarazel.de> wrote:
> Large allocation can take a bit. Especially dead_item_alloc() sneakily
> initializes parallelism (which is darn ugly).

Attached v3 revision of the patch makes this a bit clearer.

The patch now moves the pg_class updates for indexes from the end of
lazy_scan_heap() to its heap_vacuum_rel() caller, at the point just
after lazy_scan_heap() returns. This seems logical to me because it
makes all pg_class updates take place inside heap_vacuum_rel(). It
also means that we pretty start parallel mode right at the beginning
of lazy_scan_heap(), and end it right at the end -- which also seems
like a small improvement.

And so we no longer end parallel mode in order to be able to safely
update pg_class for indexes. Rather, we end parallel mode because
processing by lazy_scan_heap() as a whole completed.

-- 
Peter Geoghegan

Attachment
Hi,

On 2022-03-10 20:55:37 -0800, Peter Geoghegan wrote:
> On Thu, Mar 10, 2022 at 8:08 PM Andres Freund <andres@anarazel.de> wrote:
> > But that's probably best done separately - but perhaps worth to "weaken" the
> > comment a bit?
> 
> I'm confused again.
> 
> I don't get why it's only going to be worth delaying establishing
> vistest if we can also delay establishing OldestXmin in about the same
> way.

GlobalVisTestFor() doesn't trawl through the procarray to compute horizons. It
just gets already computed horizons, potentially very approximate ones that
later will get updated when necessary to determine the visibility of a
concrete xid.  You could call GlobalVisTestFor() and then later call call
vacuum_set_xid_limits() and the test by GlobalVisTestFor() would still be
accurate.


> AFAICT the only compelling reason to have two separate cutoffs from the
> point of view of vacuumlazy.c is that it enables periodically refreshing
> vistest within lazy_scan_prune, to allow it to prune dead tuples more
> eagerly (fewer recently dead tuples, more dead removed tuples). That I can
> see, that much I get.

The main reason to have vistest at all is that for on-access pruning computing
accurate horizons is excruciatingly expensive - the reason behind procarray
scaling badly were all the xmin accesses during normal snapshot computations!
And due to the way the pruning code is used for both vacuuming and on-access
pruning that kind of bleeds into vacuumlazy.c.  That could have been reduced,
but it seemed beneficial to allow to update horizons as we go.


> Periodically refreshing OldestXmin for freezing won't work, though. At
> the very least it seems much less compelling than the vistest idea.

I think there's a fair bit of value in using an "as aggressive as possible"
initial OldestXmin. Which is used to set a bunch of cutoffs / determine
aggressiveness etc.

Once we compute "measured" relfrozenxid however, there's afaict otherwise not
much point in updating OldestXmin as we go, at least if we start to use
vistest for the horizon determinations inside vacuumlazy.c.


> Yes, it probably is true that we could in principle further split
> OldestXmin along "xmin vs xid horizon" lines (I should say "split it
> again" -- you already split OldestXmin into "OldestXmin for freezing,
> vistest for pruning" as part of your snapshot scalability work). But
> would it really make any sense? If so, I can't see why. At least not
> right now.

If we don't switch to using vistest for HTSV determinations in vacuumlazy.c,
then I don't think we can refresh vistest without causing problems in
lazy_scan_prune().

But I think refreshing horizons is a different discussion from using
as-recent-as-possible initial horizons...

Greetings,

Andres Freund



On Sat, Mar 12, 2022 at 1:49 PM Andres Freund <andres@anarazel.de> wrote:
> GlobalVisTestFor() doesn't trawl through the procarray to compute horizons. It
> just gets already computed horizons, potentially very approximate ones that
> later will get updated when necessary to determine the visibility of a
> concrete xid.

That much is easy to understand. It's the most significant way in
which the snapshot scalability work improves scalability.

> You could call GlobalVisTestFor() and then later call call
> vacuum_set_xid_limits() and the test by GlobalVisTestFor() would still be
> accurate.

That seems kinda obvious, since all it does is get the appropriate
state for the relkind of the relation being vacuumed. Typically it
just returns a pointer to GlobalVisDataRels, which is a static
variable in procarray.c.

> The main reason to have vistest at all is that for on-access pruning computing
> accurate horizons is excruciatingly expensive - the reason behind procarray
> scaling badly were all the xmin accesses during normal snapshot computations!
> And due to the way the pruning code is used for both vacuuming and on-access
> pruning that kind of bleeds into vacuumlazy.c.  That could have been reduced,
> but it seemed beneficial to allow to update horizons as we go.

That I get too.

> > Periodically refreshing OldestXmin for freezing won't work, though. At
> > the very least it seems much less compelling than the vistest idea.
>
> I think there's a fair bit of value in using an "as aggressive as possible"
> initial OldestXmin. Which is used to set a bunch of cutoffs / determine
> aggressiveness etc.

I just don't think that it makes very much difference, since we're
only talking about freezing here -- which is not something we tend to
be too eager with anyway (FreezeLimit is very seldom the same as
OldestXmin, etc).

Maybe it could take a long time for vac_open_indexes() to return, but
that's really an edge case. I'm puzzled why you're placing so much
emphasis on this. I can see why that's valuable in a theoretical
abstract kind of way -- but that's about it. And so it feels like I'm
still missing something.

> Once we compute "measured" relfrozenxid however, there's afaict otherwise not
> much point in updating OldestXmin as we go, at least if we start to use
> vistest for the horizon determinations inside vacuumlazy.c.

> If we don't switch to using vistest for HTSV determinations in vacuumlazy.c,
> then I don't think we can refresh vistest without causing problems in
> lazy_scan_prune().

Why? Right now we could easily have a concurrent opportunistic prune
that uses a vistest that's well ahead of the one in VACUUM. And so
AFAICT it's already quite possible that any page encountered within
lazy_scan_prune was already pruned using a much more recent
vistest. So what's the problem with doing the same thing inside the
backend running VACUUM instead?

Is it something to do with the VACUUM backend's state?

We can recompute backend horizons during VACUUM after OldestXmin is
established, so I can't see why the backend's procarray.c state should
matter. If it did that would seem like a big problem in general. In
particular, my commit 9dd963ae25 (an nbtree VACUUM enhancement)
independently calls GetOldestNonRemovableTransactionId() midway
through VACUUM.

> But I think refreshing horizons is a different discussion from using
> as-recent-as-possible initial horizons...

Agreed. This is why I find your emphasis on as-recent-as-possible
initial horizons confusing. It seems as if you're giving almost equal
emphasis to both issues, even though one issue (more eager pruning
during VACUUM) is of obvious practical value, while the other
(as-recent-as-possible initial horizons) is pretty theoretical and
academic -- at best.

--
Peter Geoghegan



Hi,

On 2022-03-14 12:05:14 -0700, Peter Geoghegan wrote:
> On Sat, Mar 12, 2022 at 1:49 PM Andres Freund <andres@anarazel.de> wrote:
> > > Periodically refreshing OldestXmin for freezing won't work, though. At
> > > the very least it seems much less compelling than the vistest idea.
> >
> > I think there's a fair bit of value in using an "as aggressive as possible"
> > initial OldestXmin. Which is used to set a bunch of cutoffs / determine
> > aggressiveness etc.
> 
> I just don't think that it makes very much difference, since we're
> only talking about freezing here -- which is not something we tend to
> be too eager with anyway (FreezeLimit is very seldom the same as
> OldestXmin, etc).

Well, it's not uncommon to VACUUM FREEZE after ETL etc. But more importantly,
the concrete OldestXmin value matters, because it's the difference between
removing or keeping dead rows.


> Maybe it could take a long time for vac_open_indexes() to return, but
> that's really an edge case. I'm puzzled why you're placing so much
> emphasis on this. I can see why that's valuable in a theoretical
> abstract kind of way -- but that's about it. And so it feels like I'm
> still missing something.

I'm not saying it's a *huge* improvement. But especially for smaller tables
it's not uncommon to see multiple autovacuums on a table in quick succession
just because of a slightly outdated horizon. Of course you're not going to see
that in a workload with a handfull of heavily modified tables, but if there's
many tables it's a different story.


> > Once we compute "measured" relfrozenxid however, there's afaict otherwise not
> > much point in updating OldestXmin as we go, at least if we start to use
> > vistest for the horizon determinations inside vacuumlazy.c.
> 
> > If we don't switch to using vistest for HTSV determinations in vacuumlazy.c,
> > then I don't think we can refresh vistest without causing problems in
> > lazy_scan_prune().
> 
> Why? Right now we could easily have a concurrent opportunistic prune
> that uses a vistest that's well ahead of the one in VACUUM. And so
> AFAICT it's already quite possible that any page encountered within
> lazy_scan_prune was already pruned using a much more recent
> vistest. So what's the problem with doing the same thing inside the
> backend running VACUUM instead?

That was a brainfart on my end...


> > But I think refreshing horizons is a different discussion from using
> > as-recent-as-possible initial horizons...
> 
> Agreed. This is why I find your emphasis on as-recent-as-possible
> initial horizons confusing. It seems as if you're giving almost equal
> emphasis to both issues, even though one issue (more eager pruning
> during VACUUM) is of obvious practical value, while the other
> (as-recent-as-possible initial horizons) is pretty theoretical and
> academic -- at best.

I'm not intending to weigh them the same. I think using a more recent horizon
is more important than you describe here, but "computed horizons" will be a
considerably larger win.

Greetings,

Andres Freund



On Fri, Mar 18, 2022 at 7:53 AM Andres Freund <andres@anarazel.de> wrote:
> > Agreed. This is why I find your emphasis on as-recent-as-possible
> > initial horizons confusing. It seems as if you're giving almost equal
> > emphasis to both issues, even though one issue (more eager pruning
> > during VACUUM) is of obvious practical value, while the other
> > (as-recent-as-possible initial horizons) is pretty theoretical and
> > academic -- at best.
>
> I'm not intending to weigh them the same. I think using a more recent horizon
> is more important than you describe here, but "computed horizons" will be a
> considerably larger win.

Okay, got it.

My relfrozenxid and freezing patch series patch v10-0002-* took a new
approach to avoiding the visibility map aggressive VACUUM special
case, that seems like it might be relevant here (surprisingly enough).
The latest version totally embraces making all visibility map access
inside lazy_scan_heap use local state describing a range of skippable
blocks. Not that different, really, but still something that hints at
an interesting future direction.

What if we built a palloc'd array of these ranges up front, right
after establishing OldestXmin? We could even build a complete picture
of the entire table before we'd even scanned the first block. I don't
think it would require too much space. We'd then be operating against
a "snapshot of the visibility map just after OldestXmin was
established", even in a multi-hour VACUUM operation. This is kind of a
move in the opposite direction, but it's not necessarily the wrong
direction; at least this way we're not going to dirty concurrently
modified pages from the largest tables, just to set hint bits or clean
up one or two concurrently aborted transactions.

I also think that this would help with your aio work. Having a
palloc'd array like this would be an ideal target for prefetching. I
actually checked what you were doing here. Evidently the visibility
map stuff at the top of lazy_scan_heap was already a problem for aio:


https://github.com/anarazel/postgres/blob/29d3a3a3ddd47906d9128bc68c218664ab15a2c6/src/backend/access/heap/vacuumlazy.c#L927

I'm posting a more refined version of just that patch here, since it's
closer to the ideal for this kind of thing; it takes all of the
decisions away from lazy_scan_heap, which need only execute skipping
by following instructions built by the new helper function.

--
Peter Geoghegan

Attachment