Thread: BUG #17821: Assertion failed in heap_update() due to heap pruning
The following bug has been logged on the website: Bug reference: 17821 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 15.2 Operating system: Ubuntu 22.04 Description: When performing multi-threaded tests, I've stumbled upon a failed assertion with the following stack trace: Core was generated by `postgres: postgres regress003 127.0.0.1(35610) GRANT '. Program terminated with signal SIGABRT, Aborted. #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140650166969280) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140650166969280) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140650166969280) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140650166969280, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007febad28e476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007febad2747f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x0000556725c44981 in ExceptionalCondition (conditionName=conditionName@entry=0x556725dc6b00 "ItemIdIsNormal(lp)", fileName=fileName@entry=0x556725dc48e0 "heapam.c", lineNumber=lineNumber@entry=3024) at assert.c:66 #6 0x00005567249f3aa7 in heap_update (relation=relation@entry=0x7feb61671cd8, otid=otid@entry=0x61900002c3fc, newtup=newtup@entry=0x61900002c3f8, cid=<optimized out>, crosscheck=crosscheck@entry=0x0, wait=wait@entry=true, tmfd=<optimized out>, lockmode=<optimized out>) at heapam.c:3024 #7 0x00005567249f8d46 in simple_heap_update (relation=relation@entry=0x7feb61671cd8, otid=otid@entry=0x61900002c3fc, tup=tup@entry=0x61900002c3f8) at heapam.c:3969 #8 0x0000556724c367ba in CatalogTupleUpdate (heapRel=heapRel@entry=0x7feb61671cd8, otid=0x61900002c3fc, tup=0x61900002c3f8) at indexing.c:309 #9 0x0000556724bfc30c in ExecGrant_Relation (istmt=istmt@entry=0x7ffdeb8ca1e0) at aclchk.c:2041 #10 0x0000556724bffee3 in ExecGrantStmt_oids (istmt=istmt@entry=0x7ffdeb8ca1e0) at aclchk.c:598 #11 0x0000556724c00ab0 in ExecuteGrantStmt (stmt=stmt@entry=0x625000005df0) at aclchk.c:583 #12 0x00005567257ccac7 in ProcessUtilitySlow (pstate=pstate@entry=0x61900002c298, pstmt=pstmt@entry=0x625000005ec0, queryString=queryString@entry=0x625000005218 "GRANT SELECT ON pg_proc TO CURRENT_USER;", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=<optimized out>, qc=<optimized out>) at utility.c:1802 #13 0x00005567257c92a5 in standard_ProcessUtility (pstmt=pstmt@entry=0x625000005ec0, queryString=queryString@entry=0x625000005218 "GRANT SELECT ON pg_proc TO CURRENT_USER;", readOnlyTree=readOnlyTree@entry=false, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=<optimized out>, qc=<optimized out>) at utility.c:973 #14 0x00007feba7e91d97 in pgss_ProcessUtility (pstmt=<optimized out>, queryString=<optimized out>, readOnlyTree=<optimized out>, context=<optimized out>, params=<optimized out>, queryEnv=<optimized out>, dest=<optimized out>, qc=<optimized out>) at pg_stat_statements.c:1143 #15 0x00005567257c9906 in ProcessUtility (pstmt=pstmt@entry=0x625000005ec0, queryString=<optimized out>, readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>, queryEnv=<optimized out>, dest=0x6250000066f0, qc=0x7ffdeb8cae50) at utility.c:526 #16 0x00005567257c2488 in PortalRunUtility (portal=portal@entry=0x62500002fa18, pstmt=pstmt@entry=0x625000005ec0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x6250000066f0, qc=qc@entry=0x7ffdeb8cae50) at pquery.c:1158 #17 0x00005567257c3010 in PortalRunMulti (portal=portal@entry=0x62500002fa18, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x6250000066f0, altdest=altdest@entry=0x6250000066f0, qc=qc@entry=0x7ffdeb8cae50) at pquery.c:1315 #18 0x00005567257c3d82 in PortalRun (portal=portal@entry=0x62500002fa18, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x6250000066f0, altdest=altdest@entry=0x6250000066f0, qc=<optimized out>) at pquery.c:791 #19 0x00005567257b7f59 in exec_simple_query (query_string=query_string@entry=0x625000005218 "GRANT SELECT ON pg_proc TO CURRENT_USER;") at postgres.c:1240 #20 0x00005567257bcead in PostgresMain (dbname=dbname@entry=0x62900001b268 "regress003", username=username@entry=0x62900001b248 "postgres") at postgres.c:4572 #21 0x000055672555f72b in BackendRun (port=port@entry=0x614000230640) at postmaster.c:4461 #22 0x0000556725567fc7 in BackendStartup (port=port@entry=0x614000230640) at postmaster.c:4189 #23 0x00005567255685c4 in ServerLoop () at postmaster.c:1779 #24 0x000055672556acbb in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x603000000370) at postmaster.c:1463 #25 0x00005567251c4499 in main (argc=3, argv=0x603000000370) at main.c:200 After series of experiments I came to the following simple reproduction: numclients=2 for ((i=1;i<=300;i++)); do echo "iteration $i" for ((j=1;j<=numclients;j++)); do for ((k=1;k<=300;k++)); do echo "GRANT ALL ON pg_class TO public;"; done | psql >psql-$j.log 2>&1 & done wait grep 'TRAP:' server.log && { res=0; break; } psql -c "VACUUM FULL pg_class" >psql-vacuum.log done pg_ctl -w -t 10 -D "$PGDB" stop ... iteration 3 TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c", Line: 3024, PID: 1817582 waiting for server to shut down............. failed To ease the reproduction I did: sed 's/minfree = Max(minfree, BLCKSZ \/ 10)/minfree = Max(minfree, BLCKSZ \/ 1)/' -i src/backend/access/heap/pruneheap.c (I believe that that doesn't change modus vivendi of the page pruning too much.) With some debugging messages added, I see: 2023-03-02 15:22:13.027 MSK|[unknown]|[unknown]|640094f5.1ba5ab|LOG: connection received: host=[local] 2023-03-02 15:22:13.027 MSK|law|test1|640094f5.1ba5ab|LOG: connection authorized: user=law database=test1 application_name=psql 2023-03-02 15:22:13.027 MSK|[unknown]|[unknown]|640094f5.1ba5ac|LOG: connection received: host=[local] 2023-03-02 15:22:13.028 MSK|law|test1|640094f5.1ba5ac|LOG: connection authorized: user=law database=test1 application_name=psql 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|LOG: statement: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|LOG: ExecuteGrantStmt(): before objectNamesToOids() 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|LOG: ExecuteGrantStmt(): after objectNamesToOids() 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|LOG: ExecGrant_Relation() after SearchSysCache1(); ItemPointerGetOffsetNumber(&tuple->t_self): 24 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|LOG: ExecGrant_Relationi(): before CatalogTupleUpdate(); ItemPointerGetOffsetNumber(&newtuple->t_self): 24 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|LOG: heap_update(): after ReadBuffer(); buffer: 2012, ItemPointerGetOffsetNumber(otid): 24 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|LOG: heap_update(): after PageGetItemId(); ItemPointerGetOffsetNumber(otid): 24, lp->lp_flags: 1 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ab|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: statement: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: ExecuteGrantStmt(): before objectNamesToOids() 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: heap_page_prune_execute() before ItemIdSetUnused(); buffer: 2012, off: 24 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: ExecuteGrantStmt(): after objectNamesToOids() 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: ExecGrant_Relation() after SearchSysCache1(); ItemPointerGetOffsetNumber(&tuple->t_self): 24 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: ExecGrant_Relationi(): before CatalogTupleUpdate(); ItemPointerGetOffsetNumber(&newtuple->t_self): 24 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: heap_update(): after ReadBuffer(); buffer: 2012, ItemPointerGetOffsetNumber(otid): 24 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: heap_update(): after PageGetItemId(); ItemPointerGetOffsetNumber(otid): 24, lp->lp_flags: 0 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c", Line: 3025, PID: 1811884 So it looks like the session 1ba5ac made linepointer 24 unused while pruning in the guts of objectNamesToOids(), then got the same offset (24) from the SearchSysCache1() and finally was surprised inside heap_update() to see an unused line pointer. This raises two questions for me. 1) Was it legal to prune that linepointer (what if ExecGrant do not perform heap_modify_tuple() in this case, will an unused line pointer be left behind?)? 2) Should page pruning be reflected in SysCache somehow?
02.03.2023 16:00, PG Bug reporting form wrote:
#5 0x0000556725c44981 in ExceptionalCondition (conditionName=conditionName@entry=0x556725dc6b00 "ItemIdIsNormal(lp)", fileName=fileName@entry=0x556725dc48e0 "heapam.c", lineNumber=lineNumber@entry=3024) at assert.c:66 ... 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: ExecuteGrantStmt(): before objectNamesToOids() 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|LOG: heap_page_prune_execute() before ItemIdSetUnused(); buffer: 2012, off: 24 2023-03-02 15:22:13.031 MSK|law|test1|640094f5.1ba5ac|STATEMENT: GRANT ALL ON pg_class TO public; ... So it looks like the session 1ba5ac made linepointer 24 unused while pruning in the guts of objectNamesToOids(), then got the same offset (24) from the SearchSysCache1() and finally was surprised inside heap_update() to see an unused line pointer. This raises two questions for me. 1) Was it legal to prune that linepointer (what if ExecGrant do not perform heap_modify_tuple() in this case, will an unused line pointer be left behind?)? 2) Should page pruning be reflected in SysCache somehow?
Further investigation gave me a positive answer to the first question.
With more debug logging added I see that when that pruning executed
there was a redirection happened. There was dead items found in a tuple
chain starting from position 22, and redirection 22 -> 25 was performed,
so item 24 became unused.
The fragment of the detailed log is attached, and also you can look at
a dump of the modified page at the moment of the Assert.
SELECT * from heap_page_item_attrs(pg_read_binary_file('page.bin'), 1259::oid, true);
shows:
...
21 | 3600 | 1 | 225 | 680 | 0 | 0 | (10,21) | 33 | 11011 | 32 | 1111111111111111111111111111111000000000 | | ...
22 | 25 | 2 | 0 | | | | | | | | | |
23 | 0 | 3 | 0 | | | | | | | | | |
24 | 0 | 0 | 0 | | | | | | | | | |
25 | 3368 | 1 | 225 | 222501 | 0 | 0 | (10,25) | 32801 | 10499 | 32 | 1111111111111111111111111111111000000000 | | ...
(25 rows)
So we have only the second question left.
Best regards,
Alexander
Attachment
Hi On March 3, 2023 10:00:00 PM PST, Alexander Lakhin <exclusion@gmail.com> wrote: >02.03.2023 16:00, PG Bug reporting form wrote: >So we have only the second question left. My first suspicion would be that we aren't holding a lock in the right moment, to enforce cache invalidation processing. Greetings, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi, 04.03.2023 19:48, Andres Freund wrote: > My first suspicion would be that we aren't holding a lock in the right moment, to enforce cache invalidation processing. > I've encountered another case of that assertion failure (this time with pg_statistic): numclients=10 for ((c=1;c<=numclients;c++)); do createdb db$c done for ((i=1;i<=50;i++)); do echo "iteration $i" for ((c=1;c<=numclients;c++)); do echo " CREATE TABLE t(i int4, t1 text[], t2 text[]); INSERT INTO t SELECT g, ARRAY[repeat('x', g)], ARRAY[repeat('x', g)] FROM generate_series(1, 200) g; ANALYZE t; " | psql -d db$c >/dev/null done for ((c=1;c<=numclients;c++)); do echo " ANALYZE t; " | psql -d db$c >/dev/null & echo " SELECT * FROM t WHERE i = 1; ANALYZE t; " | psql -d db$c >/dev/null & done wait grep 'TRAP:' server.log && { break; } for ((c=1;c<=numclients;c++)); do echo " DROP TABLE t; VACUUM pg_statistic; " | psql -d db$c >/dev/null done done ... iteration 13 server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c", Line: 3074, PID: 89697 Reproduced on REL_12_STABLE (starting from 5e0928005) .. master. (Modifying pruneheap.c as in [1] not needed for this case.) Here we have the following sequence of events: Session 1 ANALYZE t; // needed just to produce material for pruning do_analyze_rel() -> update_attstats() -> CatalogTupleUpdateWithInfo() -> heap_update(ctid(18,16)); PrepareToInvalidateCacheTuple(hashvalue: {hash1}) Session 2: SELECT * FROM t WHERE i = 1; SearchCatCacheInternal(hashValue: {hash1}) SearchCatCacheMiss(hashValue: {hash1}) ProcessInvalidationMessages() CatCacheInvalidate(hashValue: {hash1}) CatCacheInvalidate(hashValue: {hash2}) CatCacheInvalidate(hashValue: {hash1}) SearchCatCacheMiss returns tid(18,16) ANALYZE t; do_analyze_rel() -> update_attstats() attno=0 SearchSysCache3() -> SearchCatCacheInternal(hashValue: {hash2}) -> SearchCatCacheMiss() -> heap_page_prune(block: 18) attno=1 SearchSysCache3() -> SearchCatCacheInternal(hashValue: {hash1}) -> oldtup = tid(18,16) CatalogTupleUpdateWithInfo() -> heap_update(tid(18, 16)) (A server.log with additional tracing messages for one of the failed runs is attached.) In other words, a backend can get some tuple id from catcache and then prune the underlying page while searching for another (missing) entry in the cache (any other reading of the page that may trigger pruning will do as well). After that, an attempt to update that tuple might trigger the Assert(). [1] https://www.postgresql.org/message-id/17821-dd8c334263399284%40postgresql.org Best regards, Alexander
Attachment
Hello, 07.10.2023 11:00, Alexander Lakhin wrote: > > I've encountered another case of that assertion failure (this time with pg_statistic): That second case is not reproduced anymore, after ad98fb142, but the first is still actual. I also discovered another one, with pg_subscription: echo " CREATE ROLE su SUPERUSER; CREATE SUBSCRIPTION sub CONNECTION 'dbname=nonexisting' PUBLICATION pub WITH (connect = false); " | psql for ((i=1;i<=20;i++)); do echo "iteration $i" for ((k=1;k<=100;k++)); do for ((j=1;j<=200;j++)); do echo " ALTER SUBSCRIPTION sub SET (synchronous_commit = false); "; done | psql >/dev/null 2>&1 & for ((j=1;j<=200;j++)); do echo " ALTER SUBSCRIPTION sub OWNER TO su; ALTER SUBSCRIPTION sub SET (slot_name = NONE); "; done | psql >/dev/null 2>&1 & wait done grep 'TRAP:' server.log && break; done This script fails for me as below: ... iteration 5 TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c", Line: 3296, PID: 388093 (sed 's/minfree = Max(minfree, BLCKSZ \/ 10)/minfree = Max(minfree, BLCKSZ \/1)/' -i src/backend/access/heap/pruneheap.c is highly recommended for reproducing) Melanie and Heikki, as you have been working on pruning over the recent months, perhaps this defect could be of interest to you... Best regards, Alexander
On Tue, Apr 9, 2024 at 8:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello, > > 07.10.2023 11:00, Alexander Lakhin wrote: > > > > I've encountered another case of that assertion failure (this time with pg_statistic): > > That second case is not reproduced anymore, after ad98fb142, but the first > is still actual. > > I also discovered another one, with pg_subscription: > echo " > CREATE ROLE su SUPERUSER; > CREATE SUBSCRIPTION sub CONNECTION 'dbname=nonexisting' > PUBLICATION pub WITH (connect = false); > " | psql > > for ((i=1;i<=20;i++)); do > echo "iteration $i" > > for ((k=1;k<=100;k++)); do > for ((j=1;j<=200;j++)); do echo " > ALTER SUBSCRIPTION sub SET (synchronous_commit = false); > "; done | psql >/dev/null 2>&1 & > > for ((j=1;j<=200;j++)); do echo " > ALTER SUBSCRIPTION sub OWNER TO su; > ALTER SUBSCRIPTION sub SET (slot_name = NONE); > "; done | psql >/dev/null 2>&1 & > wait > done > > grep 'TRAP:' server.log && break; > done > > This script fails for me as below: > ... > iteration 5 > TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c", Line: 3296, PID: 388093 > > (sed 's/minfree = Max(minfree, BLCKSZ \/ 10)/minfree = Max(minfree, BLCKSZ \/1)/' > -i src/backend/access/heap/pruneheap.c > is highly recommended for reproducing) > > Melanie and Heikki, as you have been working on pruning over the recent > months, perhaps this defect could be of interest to you... Hi Alexander, Just to confirm, the original bug was filed against 15, but I see ad98fb142 was committed on master as well as backpatched. You are able to reproduce your pg_subscription example on master? I tried your repro (including the modification to minfree) and wasn't able to reproduce on master on my machine. This could totally be my fault, but I just wanted to confirm you were able to repro on master. - Melanie
Hi Melanie, 09.04.2024 20:41, Melanie Plageman wrote: > Just to confirm, the original bug was filed against 15, but I see > ad98fb142 was committed on master as well as backpatched. You are able > to reproduce your pg_subscription example on master? I tried your > repro (including the modification to minfree) and wasn't able to > reproduce on master on my machine. This could totally be my fault, but > I just wanted to confirm you were able to repro on master. > Yes, I reproduced both cases on master (27074bce0). Please look at the self-contained script that works for me: .../repro-17821-SUBSCRIPTION.sh ... CREATE ROLE WARNING: subscription was created, but is not connected HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription. CREATE SUBSCRIPTION iteration 1 iteration 2 iteration 3 TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c", Line: 3296, PID: 1069025 TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c", Line: 3296, PID: 1069621 Best regards, Alexander
Attachment
Hi, On 2025-01-22 18:13:49 -0800, Noah Misch wrote: > Attached. Outside of comments and tests, it's a simple 20-line patch. Since > only two weeks will remain before release freeze, in the absence of review, I > would push this after 2025-01-25T16:00+0000. I just fixed skink, the valgrind animal, so it runs not just the main regression tests but all tests with a valgrind-ified postgres. Unfortunately, the next run triggered a failure in the test added in this thread: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-03-03%2019%3A52%3A38&stg=injection_points-check diff -U3 /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out --- /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out 2025-01-2519:30:50.005386842 +0000 +++ /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out 2025-03-0321:08:02.025314915 +0000 @@ -75,6 +75,7 @@ SELECT FROM injection_points_wakeup('heap_update-before-pin'); <waiting ...> step grant1: <... completed> +ERROR: tuple concurrently deleted step wakegrant4: <... completed> step inspect4: SELECT relhastriggers, relhassubclass FROM pg_class @@ -82,6 +83,6 @@ relhastriggers|relhassubclass --------------+-------------- -f |f +t |t (1 row) Greetings, Andres Freund
On Mon, Mar 03, 2025 at 05:07:10PM -0500, Andres Freund wrote: > On 2025-01-22 18:13:49 -0800, Noah Misch wrote: > > Attached. Outside of comments and tests, it's a simple 20-line patch. Since > > only two weeks will remain before release freeze, in the absence of review, I > > would push this after 2025-01-25T16:00+0000. > > I just fixed skink, the valgrind animal, so it runs not just the main > regression tests but all tests with a valgrind-ified postgres. Thanks. > Unfortunately, > the next run triggered a failure in the test added in this thread: > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-03-03%2019%3A52%3A38&stg=injection_points-check > > diff -U3 /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out > --- /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out 2025-01-25 19:30:50.005386842 +0000 > +++ /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out 2025-03-0321:08:02.025314915 +0000 > @@ -75,6 +75,7 @@ > SELECT FROM injection_points_wakeup('heap_update-before-pin'); > <waiting ...> > step grant1: <... completed> > +ERROR: tuple concurrently deleted > step wakegrant4: <... completed> > step inspect4: > SELECT relhastriggers, relhassubclass FROM pg_class > @@ -82,6 +83,6 @@ > > relhastriggers|relhassubclass > --------------+-------------- > -f |f > +t |t > (1 row) That isolation permutation tests an unfixed bug. Here, it's giving a result as though the bug were fixed. The suite passed the next two skink runs. I'd tend to defer debugging exactly what went wrong until the project to fix the bug under test. I could delete the permutation, or I could leave it awhile to see how high-probability this failure is. I'm inclined to leave it until it gets four failures, then delete the permutation.
On Mon, Mar 03, 2025 at 07:34:42PM -0800, Noah Misch wrote: > On Mon, Mar 03, 2025 at 05:07:10PM -0500, Andres Freund wrote: > > I just fixed skink, the valgrind animal, so it runs not just the main > > regression tests but all tests with a valgrind-ified postgres. > > Thanks. > > > Unfortunately, > > the next run triggered a failure in the test added in this thread: > > > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-03-03%2019%3A52%3A38&stg=injection_points-check > > > > diff -U3 /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out > > --- /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out 2025-01-25 19:30:50.005386842 +0000 > > +++ /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out 2025-03-0321:08:02.025314915 +0000 > > @@ -75,6 +75,7 @@ > > SELECT FROM injection_points_wakeup('heap_update-before-pin'); > > <waiting ...> > > step grant1: <... completed> > > +ERROR: tuple concurrently deleted > > step wakegrant4: <... completed> > > step inspect4: > > SELECT relhastriggers, relhassubclass FROM pg_class > > @@ -82,6 +83,6 @@ > > > > relhastriggers|relhassubclass > > --------------+-------------- > > -f |f > > +t |t > > (1 row) > > That isolation permutation tests an unfixed bug. Here, it's giving a result > as though the bug were fixed. The suite passed the next two skink runs. I'd > tend to defer debugging exactly what went wrong until the project to fix the > bug under test. I could delete the permutation, or I could leave it awhile to > see how high-probability this failure is. I'm inclined to leave it until it > gets four failures, then delete the permutation. Alexander Lakhin shared a recipe that reproduced it. I found permutation 2 also failed under that recipe, so removing permutation 3 got less attractive. These tests need a concept of "wait until VACUUM definitely will prune a particular tuple". My removable_cutoff() header comment made an argument for having achieved that, but it had two gaps. The attached patch passes 1000 Valgrind iterations of the spec. Without the patch, ~3% of iterations failed. Commit c2dc1a7 added DISABLE_PAGE_SKIPPING to stop buffer pins thwarting other VACUUM tests. However, I suspect FREEZE is necessary and sufficient in all these tests, since we need to coax this code to insist on the cleanup lock: /* * If we didn't get the cleanup lock, we can still collect LP_DEAD * items in the dead_items area for later vacuuming, count live and * recently dead tuples for vacuum logging, and determine if this * block could later be truncated. If we encounter any xid/mxids that * require advancing the relfrozenxid/relminxid, we'll have to wait * for a cleanup lock and call lazy_scan_prune(). */ if (!got_cleanup_lock && !lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items)) { /* * lazy_scan_noprune could not do all required processing. Wait * for a cleanup lock, and call lazy_scan_prune in the usual way. */ Assert(vacrel->aggressive); LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBufferForCleanup(buf); got_cleanup_lock = true; } FREEZE causes the VACUUM to "require advancing the relfrozenxid/relminxid", hence the effectiveness of FREEZE for this purpose. The code has moved around, but I think the policy was the same at the time of that commit c2dc1a7. Per the doc for DISABLE_PAGE_SKIPPING, it's for working around a corrupt vm, not for pin contention. The spec still passes 1000 Valgrind iterations if I remove DISABLE_PAGE_SKIPPING.