Re: BUG #17821: Assertion failed in heap_update() due to heap pruning - Mailing list pgsql-bugs

From Noah Misch
Subject Re: BUG #17821: Assertion failed in heap_update() due to heap pruning
Date
Msg-id 20250402170436.4a.nmisch@google.com
Whole thread Raw
In response to Re: BUG #17821: Assertion failed in heap_update() due to heap pruning  (Noah Misch <noah@leadboat.com>)
List pgsql-bugs
On Mon, Mar 03, 2025 at 07:34:42PM -0800, Noah Misch wrote:
> On Mon, Mar 03, 2025 at 05:07:10PM -0500, Andres Freund wrote:
> > I just fixed skink, the valgrind animal, so it runs not just the main
> > regression tests but all tests with a valgrind-ified postgres.
> 
> Thanks.
> 
> > Unfortunately,
> > the next run triggered a failure in the test added in this thread:
> > 
> >
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-03-03%2019%3A52%3A38&stg=injection_points-check
> > 
> > diff -U3
/home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out
> > --- /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out
  2025-01-25 19:30:50.005386842 +0000
 
> > +++
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out
2025-03-0321:08:02.025314915 +0000
 
> > @@ -75,6 +75,7 @@
> >      SELECT FROM injection_points_wakeup('heap_update-before-pin');
> >   <waiting ...>
> >  step grant1: <... completed>
> > +ERROR:  tuple concurrently deleted
> >  step wakegrant4: <... completed>
> >  step inspect4:
> >      SELECT relhastriggers, relhassubclass FROM pg_class
> > @@ -82,6 +83,6 @@
> > 
> >  relhastriggers|relhassubclass
> >  --------------+--------------
> > -f             |f
> > +t             |t
> >  (1 row)
> 
> That isolation permutation tests an unfixed bug.  Here, it's giving a result
> as though the bug were fixed.  The suite passed the next two skink runs.  I'd
> tend to defer debugging exactly what went wrong until the project to fix the
> bug under test.  I could delete the permutation, or I could leave it awhile to
> see how high-probability this failure is.  I'm inclined to leave it until it
> gets four failures, then delete the permutation.

Alexander Lakhin shared a recipe that reproduced it.  I found permutation 2
also failed under that recipe, so removing permutation 3 got less attractive.

These tests need a concept of "wait until VACUUM definitely will prune a
particular tuple".  My removable_cutoff() header comment made an argument for
having achieved that, but it had two gaps.  The attached patch passes 1000
Valgrind iterations of the spec.  Without the patch, ~3% of iterations failed.

Commit c2dc1a7 added DISABLE_PAGE_SKIPPING to stop buffer pins thwarting other
VACUUM tests.  However, I suspect FREEZE is necessary and sufficient in all
these tests, since we need to coax this code to insist on the cleanup lock:

        /*
         * If we didn't get the cleanup lock, we can still collect LP_DEAD
         * items in the dead_items area for later vacuuming, count live and
         * recently dead tuples for vacuum logging, and determine if this
         * block could later be truncated. If we encounter any xid/mxids that
         * require advancing the relfrozenxid/relminxid, we'll have to wait
         * for a cleanup lock and call lazy_scan_prune().
         */
        if (!got_cleanup_lock &&
            !lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items))
        {
            /*
             * lazy_scan_noprune could not do all required processing.  Wait
             * for a cleanup lock, and call lazy_scan_prune in the usual way.
             */
            Assert(vacrel->aggressive);
            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
            LockBufferForCleanup(buf);
            got_cleanup_lock = true;
        }

FREEZE causes the VACUUM to "require advancing the relfrozenxid/relminxid",
hence the effectiveness of FREEZE for this purpose.  The code has moved
around, but I think the policy was the same at the time of that commit
c2dc1a7.  Per the doc for DISABLE_PAGE_SKIPPING, it's for working around a
corrupt vm, not for pin contention.  The spec still passes 1000 Valgrind
iterations if I remove DISABLE_PAGE_SKIPPING.

Attachment

pgsql-bugs by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: BUG #18874: CVE-2024-4317
Next
From: PG Bug reporting form
Date:
Subject: BUG #18875: COPY BINARY tsvector FROM file leads to misaligned memory access