Thread: pgsql: Avoid improbable PANIC during heap_update.

pgsql: Avoid improbable PANIC during heap_update.

From

Tom Lane

Date:

13 April 2021, 16:17:39

Avoid improbable PANIC during heap_update.

heap_update needs to clear any existing "all visible" flag on
the old tuple's page (and on the new page too, if different).
Per coding rules, to do this it must acquire pin on the appropriate
visibility-map page while not holding exclusive buffer lock;
which creates a race condition since someone else could set the
flag whenever we're not holding the buffer lock.  The code is
supposed to handle that by re-checking the flag after acquiring
buffer lock and retrying if it became set.  However, one code
path through heap_update itself, as well as one in its subroutine
RelationGetBufferForTuple, failed to do this.  The end result,
in the unlikely event that a concurrent VACUUM did set the flag
while we're transiently not holding lock, is a non-recurring
"PANIC: wrong buffer passed to visibilitymap_clear" failure.

This has been seen a few times in the buildfarm since recent VACUUM
changes that added code paths that could set the all-visible flag
while holding only exclusive buffer lock.  Previously, the flag
was (usually?) set only after doing LockBufferForCleanup, which
would insist on buffer pin count zero, thus preventing the flag
from becoming set partway through heap_update.  However, it's
clear that it's heap_update not VACUUM that's at fault here.

What's less clear is whether there is any hazard from these bugs
in released branches.  heap_update is certainly violating API
expectations, but if there is no code path that can set all-visible
without a cleanup lock then it's only a latent bug.  That's not
100% certain though, besides which we should worry about extensions
or future back-patch fixes that could introduce such code paths.

I chose to back-patch to v12.  Fixing RelationGetBufferForTuple
before that would require also back-patching portions of older
fixes (notably 0d1fe9f74), which is more code churn than seems
prudent to fix a hypothetical issue.

Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/34f581c39e97e2ea237255cf75cccebccc02d477

Modified Files
--------------
src/backend/access/heap/heapam.c | 44 ++++++++++++++++++++++++----------------
src/backend/access/heap/hio.c    | 24 ++++++++++++++++------
2 files changed, 45 insertions(+), 23 deletions(-)

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Jaime Casanova

Date:

29 September 2022, 07:55:40

On Tue, Apr 13, 2021 at 04:17:39PM +0000, Tom Lane wrote:
> Avoid improbable PANIC during heap_update.
> 
> heap_update needs to clear any existing "all visible" flag on
> the old tuple's page (and on the new page too, if different).
> Per coding rules, to do this it must acquire pin on the appropriate
> visibility-map page while not holding exclusive buffer lock;
> which creates a race condition since someone else could set the
> flag whenever we're not holding the buffer lock.  The code is
> supposed to handle that by re-checking the flag after acquiring
> buffer lock and retrying if it became set.  However, one code
> path through heap_update itself, as well as one in its subroutine
> RelationGetBufferForTuple, failed to do this.  The end result,
> in the unlikely event that a concurrent VACUUM did set the flag
> while we're transiently not holding lock, is a non-recurring
> "PANIC: wrong buffer passed to visibilitymap_clear" failure.
> 

Hi,

This doesn't look as improbable because I saw it at least 3 times with
v15beta4.

The first time I thought it was my fault, then I tried with a commit on
september 25 (didn't remember which exactly but that doesn't seems too
relevant).
Finally I saw it again in a build with TRACE_VISIBILITYMAP defined (the
same commit).

But I haven't see it anymore on rc1. Anyway I'm attaching the backtrace
(this is from the build with TRACE_VISIBILITYMAP), the query that was 
running at the time was (no changes were made to quad_poly_tbl table 
nor any indexes were added to this table):

"""
update public.quad_poly_tbl set
  id = public.quad_poly_tbl.id
returning
  public.quad_poly_tbl.id as c0
"""

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Attachment

wrong_buffer_panic_beta4.txt

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Jaime Casanova

Date:

30 September 2022, 20:44:22

On Thu, Sep 29, 2022 at 02:55:40AM -0500, Jaime Casanova wrote:
> On Tue, Apr 13, 2021 at 04:17:39PM +0000, Tom Lane wrote:
> > Avoid improbable PANIC during heap_update.
> > 
> > heap_update needs to clear any existing "all visible" flag on
> > the old tuple's page (and on the new page too, if different).
> > Per coding rules, to do this it must acquire pin on the appropriate
> > visibility-map page while not holding exclusive buffer lock;
> > which creates a race condition since someone else could set the
> > flag whenever we're not holding the buffer lock.  The code is
> > supposed to handle that by re-checking the flag after acquiring
> > buffer lock and retrying if it became set.  However, one code
> > path through heap_update itself, as well as one in its subroutine
> > RelationGetBufferForTuple, failed to do this.  The end result,
> > in the unlikely event that a concurrent VACUUM did set the flag
> > while we're transiently not holding lock, is a non-recurring
> > "PANIC: wrong buffer passed to visibilitymap_clear" failure.
> > 
> 
> Hi,
> 
> This doesn't look as improbable because I saw it at least 3 times with
> v15beta4.
> 
> The first time I thought it was my fault, then I tried with a commit on
> september 25 (didn't remember which exactly but that doesn't seems too
> relevant).
> Finally I saw it again in a build with TRACE_VISIBILITYMAP defined (the
> same commit).
> 
> But I haven't see it anymore on rc1. Anyway I'm attaching the backtrace
> (this is from the build with TRACE_VISIBILITYMAP), the query that was 
> running at the time was (no changes were made to quad_poly_tbl table 
> nor any indexes were added to this table):
> 
> """
> update public.quad_poly_tbl set
>   id = public.quad_poly_tbl.id
> returning
>   public.quad_poly_tbl.id as c0
> """
> 

Just to confirm I saw this on RC1

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Tom Lane

Date:

30 September 2022, 20:51:20

Jaime Casanova <jcasanov@systemguards.com.ec> writes:
> Just to confirm I saw this on RC1

What test case are you using?

            regards, tom lane

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Tom Lane

Date:

30 September 2022, 21:28:51

Jaime Casanova <jcasanov@systemguards.com.ec> writes:
> Just to confirm I saw this on RC1

Ugh ... I think I see the problem.  There's still one path through
RelationGetBufferForTuple that fails to guarantee that it's acquired
a vmbuffer pin if the all-visible flag becomes set in the otherBuffer.
Namely, if we're forced to extend the relation, then we deal with
vm pins when ConditionalLockBuffer(otherBuffer) fails ... but not
when it succeeds.  I think the fix is just to move the last
GetVisibilityMapPins call out of the "if
(unlikely(!ConditionalLockBuffer(otherBuffer)))" stanza.

It'd still be good to have a test case for this ...

            regards, tom lane

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Peter Geoghegan

Date:

30 September 2022, 21:39:20

On Fri, Sep 30, 2022 at 2:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jaime Casanova <jcasanov@systemguards.com.ec> writes:
> > Just to confirm I saw this on RC1
>
> Ugh ... I think I see the problem.  There's still one path through
> RelationGetBufferForTuple that fails to guarantee that it's acquired
> a vmbuffer pin if the all-visible flag becomes set in the otherBuffer.

> It'd still be good to have a test case for this ...

FWIW it seems possible that the Postgres 15 vacuumlazy.c work that
added lazy_scan_noprune() made this scenario more likely in practice
-- even compared to Postgres 14.

Note that VACUUM will collect preexisting LP_DEAD items in heap pages
that cannot be cleanup locked during VACUUM's first heap pass in
Postgres 15 (in lazy_scan_noprune). There is no need for a cleanup
lock in the second heap pass, either (that details is the same as 14,
but not earlier versions). So 15 is the first version that doesn't
need a cleanup lock in either the first heap pass or the second heap
pass to be able to set the heap page all-visible. That difference
seems like it could be "protective" on 14, especially when vacuuming
smaller tables.

-- 
Peter Geoghegan

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Tom Lane

Date:

30 September 2022, 21:52:30

I wrote:
> Ugh ... I think I see the problem.  There's still one path through
> RelationGetBufferForTuple that fails to guarantee that it's acquired
> a vmbuffer pin if the all-visible flag becomes set in the otherBuffer.
> Namely, if we're forced to extend the relation, then we deal with
> vm pins when ConditionalLockBuffer(otherBuffer) fails ... but not
> when it succeeds.  I think the fix is just to move the last
> GetVisibilityMapPins call out of the "if
> (unlikely(!ConditionalLockBuffer(otherBuffer)))" stanza.

Concretely, about like this.

            regards, tom lane

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index ae2e2ce37a..b0ece66629 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -678,29 +678,34 @@ loop:
             LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
             LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
             LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+        }
 
-            /*
-             * Because the buffers were unlocked for a while, it's possible,
-             * although unlikely, that an all-visible flag became set or that
-             * somebody used up the available space in the new page.  We can
-             * use GetVisibilityMapPins to deal with the first case.  In the
-             * second case, just retry from start.
-             */
-            GetVisibilityMapPins(relation, otherBuffer, buffer,
-                                 otherBlock, targetBlock, vmbuffer_other,
-                                 vmbuffer);
+        /*
+         * Because the buffers were unlocked for a while, it's possible,
+         * although unlikely, that an all-visible flag became set or that
+         * somebody used up the available space in the new page.  We can use
+         * GetVisibilityMapPins to deal with the first case.  In the second
+         * case, just retry from start.
+         */
+        GetVisibilityMapPins(relation, otherBuffer, buffer,
+                             otherBlock, targetBlock, vmbuffer_other,
+                             vmbuffer);
 
-            if (len > PageGetHeapFreeSpace(page))
-            {
-                LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
-                UnlockReleaseBuffer(buffer);
+        /*
+         * Note that we have to check the available space even if our
+         * conditional lock succeeded, because GetVisibilityMapPins might've
+         * transiently released lock on the target buffer to acquire a VM pin
+         * for the otherBuffer.
+         */
+        if (len > PageGetHeapFreeSpace(page))
+        {
+            LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
+            UnlockReleaseBuffer(buffer);
 
-                goto loop;
-            }
+            goto loop;
         }
     }
-
-    if (len > PageGetHeapFreeSpace(page))
+    else if (len > PageGetHeapFreeSpace(page))
     {
         /* We should not get here given the test at the top */
         elog(PANIC, "tuple is too big: size %zu", len);

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Tom Lane

Date:

30 September 2022, 21:56:11

Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Sep 30, 2022 at 2:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ugh ... I think I see the problem.  There's still one path through
>> RelationGetBufferForTuple that fails to guarantee that it's acquired
>> a vmbuffer pin if the all-visible flag becomes set in the otherBuffer.

> FWIW it seems possible that the Postgres 15 vacuumlazy.c work that
> added lazy_scan_noprune() made this scenario more likely in practice
> -- even compared to Postgres 14.

Could be, because we haven't seen field reports of this in v14 yet.
And there's still no hard evidence of a bug pre-14.  Nonetheless,
I'm inclined to backpatch to v12 as 34f581c39 was.

            regards, tom lane

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Peter Geoghegan

Date:

30 September 2022, 22:03:46

On Fri, Sep 30, 2022 at 2:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > FWIW it seems possible that the Postgres 15 vacuumlazy.c work that
> > added lazy_scan_noprune() made this scenario more likely in practice
> > -- even compared to Postgres 14.
>
> Could be, because we haven't seen field reports of this in v14 yet.

I would be more confident here were it not for the recent
heap_delete() issue reported by one of my AWS colleagues (and fixed by
another, Jeff Davis). See commit 163b0993 if you missed it before now.

> And there's still no hard evidence of a bug pre-14.  Nonetheless,
> I'm inclined to backpatch to v12 as 34f581c39 was.

+1

-- 
Peter Geoghegan

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Tom Lane

Date:

30 September 2022, 23:52:44

Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Sep 30, 2022 at 2:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Could be, because we haven't seen field reports of this in v14 yet.

> I would be more confident here were it not for the recent
> heap_delete() issue reported by one of my AWS colleagues (and fixed by
> another, Jeff Davis). See commit 163b0993 if you missed it before now.

Hmm, okay, though that's really a distinct bug of the same ilk.
You're right that I'd not paid close attention to that thread after
Jeff diagnosed the problem.  It does seem like Robins' report
shows that there's some way that v13 will set the AV bit without
a cleanup lock ... does that constitute a bug in itself?

            regards, tom lane

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Peter Geoghegan

Date:

01 October 2022, 00:09:50

On Fri, Sep 30, 2022 at 4:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:> >
I would be more confident here were it not for the recent
> > heap_delete() issue reported by one of my AWS colleagues (and fixed by
> > another, Jeff Davis). See commit 163b0993 if you missed it before now.
>
> Hmm, okay, though that's really a distinct bug of the same ilk.
> You're right that I'd not paid close attention to that thread after
> Jeff diagnosed the problem.

I just meant that I don't feel particularly confident about what might
be possible or likely in Postgres 14 with this new issue in
heap_update() on point releases without today's bugfix. My theory
about lazy_scan_noprune() might be correct, but take it with a grain
of salt.

> It does seem like Robins' report
> shows that there's some way that v13 will set the AV bit without
> a cleanup lock ... does that constitute a bug in itself?

We never got to the bottom of that part, strangely enough. I can ask again.

In any case we cannot really treat the information that we have about
that as a bug report -- not as things stand. Why should the question
of whether or not we ever set a page PD_ALL_VISIBLE without a cleanup
lock on v13 be a mystery at all? Why wouldn't a simple grep get to the
bottom of it? I have to imagine that the true explanation is very
simple and boring.

-- 
Peter Geoghegan

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Peter Geoghegan

Date:

01 October 2022, 01:29:28

On Fri, Sep 30, 2022 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> In any case we cannot really treat the information that we have about
> that as a bug report -- not as things stand. Why should the question
> of whether or not we ever set a page PD_ALL_VISIBLE without a cleanup
> lock on v13 be a mystery at all? Why wouldn't a simple grep get to the
> bottom of it? I have to imagine that the true explanation is very
> simple and boring.

I talked to Robins about this privately. I was wrong; there isn't a
simple or boring explanation.

Robins set out to find bugs like this in Postgres via stress-testing.
He used a lab environment for this, and was quite methodical. So there
is no reason to doubt that a PANIC happened on v13 at least once.
There must be some relatively complicated explanation for that, but
right now I can only speculate.

--
Peter Geoghegan

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Peter Geoghegan

Date:

01 October 2022, 04:23:05

On Fri, Sep 30, 2022 at 6:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I talked to Robins about this privately. I was wrong; there isn't a
> simple or boring explanation.

I think that I figured it out. With or without bugfix commit 163b0993,
we do these steps early in heap_delete() (this is 13 code as of
today):

2490     page = BufferGetPage(buffer);
2491
2492     /*
2493      * Before locking the buffer, pin the visibility map page if
it appears to
2494      * be necessary.  Since we haven't got the lock yet, someone
else might be
2495      * in the middle of changing this, so we'll need to recheck
after we have
2496      * the lock.
2497      */
2498     if (PageIsAllVisible(page))
2499         visibilitymap_pin(relation, block, &vmbuffer);

So we're calling visibilitymap_pin() having just acquired a buffer pin
on a heap page buffer for the first time, and without acquiring a
buffer lock on the same heap page (we don't hold one now, and we've
never held one at some earlier point).

Without Jeff's bugfix, nothing stops heap_delete() from getting a pin
on a heap page that happens to have already been cleanup locked by
another session running VACUUM. The same session could then
(correctly) observe that the page does not have PD_ALL_VISIBLE set --
not yet. VACUUM might then set PD_ALL_VISIBLE, without having to
acquire any kind of interlock that heap_delete() will reliably notice.
After all, VACUUM had a cleanup lock before the other session's call
to heap_delete() even began -- so the responsibility has to lie with
heap_delete().

Jeff's bugfix will fix the bug on 13 too. The bugfix doesn't take the
aggressive/conservative approach of simply getting an exclusive lock
to check PageIsAllVisible() at the same point, on performance grounds
(no change there). The bugfix does make this old heap_delete()
no-buffer-lock behavior safe by teaching heap_delete() to not assume
that a page that didn't have PD_ALL_VISIBLE initially set cannot have
it set concurrently.

So 13 is only different to 14 in that there are fewer ways for
essentially the same race to happen. This is probably only true for
the heap_delete() issue, not either of the similar heap_update()
issues.

--
Peter Geoghegan

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Tom Lane

Date:

01 October 2022, 04:38:00

Peter Geoghegan <pg@bowt.ie> writes:
> ... nothing stops heap_delete() from getting a pin
> on a heap page that happens to have already been cleanup locked by
> another session running VACUUM. The same session could then
> (correctly) observe that the page does not have PD_ALL_VISIBLE set --
> not yet. VACUUM might then set PD_ALL_VISIBLE, without having to
> acquire any kind of interlock that heap_delete() will reliably notice.

I'm too tired to think this through completely clearly, but this
sounds right, and what it seems to imply is that this race condition
exists in all PG versions.  Which would imply that we need to do the
work to back-patch these three fixes into v11/v10.  I would rather
not do that, because then we'd have to also back-patch some other
more-invasive changes, and the net risk of introducing new bugs
seems uncomfortably high.  (Especially for v10, where there will
be no second chance after the November releases.)

So what is bothering me about this line of thought is: how come
there have not been reports of these failures in older branches?
Is there some aspect we're not thinking about that masks the bug?

            regards, tom lane

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Peter Geoghegan

Date:

01 October 2022, 05:09:19

On Fri, Sep 30, 2022 at 9:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm too tired to think this through completely clearly, but this
> sounds right, and what it seems to imply is that this race condition
> exists in all PG versions.

I think that the heap_delete() issue is probably in all PG versions.

> Which would imply that we need to do the
> work to back-patch these three fixes into v11/v10.

I am not aware of any reason why we should need the heap_update()
fixes to be backpatched any further. Though I will need to think about
it some more.

> So what is bothering me about this line of thought is: how come
> there have not been reports of these failures in older branches?
> Is there some aspect we're not thinking about that masks the bug?

The likely explanation is that Robins was able to find the
heap_delete() bug by throwing lots of resources (human effort and
machine time) into it. It literally took weeks of adversarial
stress-testing to find the bug. It's entirely possible and perhaps
likely that this isn't representative of real world conditions in some
crucial way.

-- 
Peter Geoghegan

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Tom Lane

Date:

01 October 2022, 05:13:13

Peter Geoghegan <pg@bowt.ie> writes:
> I think that the heap_delete() issue is probably in all PG versions.

Yeah, that's what I'm afraid of ...

> I am not aware of any reason why we should need the heap_update()
> fixes to be backpatched any further.

How so?  AFAICS these are exactly the same oversight, ie failure
to deal with the all-visible bit getting set partway through the
operation.  You've explained how that can happen.

            regards, tom lane

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Peter Geoghegan

Date:

01 October 2022, 05:25:06

On Fri, Sep 30, 2022 at 10:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> How so?  AFAICS these are exactly the same oversight, ie failure
> to deal with the all-visible bit getting set partway through the
> operation.  You've explained how that can happen.

I thought that there might have been something protective about how
the loop would work in heap_update(), but perhaps that's not true. It
might just be that heap_update() does lots of stuff in between, so
it's less likely to be affected by this particular race (the race
which seems to be present in all versions).

-- 
Peter Geoghegan

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Jeff Davis

Date:

01 October 2022, 16:35:31

On Fri, 2022-09-30 at 21:23 -0700, Peter Geoghegan wrote:
> 2490     page = BufferGetPage(buffer);
> 2491
> 2492     /*
> 2493      * Before locking the buffer, pin the visibility map page if
> it appears to
> 2494      * be necessary.  Since we haven't got the lock yet, someone
> else might be
> 2495      * in the middle of changing this, so we'll need to recheck
> after we have
> 2496      * the lock.
> 2497      */
> 2498     if (PageIsAllVisible(page))
> 2499         visibilitymap_pin(relation, block, &vmbuffer);
>
> So we're calling visibilitymap_pin() having just acquired a buffer
> pin
> on a heap page buffer for the first time, and without acquiring a
> buffer lock on the same heap page (we don't hold one now, and we've
> never held one at some earlier point).
>
> Without Jeff's bugfix, nothing stops heap_delete() from getting a pin
> on a heap page that happens to have already been cleanup locked by
> another session running VACUUM. The same session could then
> (correctly) observe that the page does not have PD_ALL_VISIBLE set --
> not yet. 

With you so far; I had considered this code path and was still unable
to repro.

> VACUUM might then set PD_ALL_VISIBLE, without having to
> acquire any kind of interlock that heap_delete() will reliably
> notice.
> After all, VACUUM had a cleanup lock before the other session's call
> to heap_delete() even began -- so the responsibility has to lie with
> heap_delete().

Directly after the code you reference above, there is (in 5f9dda4c06,
right before my patch):

 2501     LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 2502
 2503     /*
 2504      * If we didn't pin the visibility map page and the page has
become all
 2505      * visible while we were busy locking the buffer, we'll have
to unlock and
 2506      * re-lock, to avoid holding the buffer lock across an I/O.
That's a bit
 2507      * unfortunate, but hopefully shouldn't happen often.
 2508      */
 2509     if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
 2510     {
 2511         LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 2512         visibilitymap_pin(relation, block, &vmbuffer);
 2513         LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 2514     }

Doesn't that deal with the case you brought up directly? heap_delete()
can't get the exclusive lock until VACUUM releases its cleanup lock, at
which point all-visible will be set. Then heap_delete() should notice
in line 2509 and then pin the VM buffer. Right?

And I don't think a similar issue exists when the locks are briefly
released on lines 2563 or 2606. The pin is held until after the VM bit
is cleared (aside from an error path and an early return):

 2489     buffer = ReadBuffer(relation, block);
 ...
 2717     if (PageIsAllVisible(page))
 2718     {
 2719         all_visible_cleared = true;
 2720         PageClearAllVisible(page);
 ...
 2835     ReleaseBuffer(buffer);

If VACCUM hasn't acquired the cleanup lock before 2489, it can't get
one until heap_delete() is done looking at the all-visible bit anyway.
So I don't see how my patch would have fixed it even if that was the
problem.

Of course, I must be wrong somewhere, because the bug seems to exist.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Peter Geoghegan

Date:

01 October 2022, 16:53:59

On Sat, Oct 1, 2022 at 9:35 AM Jeff Davis <pgsql@j-davis.com> wrote:
> Doesn't that deal with the case you brought up directly? heap_delete()
> can't get the exclusive lock until VACUUM releases its cleanup lock, at
> which point all-visible will be set. Then heap_delete() should notice
> in line 2509 and then pin the VM buffer. Right?

I now believe that you're right. I don't think that this code was ever
designed to rely on cleanup locks in any way; that was kind of an
accident all along. Even still, I'm not sure how I missed such an
obvious thing. Sorry for the misdirection.

Still, there has to be *some* reason why the bug could repro on 13.

--
Peter Geoghegan

Re: pgsql: Avoid improbable PANIC during heap_update.

From

Jaime Casanova

Date:

03 October 2022, 19:34:19

On Fri, Sep 30, 2022 at 04:51:20PM -0400, Tom Lane wrote:
> Jaime Casanova <jcasanov@systemguards.com.ec> writes:
> > Just to confirm I saw this on RC1
> 
> What test case are you using?
> 

Hi,

Currently the way I have to reproduce it is:

- install the regression database
- drop all tables but: 
    hash_i4_heap, hash_name_heap, hash_txt_heap,
      quad_poly_tbl, road
- run 10 sqlsmith processes... normally in an hour or less the problem
  appears

I have the logs from the last time it happened so maybe I can trace the
exact pattern to reproducite at will... at least to keep a test
somewhere.

BTW, so far so good with your last fix (about 12 hours now)...

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL