Re: BUG #17741: vacuum process hangs after pg_surgery manipulations - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: BUG #17741: vacuum process hangs after pg_surgery manipulations
Date
Msg-id CAD21AoBYvTfc9E+3p6ecN2n=UsftggWaQiZo1xtYnObQ-uTiQQ@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17741: vacuum process hangs after pg_surgery manipulations  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: BUG #17741: vacuum process hangs after pg_surgery manipulations  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-bugs
On Tue, Jan 17, 2023 at 12:37 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Jan-09, PG Bug reporting form wrote:
>
> > On the REL_15_STABLE, you can hang vacuum freeze. Maybe this is not
> > desired?
> > https://www.postgresql.org/docs/current/pgsurgery.html
> >
> > reproduce script:
> > create extension pg_surgery;
>
> Using pg_surgery is the equivalent of introducing corruption in your
> data.  It has, of course, completely valid uses, but if you break the
> system while using it, it's on you to fix it.
>
> The pg_surgery documentation you cite states:
>
> : These functions are unsafe by design and using them may corrupt (or
> : further corrupt) your database.
>
> So, you've been warned.

While this is completely true and I agree, can we improve this
situation somewhat so that it ends up with an error instead of getting
hanged?

In this case, the tuple with a = 1, the root of the HOT chain, was
killed, and the tuple with a = 2 was heap-only tuple and HOT-updated.
In heap_page_prune(), we normally can prune the tuple with a = 2 as
part of pruning its chain, but since the root tuple was already killed
we could not prune this tuple. Then, we ended up retrying
heap_page_prune() since we saw as if the tuple became dead since
heap_page_prune() looked. Normally retrying heap_page_prune() works
but in this case since we didn't have the root tuple it misses again,
and gets hanged after all. I think that we didn't have this hang
before 8523492d4e3 even in the same corruption case. One idea is to
improve this situation is that we have a sanity check that we have
retired due to the same tuple.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Next
From: "Sam.Mesh"
Date:
Subject: index not used for bigint without explicit cast