Thread: BUG #17741: vacuum process hangs after pg_surgery manipulations

BUG #17741: vacuum process hangs after pg_surgery manipulations

From

PG Bug reporting form

Date:

09 January 2023, 08:37:37

The following bug has been logged on the website:

Bug reference:      17741
Logged by:          Alexander Kozhemyakin
Email address:      a.kozhemyakin@postgrespro.ru
PostgreSQL version: 15.1
Operating system:   Debian 10 (x86_64)
Description:

On the REL_15_STABLE, you can hang vacuum freeze. Maybe this is not
desired?
https://www.postgresql.org/docs/current/pgsurgery.html

reproduce script:
create extension pg_surgery;
create table t1(a int);
insert into t1 values (1);
update t1 set a = 2;
select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
select ctid from t1;
update t1 set a = 3;
select ctid from t1;
vacuum freeze t1;

Now we have hung vacuum process.
stacktrace:
#0  0x0000561b89170515 in heap_prune_chain (prstate=0x7ffd243a8fb0,
rootoffnum=2, buffer=129) at pruneheap.c:615
#1  heap_page_prune (relation=relation@entry=0x7fd14fa3cba8,
buffer=buffer@entry=129, vistest=<optimized out>,
old_snap_xmin=old_snap_xmin@entry=0, old_snap_ts=old_snap_ts@entry=0,
nnewlpdead=nnewlpdead@entry=0x7ffd243a9c04, off_loc=0x561b8ae3f87c)
    at pruneheap.c:376
#2  0x0000561b89172757 in lazy_scan_prune
(vacrel=vacrel@entry=0x561b8ae3f7f8, buf=buf@entry=129, blkno=blkno@entry=0,
page=page@entry=0x7fd1503abc00 "",
prunestate=prunestate@entry=0x7ffd243aaeb0) at vacuumlazy.c:1590
#3  0x0000561b8917451b in lazy_scan_heap (vacrel=0x561b8ae3f7f8) at
vacuumlazy.c:1048
#4  heap_vacuum_rel (rel=0x7fd14fa3cba8, params=0x7ffd243ab360,
bstrategy=<optimized out>) at vacuumlazy.c:534
#5  0x0000561b8929d1bb in table_relation_vacuum (bstrategy=<optimized out>,
params=0x7ffd243ab360, rel=0x7fd14fa3cba8) at
../../../src/include/access/tableam.h:1680
#6  vacuum_rel (relid=24576, relation=<optimized out>,
params=params@entry=0x7ffd243ab360) at vacuum.c:2086
#7  0x0000561b8929e6f5 in vacuum (relations=0x561b8aee6668,
params=0x7ffd243ab360, bstrategy=<optimized out>, isTopLevel=<optimized
out>) at vacuum.c:475
#8  0x0000561b8929ec1f in ExecVacuum (pstate=pstate@entry=0x561b8aedbae8,
vacstmt=vacstmt@entry=0x561b8ae1a268, isTopLevel=isTopLevel@entry=true) at
vacuum.c:275
#9  0x0000561b8940c3d0 in standard_ProcessUtility (pstmt=0x561b8ae1a628,
queryString=0x561b8ae197b8 "vacuum freeze t1;", readOnlyTree=<optimized
out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x561b8ae1a708, qc=0x7ffd243ab6b0)
    at utility.c:866
#10 0x0000561b8940ab3f in PortalRunUtility
(portal=portal@entry=0x561b8ae87018, pstmt=pstmt@entry=0x561b8ae1a628,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x561b8ae1a708,
qc=0x7ffd243ab6b0) at pquery.c:1158
#11 0x0000561b8940ac73 in PortalRunMulti
(portal=portal@entry=0x561b8ae87018, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x561b8ae1a708,
altdest=altdest@entry=0x561b8ae1a708, qc=qc@entry=0x7ffd243ab6b0)
    at pquery.c:1315
#12 0x0000561b8940b17f in PortalRun (portal=portal@entry=0x561b8ae87018,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x561b8ae1a708,
altdest=altdest@entry=0x561b8ae1a708,
    qc=0x7ffd243ab6b0) at pquery.c:791
#13 0x0000561b89407763 in exec_simple_query (query_string=0x561b8ae197b8
"vacuum freeze t1;") at postgres.c:1250
#14 0x0000561b8940840b in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4593
#15 0x0000561b8938e791 in BackendRun (port=<optimized out>, port=<optimized
out>) at postmaster.c:4504
#16 BackendStartup (port=<optimized out>) at postmaster.c:4232
#17 ServerLoop () at postmaster.c:1806
#18 0x0000561b8938f718 in PostmasterMain (argc=3, argv=0x561b8ae13db0) at
postmaster.c:1478
#19 0x0000561b89123f19 in main (argc=3, argv=0x561b8ae13db0) at main.c:202

Re: BUG #17741: vacuum process hangs after pg_surgery manipulations

From

Alvaro Herrera

Date:

16 January 2023, 15:31:04

On 2023-Jan-09, PG Bug reporting form wrote:

> On the REL_15_STABLE, you can hang vacuum freeze. Maybe this is not
> desired?
> https://www.postgresql.org/docs/current/pgsurgery.html
> 
> reproduce script:
> create extension pg_surgery;

Using pg_surgery is the equivalent of introducing corruption in your
data.  It has, of course, completely valid uses, but if you break the
system while using it, it's on you to fix it.

The pg_surgery documentation you cite states:

: These functions are unsafe by design and using them may corrupt (or
: further corrupt) your database.

So, you've been warned.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
                http://smylers.hates-software.com/2005/09/08/1995c749.html

Re: BUG #17741: vacuum process hangs after pg_surgery manipulations

From

Masahiko Sawada

Date:

18 January 2023, 01:40:39

On Tue, Jan 17, 2023 at 12:37 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Jan-09, PG Bug reporting form wrote:
>
> > On the REL_15_STABLE, you can hang vacuum freeze. Maybe this is not
> > desired?
> > https://www.postgresql.org/docs/current/pgsurgery.html
> >
> > reproduce script:
> > create extension pg_surgery;
>
> Using pg_surgery is the equivalent of introducing corruption in your
> data.  It has, of course, completely valid uses, but if you break the
> system while using it, it's on you to fix it.
>
> The pg_surgery documentation you cite states:
>
> : These functions are unsafe by design and using them may corrupt (or
> : further corrupt) your database.
>
> So, you've been warned.

While this is completely true and I agree, can we improve this
situation somewhat so that it ends up with an error instead of getting
hanged?

In this case, the tuple with a = 1, the root of the HOT chain, was
killed, and the tuple with a = 2 was heap-only tuple and HOT-updated.
In heap_page_prune(), we normally can prune the tuple with a = 2 as
part of pruning its chain, but since the root tuple was already killed
we could not prune this tuple. Then, we ended up retrying
heap_page_prune() since we saw as if the tuple became dead since
heap_page_prune() looked. Normally retrying heap_page_prune() works
but in this case since we didn't have the root tuple it misses again,
and gets hanged after all. I think that we didn't have this hang
before 8523492d4e3 even in the same corruption case. One idea is to
improve this situation is that we have a sanity check that we have
retired due to the same tuple.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: BUG #17741: vacuum process hangs after pg_surgery manipulations

From

Alvaro Herrera

Date:

29 January 2023, 19:24:21

On 2023-Jan-18, Masahiko Sawada wrote:

> While this is completely true and I agree, can we improve this
> situation somewhat so that it ends up with an error instead of getting
> hanged?

Well, I don't know.  I think in this case we would have to look at a
patch that claimed to change the behavior, so that we can determine
whether it's likely to break something else.

> In this case, the tuple with a = 1, the root of the HOT chain, was
> killed, and the tuple with a = 2 was heap-only tuple and HOT-updated.
> In heap_page_prune(), we normally can prune the tuple with a = 2 as
> part of pruning its chain, but since the root tuple was already killed
> we could not prune this tuple. Then, we ended up retrying
> heap_page_prune() since we saw as if the tuple became dead since
> heap_page_prune() looked.

My intuition for attacking this, is that we should definitely strive to
change the behavior if the pattern of corruption is something that is
seen to appear with some frequency.  If it only happens because somebody
was careless while running pg_surgery, then let's just leave it to her
to complete the surgery.  But if some unknown server bug causes it and
we have a lot of people with vacuum hanging because of it, then I agree
with might want to look for alternatives.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/