Thread: BUG #17741: vacuum process hangs after pg_surgery manipulations
The following bug has been logged on the website: Bug reference: 17741 Logged by: Alexander Kozhemyakin Email address: a.kozhemyakin@postgrespro.ru PostgreSQL version: 15.1 Operating system: Debian 10 (x86_64) Description: On the REL_15_STABLE, you can hang vacuum freeze. Maybe this is not desired? https://www.postgresql.org/docs/current/pgsurgery.html reproduce script: create extension pg_surgery; create table t1(a int); insert into t1 values (1); update t1 set a = 2; select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]); select ctid from t1; update t1 set a = 3; select ctid from t1; vacuum freeze t1; Now we have hung vacuum process. stacktrace: #0 0x0000561b89170515 in heap_prune_chain (prstate=0x7ffd243a8fb0, rootoffnum=2, buffer=129) at pruneheap.c:615 #1 heap_page_prune (relation=relation@entry=0x7fd14fa3cba8, buffer=buffer@entry=129, vistest=<optimized out>, old_snap_xmin=old_snap_xmin@entry=0, old_snap_ts=old_snap_ts@entry=0, nnewlpdead=nnewlpdead@entry=0x7ffd243a9c04, off_loc=0x561b8ae3f87c) at pruneheap.c:376 #2 0x0000561b89172757 in lazy_scan_prune (vacrel=vacrel@entry=0x561b8ae3f7f8, buf=buf@entry=129, blkno=blkno@entry=0, page=page@entry=0x7fd1503abc00 "", prunestate=prunestate@entry=0x7ffd243aaeb0) at vacuumlazy.c:1590 #3 0x0000561b8917451b in lazy_scan_heap (vacrel=0x561b8ae3f7f8) at vacuumlazy.c:1048 #4 heap_vacuum_rel (rel=0x7fd14fa3cba8, params=0x7ffd243ab360, bstrategy=<optimized out>) at vacuumlazy.c:534 #5 0x0000561b8929d1bb in table_relation_vacuum (bstrategy=<optimized out>, params=0x7ffd243ab360, rel=0x7fd14fa3cba8) at ../../../src/include/access/tableam.h:1680 #6 vacuum_rel (relid=24576, relation=<optimized out>, params=params@entry=0x7ffd243ab360) at vacuum.c:2086 #7 0x0000561b8929e6f5 in vacuum (relations=0x561b8aee6668, params=0x7ffd243ab360, bstrategy=<optimized out>, isTopLevel=<optimized out>) at vacuum.c:475 #8 0x0000561b8929ec1f in ExecVacuum (pstate=pstate@entry=0x561b8aedbae8, vacstmt=vacstmt@entry=0x561b8ae1a268, isTopLevel=isTopLevel@entry=true) at vacuum.c:275 #9 0x0000561b8940c3d0 in standard_ProcessUtility (pstmt=0x561b8ae1a628, queryString=0x561b8ae197b8 "vacuum freeze t1;", readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x561b8ae1a708, qc=0x7ffd243ab6b0) at utility.c:866 #10 0x0000561b8940ab3f in PortalRunUtility (portal=portal@entry=0x561b8ae87018, pstmt=pstmt@entry=0x561b8ae1a628, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x561b8ae1a708, qc=0x7ffd243ab6b0) at pquery.c:1158 #11 0x0000561b8940ac73 in PortalRunMulti (portal=portal@entry=0x561b8ae87018, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x561b8ae1a708, altdest=altdest@entry=0x561b8ae1a708, qc=qc@entry=0x7ffd243ab6b0) at pquery.c:1315 #12 0x0000561b8940b17f in PortalRun (portal=portal@entry=0x561b8ae87018, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x561b8ae1a708, altdest=altdest@entry=0x561b8ae1a708, qc=0x7ffd243ab6b0) at pquery.c:791 #13 0x0000561b89407763 in exec_simple_query (query_string=0x561b8ae197b8 "vacuum freeze t1;") at postgres.c:1250 #14 0x0000561b8940840b in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4593 #15 0x0000561b8938e791 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4504 #16 BackendStartup (port=<optimized out>) at postmaster.c:4232 #17 ServerLoop () at postmaster.c:1806 #18 0x0000561b8938f718 in PostmasterMain (argc=3, argv=0x561b8ae13db0) at postmaster.c:1478 #19 0x0000561b89123f19 in main (argc=3, argv=0x561b8ae13db0) at main.c:202
On 2023-Jan-09, PG Bug reporting form wrote: > On the REL_15_STABLE, you can hang vacuum freeze. Maybe this is not > desired? > https://www.postgresql.org/docs/current/pgsurgery.html > > reproduce script: > create extension pg_surgery; Using pg_surgery is the equivalent of introducing corruption in your data. It has, of course, completely valid uses, but if you break the system while using it, it's on you to fix it. The pg_surgery documentation you cite states: : These functions are unsafe by design and using them may corrupt (or : further corrupt) your database. So, you've been warned. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html
On Tue, Jan 17, 2023 at 12:37 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Jan-09, PG Bug reporting form wrote: > > > On the REL_15_STABLE, you can hang vacuum freeze. Maybe this is not > > desired? > > https://www.postgresql.org/docs/current/pgsurgery.html > > > > reproduce script: > > create extension pg_surgery; > > Using pg_surgery is the equivalent of introducing corruption in your > data. It has, of course, completely valid uses, but if you break the > system while using it, it's on you to fix it. > > The pg_surgery documentation you cite states: > > : These functions are unsafe by design and using them may corrupt (or > : further corrupt) your database. > > So, you've been warned. While this is completely true and I agree, can we improve this situation somewhat so that it ends up with an error instead of getting hanged? In this case, the tuple with a = 1, the root of the HOT chain, was killed, and the tuple with a = 2 was heap-only tuple and HOT-updated. In heap_page_prune(), we normally can prune the tuple with a = 2 as part of pruning its chain, but since the root tuple was already killed we could not prune this tuple. Then, we ended up retrying heap_page_prune() since we saw as if the tuple became dead since heap_page_prune() looked. Normally retrying heap_page_prune() works but in this case since we didn't have the root tuple it misses again, and gets hanged after all. I think that we didn't have this hang before 8523492d4e3 even in the same corruption case. One idea is to improve this situation is that we have a sanity check that we have retired due to the same tuple. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2023-Jan-18, Masahiko Sawada wrote: > While this is completely true and I agree, can we improve this > situation somewhat so that it ends up with an error instead of getting > hanged? Well, I don't know. I think in this case we would have to look at a patch that claimed to change the behavior, so that we can determine whether it's likely to break something else. > In this case, the tuple with a = 1, the root of the HOT chain, was > killed, and the tuple with a = 2 was heap-only tuple and HOT-updated. > In heap_page_prune(), we normally can prune the tuple with a = 2 as > part of pruning its chain, but since the root tuple was already killed > we could not prune this tuple. Then, we ended up retrying > heap_page_prune() since we saw as if the tuple became dead since > heap_page_prune() looked. My intuition for attacking this, is that we should definitely strive to change the behavior if the pattern of corruption is something that is seen to appear with some frequency. If it only happens because somebody was careless while running pg_surgery, then let's just leave it to her to complete the surgery. But if some unknown server bug causes it and we have a lot of people with vacuum hanging because of it, then I agree with might want to look for alternatives. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/