Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae - Mailing list pgsql-bugs
From | Melanie Plageman |
---|---|
Subject | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae |
Date | |
Msg-id | CAAKRu_aUJ-yWy67QnytdNmvKjsAyZtVgnQpsdi-3hk_SbTL3Mg@mail.gmail.com Whole thread Raw |
In response to | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
|
List | pgsql-bugs |
On Fri, Apr 26, 2024 at 4:46 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Apr 15, 2024 at 2:52 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2024-03-29 12:05:31 -0400, Robert Haas wrote: > > > Perhaps, for some reason I'm not grokking at the moment, preventing > > > maybe_needed from retreating would be good enough to prevent trouble > > > in practice, but it doesn't appear to me to be the most principled > > > solution as of this writing. > > > > If we prevent maybe_needed from retreating below the OldestXmin value used for > > cutoff, then pruning should never be less aggressive than what's needed by > > lazy_scan_prune(). So lazy_scan_prune() shouldn't be able to encounter tuples > > that weren't considered removable in heap_page_prune(), in < 17, nor would > > heap_page_prune_and_freeze() have that issue. > > > > I think one fairly easy fix for this would be to pass OldestXmin to > > heap_page_prune[_and_freeze](), and have heap_prune_satisfies_vacuum() first > > check OldestXmin and only if that considers the tuple still running, use > > GlobalVisTestIsRemovableXid(). That continues to allow us to prune more > > aggressively, without any potential risk of IsRemoval being too conservative. > > It seems to me that in all stable versions containing the retry loop > (from 8523492d4e34), the following can theoretically happen, causing > an infinite loop in lazy_scan_prune(): > > 1) tuple's xmin and xmax are older than VacuumCutoffs->OldestXmin > 2) heap_page_prune()-> heap_prune_satisfies_vacuum()-> > HeapTupleSatisfiesVacuumHorizon() returns HEAPTUPLE_RECENTLY_DEAD > 3) in GlobalVisTestIsRemovableXid(), dead_after is between > GlobalVisState->maybe_needed and definitely_needed, causing > GlobalVisState to be recomputed > 4) GlobalVisState->maybe_needed moves backwards > 5) tuple is not removable because dead_after is now newer than maybe_needed > 6) in lazy_scan_prune(), HeapTupleSatisfiesVacuum() returns > HEAPTUPLE_DEAD because dead_after is older than OldestXmin > 7) infinite loop > > One way to fix this (as mentioned earlier in this thread) is to > backport 1ccc1e05ae because it eliminates the retry loop. In our above > example, lazy_scan_prune() reuses the tuple's HEAPTUPLE_RECENTLY_DEAD > HTSV_Result instead of recomputing it. We'd have to rule out any of > the data corruption claims about that commit to justify this, but I am > not yet convinced that 1ccc1e05ae could cause any problems with > relfrozenxid advancement. > > Andres' idea of passing VacuumCutoffs->OldestXmin to heap_page_prune() > makes sense. We could add a member to PruneState, oldest_xmin, and > initialize it to InvalidTransactionId for on-access pruning and to > VacuumCutoffs->OldestXmin for vacuum. Then, in > heap_prune_satisfies_vacuum(), we could add this if branch: > > if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin)) > > to here: > > - if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after)) > + > + if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin)) > + res = HEAPTUPLE_DEAD; > + else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after)) > res = HEAPTUPLE_DEAD; > else if (OldSnapshotThresholdActive()) > > This seems like a more targeted fix than backpatching 1ccc1e05ae. > > I plan to attempt to write a test case that repros this scenario > (infinite loop on stable branch) next week. I can repro the hang on 14 and 15 with the following: I start a primary with the following options: wal_level = replica, hot_standby_feedback=on primary_conninfo='host=localhost port=6432 dbname=postgres' then take a basebackup and start the replica. Then I connect with psql five times (sessions A, B, and C on the primary and sessions A and B on the replica): Then I do the following steps. step 1: PRIMARY SESSION A: -- CREATE TABLE foo(a INT) WITH (autovacuum_enabled=false); CREATE INDEX foo_idx ON foo(a); INSERT INTO foo VALUES (1); UPDATE foo SET a = 100 WHERE a = 1; DROP TABLESPACE IF EXISTS foo_tablespace; CREATE TABLESPACE foo_tablespace LOCATION 'somelocation'; step 2: REPLICA SESSION A: -- ALTER SYSTEM SET primary_conninfo = 'host=localhost port=9999 dbname=postgres'; SELECT pg_reload_conf(); SELECT pg_terminate_backend((select pid from pg_stat_activity where backend_type = 'walreceiver')); BEGIN; DECLARE c1 CURSOR FOR SELECT * FROM foo; FETCH FORWARD FROM c1; step 3: PRIMARY SESSION A: -- INSERT INTO foo VALUES (99); UPDATE foo SET a = 100 WHERE a = 99; step 4: PRIMARY SESSION B: -- BEGIN; DECLARE c1 CURSOR FOR SELECT * FROM foo; FETCH FORWARD FROM c1; step 5: PRIMARY SESSION C: -- BEGIN; ALTER INDEX foo_idx SET TABLESPACE foo_tablespace; step 6: PRIMARY SESSION A: -- VACUUM (FREEZE) foo; step 7: PRIMARY SESSION B: -- COMMIT; step 8: REPLICA SESSION B: -- ALTER SYSTEM SET primary_conninfo = 'host=localhost port=6432 dbname=postgres'; SELECT pg_reload_conf(); step 9: PRIMARY SESSION C: -- COMMIT; step 10: REPLICA SESSION A: -- COMMIT; This infinitely loops in lazy_scan_prune() on 14 and 15. On 16 and master, everything is normal (for this specific scenario). The steps roughly follow what I wrote in my previous email. The difference between 16/master and 14/15, is that in 14/15, vacuum_set_xid_limits() is called before opening indexes. This allows an intervening relcache invalidation caused by modifying an index on the table being vacuumed to force us to rebuild the catalog snapshot between vacuum_set_xid_limits() and eventually pruning the tuples. Rebuilding the catalog snapshot will update RecentXmin to a potentially different value than ComputeXidHorizonsResultLastXmin ( ComputeXidHorizonResultLastXmin is set to RecentXmin during vacuum_set_xid_limits()). This provides the opportunity for GlobalVisTestShouldUpdate() to return true while pruning RECENTLY_DEAD tuples. See this line in GlobalVisTestShouldUpdate(): return RecentXmin != ComputeXidHorizonsResultLastXmin; On 16 and master, AFAICT, RecentXmin will always equal ComputeXidHorizonsResultLastXmin in GlobalVisTestShouldUpdate() when called through heap_prune_satisfies_vacuum(). So, even if a wal sender pushes the horizon backwards on the primary, vacuum's GlobalVisState->maybe_needed won't be pushed back to a value lower than OldestXmin. I haven't yet written the fix and tested it. I want to investigate a bit more and write more detailed notes on the repro steps. I also want to fully convince myself this isn't possible on master or 16. - Melanie
pgsql-bugs by date: