Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae - Mailing list pgsql-bugs

From Melanie Plageman
Subject Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date
Msg-id CAAKRu_aUJ-yWy67QnytdNmvKjsAyZtVgnQpsdi-3hk_SbTL3Mg@mail.gmail.com
Whole thread Raw
In response to Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
List pgsql-bugs
On Fri, Apr 26, 2024 at 4:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Apr 15, 2024 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2024-03-29 12:05:31 -0400, Robert Haas wrote:
> > > Perhaps, for some reason I'm not grokking at the moment, preventing
> > > maybe_needed from retreating would be good enough to prevent trouble
> > > in practice, but it doesn't appear to me to be the most principled
> > > solution as of this writing.
> >
> > If we prevent maybe_needed from retreating below the OldestXmin value used for
> > cutoff, then pruning should never be less aggressive than what's needed by
> > lazy_scan_prune(). So lazy_scan_prune() shouldn't be able to encounter tuples
> > that weren't considered removable in heap_page_prune(), in < 17, nor would
> > heap_page_prune_and_freeze() have that issue.
> >
> > I think one fairly easy fix for this would be to pass OldestXmin to
> > heap_page_prune[_and_freeze](), and have heap_prune_satisfies_vacuum() first
> > check OldestXmin and only if that considers the tuple still running, use
> > GlobalVisTestIsRemovableXid(). That continues to allow us to prune more
> > aggressively, without any potential risk of IsRemoval being too conservative.
>
> It seems to me that in all stable versions containing the retry loop
> (from 8523492d4e34), the following can theoretically happen, causing
> an infinite loop in lazy_scan_prune():
>
> 1) tuple's xmin and xmax are older than VacuumCutoffs->OldestXmin
> 2) heap_page_prune()-> heap_prune_satisfies_vacuum()->
> HeapTupleSatisfiesVacuumHorizon() returns HEAPTUPLE_RECENTLY_DEAD
> 3) in GlobalVisTestIsRemovableXid(), dead_after is between
> GlobalVisState->maybe_needed and definitely_needed, causing
> GlobalVisState to be recomputed
> 4) GlobalVisState->maybe_needed moves backwards
> 5) tuple is not removable because dead_after is now newer than maybe_needed
> 6) in lazy_scan_prune(), HeapTupleSatisfiesVacuum() returns
> HEAPTUPLE_DEAD because dead_after is older than OldestXmin
> 7) infinite loop
>
> One way to fix this (as mentioned earlier in this thread) is to
> backport 1ccc1e05ae because it eliminates the retry loop. In our above
> example, lazy_scan_prune() reuses the tuple's HEAPTUPLE_RECENTLY_DEAD
> HTSV_Result instead of recomputing it. We'd have to rule out any of
> the data corruption claims about that commit to justify this, but I am
> not yet convinced that 1ccc1e05ae could cause any problems with
> relfrozenxid advancement.
>
> Andres' idea of passing VacuumCutoffs->OldestXmin to heap_page_prune()
> makes sense. We could add a member to PruneState, oldest_xmin, and
> initialize it to InvalidTransactionId for on-access pruning and to
> VacuumCutoffs->OldestXmin for vacuum. Then, in
> heap_prune_satisfies_vacuum(), we could add this if branch:
>
> if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin))
>
> to here:
>
> -   if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
> +
> +   if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin))
> +       res = HEAPTUPLE_DEAD;
> +   else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
>         res = HEAPTUPLE_DEAD;
>     else if (OldSnapshotThresholdActive())
>
> This seems like a more targeted fix than backpatching 1ccc1e05ae.
>
> I plan to attempt to write a test case that repros this scenario
> (infinite loop on stable branch) next week.

I can repro the hang on 14 and 15 with the following:

I start a primary with the following options:

wal_level = replica, hot_standby_feedback=on
primary_conninfo='host=localhost port=6432 dbname=postgres'

then take a basebackup and start the replica.

Then I connect with psql five times (sessions A, B, and C on the
primary and sessions A and B on the replica):

Then I do the following steps.

step 1:
  PRIMARY SESSION A:
  --
  CREATE TABLE foo(a INT) WITH (autovacuum_enabled=false);
  CREATE INDEX foo_idx ON foo(a);
  INSERT INTO foo VALUES (1);
  UPDATE foo SET a = 100 WHERE a = 1;

  DROP TABLESPACE IF EXISTS foo_tablespace;
  CREATE TABLESPACE foo_tablespace LOCATION 'somelocation';

step 2:
  REPLICA SESSION A:
  --
  ALTER SYSTEM SET primary_conninfo = 'host=localhost port=9999
dbname=postgres';
  SELECT pg_reload_conf();
  SELECT pg_terminate_backend((select pid from pg_stat_activity where
backend_type = 'walreceiver'));
  BEGIN;
    DECLARE c1 CURSOR FOR SELECT * FROM foo;
    FETCH FORWARD FROM c1;

step 3:
  PRIMARY SESSION A:
  --
  INSERT INTO foo VALUES (99);
  UPDATE foo SET a = 100 WHERE a = 99;

step 4:
  PRIMARY SESSION B:
  --
  BEGIN;
    DECLARE c1 CURSOR FOR SELECT * FROM foo;
    FETCH FORWARD FROM c1;

step 5:
  PRIMARY SESSION C:
  --
  BEGIN;
    ALTER INDEX foo_idx SET TABLESPACE foo_tablespace;

step 6:
  PRIMARY SESSION A:
  --
  VACUUM (FREEZE) foo;

step 7:
  PRIMARY SESSION B:
  --
  COMMIT;

step 8:
  REPLICA SESSION B:
  --
  ALTER SYSTEM SET primary_conninfo = 'host=localhost port=6432
dbname=postgres';
  SELECT pg_reload_conf();

step 9:
  PRIMARY SESSION C:
  --
  COMMIT;

step 10:
  REPLICA SESSION A:
  --
  COMMIT;

This infinitely loops in lazy_scan_prune() on 14 and 15. On 16 and
master, everything is normal (for this specific scenario).

The steps roughly follow what I wrote in my previous email.

The difference between 16/master and 14/15, is that in 14/15,
vacuum_set_xid_limits() is called before opening indexes. This allows
an intervening relcache invalidation caused by modifying an index on
the table being vacuumed to force us to rebuild the catalog snapshot
between vacuum_set_xid_limits() and eventually pruning the tuples.
Rebuilding the catalog snapshot will update RecentXmin to a
potentially different value than ComputeXidHorizonsResultLastXmin (
ComputeXidHorizonResultLastXmin is set to RecentXmin during
vacuum_set_xid_limits()). This provides the opportunity for
GlobalVisTestShouldUpdate() to return true while pruning RECENTLY_DEAD
tuples.

See this line in GlobalVisTestShouldUpdate():
    return RecentXmin != ComputeXidHorizonsResultLastXmin;

On 16 and master, AFAICT, RecentXmin will always equal
ComputeXidHorizonsResultLastXmin in GlobalVisTestShouldUpdate() when
called through heap_prune_satisfies_vacuum(). So, even if a wal sender
pushes the horizon backwards on the primary, vacuum's
GlobalVisState->maybe_needed won't be pushed back to a value lower
than OldestXmin.

I haven't yet written the fix and tested it. I want to investigate a
bit more and write more detailed notes on the repro steps. I also want
to fully convince myself this isn't possible on master or 16.

- Melanie



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18459: running pg_ctl.exe from inside a program running as a Windows service returns "error code 6"
Next
From: Corey Huinker
Date:
Subject: Re: BUG #18429: Inconsistent results on similar queries with join lateral