Re: recovering from "found xmin ... from before relfrozenxid ..." - Mailing list pgsql-hackers

From Tom Lane
Subject Re: recovering from "found xmin ... from before relfrozenxid ..."
Date
Msg-id 849990.1600621996@sss.pgh.pa.us
Whole thread Raw
In response to Re: recovering from "found xmin ... from before relfrozenxid ..."  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: recovering from "found xmin ... from before relfrozenxid ..."  (Ashutosh Sharma <ashu.coek88@gmail.com>)
Re: recovering from "found xmin ... from before relfrozenxid ..."  (Amit Kapila <amit.kapila16@gmail.com>)
Re: recovering from "found xmin ... from before relfrozenxid ..."  (Robert Haas <robertmhaas@gmail.com>)
Re: recovering from "found xmin ... from before relfrozenxid ..."  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
I wrote:
> So this confirms the suspicion that the cause of the buildfarm
> failures is a concurrently-open transaction, presumably from
> autovacuum.  I don't have time to poke further right now.

I spent some more time analyzing this, and there seem to be two distinct
issues:

1. My patch a7212be8b does indeed have a problem.  It will allow
vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
table if freeze_min_age is zero (ie VACUUM FREEZE).  If there's
any concurrent transactions, this falls foul of
heap_prepare_freeze_tuple's expectation that

 * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
 * XID older than it could neither be running nor seen as running by any
 * open transaction.  This ensures that the replacement will not change
 * anyone's idea of the tuple state.

The "cannot freeze committed xmax" error message appears to be banking on
the assumption that we'd not reach heap_prepare_freeze_tuple for any
committed-dead tuple unless its xmax is past the specified cutoff_xid.

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway?  I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

These two points interact, in that we don't get to the "cannot freeze"
failure except when pruneheap has decided not to remove something that
is removable according to VACUUM's rules.  (VACUUM doesn't actually
remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
the tuple, and heap_prepare_freeze_tuple spits up.)  However, if I revert
a7212be8b then the pg_surgery test still fails in the presence of a
concurrent transaction (both of the expected "skipping TID" notices
disappear).  So reverting that patch wouldn't get us out of trouble.

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables.  We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits.  However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

Independently of that, it seems like we need to fix things so that
when pruneheap.c is called by vacuum, it makes EXACTLY the same
dead-or-not-dead decisions that the main vacuum code makes.  This
business with applying some GlobalVisState rule or other instead
seems just as unsafe as can be.

AFAICS, there is no chance of the existing pg_surgery regression test
being fully stable if we don't fix both things.

BTW, attached is a quick-hack patch to allow automated testing
of this scenario, along the lines I sketched yesterday.  This
test passes if you run the two scripts serially, but not when
you run them in parallel.  I'm not proposing this for commit;
it's a hack and its timing behavior is probably not stable enough
for the buildfarm.  But it's pretty useful for poking at these
issues.

            regards, tom lane

diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
index a66776c4c4..a7bcfca0fb 100644
--- a/contrib/pg_surgery/Makefile
+++ b/contrib/pg_surgery/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_surgery
 DATA = pg_surgery--1.0.sql
 PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"

-REGRESS = heap_surgery
+REGRESS = --schedule=surgery_schedule

 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index d4a757ffa0..6ef5597e7c 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,4 +1,10 @@
 create extension pg_surgery;
+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+ pg_sleep
+----------
+
+(1 row)
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);
diff --git a/contrib/pg_surgery/expected/sleep.out b/contrib/pg_surgery/expected/sleep.out
new file mode 100644
index 0000000000..208e31163d
--- /dev/null
+++ b/contrib/pg_surgery/expected/sleep.out
@@ -0,0 +1,6 @@
+select pg_sleep(1.0);  -- long enough for whole heap_surgery test
+ pg_sleep
+----------
+
+(1 row)
+
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 6526b27535..212c657f3c 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -1,5 +1,7 @@
 create extension pg_surgery;

+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);
diff --git a/contrib/pg_surgery/sql/sleep.sql b/contrib/pg_surgery/sql/sleep.sql
new file mode 100644
index 0000000000..a85c25bca2
--- /dev/null
+++ b/contrib/pg_surgery/sql/sleep.sql
@@ -0,0 +1 @@
+select pg_sleep(1.0);  -- long enough for whole heap_surgery test
diff --git a/contrib/pg_surgery/surgery_schedule b/contrib/pg_surgery/surgery_schedule
new file mode 100644
index 0000000000..1bb3aa5e53
--- /dev/null
+++ b/contrib/pg_surgery/surgery_schedule
@@ -0,0 +1 @@
+test: heap_surgery sleep

pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)
Next
From: Ranier Vilela
Date:
Subject: Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)