Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619
Date
Msg-id 2472976.1621117273@sss.pgh.pa.us
Whole thread Raw
In response to prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619
List pgsql-hackers
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> After my commit c532d15ddd to split up copy.c, buildfarm animal "prion"
> failed in pg_upgrade
> (https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-11-23%2009%3A23%3A16):

prion's continued to fail, rarely, in this same way:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-15%2004%3A08%3A06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-05-12%2018%3A43%3A14

The failures are remarkably identical, and they also look a lot like
field reports we've been seeing off and on for years.  I do not know
why it always seems to be pg_toast_2619 (i.e. pg_statistic) that's
affected, but the pattern is pretty undeniable by now.

What I do have that's new is that *I can reproduce it*, at long last.
For me, installing the attached patch and running pg_upgrade's
"make check" fails, pretty much every time, with symptoms identical
to prion's.

The patch consists of

(1) 100ms delay just before detoasting, when loading a pg_statistic
catcache entry that has toasted datums

(2) provisions to mark such catcache entries dead immediately
(effectively, CATCACHE_FORCE_RELEASE for only these tuples);
this is to force us through (1) as often as possible

(3) some quick-hack debugging aids added to HeapTupleSatisfiesToast,
plus convert the missing-chunk error to PANIC to get a stack
trace.

If it doesn't reproduce for you, try adjusting the delay.  100ms
was the first value I tried, though, so I think it's probably
not too sensitive.

The trace I'm getting shows pretty positively that autovacuum
has fired on pg_statistic, and removed the needed toast entries,
just before the failure.  So something is wrong with our logic
about when toast entries can be removed.

I do not have a lot of idea why, but I see something that is
probably a honkin' big clue:

2021-05-15 17:28:05.965 EDT [2469444] LOG:  HeapTupleSatisfiesToast: xmin 2 t_infomask 0x0b02

That is, the toast tuples in question are not just frozen, but
actually have xmin = FrozenTransactionId.

I do not think that is normal --- at least, it's not the state
immediately after initdb, and I can't make it happen with
"vacuum freeze pg_statistic".  A plausible theory is that pg_upgrade
caused this to happen (but how?) and then vacuuming of toast rows
goes off the rails because of it.

Anyway, I have no more time to poke at this right now, so I'm
posting the reproducer in case someone else wants to look at it.

            regards, tom lane

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index d3c57cd16a..5830df83c2 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -73,11 +73,14 @@
 #include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
+#include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
 #include "utils/snapmgr.h"


+bool trace_toast_visibility = false;
+
 /*
  * SetHintBits()
  *
@@ -366,6 +369,13 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
     Assert(ItemPointerIsValid(&htup->t_self));
     Assert(htup->t_tableOid != InvalidOid);

+    if (trace_toast_visibility)
+        ereport(LOG,
+                errmsg("HeapTupleSatisfiesToast: xmin %u t_infomask 0x%04x",
+                       HeapTupleHeaderGetXmin(tuple),
+                       tuple->t_infomask),
+                debug_query_string ? 0 : errbacktrace());
+
     if (!HeapTupleHeaderXminCommitted(tuple))
     {
         if (HeapTupleHeaderXminInvalid(tuple))
@@ -420,6 +430,11 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
             return false;
     }

+    if (trace_toast_visibility)
+        elog(LOG, "HeapTupleSatisfiesToast: xmin %u t_infomask 0x%04x OK",
+             HeapTupleHeaderGetXmin(tuple),
+             tuple->t_infomask);
+
     /* otherwise assume the tuple is valid for TOAST. */
     return true;
 }
diff --git a/src/backend/access/heap/heaptoast.c b/src/backend/access/heap/heaptoast.c
index 55bbe1d584..03ea2dc80d 100644
--- a/src/backend/access/heap/heaptoast.c
+++ b/src/backend/access/heap/heaptoast.c
@@ -781,7 +781,7 @@ heap_fetch_toast_slice(Relation toastrel, Oid valueid, int32 attrsize,
      * Final checks that we successfully fetched the datum
      */
     if (expectedchunk != (endchunk + 1))
-        ereport(ERROR,
+        ereport(PANIC,
                 (errcode(ERRCODE_DATA_CORRUPTED),
                  errmsg_internal("missing chunk number %d for toast value %u in %s",
                                  expectedchunk, valueid,
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 4fbdc62d8c..a1f923e02c 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -23,6 +23,7 @@
 #include "access/xact.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
+#include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
 #include "common/hashfn.h"
 #include "miscadmin.h"
@@ -39,6 +40,7 @@
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"

+extern bool trace_toast_visibility;

  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */

@@ -1810,6 +1812,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 {
     CatCTup    *ct;
     HeapTuple    dtp;
+    bool makedead = false;
     MemoryContext oldcxt;

     /* negative entries have no tuple associated */
@@ -1827,7 +1830,16 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
          * something using a slightly stale catcache entry.
          */
         if (HeapTupleHasExternal(ntp))
+        {
+            if (cache->cc_reloid == StatisticRelationId)
+            {
+                pg_usleep(100000);
+                makedead = true;
+                trace_toast_visibility = true;
+            }
             dtp = toast_flatten_tuple(ntp, cache->cc_tupdesc);
+            trace_toast_visibility = false;
+        }
         else
             dtp = ntp;

@@ -1887,7 +1899,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->my_cache = cache;
     ct->c_list = NULL;
     ct->refcount = 0;            /* for the moment */
-    ct->dead = false;
+    ct->dead = makedead;
     ct->negative = negative;
     ct->hash_value = hashValue;


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: compute_query_id and pg_stat_statements
Next
From: Alvaro Herrera
Date:
Subject: Re: compute_query_id and pg_stat_statements