Re: Recovering from detoast-related catcache invalidations - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Recovering from detoast-related catcache invalidations
Date
Msg-id 754826.1705090452@sss.pgh.pa.us
Whole thread Raw
In response to Re: Recovering from detoast-related catcache invalidations  (Xiaoran Wang <fanfuxiaoran@gmail.com>)
Responses Re: Recovering from detoast-related catcache invalidations
List pgsql-hackers
Xiaoran Wang <fanfuxiaoran@gmail.com> writes:
>> Maybe, but that undocumented hack in SetHintBits seems completely
>> unacceptable.  Isn't there a cleaner way to make this check?

> Maybe we don't need to call 'HeapTupleSatisfiesVisibility' to check if the
> tuple has been deleted.
> As the tuple's xmin must been committed, so we just need to check if its
> xmax is committed,

I'm not super thrilled with that.  Something I realized last night is
that your proposal only works if "ntp" is pointing directly into the
catalog's disk buffers.  If something earlier than this code had made
a local-memory copy of the catalog tuple, then it's possible that its
header fields (particularly xmax) are out of date compared to shared
buffers and would fail to tell us that some other process just
invalidated the tuple.  Now in fact, with the current implementation
of syscache_getnext() the result is actually a live tuple and so we
can expect to see any relevant updates.  But I think we'd better add
some Asserts that that's so; and that also provides us with a way to
call HeapTupleSatisfiesVisibility fully legally, because we can get
the buffer reference out of the scan descriptor too.

This is uncomfortably much in bed with the tuple table slot code,
perhaps, but I don't see a way to do it more cleanly unless we want
to add some new provisions to that API.  Andres, do you have any
thoughts about that?

Anyway, this approach gets rid of false positives, which is great
for performance and bad for testing.  Code coverage says that now
we never hit the failure paths during regression tests, which is
unsurprising, but I'm not very comfortable with leaving those paths
unexercised.  I tried to make an isolation test to exercise them,
but there's no good way at the SQL level to get a session to block
during the detoast step.  LOCK TABLE on some catalog's toast table
would do, but we disallow it.  I thought about adding a small C
function to regress.so to take out such a lock, but we have no
infrastructure for referencing regress.so from isolation tests.
What I ended up doing is adding a random failure about 0.1% of
the time in USE_ASSERT_CHECKING builds --- that's intellectually
ugly for sure, but doing better seems like way more work than
it's worth.

            regards, tom lane

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index c0591cffe5..b1ce7bf513 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -15,6 +15,7 @@
 #include "postgres.h"

 #include "access/genam.h"
+#include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/relscan.h"
 #include "access/sysattr.h"
@@ -24,8 +25,10 @@
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/bufmgr.h"
 #ifdef CATCACHE_STATS
 #include "storage/ipc.h"        /* for on_proc_exit */
 #endif
@@ -90,10 +93,10 @@ static void CatCachePrintStats(int code, Datum arg);
 static void CatCacheRemoveCTup(CatCache *cache, CatCTup *ct);
 static void CatCacheRemoveCList(CatCache *cache, CatCList *cl);
 static void CatalogCacheInitializeCache(CatCache *cache);
-static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
+static CatCTup *CatalogCacheCreateEntry(CatCache *cache,
+                                        HeapTuple ntp, SysScanDesc scandesc,
                                         Datum *arguments,
-                                        uint32 hashValue, Index hashIndex,
-                                        bool negative);
+                                        uint32 hashValue, Index hashIndex);

 static void ReleaseCatCacheWithOwner(HeapTuple tuple, ResourceOwner resowner);
 static void ReleaseCatCacheListWithOwner(CatCList *list, ResourceOwner resowner);
@@ -1372,6 +1375,7 @@ SearchCatCacheMiss(CatCache *cache,
     SysScanDesc scandesc;
     HeapTuple    ntp;
     CatCTup    *ct;
+    bool        stale;
     Datum        arguments[CATCACHE_MAXKEYS];

     /* Initialize local parameter array */
@@ -1380,16 +1384,6 @@ SearchCatCacheMiss(CatCache *cache,
     arguments[2] = v3;
     arguments[3] = v4;

-    /*
-     * Ok, need to make a lookup in the relation, copy the scankey and fill
-     * out any per-call fields.
-     */
-    memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
-    cur_skey[0].sk_argument = v1;
-    cur_skey[1].sk_argument = v2;
-    cur_skey[2].sk_argument = v3;
-    cur_skey[3].sk_argument = v4;
-
     /*
      * Tuple was not found in cache, so we have to try to retrieve it directly
      * from the relation.  If found, we will add it to the cache; if not
@@ -1404,9 +1398,28 @@ SearchCatCacheMiss(CatCache *cache,
      * will eventually age out of the cache, so there's no functional problem.
      * This case is rare enough that it's not worth expending extra cycles to
      * detect.
+     *
+     * Another case, which we *must* handle, is that the tuple could become
+     * outdated during CatalogCacheCreateEntry's attempt to detoast it (since
+     * AcceptInvalidationMessages can run during TOAST table access).  We do
+     * not want to return already-stale catcache entries, so we loop around
+     * and do the table scan again if that happens.
      */
     relation = table_open(cache->cc_reloid, AccessShareLock);

+    do
+    {
+        /*
+         * Ok, need to make a lookup in the relation, copy the scankey and
+         * fill out any per-call fields.  (We must re-do this when retrying,
+         * because systable_beginscan scribbles on the scankey.)
+         */
+        memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
+        cur_skey[0].sk_argument = v1;
+        cur_skey[1].sk_argument = v2;
+        cur_skey[2].sk_argument = v3;
+        cur_skey[3].sk_argument = v4;
+
     scandesc = systable_beginscan(relation,
                                   cache->cc_indexoid,
                                   IndexScanOK(cache, cur_skey),
@@ -1415,12 +1428,18 @@ SearchCatCacheMiss(CatCache *cache,
                                   cur_skey);

     ct = NULL;
+    stale = false;

     while (HeapTupleIsValid(ntp = systable_getnext(scandesc)))
     {
-        ct = CatalogCacheCreateEntry(cache, ntp, arguments,
-                                     hashValue, hashIndex,
-                                     false);
+        ct = CatalogCacheCreateEntry(cache, ntp, scandesc, NULL,
+                                     hashValue, hashIndex);
+        /* upon failure, we must start the scan over */
+        if (ct == NULL)
+        {
+            stale = true;
+            break;
+        }
         /* immediately set the refcount to 1 */
         ResourceOwnerEnlarge(CurrentResourceOwner);
         ct->refcount++;
@@ -1429,6 +1448,7 @@ SearchCatCacheMiss(CatCache *cache,
     }

     systable_endscan(scandesc);
+    } while (stale);

     table_close(relation, AccessShareLock);

@@ -1447,9 +1467,11 @@ SearchCatCacheMiss(CatCache *cache,
         if (IsBootstrapProcessingMode())
             return NULL;

-        ct = CatalogCacheCreateEntry(cache, NULL, arguments,
-                                     hashValue, hashIndex,
-                                     true);
+        ct = CatalogCacheCreateEntry(cache, NULL, NULL, arguments,
+                                     hashValue, hashIndex);
+
+        /* Creating a negative cache entry shouldn't fail */
+        Assert(ct != NULL);

         CACHE_elog(DEBUG2, "SearchCatCache(%s): Contains %d/%d tuples",
                    cache->cc_relname, cache->cc_ntup, CacheHdr->ch_ntup);
@@ -1663,7 +1685,8 @@ SearchCatCacheList(CatCache *cache,
      * We have to bump the member refcounts temporarily to ensure they won't
      * get dropped from the cache while loading other members. We use a PG_TRY
      * block to ensure we can undo those refcounts if we get an error before
-     * we finish constructing the CatCList.
+     * we finish constructing the CatCList.  ctlist must be valid throughout
+     * the PG_TRY block.
      */
     ctlist = NIL;

@@ -1672,19 +1695,23 @@ SearchCatCacheList(CatCache *cache,
         ScanKeyData cur_skey[CATCACHE_MAXKEYS];
         Relation    relation;
         SysScanDesc scandesc;
-
-        /*
-         * Ok, need to make a lookup in the relation, copy the scankey and
-         * fill out any per-call fields.
-         */
-        memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * cache->cc_nkeys);
-        cur_skey[0].sk_argument = v1;
-        cur_skey[1].sk_argument = v2;
-        cur_skey[2].sk_argument = v3;
-        cur_skey[3].sk_argument = v4;
+        bool        stale;

         relation = table_open(cache->cc_reloid, AccessShareLock);

+        do
+        {
+            /*
+             * Ok, need to make a lookup in the relation, copy the scankey and
+             * fill out any per-call fields.  (We must re-do this when
+             * retrying, because systable_beginscan scribbles on the scankey.)
+             */
+            memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * cache->cc_nkeys);
+            cur_skey[0].sk_argument = v1;
+            cur_skey[1].sk_argument = v2;
+            cur_skey[2].sk_argument = v3;
+            cur_skey[3].sk_argument = v4;
+
         scandesc = systable_beginscan(relation,
                                       cache->cc_indexoid,
                                       IndexScanOK(cache, cur_skey),
@@ -1695,6 +1722,8 @@ SearchCatCacheList(CatCache *cache,
         /* The list will be ordered iff we are doing an index scan */
         ordered = (scandesc->irel != NULL);

+        stale = false;
+
         while (HeapTupleIsValid(ntp = systable_getnext(scandesc)))
         {
             uint32        hashValue;
@@ -1737,9 +1766,32 @@ SearchCatCacheList(CatCache *cache,
             if (!found)
             {
                 /* We didn't find a usable entry, so make a new one */
-                ct = CatalogCacheCreateEntry(cache, ntp, arguments,
-                                             hashValue, hashIndex,
-                                             false);
+                ct = CatalogCacheCreateEntry(cache, ntp, scandesc, NULL,
+                                             hashValue, hashIndex);
+                /* upon failure, we must start the scan over */
+                if (ct == NULL)
+                {
+                    /*
+                     * Release refcounts on any items we already had.  We dare
+                     * not try to free them if they're now unreferenced, since
+                     * an error while doing that would result in the PG_CATCH
+                     * below doing extra refcount decrements.  Besides, we'll
+                     * likely re-adopt those items in the next iteration, so
+                     * it's not worth complicating matters to try to get rid
+                     * of them.
+                     */
+                    foreach(ctlist_item, ctlist)
+                    {
+                        ct = (CatCTup *) lfirst(ctlist_item);
+                        Assert(ct->c_list == NULL);
+                        Assert(ct->refcount > 0);
+                        ct->refcount--;
+                    }
+                    /* Reset ctlist in preparation for new try */
+                    ctlist = NIL;
+                    stale = true;
+                    break;
+                }
             }

             /* Careful here: add entry to ctlist, then bump its refcount */
@@ -1749,6 +1801,7 @@ SearchCatCacheList(CatCache *cache,
         }

         systable_endscan(scandesc);
+        } while (stale);

         table_close(relation, AccessShareLock);

@@ -1865,22 +1918,41 @@ ReleaseCatCacheListWithOwner(CatCList *list, ResourceOwner resowner)
  * CatalogCacheCreateEntry
  *        Create a new CatCTup entry, copying the given HeapTuple and other
  *        supplied data into it.  The new entry initially has refcount 0.
+ *
+ * To create a normal cache entry, ntp must be the HeapTuple just fetched
+ * from scandesc, and "arguments" is not used.  To create a negative cache
+ * entry, pass NULL for ntp and scandesc; then "arguments" is the cache
+ * keys to use.  In either case, hashValue/hashIndex are the hash values
+ * computed from the cache keys.
+ *
+ * Returns NULL if we attempt to detoast the tuple and observe that it
+ * became stale.  (This cannot happen for a negative entry.)  Caller must
+ * retry the tuple lookup in that case.
  */
 static CatCTup *
-CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
-                        uint32 hashValue, Index hashIndex,
-                        bool negative)
+CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, SysScanDesc scandesc,
+                        Datum *arguments,
+                        uint32 hashValue, Index hashIndex)
 {
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;

-    /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;

-        Assert(!negative);
+        /*
+         * The visibility check below essentially never fails during our
+         * regression tests, and there's no easy way to force it to fail for
+         * testing purposes.  To ensure we have test coverage for the retry
+         * paths in our callers, make debug builds randomly fail about 0.1% of
+         * the times through this code path.
+         */
+#ifdef USE_ASSERT_CHECKING
+        if (pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 1000))
+            return NULL;
+#endif

         /*
          * If there are any out-of-line toasted fields in the tuple, expand
@@ -1888,9 +1960,32 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
          * entry, and also protects us against the possibility of the toast
          * tuples being freed before we attempt to fetch them, in case of
          * something using a slightly stale catcache entry.
+         *
+         * The tuple could become stale while we are doing toast table access
+         * (since AcceptInvalidationMessages can run then), so we must recheck
+         * its visibility afterwards.  To do that, we must be undesirably
+         * familiar with the fact that systable_getnext() will return a direct
+         * pointer into a catalog disk buffer.  This code is likely to need
+         * work if we ever want to support non-heap catalogs.
          */
         if (HeapTupleHasExternal(ntp))
+        {
+            BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) scandesc->slot;
+
+            /* Assert that we have a live pointer to a pinned catalog buffer */
+            Assert(TTS_IS_BUFFERTUPLE(scandesc->slot));
+            Assert(ntp == bslot->base.tuple);
+            Assert(BufferIsValid(bslot->buffer));
+
             dtp = toast_flatten_tuple(ntp, cache->cc_tupdesc);
+            if (!HeapTupleSatisfiesVisibility(ntp,
+                                              GetCatalogSnapshot(cache->cc_reloid),
+                                              bslot->buffer))
+            {
+                heap_freetuple(dtp);
+                return NULL;
+            }
+        }
         else
             dtp = ntp;

@@ -1929,7 +2024,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     }
     else
     {
-        Assert(negative);
+        /* Set up keys for a negative cache entry */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
         ct = (CatCTup *) palloc(sizeof(CatCTup));

@@ -1951,7 +2046,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->c_list = NULL;
     ct->refcount = 0;            /* for the moment */
     ct->dead = false;
-    ct->negative = negative;
+    ct->negative = (ntp == NULL);
     ct->hash_value = hashValue;

     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);

pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning
Next
From: Robert Haas
Date:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning