CacheInvalidateRelcache in btree is a crummy idea - Mailing list pgsql-hackers

From Tom Lane
Subject CacheInvalidateRelcache in btree is a crummy idea
Date
Msg-id 28062.1391377520@sss.pgh.pa.us
Whole thread Raw
Responses Re: CacheInvalidateRelcache in btree is a crummy idea  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
While investigating the assertion failure Noah presented at
http://www.postgresql.org/message-id/20130805170931.GA369289@tornado.leadboat.com
I was confused for a bit as to why we're getting a relcache rebuild
request for t1's index at exit of the subtransaction, when the
subtransaction hasn't made any change to the index's schema.
Investigation showed that the reason is that _bt_getroot() creates
a root page for the index during the subtransaction's INSERT, and
then it does CacheInvalidateRelcache() to notify other backends
that it's updated the index's metapage.

Now, this is a dumb idea, because relcache invalidation is assumed
to be a transactional operation, which means that if the subtransaction
rolls back, we don't send any invalidation event to other backends.
Update of a btree metapage, however, is not a transactional thing.
So if we actually *needed* that notification to be sent, this would be
a bug.

But we don't need that.  _bt_getroot() is the primary user of the
cached metapage, and it sufficiently verifies that the page it's
reached is the right one, which is why we've not seen bug reports.
_bt_getrootheight() also uses the cached data, but that's only for a
statistical estimate, so it doesn't matter if it's a bit stale.

Moreover, if we did need reliable transmission of the invalidation
message, that would've gotten broken even more by commit 96675bff1, which
suppressed doing that when InRecovery.  (In fairness, it wasn't a problem
at the time since that was pre-Hot Standby.  But the lack of notification
would be an issue now for hot standby sessions, if they needed one.)

So I'm thinking my commit d2896a9ed, which introduced this mechanism,
was poorly thought out and we should just remove the relcache invals
as per the attached patch.  Letting _bt_getroot() update the cached
metapage at next use should be a lot cheaper than a full relcache
rebuild for the index.

Note: this change will mean that Noah's test case no longer triggers
the problem discussed in that thread ... but I'm sure we can find
another test case for that.

            regards, tom lane

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..bf5bc38 100644
*** a/src/backend/access/nbtree/README
--- b/src/backend/access/nbtree/README
*************** location of the root page --- both the t
*** 438,451 ****
  root ("fast" root).  To avoid fetching the metapage for every single index
  search, we cache a copy of the meta-data information in the index's
  relcache entry (rd_amcache).  This is a bit ticklish since using the cache
! implies following a root page pointer that could be stale.  We require
! every metapage update to send out a SI "relcache inval" message on the
! index relation.  That ensures that each backend will flush its cached copy
! not later than the start of its next transaction.  Therefore, stale
! pointers cannot be used for longer than the current transaction, which
! reduces the problem to the same one already dealt with for concurrent
! VACUUM --- we can just imagine that each open transaction is potentially
! "already in flight" to the old root.

  The algorithm assumes we can fit at least three items per page
  (a "high key" and two real data items).  Therefore it's unsafe
--- 438,454 ----
  root ("fast" root).  To avoid fetching the metapage for every single index
  search, we cache a copy of the meta-data information in the index's
  relcache entry (rd_amcache).  This is a bit ticklish since using the cache
! implies following a root page pointer that could be stale.  However, a
! backend following a cached pointer can sufficiently verify whether it
! reached the intended page; either by checking the is-root flag when it
! is going to the true root, or by checking that the page has no siblings
! when going to the fast root.  At worst, this could result in descending
! some extra tree levels if we have a cached pointer to a fast root that is
! now above the real fast root.  Such cases shouldn't arise often enough to
! be worth optimizing; and in any case we can expect a relcache flush will
! discard the cached metapage before long, since a VACUUM that's moved the
! fast root pointer can be expected to issue a statistics update for the
! index.

  The algorithm assumes we can fit at least three items per page
  (a "high key" and two real data items).  Therefore it's unsafe
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 49c084a..9140749 100644
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 21,27 ****
  #include "miscadmin.h"
  #include "storage/lmgr.h"
  #include "storage/predicate.h"
- #include "utils/inval.h"
  #include "utils/tqual.h"


--- 21,26 ----
*************** _bt_insertonpg(Relation rel,
*** 868,880 ****

          END_CRIT_SECTION();

!         /* release buffers; send out relcache inval if metapage changed */
          if (BufferIsValid(metabuf))
-         {
-             if (!InRecovery)
-                 CacheInvalidateRelcache(rel);
              _bt_relbuf(rel, metabuf);
-         }

          _bt_relbuf(rel, buf);
      }
--- 867,875 ----

          END_CRIT_SECTION();

!         /* release buffers */
          if (BufferIsValid(metabuf))
              _bt_relbuf(rel, metabuf);

          _bt_relbuf(rel, buf);
      }
*************** _bt_newroot(Relation rel, Buffer lbuf, B
*** 1963,1972 ****

      END_CRIT_SECTION();

-     /* send out relcache inval for metapage change */
-     if (!InRecovery)
-         CacheInvalidateRelcache(rel);
-
      /* done with metapage */
      _bt_relbuf(rel, metabuf);

--- 1958,1963 ----
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c7c65a6..1e78858 100644
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 28,34 ****
  #include "storage/indexfsm.h"
  #include "storage/lmgr.h"
  #include "storage/predicate.h"
- #include "utils/inval.h"
  #include "utils/snapmgr.h"


--- 28,33 ----
*************** _bt_getroot(Relation rel, int access)
*** 247,258 ****
          END_CRIT_SECTION();

          /*
-          * Send out relcache inval for metapage change (probably unnecessary
-          * here, but let's be safe).
-          */
-         CacheInvalidateRelcache(rel);
-
-         /*
           * swap root write lock for read lock.    There is no danger of anyone
           * else accessing the new root page while it's unlocked, since no one
           * else knows where it is yet.
--- 246,251 ----
*************** _bt_pagedel(Relation rel, Buffer buf, BT
*** 1545,1556 ****

      END_CRIT_SECTION();

!     /* release metapage; send out relcache inval if metapage changed */
      if (BufferIsValid(metabuf))
-     {
-         CacheInvalidateRelcache(rel);
          _bt_relbuf(rel, metabuf);
!     }
      /* can always release leftsib immediately */
      if (BufferIsValid(lbuf))
          _bt_relbuf(rel, lbuf);
--- 1538,1547 ----

      END_CRIT_SECTION();

!     /* release metapage */
      if (BufferIsValid(metabuf))
          _bt_relbuf(rel, metabuf);
!
      /* can always release leftsib immediately */
      if (BufferIsValid(lbuf))
          _bt_relbuf(rel, lbuf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b7f9e2e..1da9548 100644
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
*************** btbuild(PG_FUNCTION_ARGS)
*** 149,162 ****
  #endif   /* BTREE_BUILD_STATS */

      /*
-      * If we are reindexing a pre-existing index, it is critical to send out a
-      * relcache invalidation SI message to ensure all backends re-read the
-      * index metapage.    We expect that the caller will ensure that happens
-      * (typically as a side effect of updating index stats, but it must happen
-      * even if the stats don't change!)
-      */
-
-     /*
       * Return statistics
       */
      result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
--- 149,154 ----

pgsql-hackers by date:

Previous
From: Gavin Flower
Date:
Subject: Re: [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.
Next
From: Tom Lane
Date:
Subject: Re: [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.