Thread: CacheInvalidateRelcache in btree is a crummy idea

CacheInvalidateRelcache in btree is a crummy idea

From
Tom Lane
Date:
While investigating the assertion failure Noah presented at
http://www.postgresql.org/message-id/20130805170931.GA369289@tornado.leadboat.com
I was confused for a bit as to why we're getting a relcache rebuild
request for t1's index at exit of the subtransaction, when the
subtransaction hasn't made any change to the index's schema.
Investigation showed that the reason is that _bt_getroot() creates
a root page for the index during the subtransaction's INSERT, and
then it does CacheInvalidateRelcache() to notify other backends
that it's updated the index's metapage.

Now, this is a dumb idea, because relcache invalidation is assumed
to be a transactional operation, which means that if the subtransaction
rolls back, we don't send any invalidation event to other backends.
Update of a btree metapage, however, is not a transactional thing.
So if we actually *needed* that notification to be sent, this would be
a bug.

But we don't need that.  _bt_getroot() is the primary user of the
cached metapage, and it sufficiently verifies that the page it's
reached is the right one, which is why we've not seen bug reports.
_bt_getrootheight() also uses the cached data, but that's only for a
statistical estimate, so it doesn't matter if it's a bit stale.

Moreover, if we did need reliable transmission of the invalidation
message, that would've gotten broken even more by commit 96675bff1, which
suppressed doing that when InRecovery.  (In fairness, it wasn't a problem
at the time since that was pre-Hot Standby.  But the lack of notification
would be an issue now for hot standby sessions, if they needed one.)

So I'm thinking my commit d2896a9ed, which introduced this mechanism,
was poorly thought out and we should just remove the relcache invals
as per the attached patch.  Letting _bt_getroot() update the cached
metapage at next use should be a lot cheaper than a full relcache
rebuild for the index.

Note: this change will mean that Noah's test case no longer triggers
the problem discussed in that thread ... but I'm sure we can find
another test case for that.

            regards, tom lane

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..bf5bc38 100644
*** a/src/backend/access/nbtree/README
--- b/src/backend/access/nbtree/README
*************** location of the root page --- both the t
*** 438,451 ****
  root ("fast" root).  To avoid fetching the metapage for every single index
  search, we cache a copy of the meta-data information in the index's
  relcache entry (rd_amcache).  This is a bit ticklish since using the cache
! implies following a root page pointer that could be stale.  We require
! every metapage update to send out a SI "relcache inval" message on the
! index relation.  That ensures that each backend will flush its cached copy
! not later than the start of its next transaction.  Therefore, stale
! pointers cannot be used for longer than the current transaction, which
! reduces the problem to the same one already dealt with for concurrent
! VACUUM --- we can just imagine that each open transaction is potentially
! "already in flight" to the old root.

  The algorithm assumes we can fit at least three items per page
  (a "high key" and two real data items).  Therefore it's unsafe
--- 438,454 ----
  root ("fast" root).  To avoid fetching the metapage for every single index
  search, we cache a copy of the meta-data information in the index's
  relcache entry (rd_amcache).  This is a bit ticklish since using the cache
! implies following a root page pointer that could be stale.  However, a
! backend following a cached pointer can sufficiently verify whether it
! reached the intended page; either by checking the is-root flag when it
! is going to the true root, or by checking that the page has no siblings
! when going to the fast root.  At worst, this could result in descending
! some extra tree levels if we have a cached pointer to a fast root that is
! now above the real fast root.  Such cases shouldn't arise often enough to
! be worth optimizing; and in any case we can expect a relcache flush will
! discard the cached metapage before long, since a VACUUM that's moved the
! fast root pointer can be expected to issue a statistics update for the
! index.

  The algorithm assumes we can fit at least three items per page
  (a "high key" and two real data items).  Therefore it's unsafe
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 49c084a..9140749 100644
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 21,27 ****
  #include "miscadmin.h"
  #include "storage/lmgr.h"
  #include "storage/predicate.h"
- #include "utils/inval.h"
  #include "utils/tqual.h"


--- 21,26 ----
*************** _bt_insertonpg(Relation rel,
*** 868,880 ****

          END_CRIT_SECTION();

!         /* release buffers; send out relcache inval if metapage changed */
          if (BufferIsValid(metabuf))
-         {
-             if (!InRecovery)
-                 CacheInvalidateRelcache(rel);
              _bt_relbuf(rel, metabuf);
-         }

          _bt_relbuf(rel, buf);
      }
--- 867,875 ----

          END_CRIT_SECTION();

!         /* release buffers */
          if (BufferIsValid(metabuf))
              _bt_relbuf(rel, metabuf);

          _bt_relbuf(rel, buf);
      }
*************** _bt_newroot(Relation rel, Buffer lbuf, B
*** 1963,1972 ****

      END_CRIT_SECTION();

-     /* send out relcache inval for metapage change */
-     if (!InRecovery)
-         CacheInvalidateRelcache(rel);
-
      /* done with metapage */
      _bt_relbuf(rel, metabuf);

--- 1958,1963 ----
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c7c65a6..1e78858 100644
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 28,34 ****
  #include "storage/indexfsm.h"
  #include "storage/lmgr.h"
  #include "storage/predicate.h"
- #include "utils/inval.h"
  #include "utils/snapmgr.h"


--- 28,33 ----
*************** _bt_getroot(Relation rel, int access)
*** 247,258 ****
          END_CRIT_SECTION();

          /*
-          * Send out relcache inval for metapage change (probably unnecessary
-          * here, but let's be safe).
-          */
-         CacheInvalidateRelcache(rel);
-
-         /*
           * swap root write lock for read lock.    There is no danger of anyone
           * else accessing the new root page while it's unlocked, since no one
           * else knows where it is yet.
--- 246,251 ----
*************** _bt_pagedel(Relation rel, Buffer buf, BT
*** 1545,1556 ****

      END_CRIT_SECTION();

!     /* release metapage; send out relcache inval if metapage changed */
      if (BufferIsValid(metabuf))
-     {
-         CacheInvalidateRelcache(rel);
          _bt_relbuf(rel, metabuf);
!     }
      /* can always release leftsib immediately */
      if (BufferIsValid(lbuf))
          _bt_relbuf(rel, lbuf);
--- 1538,1547 ----

      END_CRIT_SECTION();

!     /* release metapage */
      if (BufferIsValid(metabuf))
          _bt_relbuf(rel, metabuf);
!
      /* can always release leftsib immediately */
      if (BufferIsValid(lbuf))
          _bt_relbuf(rel, lbuf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b7f9e2e..1da9548 100644
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
*************** btbuild(PG_FUNCTION_ARGS)
*** 149,162 ****
  #endif   /* BTREE_BUILD_STATS */

      /*
-      * If we are reindexing a pre-existing index, it is critical to send out a
-      * relcache invalidation SI message to ensure all backends re-read the
-      * index metapage.    We expect that the caller will ensure that happens
-      * (typically as a side effect of updating index stats, but it must happen
-      * even if the stats don't change!)
-      */
-
-     /*
       * Return statistics
       */
      result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
--- 149,154 ----

Re: CacheInvalidateRelcache in btree is a crummy idea

From
Heikki Linnakangas
Date:
On 02/02/2014 11:45 PM, Tom Lane wrote:
> So I'm thinking my commit d2896a9ed, which introduced this mechanism,
> was poorly thought out and we should just remove the relcache invals
> as per the attached patch.  Letting _bt_getroot() update the cached
> metapage at next use should be a lot cheaper than a full relcache
> rebuild for the index.

Looks good to me.

- Heikki



Re: CacheInvalidateRelcache in btree is a crummy idea

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 02/02/2014 11:45 PM, Tom Lane wrote:
>> So I'm thinking my commit d2896a9ed, which introduced this mechanism,
>> was poorly thought out and we should just remove the relcache invals
>> as per the attached patch.  Letting _bt_getroot() update the cached
>> metapage at next use should be a lot cheaper than a full relcache
>> rebuild for the index.

> Looks good to me.

I did think of one possible objection to this idea.  Although
_bt_getroot() checks that the page it arrives at is a usable fast root,
it could still fail if the page number is off the end of the relation;
that is, we have a cache entry that predates an index truncation.
Currently, the only way for a btree to get physically shorter is a
REINDEX, which implies a relfilenode change and hence a (transactional)
relcache flush, so this couldn't happen.  And if we tried to allow
VACUUM to truncate an index without a full lock, we'd probably have
the same type of issue for any concurrent process that's following
a stale cross-page link, not just a link to the root.  So I'm not
particularly impressed by this objection, but it could be made.

If we ever did need to make that work, a possible solution would be
to refactor things so that the metapage cache lives at the smgr level
not the relcache level, where it'd get blown away by a cross-backend
smgr inval (CacheInvalidateSmgr) --- which is nontransactional,
thereby fixing the basic problem with the way it's being done now.
There would still be some issues about locking, but at least the
cache per se wouldn't be a hazard anymore.

Since I'm not aware of any plans to make on-the-fly btree truncation
work, I won't bother to implement that now, but I thought I'd mention
it for the archives.
        regards, tom lane