Re: Recovery inconsistencies, standby much larger than primary - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Recovery inconsistencies, standby much larger than primary
Date
Msg-id 20918.1392240811@sss.pgh.pa.us
Whole thread Raw
In response to Re: Recovery inconsistencies, standby much larger than primary  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> What I think we probably want to do is forcibly cause the target page
> to exist, using a P_NEW loop like what I committed, and then decide
> on the basis of whether it's all-zeroes whether to consider it invalid
> or not.  This seems sane on the grounds that it's just the extension
> to the page level of the existing policy of creating the file whether
> it existed or not.  It could only result in a large amount of wasted
> work if the passed-in target block is insane --- but since we got it
> out of a CRC-checked WAL record, I think it's safe to not worry too
> much about that.

Like the attached.  A possible complaint is that if the WAL sequence
contains updates against large relations that are later dropped,
this will waste time and disk space replaying those updates as best
it can.  Doesn't seem like that's a case to optimize for, however.

            regards, tom lane

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index f1918f6..7a820c0 100644
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
*************** XLogReadBuffer(RelFileNode rnode, BlockN
*** 277,297 ****
   * XLogReadBufferExtended
   *        Read a page during XLOG replay
   *
!  * This is functionally comparable to ReadBufferExtended. There's some
!  * differences in the behavior wrt. the "mode" argument:
   *
!  * In RBM_NORMAL mode, if the page doesn't exist, or contains all-zeroes, we
!  * return InvalidBuffer. In this case the caller should silently skip the
!  * update on this page. (In this situation, we expect that the page was later
!  * dropped or truncated. If we don't see evidence of that later in the WAL
!  * sequence, we'll complain at the end of WAL replay.)
   *
   * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
   * relation is extended with all-zeroes pages up to the given block number.
   *
!  * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't
!  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
!  * to imply that the page should be dropped or truncated later.
   */
  Buffer
  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
--- 277,307 ----
   * XLogReadBufferExtended
   *        Read a page during XLOG replay
   *
!  * This is functionally comparable to ReadBufferExtended, except that we
!  * are willing to create the target page (and indeed the whole relation)
!  * if it doesn't currently exist.  This allows safe replay of WAL sequences
!  * in which a relation was later dropped or truncated.
   *
!  * The "mode" argument provides some control over this behavior.  (See also
!  * ReadBufferExtended's specification of what the modes do.)
   *
   * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
   * relation is extended with all-zeroes pages up to the given block number.
+  * These modes should be used if the caller is going to initialize the page
+  * contents from scratch, and doesn't need it to be valid already.
   *
!  * In RBM_NORMAL mode, we similarly create the page if needed, but if the
!  * page contains all zeroes (including the case where we just created it),
!  * we return InvalidBuffer.  Then the caller should silently skip the update
!  * on this page.  This mode should be used for incremental updates where the
!  * caller needs to see a valid page.  (In this case, we expect that the page
!  * later gets dropped or truncated. If we don't see evidence of that later in
!  * the WAL sequence, we'll complain at the end of WAL replay.)
!  *
!  * RBM_NORMAL_NO_LOG mode is like RBM_NORMAL except that we will return an
!  * all-zeroes page, and not log it as one that ought to get dropped later.
!  * This mode is for when the caller wants to read a page that might validly
!  * contain zeroes.
   */
  Buffer
  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
*************** XLogReadBufferExtended(RelFileNode rnode
*** 299,304 ****
--- 309,315 ----
  {
      BlockNumber lastblock;
      Buffer        buffer;
+     bool        present;
      SMgrRelation smgr;

      Assert(blkno != P_NEW);
*************** XLogReadBufferExtended(RelFileNode rnode
*** 316,342 ****
       */
      smgrcreate(smgr, forknum, true);

      lastblock = smgrnblocks(smgr, forknum);

      if (blkno < lastblock)
      {
          /* page exists in file */
          buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
                                             mode, NULL);
      }
      else
      {
!         /* hm, page doesn't exist in file */
!         if (mode == RBM_NORMAL)
!         {
!             log_invalid_page(rnode, forknum, blkno, false);
!             return InvalidBuffer;
!         }
!         if (mode == RBM_NORMAL_NO_LOG)
!             return InvalidBuffer;
!         /* OK to extend the file */
          /* we do this in recovery only - no rel-extension lock needed */
          Assert(InRecovery);
          buffer = InvalidBuffer;
          do
          {
--- 327,357 ----
       */
      smgrcreate(smgr, forknum, true);

+     /*
+      * On the same principle, if the page doesn't already exist in the file,
+      * create it by extending the relation as far as needed.
+      *
+      * When we are working in a not-yet-consistent database, it's possible for
+      * P_NEW to behave somewhat inconsistently as a result of incomplete
+      * segment files.  Don't assume that the returned pages are necessarily
+      * consecutive.  When we're done with this loop, however, any segments
+      * before the target page's segment have been zero-filled until complete.
+      */
      lastblock = smgrnblocks(smgr, forknum);

      if (blkno < lastblock)
      {
          /* page exists in file */
+         present = true;
          buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
                                             mode, NULL);
      }
      else
      {
!         /* must extend the file */
          /* we do this in recovery only - no rel-extension lock needed */
          Assert(InRecovery);
+         present = false;
          buffer = InvalidBuffer;
          do
          {
*************** XLogReadBufferExtended(RelFileNode rnode
*** 352,357 ****
--- 367,374 ----
              ReleaseBuffer(buffer);
              buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
                                                 mode, NULL);
+             /* page was not in fact created by P_NEW extension */
+             present = true;
          }
      }

*************** XLogReadBufferExtended(RelFileNode rnode
*** 368,374 ****
          if (PageIsNew(page))
          {
              ReleaseBuffer(buffer);
!             log_invalid_page(rnode, forknum, blkno, true);
              return InvalidBuffer;
          }
      }
--- 385,391 ----
          if (PageIsNew(page))
          {
              ReleaseBuffer(buffer);
!             log_invalid_page(rnode, forknum, blkno, present);
              return InvalidBuffer;
          }
      }

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: narwhal and PGDLLIMPORT
Next
From: Magnus Hagander
Date:
Subject: Re: Terminating pg_basebackup background streamer