Re: Report: race conditions in WAL replay routines - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Report: race conditions in WAL replay routines
Date
Msg-id 22343.1328480598@sss.pgh.pa.us
Whole thread Raw
In response to Re: Report: race conditions in WAL replay routines  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Report: race conditions in WAL replay routines
List pgsql-hackers
Simon Riggs <simon@2ndQuadrant.com> writes:
> Please post the patch rather than fixing directly. There's some subtle
> stuff there and it would be best to discuss first.

Here's a proposed patch for the issues around unlocked updates of
shared-memory state.  After going through this I believe that there is
little risk of any real problems in the current state of the code; this
is more in the nature of future-proofing against foreseeable changes.
(One such is that we'd discussed fixing the age() function to work
during Hot Standby.)  So I suggest applying this to HEAD but not
back-patching.

Except for one thing.  I realized while looking at the NEXTOID replay
code that it is completely broken: it only advances
ShmemVariableCache->nextOid when that's less than the value in the WAL
record.  So that comparison fails if the OID counter wraps around during
replay.  I've fixed this in the attached patch by just forcibly
assigning the new value instead of trying to be smart, and I think
probably that aspect of it needs to be back-patched.

            regards, tom lane

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 454ca310f339db34a4cc352899fea1d663abf93b..0f4cea124d7436380c730203e6cfd1518bc5d3b2 100644
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
*************** CheckPointMultiXact(void)
*** 1655,1662 ****
   * Set the next-to-be-assigned MultiXactId and offset
   *
   * This is used when we can determine the correct next ID/offset exactly
!  * from a checkpoint record.  We need no locking since it is only called
!  * during bootstrap and XLog replay.
   */
  void
  MultiXactSetNextMXact(MultiXactId nextMulti,
--- 1655,1663 ----
   * Set the next-to-be-assigned MultiXactId and offset
   *
   * This is used when we can determine the correct next ID/offset exactly
!  * from a checkpoint record.  Although this is only called during bootstrap
!  * and XLog replay, we take the lock in case any hot-standby backends are
!  * examining the values.
   */
  void
  MultiXactSetNextMXact(MultiXactId nextMulti,
*************** MultiXactSetNextMXact(MultiXactId nextMu
*** 1664,1671 ****
--- 1665,1674 ----
  {
      debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
                  nextMulti, nextMultiOffset);
+     LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
      MultiXactState->nextMXact = nextMulti;
      MultiXactState->nextOffset = nextMultiOffset;
+     LWLockRelease(MultiXactGenLock);
  }

  /*
*************** MultiXactSetNextMXact(MultiXactId nextMu
*** 1674,1685 ****
   *
   * This is used when we can determine minimum safe values from an XLog
   * record (either an on-line checkpoint or an mxact creation log entry).
!  * We need no locking since it is only called during XLog replay.
   */
  void
  MultiXactAdvanceNextMXact(MultiXactId minMulti,
                            MultiXactOffset minMultiOffset)
  {
      if (MultiXactIdPrecedes(MultiXactState->nextMXact, minMulti))
      {
          debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
--- 1677,1690 ----
   *
   * This is used when we can determine minimum safe values from an XLog
   * record (either an on-line checkpoint or an mxact creation log entry).
!  * Although this is only called during XLog replay, we take the lock in case
!  * any hot-standby backends are examining the values.
   */
  void
  MultiXactAdvanceNextMXact(MultiXactId minMulti,
                            MultiXactOffset minMultiOffset)
  {
+     LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
      if (MultiXactIdPrecedes(MultiXactState->nextMXact, minMulti))
      {
          debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
*************** MultiXactAdvanceNextMXact(MultiXactId mi
*** 1691,1696 ****
--- 1696,1702 ----
                      minMultiOffset);
          MultiXactState->nextOffset = minMultiOffset;
      }
+     LWLockRelease(MultiXactGenLock);
  }

  /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6e84cd0a21671486693e7f94d5fda8efdedf4bb4..3d08e92d3a747cc9426206dc0623ff390f18b09c 100644
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
*************** PrescanPreparedTransactions(TransactionI
*** 1715,1720 ****
--- 1715,1724 ----
              /*
               * Examine subtransaction XIDs ... they should all follow main
               * XID, and they may force us to advance nextXid.
+              *
+              * We don't expect anyone else to modify nextXid, hence we don't
+              * need to hold a lock while examining it.  We still acquire the
+              * lock to modify it, though.
               */
              subxids = (TransactionId *)
                  (buf + MAXALIGN(sizeof(TwoPhaseFileHeader)));
*************** PrescanPreparedTransactions(TransactionI
*** 1726,1733 ****
--- 1730,1739 ----
                  if (TransactionIdFollowsOrEquals(subxid,
                                                   ShmemVariableCache->nextXid))
                  {
+                     LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
                      ShmemVariableCache->nextXid = subxid;
                      TransactionIdAdvance(ShmemVariableCache->nextXid);
+                     LWLockRelease(XidGenLock);
                  }
              }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b128bfda36dbf00c0ef49a34bd2d48d6cdf42218..5e59c1ab1967a0516ef5e240286c36cab9ae2570 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** StartupXLOG(void)
*** 6625,6636 ****
                  errcontext.previous = error_context_stack;
                  error_context_stack = &errcontext;

!                 /* nextXid must be beyond record's xid */
                  if (TransactionIdFollowsOrEquals(record->xl_xid,
                                                   ShmemVariableCache->nextXid))
                  {
                      ShmemVariableCache->nextXid = record->xl_xid;
                      TransactionIdAdvance(ShmemVariableCache->nextXid);
                  }

                  /*
--- 6625,6644 ----
                  errcontext.previous = error_context_stack;
                  error_context_stack = &errcontext;

!                 /*
!                  * ShmemVariableCache->nextXid must be beyond record's xid.
!                  *
!                  * We don't expect anyone else to modify nextXid, hence we
!                  * don't need to hold a lock while examining it.  We still
!                  * acquire the lock to modify it, though.
!                  */
                  if (TransactionIdFollowsOrEquals(record->xl_xid,
                                                   ShmemVariableCache->nextXid))
                  {
+                     LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
                      ShmemVariableCache->nextXid = record->xl_xid;
                      TransactionIdAdvance(ShmemVariableCache->nextXid);
+                     LWLockRelease(XidGenLock);
                  }

                  /*
*************** StartupXLOG(void)
*** 6656,6661 ****
--- 6664,6670 ----
                      TransactionIdIsValid(record->xl_xid))
                      RecordKnownAssignedTransactionIds(record->xl_xid);

+                 /* Now apply the WAL record itself */
                  RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);

                  /* Pop the error context stack */
*************** StartupXLOG(void)
*** 6971,6978 ****
--- 6980,6989 ----
      XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;

      /* also initialize latestCompletedXid, to nextXid - 1 */
+     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
      ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
      TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
+     LWLockRelease(ProcArrayLock);

      /*
       * Start up the commit log and subtrans, if not already done for
*************** xlog_redo(XLogRecPtr lsn, XLogRecord *re
*** 8547,8558 ****
      {
          Oid            nextOid;

          memcpy(&nextOid, XLogRecGetData(record), sizeof(Oid));
!         if (ShmemVariableCache->nextOid < nextOid)
!         {
!             ShmemVariableCache->nextOid = nextOid;
!             ShmemVariableCache->oidCount = 0;
!         }
      }
      else if (info == XLOG_CHECKPOINT_SHUTDOWN)
      {
--- 8558,8575 ----
      {
          Oid            nextOid;

+         /*
+          * We used to try to take the maximum of ShmemVariableCache->nextOid
+          * and the recorded nextOid, but that fails if the OID counter wraps
+          * around.  Since no OID allocation should be happening during replay
+          * anyway, better to just believe the record exactly.  We still take
+          * OidGenLock while setting the variable, just in case.
+          */
          memcpy(&nextOid, XLogRecGetData(record), sizeof(Oid));
!         LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
!         ShmemVariableCache->nextOid = nextOid;
!         ShmemVariableCache->oidCount = 0;
!         LWLockRelease(OidGenLock);
      }
      else if (info == XLOG_CHECKPOINT_SHUTDOWN)
      {
*************** xlog_redo(XLogRecPtr lsn, XLogRecord *re
*** 8560,8568 ****
--- 8577,8589 ----

          memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
          /* In a SHUTDOWN checkpoint, believe the counters exactly */
+         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
          ShmemVariableCache->nextXid = checkPoint.nextXid;
+         LWLockRelease(XidGenLock);
+         LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
          ShmemVariableCache->nextOid = checkPoint.nextOid;
          ShmemVariableCache->oidCount = 0;
+         LWLockRelease(OidGenLock);
          MultiXactSetNextMXact(checkPoint.nextMulti,
                                checkPoint.nextMultiOffset);
          SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
*************** xlog_redo(XLogRecPtr lsn, XLogRecord *re
*** 8575,8581 ****
          if (InArchiveRecovery &&
              !XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
              XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
!             ereport(ERROR,
                      (errmsg("online backup was canceled, recovery cannot continue")));

          /*
--- 8596,8602 ----
          if (InArchiveRecovery &&
              !XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
              XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
!             ereport(PANIC,
                      (errmsg("online backup was canceled, recovery cannot continue")));

          /*
*************** xlog_redo(XLogRecPtr lsn, XLogRecord *re
*** 8641,8655 ****
          CheckPoint    checkPoint;

          memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
!         /* In an ONLINE checkpoint, treat the counters like NEXTOID */
          if (TransactionIdPrecedes(ShmemVariableCache->nextXid,
                                    checkPoint.nextXid))
              ShmemVariableCache->nextXid = checkPoint.nextXid;
!         if (ShmemVariableCache->nextOid < checkPoint.nextOid)
!         {
!             ShmemVariableCache->nextOid = checkPoint.nextOid;
!             ShmemVariableCache->oidCount = 0;
!         }
          MultiXactAdvanceNextMXact(checkPoint.nextMulti,
                                    checkPoint.nextMultiOffset);
          if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
--- 8662,8678 ----
          CheckPoint    checkPoint;

          memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
!         /* In an ONLINE checkpoint, treat the XID counter as a minimum */
!         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
          if (TransactionIdPrecedes(ShmemVariableCache->nextXid,
                                    checkPoint.nextXid))
              ShmemVariableCache->nextXid = checkPoint.nextXid;
!         LWLockRelease(XidGenLock);
!         /* ... but still treat OID counter as exact */
!         LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
!         ShmemVariableCache->nextOid = checkPoint.nextOid;
!         ShmemVariableCache->oidCount = 0;
!         LWLockRelease(OidGenLock);
          MultiXactAdvanceNextMXact(checkPoint.nextMulti,
                                    checkPoint.nextMultiOffset);
          if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8bda474bc712938ce8619e5ba715c12e7ba45deb..09b7311e7bcd82fbdc16d62ffcd8f3214193ab50 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
*************** ProcArrayApplyRecoveryInfo(RunningTransa
*** 654,670 ****
                                running->latestCompletedXid))
          ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;

!     /* nextXid must be beyond any observed xid */
      nextXid = latestObservedXid;
      TransactionIdAdvance(nextXid);
      if (TransactionIdFollows(nextXid, ShmemVariableCache->nextXid))
          ShmemVariableCache->nextXid = nextXid;

-     Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
      Assert(TransactionIdIsValid(ShmemVariableCache->nextXid));

-     LWLockRelease(ProcArrayLock);
-
      KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
      if (standbyState == STANDBY_SNAPSHOT_READY)
          elog(trace_recovery(DEBUG1), "recovery snapshots are now enabled");
--- 654,681 ----
                                running->latestCompletedXid))
          ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;

!     Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
!
!     LWLockRelease(ProcArrayLock);
!
!     /*
!      * ShmemVariableCache->nextXid must be beyond any observed xid.
!      *
!      * We don't expect anyone else to modify nextXid, hence we don't need to
!      * hold a lock while examining it.  We still acquire the lock to modify
!      * it, though.
!      */
      nextXid = latestObservedXid;
      TransactionIdAdvance(nextXid);
      if (TransactionIdFollows(nextXid, ShmemVariableCache->nextXid))
+     {
+         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
          ShmemVariableCache->nextXid = nextXid;
+         LWLockRelease(XidGenLock);
+     }

      Assert(TransactionIdIsValid(ShmemVariableCache->nextXid));

      KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
      if (standbyState == STANDBY_SNAPSHOT_READY)
          elog(trace_recovery(DEBUG1), "recovery snapshots are now enabled");
*************** GetOldestActiveTransactionId(void)
*** 1690,1695 ****
--- 1701,1713 ----

      LWLockAcquire(ProcArrayLock, LW_SHARED);

+     /*
+      * It's okay to read nextXid without acquiring XidGenLock because (1) we
+      * assume TransactionIds can be read atomically and (2) we don't care if
+      * we get a slightly stale value.  It can't be very stale anyway, because
+      * the LWLockAcquire above will have done any necessary memory
+      * interlocking.
+      */
      oldestRunningXid = ShmemVariableCache->nextXid;

      /*
*************** RecordKnownAssignedTransactionIds(Transa
*** 2609,2615 ****
--- 2627,2635 ----
          /* ShmemVariableCache->nextXid must be beyond any observed xid */
          next_expected_xid = latestObservedXid;
          TransactionIdAdvance(next_expected_xid);
+         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
          ShmemVariableCache->nextXid = next_expected_xid;
+         LWLockRelease(XidGenLock);
      }
  }


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Report: race conditions in WAL replay routines
Next
From: Noah Misch
Date:
Subject: Re: initdb and fsync