Thread: Report: race conditions in WAL replay routines

Report: race conditions in WAL replay routines

From

Tom Lane

Date:

05 February 2012, 15:19:36

Pursuant to the recent discussion about bugs 6200 and 6245, I went
trawling through all the WAL redo routines looking for bugs such as
inadequate locking or transiently trashing shared buffers.  Here's
what I found:

* As we already knew, RestoreBkpBlocks is broken because it transiently
trashes a shared buffer, which another process could be accessing while
holding only a pin.

* seq_redo() has the same disease, since we allow SELECT * FROM
sequences.

* Everything else seems to be free of that specific issue; in particular
the index-related replay routines are at fairly low risk since we don't
have any coding rules allowing index pages to be examined without
holding a buffer lock.

* There are assorted replay routines that assume they can whack fields
of ShmemVariableCache around without any lock.  However, it's pretty
inconsistent; about half do it like that, while the other half assume
that they can read ShmemVariableCache without lock but should acquire
lock to modify it.  I think the latter coding rule is a whole lot safer
in the presence of Hot Standby and should be adopted throughout.

* Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
It's not entirely clear to me that no read-only transaction can ever
examine the shared-memory variables they change.  In any case, if there
is in fact no other process examining those variables, there can be no
contention for the lock so it should be cheap to get.

* Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
if it fails to clean out tablespace directories.  This seems to me to be
the height of folly, especially when the failure is more or less an
expected case.  If the error occurs the database is dead in the water,
because that error is actually a PANIC and will recur on subsequent
restart attempts.  Therefore there is no way to recover short of manual
intervention to clean out the non-empty directory.  And why are we
pulling the fire alarm like this?  Well, uh, it's because we might fail
to recover some disk space in the dropped tablespace.  Seems to me to be
a lot better to just elog(LOG) and move on.  This is quite analogous to
the case of failing to unlink a file after commit --- wasting disk space
might be bad, but it's very much the lesser evil compared to this.

Barring objections I'm going to fix all this stuff and back-patch as
far as 9.0 where hot standby was added.
        regards, tom lane

Re: Report: race conditions in WAL replay routines

From

Simon Riggs

Date:

05 February 2012, 17:01:00

On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pursuant to the recent discussion about bugs 6200 and 6245, I went
> trawling through all the WAL redo routines looking for bugs such as
> inadequate locking or transiently trashing shared buffers.  Here's
> what I found:
>
> * As we already knew, RestoreBkpBlocks is broken because it transiently
> trashes a shared buffer, which another process could be accessing while
> holding only a pin.

Agreed

> * seq_redo() has the same disease, since we allow SELECT * FROM
> sequences.

Why do we do that?

> * Everything else seems to be free of that specific issue; in particular
> the index-related replay routines are at fairly low risk since we don't
> have any coding rules allowing index pages to be examined without
> holding a buffer lock.

Yep

> * There are assorted replay routines that assume they can whack fields
> of ShmemVariableCache around without any lock.  However, it's pretty
> inconsistent; about half do it like that, while the other half assume
> that they can read ShmemVariableCache without lock but should acquire
> lock to modify it.  I think the latter coding rule is a whole lot safer
> in the presence of Hot Standby and should be adopted throughout.

Agreed

> * Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
> It's not entirely clear to me that no read-only transaction can ever
> examine the shared-memory variables they change.  In any case, if there
> is in fact no other process examining those variables, there can be no
> contention for the lock so it should be cheap to get.

Row locking requires a WAL record to be written, so that whole path is
dead during HS.

> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
> if it fails to clean out tablespace directories.  This seems to me to be
> the height of folly, especially when the failure is more or less an
> expected case.  If the error occurs the database is dead in the water,
> because that error is actually a PANIC and will recur on subsequent
> restart attempts.  Therefore there is no way to recover short of manual
> intervention to clean out the non-empty directory.  And why are we
> pulling the fire alarm like this?  Well, uh, it's because we might fail
> to recover some disk space in the dropped tablespace.  Seems to me to be
> a lot better to just elog(LOG) and move on.  This is quite analogous to
> the case of failing to unlink a file after commit --- wasting disk space
> might be bad, but it's very much the lesser evil compared to this.

If the sysadmin is managing the db properly then this shouldn't ever
happen - the only cause is if the tablespace being dropped is being
used as a temp tablespace on the standby.

The ERROR is appropriate because we first try to remove the files. If
they won't go we then raise an all-session conflict and then try
again. Only when we fail the second time does it ERROR, which seems
OK.

If you just LOG, when exactly would we get rid of the tablespace?

> Barring objections I'm going to fix all this stuff and back-patch as
> far as 9.0 where hot standby was added.

Please post the patch rather than fixing directly. There's some subtle
stuff there and it would be best to discuss first.

Thanks

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Report: race conditions in WAL replay routines

From

Tom Lane

Date:

05 February 2012, 17:03:52

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * seq_redo() has the same disease, since we allow SELECT * FROM
>> sequences.

> Why do we do that?

It's the only existing way to obtain the parameters of a sequence.
Even if we felt like inventing a different API for doing that, it'd take
years to change every client that knows about this.

>> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
>> if it fails to clean out tablespace directories.  This seems to me to be
>> the height of folly, especially when the failure is more or less an
>> expected case.  If the error occurs the database is dead in the water,
>> because that error is actually a PANIC and will recur on subsequent
>> restart attempts.  Therefore there is no way to recover short of manual
>> intervention to clean out the non-empty directory.  And why are we
>> pulling the fire alarm like this?  Well, uh, it's because we might fail
>> to recover some disk space in the dropped tablespace.  Seems to me to be
>> a lot better to just elog(LOG) and move on.  This is quite analogous to
>> the case of failing to unlink a file after commit --- wasting disk space
>> might be bad, but it's very much the lesser evil compared to this.

> If the sysadmin is managing the db properly then this shouldn't ever
> happen - the only cause is if the tablespace being dropped is being
> used as a temp tablespace on the standby.

Right, but that is an expected/foreseeable situation.  It should not
lead to a dead-and-unrestartable database.

> If you just LOG, when exactly would we get rid of the tablespace?

The tablespace *is* gone, or at least its catalog entries are.  All we
are trying to do here is release some underlying disk space.  It's
exactly analogous to the case where we drop a table and then find (post
commit) that unlinking the disk file fails for some weird reason.
We've done what we can to clean the disk space and should just let it
go --- there is no risk to database integrity in leaving some files
behind, so killing the server is a huge overreaction.
        regards, tom lane

Re: Report: race conditions in WAL replay routines

From

Simon Riggs

Date:

05 February 2012, 17:29:39

On Sun, Feb 5, 2012 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
>>> if it fails to clean out tablespace directories.  This seems to me to be
>>> the height of folly, especially when the failure is more or less an
>>> expected case.  If the error occurs the database is dead in the water,
>>> because that error is actually a PANIC and will recur on subsequent
>>> restart attempts.  Therefore there is no way to recover short of manual
>>> intervention to clean out the non-empty directory.  And why are we
>>> pulling the fire alarm like this?  Well, uh, it's because we might fail
>>> to recover some disk space in the dropped tablespace.  Seems to me to be
>>> a lot better to just elog(LOG) and move on.  This is quite analogous to
>>> the case of failing to unlink a file after commit --- wasting disk space
>>> might be bad, but it's very much the lesser evil compared to this.
>
>> If the sysadmin is managing the db properly then this shouldn't ever
>> happen - the only cause is if the tablespace being dropped is being
>> used as a temp tablespace on the standby.
>
> Right, but that is an expected/foreseeable situation.  It should not
> lead to a dead-and-unrestartable database.
>
>> If you just LOG, when exactly would we get rid of the tablespace?
>
> The tablespace *is* gone, or at least its catalog entries are.  All we
> are trying to do here is release some underlying disk space.  It's
> exactly analogous to the case where we drop a table and then find (post
> commit) that unlinking the disk file fails for some weird reason.
> We've done what we can to clean the disk space and should just let it
> go --- there is no risk to database integrity in leaving some files
> behind, so killing the server is a huge overreaction.

I agree the tablespace entries are gone, but that won't stop existing
users from continuing.

If we're not sure of the reason why tablespace removal fails it
doesn't seem safe to continue to me.

But since this is a rare corner case, and we already try to remove
users, then LOG seems OK.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Report: race conditions in WAL replay routines

From

Tom Lane

Date:

05 February 2012, 18:23:39

Simon Riggs <simon@2ndQuadrant.com> writes:
> Please post the patch rather than fixing directly. There's some subtle
> stuff there and it would be best to discuss first.

Here's a proposed patch for the issues around unlocked updates of
shared-memory state.  After going through this I believe that there is
little risk of any real problems in the current state of the code; this
is more in the nature of future-proofing against foreseeable changes.
(One such is that we'd discussed fixing the age() function to work
during Hot Standby.)  So I suggest applying this to HEAD but not
back-patching.

Except for one thing.  I realized while looking at the NEXTOID replay
code that it is completely broken: it only advances
ShmemVariableCache->nextOid when that's less than the value in the WAL
record.  So that comparison fails if the OID counter wraps around during
replay.  I've fixed this in the attached patch by just forcibly
assigning the new value instead of trying to be smart, and I think
probably that aspect of it needs to be back-patched.

            regards, tom lane

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 454ca310f339db34a4cc352899fea1d663abf93b..0f4cea124d7436380c730203e6cfd1518bc5d3b2 100644
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
*************** CheckPointMultiXact(void)
*** 1655,1662 ****
   * Set the next-to-be-assigned MultiXactId and offset
   *
   * This is used when we can determine the correct next ID/offset exactly
!  * from a checkpoint record.  We need no locking since it is only called
!  * during bootstrap and XLog replay.
   */
  void
  MultiXactSetNextMXact(MultiXactId nextMulti,
--- 1655,1663 ----
   * Set the next-to-be-assigned MultiXactId and offset
   *
   * This is used when we can determine the correct next ID/offset exactly
!  * from a checkpoint record.  Although this is only called during bootstrap
!  * and XLog replay, we take the lock in case any hot-standby backends are
!  * examining the values.
   */
  void
  MultiXactSetNextMXact(MultiXactId nextMulti,
*************** MultiXactSetNextMXact(MultiXactId nextMu
*** 1664,1671 ****
--- 1665,1674 ----
  {
      debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
                  nextMulti, nextMultiOffset);
+     LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
      MultiXactState->nextMXact = nextMulti;
      MultiXactState->nextOffset = nextMultiOffset;
+     LWLockRelease(MultiXactGenLock);
  }

  /*
*************** MultiXactSetNextMXact(MultiXactId nextMu
*** 1674,1685 ****
   *
   * This is used when we can determine minimum safe values from an XLog
   * record (either an on-line checkpoint or an mxact creation log entry).
!  * We need no locking since it is only called during XLog replay.
   */
  void
  MultiXactAdvanceNextMXact(MultiXactId minMulti,
                            MultiXactOffset minMultiOffset)
  {
      if (MultiXactIdPrecedes(MultiXactState->nextMXact, minMulti))
      {
          debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
--- 1677,1690 ----
   *
   * This is used when we can determine minimum safe values from an XLog
   * record (either an on-line checkpoint or an mxact creation log entry).
!  * Although this is only called during XLog replay, we take the lock in case
!  * any hot-standby backends are examining the values.
   */
  void
  MultiXactAdvanceNextMXact(MultiXactId minMulti,
                            MultiXactOffset minMultiOffset)
  {
+     LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
      if (MultiXactIdPrecedes(MultiXactState->nextMXact, minMulti))
      {
          debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
*************** MultiXactAdvanceNextMXact(MultiXactId mi
*** 1691,1696 ****
--- 1696,1702 ----
                      minMultiOffset);
          MultiXactState->nextOffset = minMultiOffset;
      }
+     LWLockRelease(MultiXactGenLock);
  }

  /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6e84cd0a21671486693e7f94d5fda8efdedf4bb4..3d08e92d3a747cc9426206dc0623ff390f18b09c 100644
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
*************** PrescanPreparedTransactions(TransactionI
*** 1715,1720 ****
--- 1715,1724 ----
              /*
               * Examine subtransaction XIDs ... they should all follow main
               * XID, and they may force us to advance nextXid.
+              *
+              * We don't expect anyone else to modify nextXid, hence we don't
+              * need to hold a lock while examining it.  We still acquire the
+              * lock to modify it, though.
               */
              subxids = (TransactionId *)
                  (buf + MAXALIGN(sizeof(TwoPhaseFileHeader)));
*************** PrescanPreparedTransactions(TransactionI
*** 1726,1733 ****
--- 1730,1739 ----
                  if (TransactionIdFollowsOrEquals(subxid,
                                                   ShmemVariableCache->nextXid))
                  {
+                     LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
                      ShmemVariableCache->nextXid = subxid;
                      TransactionIdAdvance(ShmemVariableCache->nextXid);
+                     LWLockRelease(XidGenLock);
                  }
              }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b128bfda36dbf00c0ef49a34bd2d48d6cdf42218..5e59c1ab1967a0516ef5e240286c36cab9ae2570 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** StartupXLOG(void)
*** 6625,6636 ****
                  errcontext.previous = error_context_stack;
                  error_context_stack = &errcontext;

!                 /* nextXid must be beyond record's xid */
                  if (TransactionIdFollowsOrEquals(record->xl_xid,
                                                   ShmemVariableCache->nextXid))
                  {
                      ShmemVariableCache->nextXid = record->xl_xid;
                      TransactionIdAdvance(ShmemVariableCache->nextXid);
                  }

                  /*
--- 6625,6644 ----
                  errcontext.previous = error_context_stack;
                  error_context_stack = &errcontext;

!                 /*
!                  * ShmemVariableCache->nextXid must be beyond record's xid.
!                  *
!                  * We don't expect anyone else to modify nextXid, hence we
!                  * don't need to hold a lock while examining it.  We still
!                  * acquire the lock to modify it, though.
!                  */
                  if (TransactionIdFollowsOrEquals(record->xl_xid,
                                                   ShmemVariableCache->nextXid))
                  {
+                     LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
                      ShmemVariableCache->nextXid = record->xl_xid;
                      TransactionIdAdvance(ShmemVariableCache->nextXid);
+                     LWLockRelease(XidGenLock);
                  }

                  /*
*************** StartupXLOG(void)
*** 6656,6661 ****
--- 6664,6670 ----
                      TransactionIdIsValid(record->xl_xid))
                      RecordKnownAssignedTransactionIds(record->xl_xid);

+                 /* Now apply the WAL record itself */
                  RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);

                  /* Pop the error context stack */
*************** StartupXLOG(void)
*** 6971,6978 ****
--- 6980,6989 ----
      XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;

      /* also initialize latestCompletedXid, to nextXid - 1 */
+     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
      ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
      TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
+     LWLockRelease(ProcArrayLock);

      /*
       * Start up the commit log and subtrans, if not already done for
*************** xlog_redo(XLogRecPtr lsn, XLogRecord *re
*** 8547,8558 ****
      {
          Oid            nextOid;

          memcpy(&nextOid, XLogRecGetData(record), sizeof(Oid));
!         if (ShmemVariableCache->nextOid < nextOid)
!         {
!             ShmemVariableCache->nextOid = nextOid;
!             ShmemVariableCache->oidCount = 0;
!         }
      }
      else if (info == XLOG_CHECKPOINT_SHUTDOWN)
      {
--- 8558,8575 ----
      {
          Oid            nextOid;

+         /*
+          * We used to try to take the maximum of ShmemVariableCache->nextOid
+          * and the recorded nextOid, but that fails if the OID counter wraps
+          * around.  Since no OID allocation should be happening during replay
+          * anyway, better to just believe the record exactly.  We still take
+          * OidGenLock while setting the variable, just in case.
+          */
          memcpy(&nextOid, XLogRecGetData(record), sizeof(Oid));
!         LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
!         ShmemVariableCache->nextOid = nextOid;
!         ShmemVariableCache->oidCount = 0;
!         LWLockRelease(OidGenLock);
      }
      else if (info == XLOG_CHECKPOINT_SHUTDOWN)
      {
*************** xlog_redo(XLogRecPtr lsn, XLogRecord *re
*** 8560,8568 ****
--- 8577,8589 ----

          memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
          /* In a SHUTDOWN checkpoint, believe the counters exactly */
+         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
          ShmemVariableCache->nextXid = checkPoint.nextXid;
+         LWLockRelease(XidGenLock);
+         LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
          ShmemVariableCache->nextOid = checkPoint.nextOid;
          ShmemVariableCache->oidCount = 0;
+         LWLockRelease(OidGenLock);
          MultiXactSetNextMXact(checkPoint.nextMulti,
                                checkPoint.nextMultiOffset);
          SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
*************** xlog_redo(XLogRecPtr lsn, XLogRecord *re
*** 8575,8581 ****
          if (InArchiveRecovery &&
              !XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
              XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
!             ereport(ERROR,
                      (errmsg("online backup was canceled, recovery cannot continue")));

          /*
--- 8596,8602 ----
          if (InArchiveRecovery &&
              !XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
              XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
!             ereport(PANIC,
                      (errmsg("online backup was canceled, recovery cannot continue")));

          /*
*************** xlog_redo(XLogRecPtr lsn, XLogRecord *re
*** 8641,8655 ****
          CheckPoint    checkPoint;

          memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
!         /* In an ONLINE checkpoint, treat the counters like NEXTOID */
          if (TransactionIdPrecedes(ShmemVariableCache->nextXid,
                                    checkPoint.nextXid))
              ShmemVariableCache->nextXid = checkPoint.nextXid;
!         if (ShmemVariableCache->nextOid < checkPoint.nextOid)
!         {
!             ShmemVariableCache->nextOid = checkPoint.nextOid;
!             ShmemVariableCache->oidCount = 0;
!         }
          MultiXactAdvanceNextMXact(checkPoint.nextMulti,
                                    checkPoint.nextMultiOffset);
          if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
--- 8662,8678 ----
          CheckPoint    checkPoint;

          memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
!         /* In an ONLINE checkpoint, treat the XID counter as a minimum */
!         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
          if (TransactionIdPrecedes(ShmemVariableCache->nextXid,
                                    checkPoint.nextXid))
              ShmemVariableCache->nextXid = checkPoint.nextXid;
!         LWLockRelease(XidGenLock);
!         /* ... but still treat OID counter as exact */
!         LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
!         ShmemVariableCache->nextOid = checkPoint.nextOid;
!         ShmemVariableCache->oidCount = 0;
!         LWLockRelease(OidGenLock);
          MultiXactAdvanceNextMXact(checkPoint.nextMulti,
                                    checkPoint.nextMultiOffset);
          if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8bda474bc712938ce8619e5ba715c12e7ba45deb..09b7311e7bcd82fbdc16d62ffcd8f3214193ab50 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
*************** ProcArrayApplyRecoveryInfo(RunningTransa
*** 654,670 ****
                                running->latestCompletedXid))
          ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;

!     /* nextXid must be beyond any observed xid */
      nextXid = latestObservedXid;
      TransactionIdAdvance(nextXid);
      if (TransactionIdFollows(nextXid, ShmemVariableCache->nextXid))
          ShmemVariableCache->nextXid = nextXid;

-     Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
      Assert(TransactionIdIsValid(ShmemVariableCache->nextXid));

-     LWLockRelease(ProcArrayLock);
-
      KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
      if (standbyState == STANDBY_SNAPSHOT_READY)
          elog(trace_recovery(DEBUG1), "recovery snapshots are now enabled");
--- 654,681 ----
                                running->latestCompletedXid))
          ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;

!     Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
!
!     LWLockRelease(ProcArrayLock);
!
!     /*
!      * ShmemVariableCache->nextXid must be beyond any observed xid.
!      *
!      * We don't expect anyone else to modify nextXid, hence we don't need to
!      * hold a lock while examining it.  We still acquire the lock to modify
!      * it, though.
!      */
      nextXid = latestObservedXid;
      TransactionIdAdvance(nextXid);
      if (TransactionIdFollows(nextXid, ShmemVariableCache->nextXid))
+     {
+         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
          ShmemVariableCache->nextXid = nextXid;
+         LWLockRelease(XidGenLock);
+     }

      Assert(TransactionIdIsValid(ShmemVariableCache->nextXid));

      KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
      if (standbyState == STANDBY_SNAPSHOT_READY)
          elog(trace_recovery(DEBUG1), "recovery snapshots are now enabled");
*************** GetOldestActiveTransactionId(void)
*** 1690,1695 ****
--- 1701,1713 ----

      LWLockAcquire(ProcArrayLock, LW_SHARED);

+     /*
+      * It's okay to read nextXid without acquiring XidGenLock because (1) we
+      * assume TransactionIds can be read atomically and (2) we don't care if
+      * we get a slightly stale value.  It can't be very stale anyway, because
+      * the LWLockAcquire above will have done any necessary memory
+      * interlocking.
+      */
      oldestRunningXid = ShmemVariableCache->nextXid;

      /*
*************** RecordKnownAssignedTransactionIds(Transa
*** 2609,2615 ****
--- 2627,2635 ----
          /* ShmemVariableCache->nextXid must be beyond any observed xid */
          next_expected_xid = latestObservedXid;
          TransactionIdAdvance(next_expected_xid);
+         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
          ShmemVariableCache->nextXid = next_expected_xid;
+         LWLockRelease(XidGenLock);
      }
  }

Re: Report: race conditions in WAL replay routines

From

Tom Lane

Date:

05 February 2012, 19:14:52

Simon Riggs <simon@2ndQuadrant.com> writes:
> Please post the patch rather than fixing directly. There's some subtle
> stuff there and it would be best to discuss first.

And here's a proposed patch for not throwing ERROR during replay of DROP
TABLESPACE.  I had originally thought this would be a one-liner
s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really
needs to be changed too, so that it doesn't throw error for unremovable
directories.

            regards, tom lane

diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 8ed6d06bb2370a54349f2a86ee13e08a381eacf2..ed0c9ea3ee5a372c6926f41fc0fc93ed7d80d6cd 100644
*** a/src/backend/commands/tablespace.c
--- b/src/backend/commands/tablespace.c
*************** create_tablespace_directories(const char
*** 626,634 ****
  /*
   * destroy_tablespace_directories
   *
!  * Attempt to remove filesystem infrastructure
   *
!  * 'redo' indicates we are redoing a drop from XLOG; okay if nothing there
   *
   * Returns TRUE if successful, FALSE if some subdirectory is not empty
   */
--- 626,638 ----
  /*
   * destroy_tablespace_directories
   *
!  * Attempt to remove filesystem infrastructure for the tablespace.
   *
!  * 'redo' indicates we are redoing a drop from XLOG; in that case we should
!  * not throw an ERROR for problems, just LOG them.  The worst consequence of
!  * not removing files here would be failure to release some disk space, which
!  * does not justify throwing an error that would require manual intervention
!  * to get the database running again.
   *
   * Returns TRUE if successful, FALSE if some subdirectory is not empty
   */
*************** destroy_tablespace_directories(Oid table
*** 681,686 ****
--- 685,700 ----
              pfree(linkloc_with_version_dir);
              return true;
          }
+         else if (redo)
+         {
+             /* in redo, just log other types of error */
+             ereport(LOG,
+                     (errcode_for_file_access(),
+                      errmsg("could not open directory \"%s\": %m",
+                             linkloc_with_version_dir)));
+             pfree(linkloc_with_version_dir);
+             return false;
+         }
          /* else let ReadDir report the error */
      }

*************** destroy_tablespace_directories(Oid table
*** 704,710 ****

          /* remove empty directory */
          if (rmdir(subfile) < 0)
!             ereport(ERROR,
                      (errcode_for_file_access(),
                       errmsg("could not remove directory \"%s\": %m",
                              subfile)));
--- 718,724 ----

          /* remove empty directory */
          if (rmdir(subfile) < 0)
!             ereport(redo ? LOG : ERROR,
                      (errcode_for_file_access(),
                       errmsg("could not remove directory \"%s\": %m",
                              subfile)));
*************** destroy_tablespace_directories(Oid table
*** 716,738 ****

      /* remove version directory */
      if (rmdir(linkloc_with_version_dir) < 0)
!         ereport(ERROR,
                  (errcode_for_file_access(),
                   errmsg("could not remove directory \"%s\": %m",
                          linkloc_with_version_dir)));

      /*
       * Try to remove the symlink.  We must however deal with the possibility
       * that it's a directory instead of a symlink --- this could happen during
       * WAL replay (see TablespaceCreateDbspace), and it is also the case on
       * Windows where junction points lstat() as directories.
       */
      linkloc = pstrdup(linkloc_with_version_dir);
      get_parent_directory(linkloc);
      if (lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))
      {
          if (rmdir(linkloc) < 0)
!             ereport(ERROR,
                      (errcode_for_file_access(),
                       errmsg("could not remove directory \"%s\": %m",
                              linkloc)));
--- 730,759 ----

      /* remove version directory */
      if (rmdir(linkloc_with_version_dir) < 0)
!     {
!         ereport(redo ? LOG : ERROR,
                  (errcode_for_file_access(),
                   errmsg("could not remove directory \"%s\": %m",
                          linkloc_with_version_dir)));
+         pfree(linkloc_with_version_dir);
+         return false;
+     }

      /*
       * Try to remove the symlink.  We must however deal with the possibility
       * that it's a directory instead of a symlink --- this could happen during
       * WAL replay (see TablespaceCreateDbspace), and it is also the case on
       * Windows where junction points lstat() as directories.
+      *
+      * Note: in the redo case, we'll return true if this final step fails;
+      * there's no point in retrying it.
       */
      linkloc = pstrdup(linkloc_with_version_dir);
      get_parent_directory(linkloc);
      if (lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))
      {
          if (rmdir(linkloc) < 0)
!             ereport(redo ? LOG : ERROR,
                      (errcode_for_file_access(),
                       errmsg("could not remove directory \"%s\": %m",
                              linkloc)));
*************** destroy_tablespace_directories(Oid table
*** 740,746 ****
      else
      {
          if (unlink(linkloc) < 0)
!             ereport(ERROR,
                      (errcode_for_file_access(),
                       errmsg("could not remove symbolic link \"%s\": %m",
                              linkloc)));
--- 761,767 ----
      else
      {
          if (unlink(linkloc) < 0)
!             ereport(redo ? LOG : ERROR,
                      (errcode_for_file_access(),
                       errmsg("could not remove symbolic link \"%s\": %m",
                              linkloc)));
*************** tblspc_redo(XLogRecPtr lsn, XLogRecord *
*** 1451,1464 ****
          xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);

          /*
!          * If we issued a WAL record for a drop tablespace it is because there
!          * were no files in it at all. That means that no permanent objects
!          * can exist in it at this point.
           *
           * It is possible for standby users to be using this tablespace as a
           * location for their temporary files, so if we fail to remove all
           * files then do conflict processing and try again, if currently
           * enabled.
           */
          if (!destroy_tablespace_directories(xlrec->ts_id, true))
          {
--- 1472,1490 ----
          xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);

          /*
!          * If we issued a WAL record for a drop tablespace it implies that
!          * there were no files in it at all when the DROP was done. That means
!          * that no permanent objects can exist in it at this point.
           *
           * It is possible for standby users to be using this tablespace as a
           * location for their temporary files, so if we fail to remove all
           * files then do conflict processing and try again, if currently
           * enabled.
+          *
+          * Other possible reasons for failure include bollixed file permissions
+          * on a standby server when they were okay on the primary, etc etc.
+          * There's not much we can do about that, so just remove what we can
+          * and press on.
           */
          if (!destroy_tablespace_directories(xlrec->ts_id, true))
          {
*************** tblspc_redo(XLogRecPtr lsn, XLogRecord *
*** 1466,1480 ****

              /*
               * If we did recovery processing then hopefully the backends who
!              * wrote temp files should have cleaned up and exited by now. So
!              * lets recheck before we throw an error. If !process_conflicts
!              * then this will just fail again.
               */
              if (!destroy_tablespace_directories(xlrec->ts_id, true))
!                 ereport(ERROR,
                          (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                           errmsg("tablespace %u is not empty",
!                                 xlrec->ts_id)));
          }
      }
      else
--- 1492,1509 ----

              /*
               * If we did recovery processing then hopefully the backends who
!              * wrote temp files should have cleaned up and exited by now.  So
!              * retry before complaining.  If we fail again, this is just a LOG
!              * condition, because it's not worth throwing an ERROR for (as
!              * that would crash the database and require manual intervention
!              * before we could get past this WAL record on restart).
               */
              if (!destroy_tablespace_directories(xlrec->ts_id, true))
!                 ereport(LOG,
                          (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                           errmsg("tablespace %u is not empty",
!                                 xlrec->ts_id),
!                          errdetail("Continuing even though some files could not be removed.")));
          }
      }
      else

Re: Report: race conditions in WAL replay routines

From

Simon Riggs

Date:

06 February 2012, 05:13:31

On Sun, Feb 5, 2012 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> Please post the patch rather than fixing directly. There's some subtle
>> stuff there and it would be best to discuss first.
>
> Here's a proposed patch for the issues around unlocked updates of
> shared-memory state.  After going through this I believe that there is
> little risk of any real problems in the current state of the code; this
> is more in the nature of future-proofing against foreseeable changes.
> (One such is that we'd discussed fixing the age() function to work
> during Hot Standby.)  So I suggest applying this to HEAD but not
> back-patching.

All looks very good to me. Agreed.

> Except for one thing.  I realized while looking at the NEXTOID replay
> code that it is completely broken: it only advances
> ShmemVariableCache->nextOid when that's less than the value in the WAL
> record.  So that comparison fails if the OID counter wraps around during
> replay.  I've fixed this in the attached patch by just forcibly
> assigning the new value instead of trying to be smart, and I think
> probably that aspect of it needs to be back-patched.

Ouch! Well spotted.

Suggest fixing that as a separate patch; looks like backpatch to 8.0.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Report: race conditions in WAL replay routines

From

Simon Riggs

Date:

06 February 2012, 05:29:23

On Sun, Feb 5, 2012 at 11:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> Please post the patch rather than fixing directly. There's some subtle
>> stuff there and it would be best to discuss first.
>
> And here's a proposed patch for not throwing ERROR during replay of DROP
> TABLESPACE.  I had originally thought this would be a one-liner
> s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really
> needs to be changed too, so that it doesn't throw error for unremovable
> directories.

Looks good.

The existing errmsg of "tablespace is not empty" doesn't cover all
reasons why tablespace was not removed.

The final message should have
errmsg "tablespace not fully removed"
errhint "you should resolve this manually if it causes further problems"

The errdetail is good.


--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Report: race conditions in WAL replay routines

From

Tom Lane

Date:

06 February 2012, 14:32:38

Simon Riggs <simon@2ndQuadrant.com> writes:
> The existing errmsg of "tablespace is not empty" doesn't cover all
> reasons why tablespace was not removed.

Yeah, in fact that particular statement is really pretty bogus for the
replay case, because as the comment says we know that the tablespace
*is* empty so far as full-fledged database objects are concerned.

> The final message should have
> errmsg "tablespace not fully removed"
> errhint "you should resolve this manually if it causes further problems"

Planning to go with this:
                        errmsg("directories for tablespace %u could not be removed",
xlrec->ts_id),                       errhint("You can remove the directories manually if necessary.")));
 

I thought about an errdetail, but the preceding LOG entries from
destroy_tablespace_directories should provide the details reasonably
well.
        regards, tom lane

Re: Report: race conditions in WAL replay routines

From

Simon Riggs

Date:

06 February 2012, 15:44:39

On Mon, Feb 6, 2012 at 6:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> The existing errmsg of "tablespace is not empty" doesn't cover all
>> reasons why tablespace was not removed.
>
> Yeah, in fact that particular statement is really pretty bogus for the
> replay case, because as the comment says we know that the tablespace
> *is* empty so far as full-fledged database objects are concerned.
>
>> The final message should have
>> errmsg "tablespace not fully removed"
>> errhint "you should resolve this manually if it causes further problems"
>
> Planning to go with this:
>
>                         errmsg("directories for tablespace %u could not be removed",
>                                xlrec->ts_id),
>                         errhint("You can remove the directories manually if necessary.")));
>
> I thought about an errdetail, but the preceding LOG entries from
> destroy_tablespace_directories should provide the details reasonably
> well.

Looks good.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services