Thread: BUG #4879: bgwriter fails to fsync the file in recovery mode

BUG #4879: bgwriter fails to fsync the file in recovery mode

From
"Fujii Masao"
Date:
The following bug has been logged online:

Bug reference:      4879
Logged by:          Fujii Masao
Email address:      masao.fujii@gmail.com
PostgreSQL version: 8.4dev
Operating system:   RHEL5.1 x86_64
Description:        bgwriter fails to fsync the file in recovery mode
Details:

The restartpoint by bgwriter in recovery mode caused the following error.

    ERROR:  could not fsync segment 0 of relation base/11564/16422_fsm: No
such file or directory

The following procedure can reproduce this error.

(1) create warm-standby environment
(2) execute "pgbench -i -s10"
(3) execute the following SQLs

    TRUNCATE pgbench_accounts ;
    TRUNCATE pgbench_branches ;
    TRUNCATE pgbench_history ;
    TRUNCATE pgbench_tellers ;
    CHECKPOINT ;
    SELECT pg_switch_xlog();

(4) wait a minute, then the upcoming restartpoint would cause the error
    in the standby server.


Whether this error happens or not depends on the timing of operations.
So, you might need to repeat the procedure (2) and (3) in order to
reproduce the error.

I suspect that the cause of this error is the race condition between
file deletion by startup process and fsync by bgwriter: TRUNCATE xlog
record immediately deletes the corresponding file, while it might be
scheduled to be fsynced by bgwriter. We should leave the actual file
deletion to bgwriter instead of startup process, like normal mode?

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote:
> The following bug has been logged online:
>
> Bug reference:      4879
> Logged by:          Fujii Masao
> Email address:      masao.fujii@gmail.com
> PostgreSQL version: 8.4dev
> Operating system:   RHEL5.1 x86_64
> Description:        bgwriter fails to fsync the file in recovery mode
> Details:

Looking at it now.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote:
>> The following bug has been logged online:
>>
>> Bug reference:      4879
>> Logged by:          Fujii Masao
>> Email address:      masao.fujii@gmail.com
>> PostgreSQL version: 8.4dev
>> Operating system:   RHEL5.1 x86_64
>> Description:        bgwriter fails to fsync the file in recovery mode
>> Details:
>
> Looking at it now.

Thanks.

>> I suspect that the cause of this error is the race condition between
>> file deletion by startup process and fsync by bgwriter: TRUNCATE xlog
>> record immediately deletes the corresponding file, while it might be
>> scheduled to be fsynced by bgwriter. We should leave the actual file
>> deletion to bgwriter instead of startup process, like normal mode?

I think the real problem is this in mdunlink():

>     /* Register request to unlink first segment later */
>     if (!isRedo && forkNum == MAIN_FORKNUM)
>         register_unlink(rnode);

When we replay the unlink of the relation, we don't te bgwriter about
it. Normally we do, so bgwriter knows that if the fsync() fails with
ENOENT, it's ok since the file was deleted.

It's tempting to just remove the "!isRedo" condition, but then we have
another problem: if bgwriter hasn't been started yet, and the shmem
queue is full, we get stuck in register_unlink() trying to send the
message and failing.

In archive recovery, we always start bgwriter at the beginning of WAL
replay. In crash recovery, we don't start bgwriter until the end of wAL
replay. So we could change the "!isRedo" condition to
"!InArchiveRecovery". It's not a very clean solution, but it's simple.

Hmm, what happens when the startup process performs a write, and
bgwriter is not running? Do the fsync requests queue up in the shmem
queue until the end of recovery when bgwriter is launched? I guess I'll
have to try it out...

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Hmm, what happens when the startup process performs a write, and
> bgwriter is not running? Do the fsync requests queue up in the shmem
> queue until the end of recovery when bgwriter is launched? I guess I'll
> have to try it out...

Oh dear, doesn't look good. The startup process has a pendingOpsTable of
its own. bgwriter won't fsync() files that the startup process has
written itself. That needs to be fixed, or you can lose data when an
archive recovery crashes after a restartpoint.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> In archive recovery, we always start bgwriter at the beginning of WAL
> replay. In crash recovery, we don't start bgwriter until the end of wAL
> replay. So we could change the "!isRedo" condition to
> "!InArchiveRecovery". It's not a very clean solution, but it's simple.

This is probably what is needed.  We need to look around for other tests
of "in redo" that have been obsoleted by the change in bgwriter
behavior.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote:

> The restartpoint by bgwriter in recovery mode caused the following error.
>
>     ERROR:  could not fsync segment 0 of relation base/11564/16422_fsm: No
> such file or directory

I think I see a related bug also.

register_dirty_segment() run by Startup process doesn't differentiate
correctly whether the bgwriter is active. md.c line 1194. So it will
continue to register fsync requests into its own private pendingOpsTable
when it should be forwarding them to bgwriter. When bgwriter runs
mdsync() the contents of startup's pendingOpsTable will be ignored.

That looks to me to be a serious bug.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 17:35 +0300, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > Hmm, what happens when the startup process performs a write, and
> > bgwriter is not running? Do the fsync requests queue up in the shmem
> > queue until the end of recovery when bgwriter is launched? I guess I'll
> > have to try it out...
>
> Oh dear, doesn't look good. The startup process has a pendingOpsTable of
> its own. bgwriter won't fsync() files that the startup process has
> written itself. That needs to be fixed, or you can lose data when an
> archive recovery crashes after a restartpoint.

Yes, that's what I see also. Patch attached.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> In archive recovery, we always start bgwriter at the beginning of WAL
>> replay. In crash recovery, we don't start bgwriter until the end of wAL
>> replay. So we could change the "!isRedo" condition to
>> "!InArchiveRecovery". It's not a very clean solution, but it's simple.
>
> This is probably what is needed.  We need to look around for other tests
> of "in redo" that have been obsoleted by the change in bgwriter
> behavior.

We have another problem with the end-of-recovery checkpoint. When the
startup process does the checkpoint, it won't know to perform the
pending fsyncs() the bgwriter has absorbed.

A short fix would be to have bgwriter do the shutdown checkpoint instead
in archive recovery. I don't recall if there was a reason it wasn't
coded like that to begin with, though.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Heikki Linnakangas wrote:
>> Hmm, what happens when the startup process performs a write, and
>> bgwriter is not running? Do the fsync requests queue up in the shmem
>> queue until the end of recovery when bgwriter is launched? I guess I'll
>> have to try it out...

> Oh dear, doesn't look good. The startup process has a pendingOpsTable of
> its own. bgwriter won't fsync() files that the startup process has
> written itself. That needs to be fixed, or you can lose data when an
> archive recovery crashes after a restartpoint.

Ouch.  I'm beginning to think that the best thing is to temporarily
revert the change that made bgwriter active during recovery.  It's
obviously not been adequately thought through or tested.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 17:02 +0300, Heikki Linnakangas wrote:

> I think the real problem is this in mdunlink():
>
> >     /* Register request to unlink first segment later */
> >     if (!isRedo && forkNum == MAIN_FORKNUM)
> >         register_unlink(rnode);
>
> When we replay the unlink of the relation, we don't te bgwriter about
> it. Normally we do, so bgwriter knows that if the fsync() fails with
> ENOENT, it's ok since the file was deleted.
>
> It's tempting to just remove the "!isRedo" condition, but then we have
> another problem: if bgwriter hasn't been started yet, and the shmem
> queue is full, we get stuck in register_unlink() trying to send the
> message and failing.
>
> In archive recovery, we always start bgwriter at the beginning of WAL
> replay. In crash recovery, we don't start bgwriter until the end of wAL
> replay. So we could change the "!isRedo" condition to
> "!InArchiveRecovery". It's not a very clean solution, but it's simple.

That seems to work for me, though I have some doubts as to the way two
phase commit is coded. 2PC seems to assume that if a file still exists
we must be in recovery and its OK to ignore.

Clean? We've changed the conditions under which the unlink needs to be
registered and !InArchiveRecovery defines the changed conditions
precisely.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 11:31 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > Heikki Linnakangas wrote:
> >> Hmm, what happens when the startup process performs a write, and
> >> bgwriter is not running? Do the fsync requests queue up in the shmem
> >> queue until the end of recovery when bgwriter is launched? I guess I'll
> >> have to try it out...
>
> > Oh dear, doesn't look good. The startup process has a pendingOpsTable of
> > its own. bgwriter won't fsync() files that the startup process has
> > written itself. That needs to be fixed, or you can lose data when an
> > archive recovery crashes after a restartpoint.
>
> Ouch.  I'm beginning to think that the best thing is to temporarily
> revert the change that made bgwriter active during recovery.

That seems the safest course, to avoid derailing the schedule.

Please define "temporarily". Will it be re-enabled in 8.4.1, assuming we
find an acceptable fix?

> It's
> obviously not been adequately thought through or tested.

Hmmm...

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Heikki Linnakangas wrote:
>>> Hmm, what happens when the startup process performs a write, and
>>> bgwriter is not running? Do the fsync requests queue up in the shmem
>>> queue until the end of recovery when bgwriter is launched? I guess I'll
>>> have to try it out...
>
>> Oh dear, doesn't look good. The startup process has a pendingOpsTable of
>> its own. bgwriter won't fsync() files that the startup process has
>> written itself. That needs to be fixed, or you can lose data when an
>> archive recovery crashes after a restartpoint.
>
> Ouch.  I'm beginning to think that the best thing is to temporarily
> revert the change that made bgwriter active during recovery.  It's
> obviously not been adequately thought through or tested.

That was my first thought too, but unfortunately we now rely on bgwriter
to perform restartpoints :-(.

I came up with the attached patch, which includes Simon's patch to have
all fsync requests forwarded to bgwriter during archive recovery. To fix
the startup checkpoint issue, startup process requests a forced
restartpoint, which will flush any fsync requests bgwriter has
accumulated, before doing the actual checkpoint in the startup process.
This is completely untested still, but does anyone immediately see any
more problems?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fd9d41..5114664 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5156,6 +5156,7 @@ StartupXLOG(void)
     XLogRecord *record;
     uint32        freespace;
     TransactionId oldestActiveXID;
+    bool        bgwriterLaunched = false;

     XLogCtl->SharedRecoveryInProgress = true;

@@ -5472,7 +5473,11 @@ StartupXLOG(void)
              * process in addition to postmaster!
              */
             if (InArchiveRecovery && IsUnderPostmaster)
+            {
+                SetForwardFsyncRequests();
                 SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
+                bgwriterLaunched = true;
+            }

             /*
              * main redo apply loop
@@ -5742,7 +5747,16 @@ StartupXLOG(void)
          * assigning a new TLI, using a shutdown checkpoint allows us to have
          * the rule that TLI only changes in shutdown checkpoints, which
          * allows some extra error checking in xlog_redo.
+         *
+         * If bgwriter is active, we can't do the necessary fsyncs in the
+         * startup process because bgwriter has accumulated fsync requests
+         * into its private memory. So we request a last forced restart point,
+         * which will flush them to disk, before performing the actual
+         * checkpoint.
          */
+        if (bgwriterLaunched)
+            RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+                              CHECKPOINT_WAIT);
         CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);

         /*
@@ -6219,27 +6233,11 @@ CreateCheckPoint(int flags)

     /*
      * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
-     * During normal operation, bgwriter is the only process that creates
-     * checkpoints, but at the end of archive recovery, the bgwriter can be
-     * busy creating a restartpoint while the startup process tries to perform
-     * the startup checkpoint.
+     * (This is just pro forma, since in the present system structure there is
+     * only one process that is allowed to issue checkpoints at any given
+     * time.)
      */
-    if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
-    {
-        Assert(InRecovery);
-
-        /*
-         * A restartpoint is in progress. Wait until it finishes. This can
-         * cause an extra restartpoint to be performed, but that's OK because
-         * we're just about to perform a checkpoint anyway. Flushing the
-         * buffers in this restartpoint can take some time, but that time is
-         * saved from the upcoming checkpoint so the net effect is zero.
-         */
-        ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint")));
-        RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
-
-        LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
-    }
+    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

     /*
      * Prepare to accumulate statistics.
@@ -6660,8 +6658,9 @@ CreateRestartPoint(int flags)
      * restartpoint. It's assumed that flushing the buffers will do that as a
      * side-effect.
      */
-    if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
-        XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
+    if (!(flags & CHECKPOINT_FORCE) &&
+        (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
+         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo)))
     {
         XLogRecPtr    InvalidXLogRecPtr = {0, 0};

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d42e86a..b01b2ab 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -204,6 +204,21 @@ mdinit(void)
 }

 /*
+ * In archive recovery, we rely on bgwriter to do fsyncs(), but we don't
+ * know that we do archive recovery at process startup when pendingOpsTable
+ * has already been created. Calling this function drops pendingOpsTable
+ * and causes any subsequent requests to be forwarded to bgwriter.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+    /* Perform any pending ops we may have queued up */
+    if (pendingOpsTable)
+        mdsync();
+    pendingOpsTable = NULL;
+}
+
+/*
  *    mdexists() -- Does the physical file exist?
  *
  * Note: this will return true for lingering files, with pending deletions
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 7556b14..fd79d7b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -109,6 +109,7 @@ extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);

+extern void SetForwardFsyncRequests(void);
 extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
                      BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 18:12 +0300, Heikki Linnakangas wrote:

> A short fix would be to have bgwriter do the shutdown checkpoint instead
> in archive recovery. I don't recall if there was a reason it wasn't
> coded like that to begin with, though.

I think the problem was that it was coded both ways at different stages
of patch evolution. The decision to retain the shutdown checkpoint by
the startup process was taken in January, IIRC.

Having startup process issue this

if (InArchiveRecovery)
    RequestCheckpoint(CHECKPOINT_IS_SHUTDOWN |
                CHECKPOINT_FORCE |
                CHECKPOINT_WAIT)
else

should make the startup process wait for bgwriter to complete the
checkpoint. But we need to set LocalRecoveryInProgress appropriately
also.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 18:57 +0300, Heikki Linnakangas wrote:

> I came up with the attached patch, which includes Simon's patch to
> have all fsync requests forwarded to bgwriter during archive recovery.
> To fix the startup checkpoint issue, startup process requests a forced
> restartpoint, which will flush any fsync requests bgwriter has
> accumulated, before doing the actual checkpoint in the startup
> process.

That looks fine.

> This is completely untested still, but does anyone immediately see any
> more problems?

Nothing seen.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-06-25 at 18:12 +0300, Heikki Linnakangas wrote:
>
>> A short fix would be to have bgwriter do the shutdown checkpoint instead
>> in archive recovery. I don't recall if there was a reason it wasn't
>> coded like that to begin with, though.
>
> I think the problem was that it was coded both ways at different stages
> of patch evolution. The decision to retain the shutdown checkpoint by
> the startup process was taken in January, IIRC.
>
> Having startup process issue this
>
> if (InArchiveRecovery)
>     RequestCheckpoint(CHECKPOINT_IS_SHUTDOWN |
>                 CHECKPOINT_FORCE |
>                 CHECKPOINT_WAIT)
> else
>
> should make the startup process wait for bgwriter to complete the
> checkpoint. But we need to set LocalRecoveryInProgress appropriately
> also.

Yeah, the trouble is to tell bgwriter that it's OK for it to create the
checkpoint, which includes writing a WAL record, while still keeping the
system "in-recovery" for all other purposes. While we could just relax
the checks, that seems like a very important safeguard.

(I posted in the other mail to do a restartpoint before the startup
process does the checkpoint)

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 19:20 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> >

> But we need to set LocalRecoveryInProgress appropriately
> > also.
>
> Yeah, the trouble is to tell bgwriter that it's OK for it to create the
> checkpoint, which includes writing a WAL record, while still keeping the
> system "in-recovery" for all other purposes. While we could just relax
> the checks, that seems like a very important safeguard.
>
> (I posted in the other mail to do a restartpoint before the startup
> process does the checkpoint)

I think we're in strong agreement, just thinking and writing emails in
realtime gives a few race conditions... your patch covers everything I
mentioned above.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> This is completely untested still, but does anyone immediately see any
> more problems?

Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> This is completely untested still, but does anyone immediately see any
>> more problems?
>
> Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?

Yes, I just noticed that myself, testing it for the first time. That
check needs to be suppressed in the startup checkpoint.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 12:33 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > This is completely untested still, but does anyone immediately see any
> > more problems?
>
> Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?

Yes it will, as it is now.

I didn't read the parts of the patch that were described as being from
my patch. Heikki, please tell me if you change things... it means I have
to be both patch author and reviewer. Perhaps just separate patches, so
we can review them independently.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?

> Yes, I just noticed that myself, testing it for the first time. That
> check needs to be suppressed in the startup checkpoint.

Ugh.  Better would be to move the responsibility for the final
checkpoint to the bgwriter.

However, this whole line of thought still seems to be leading towards
"fix the patch", and I don't have much confidence that we can have a
trustworthy fix today.  What about "revert the patch"?

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 19:37 +0300, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> >> This is completely untested still, but does anyone immediately see any
> >> more problems?
> >
> > Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?
>
> Yes, I just noticed that myself, testing it for the first time. That
> check needs to be suppressed in the startup checkpoint.

Better to do it as it was in my patch. We can turn off/on then.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote:

> What about "revert the patch"?

That's probably just as dangerous.

Much easier to just avoid that state altogether, if you must.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote:
>
>> What about "revert the patch"?
>
> That's probably just as dangerous.

I don't feel comfortable either reverting such a big patch at last
minute. Would need a fair amount of testing to make sure that the
revertion is correct and that no patches committed after the patch are
now broken.

I'm testing the attached patch at the moment. It's the same as the
previous one, with the elog() in mdsync() issue fixed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fd9d41..274369f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5156,6 +5156,7 @@ StartupXLOG(void)
     XLogRecord *record;
     uint32        freespace;
     TransactionId oldestActiveXID;
+    bool        bgwriterLaunched = false;

     XLogCtl->SharedRecoveryInProgress = true;

@@ -5472,7 +5473,11 @@ StartupXLOG(void)
              * process in addition to postmaster!
              */
             if (InArchiveRecovery && IsUnderPostmaster)
+            {
+                SetForwardFsyncRequests();
                 SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
+                bgwriterLaunched = true;
+            }

             /*
              * main redo apply loop
@@ -5742,7 +5747,17 @@ StartupXLOG(void)
          * assigning a new TLI, using a shutdown checkpoint allows us to have
          * the rule that TLI only changes in shutdown checkpoints, which
          * allows some extra error checking in xlog_redo.
+         *
+         * If bgwriter is active, we can't do the necessary fsyncs in the
+         * startup process because bgwriter has accumulated fsync requests
+         * into its private memory. So we request a last forced restart point,
+         * which will flush them to disk, before performing the actual
+         * checkpoint. It's safe to flush the buffers first, because no other
+         * process can do (WAL-logged) changes to data pages in between.
          */
+        if (bgwriterLaunched)
+            RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+                              CHECKPOINT_WAIT);
         CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);

         /*
@@ -6219,27 +6234,11 @@ CreateCheckPoint(int flags)

     /*
      * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
-     * During normal operation, bgwriter is the only process that creates
-     * checkpoints, but at the end of archive recovery, the bgwriter can be
-     * busy creating a restartpoint while the startup process tries to perform
-     * the startup checkpoint.
+     * (This is just pro forma, since in the present system structure there is
+     * only one process that is allowed to issue checkpoints at any given
+     * time.)
      */
-    if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
-    {
-        Assert(InRecovery);
-
-        /*
-         * A restartpoint is in progress. Wait until it finishes. This can
-         * cause an extra restartpoint to be performed, but that's OK because
-         * we're just about to perform a checkpoint anyway. Flushing the
-         * buffers in this restartpoint can take some time, but that time is
-         * saved from the upcoming checkpoint so the net effect is zero.
-         */
-        ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint")));
-        RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
-
-        LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
-    }
+    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

     /*
      * Prepare to accumulate statistics.
@@ -6660,8 +6659,9 @@ CreateRestartPoint(int flags)
      * restartpoint. It's assumed that flushing the buffers will do that as a
      * side-effect.
      */
-    if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
-        XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
+    if (!(flags & CHECKPOINT_FORCE) &&
+        (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
+         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo)))
     {
         XLogRecPtr    InvalidXLogRecPtr = {0, 0};

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d42e86a..56824d5 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -204,6 +204,21 @@ mdinit(void)
 }

 /*
+ * In archive recovery, we rely on bgwriter to do fsyncs(), but we don't
+ * know that we do archive recovery at process startup when pendingOpsTable
+ * has already been created. Calling this function drops pendingOpsTable
+ * and causes any subsequent requests to be forwarded to bgwriter.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+    /* Perform any pending ops we may have queued up */
+    if (pendingOpsTable)
+        mdsync();
+    pendingOpsTable = NULL;
+}
+
+/*
  *    mdexists() -- Does the physical file exist?
  *
  * Note: this will return true for lingering files, with pending deletions
@@ -908,9 +923,18 @@ mdsync(void)
     /*
      * This is only called during checkpoints, and checkpoints should only
      * occur in processes that have created a pendingOpsTable.
+     *
+     * Startup process performing a startup checkpoint after archive recovery
+     * is an exception. It has no pendingOpsTable, but that's OK because it
+     * has requested bgwriter to perform a restartpoint before the checkpoint
+     * which does mdsync() instead.
      */
     if (!pendingOpsTable)
+    {
+        if (InRecovery)
+            return;
         elog(ERROR, "cannot sync without a pendingOpsTable");
+    }

     /*
      * If we are in the bgwriter, the sync had better include all fsync
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 7556b14..fd79d7b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -109,6 +109,7 @@ extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);

+extern void SetForwardFsyncRequests(void);
 extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
                      BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> It's tempting to just remove the "!isRedo" condition, but then we have
> another problem: if bgwriter hasn't been started yet, and the shmem
> queue is full, we get stuck in register_unlink() trying to send the
> message and failing.

> In archive recovery, we always start bgwriter at the beginning of WAL
> replay. In crash recovery, we don't start bgwriter until the end of wAL
> replay. So we could change the "!isRedo" condition to
> "!InArchiveRecovery". It's not a very clean solution, but it's simple.

I looked through the callers of smgrdounlink(), and found that
FinishPreparedTransaction passes isRedo = true.  I'm not sure at the
moment what the reasoning behind that was, but it does seem to mean that
checking InArchiveRecovery instead of isRedo may not be quite right.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Simon Riggs wrote:
>> On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote:
>>> What about "revert the patch"?
>>
>> That's probably just as dangerous.

> I don't feel comfortable either reverting such a big patch at last
> minute.

Yeah, I'm not very happy with that either.  However, I've still got no
confidence in anything proposed so far.

> I'm testing the attached patch at the moment. It's the same as the
> previous one, with the elog() in mdsync() issue fixed.

This seems like a kluge on top of a hack.  Can't we have the bgwriter
do the final checkpoint instead?

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 13:25 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > It's tempting to just remove the "!isRedo" condition, but then we have
> > another problem: if bgwriter hasn't been started yet, and the shmem
> > queue is full, we get stuck in register_unlink() trying to send the
> > message and failing.
>
> > In archive recovery, we always start bgwriter at the beginning of WAL
> > replay. In crash recovery, we don't start bgwriter until the end of wAL
> > replay. So we could change the "!isRedo" condition to
> > "!InArchiveRecovery". It's not a very clean solution, but it's simple.
>
> I looked through the callers of smgrdounlink(), and found that
> FinishPreparedTransaction passes isRedo = true.  I'm not sure at the
> moment what the reasoning behind that was, but it does seem to mean that
> checking InArchiveRecovery instead of isRedo may not be quite right.

I think that's because FinishPreparedTransaction() implicitly assumes
that if a file still exists we are in recovery. I can't comment on
whether that is necessarily true for some reason, but it doesn't sound
like it is a safe assumption. I would guess it's using isRedo as an
implicit override rather than using the correct meaning of the variable.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-06-25 at 13:25 -0400, Tom Lane wrote:
>> I looked through the callers of smgrdounlink(), and found that
>> FinishPreparedTransaction passes isRedo = true.  I'm not sure at the
>> moment what the reasoning behind that was, but it does seem to mean that
>> checking InArchiveRecovery instead of isRedo may not be quite right.
>
> I think that's because FinishPreparedTransaction() implicitly assumes
> that if a file still exists we are in recovery. I can't comment on
> whether that is necessarily true for some reason, but it doesn't sound
> like it is a safe assumption. I would guess it's using isRedo as an
> implicit override rather than using the correct meaning of the variable.

It looks like a copy-pasto. In 8.3 it used to be:

>         for (i = 0; i < hdr->ncommitrels; i++)
>             smgrdounlink(smgropen(commitrels[i]), false, false);

and now it's:

>         for (fork = 0; fork <= MAX_FORKNUM; fork++)
>         {
>             if (smgrexists(srel, fork))
>             {
>                 XLogDropRelation(delrels[i], fork);
>                 smgrdounlink(srel, fork, false, true);
>             }
>         }

That clearly assumes that we're in recovery, hence the XLogDropRelation
call, but FinishPreparedTransaction is never called in recovery.

I don't think this is related to the recovery bug we have at hand. Still
needs to be fixed, though it's not that urgent. AFAICS it causes no
other ill effect except that we don't complain if the file doesn't
exist, and we don't leave it lingering like we do for other files to
avoid the OID wraparound issue. I'll fix this as a separate patch.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Simon Riggs wrote:
>>> On Thu, 2009-06-25 at 12:43 -0400, Tom Lane wrote:
>>>> What about "revert the patch"?
>>> That's probably just as dangerous.
>
>> I don't feel comfortable either reverting such a big patch at last
>> minute.
>
> Yeah, I'm not very happy with that either.  However, I've still got no
> confidence in anything proposed so far.
>
>> I'm testing the attached patch at the moment. It's the same as the
>> previous one, with the elog() in mdsync() issue fixed.
>
> This seems like a kluge on top of a hack.  Can't we have the bgwriter
> do the final checkpoint instead?

Here's a patch taking that approach, and I think it's better than the
previous  one. I was afraid we would lose robustness if we have to set
the shared state as "out of recovery" before requesting the checkpoint,
but we can use the same trick we were using in startup process and set
LocalRecoveryInProgress=false before setting the shared variable. I
introduced a new CHECKPOINT_IS_STARTUP flag, which is otherwise
identical to CHECKPOINT_IS_SHUTDOWN, but in a startup checkpoint
CreateCheckPoint() excpects to be called while recovery is still active,
and sets LocalRecoveryInProgress=false. It also tells bgwriter that it
needs to do a checkpoint instead of a restartpoint, even though recovery
is still in progress.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fd9d41..18c47aa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5156,6 +5156,7 @@ StartupXLOG(void)
     XLogRecord *record;
     uint32        freespace;
     TransactionId oldestActiveXID;
+    bool        bgwriterLaunched;

     XLogCtl->SharedRecoveryInProgress = true;

@@ -5472,7 +5473,11 @@ StartupXLOG(void)
              * process in addition to postmaster!
              */
             if (InArchiveRecovery && IsUnderPostmaster)
+            {
+                SetForwardFsyncRequests();
                 SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
+                bgwriterLaunched = true;
+            }

             /*
              * main redo apply loop
@@ -5709,12 +5714,6 @@ StartupXLOG(void)
     /* Pre-scan prepared transactions to find out the range of XIDs present */
     oldestActiveXID = PrescanPreparedTransactions();

-    /*
-     * Allow writing WAL for us, so that we can create a checkpoint record.
-     * But not yet for other backends!
-     */
-    LocalRecoveryInProgress = false;
-
     if (InRecovery)
     {
         int            rmid;
@@ -5743,7 +5742,11 @@ StartupXLOG(void)
          * the rule that TLI only changes in shutdown checkpoints, which
          * allows some extra error checking in xlog_redo.
          */
-        CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+        if (bgwriterLaunched)
+            RequestCheckpoint(CHECKPOINT_IS_STARTUP | CHECKPOINT_IMMEDIATE |
+                              CHECKPOINT_WAIT);
+        else
+            CreateCheckPoint(CHECKPOINT_IS_STARTUP | CHECKPOINT_IMMEDIATE);

         /*
          * And finally, execute the recovery_end_command, if any.
@@ -5806,7 +5809,7 @@ StartupXLOG(void)
     }

     /*
-     * All done. Allow others to write WAL.
+     * All done. Allow backends to write WAL.
      */
     XLogCtl->SharedRecoveryInProgress = false;
 }
@@ -6123,12 +6126,13 @@ LogCheckpointStart(int flags, bool restartpoint)
      * the main message, but what about all the flags?
      */
     if (restartpoint)
-        msg = "restartpoint starting:%s%s%s%s%s%s";
+        msg = "restartpoint starting:%s%s%s%s%s%s%s";
     else
-        msg = "checkpoint starting:%s%s%s%s%s%s";
+        msg = "checkpoint starting:%s%s%s%s%s%s%s";

     elog(LOG, msg,
          (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+         (flags & CHECKPOINT_IS_STARTUP) ? " startup" : "",
          (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
          (flags & CHECKPOINT_FORCE) ? " force" : "",
          (flags & CHECKPOINT_WAIT) ? " wait" : "",
@@ -6190,10 +6194,12 @@ LogCheckpointEnd(bool restartpoint)
  *
  * flags is a bitwise OR of the following:
  *    CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *    CHECKPOINT_IS_STARTUP: checkpoint is for database startup.
  *    CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *        ignoring checkpoint_completion_target parameter.
  *    CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
- *        since the last one (implied by CHECKPOINT_IS_SHUTDOWN).
+ *        since the last one (implied by CHECKPOINT_IS_SHUTDOWN and
+ *        CHECKPOINT_IS_STARTUP).
  *
  * Note: flags contains other bits, of interest here only for logging purposes.
  * In particular note that this routine is synchronous and does not pay
@@ -6202,7 +6208,7 @@ LogCheckpointEnd(bool restartpoint)
 void
 CreateCheckPoint(int flags)
 {
-    bool        shutdown = (flags & CHECKPOINT_IS_SHUTDOWN) != 0;
+    bool        shutdown;
     CheckPoint    checkPoint;
     XLogRecPtr    recptr;
     XLogCtlInsert *Insert = &XLogCtl->Insert;
@@ -6213,35 +6219,40 @@ CreateCheckPoint(int flags)
     TransactionId *inCommitXids;
     int            nInCommit;

-    /* shouldn't happen */
-    if (RecoveryInProgress())
-        elog(ERROR, "can't create a checkpoint during recovery");
+    /*
+     * A startup checkpoint is really a shutdown checkpoint, just issued at
+     * a different time.
+     */
+    shutdown = (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IS_STARTUP)) != 0;

     /*
-     * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
-     * During normal operation, bgwriter is the only process that creates
-     * checkpoints, but at the end of archive recovery, the bgwriter can be
-     * busy creating a restartpoint while the startup process tries to perform
-     * the startup checkpoint.
+     * A startup checkpoint is created before anyone else is allowed to
+     * write WAL. To allow us to write the checkpoint record, set
+     * LocalRecoveryInProgress to false. This lets us write WAL, but others
+     * are still not allowed to do so.
      */
-    if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
+    if (flags & CHECKPOINT_IS_STARTUP)
     {
-        Assert(InRecovery);
-
-        /*
-         * A restartpoint is in progress. Wait until it finishes. This can
-         * cause an extra restartpoint to be performed, but that's OK because
-         * we're just about to perform a checkpoint anyway. Flushing the
-         * buffers in this restartpoint can take some time, but that time is
-         * saved from the upcoming checkpoint so the net effect is zero.
-         */
-        ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint")));
-        RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
-
-        LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+        Assert(RecoveryInProgress());
+        LocalRecoveryInProgress = false;
+        InitXLOGAccess();
+    }
+    else
+    {
+        /* shouldn't happen */
+        if (RecoveryInProgress())
+            elog(ERROR, "can't create a checkpoint during recovery");
     }

     /*
+     * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
+     * (This is just pro forma, since in the present system structure there is
+     * only one process that is allowed to issue checkpoints at any given
+     * time.)
+     */
+    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+
+    /*
      * Prepare to accumulate statistics.
      *
      * Note: because it is possible for log_checkpoints to change while a
@@ -6298,7 +6309,8 @@ CreateCheckPoint(int flags)
      * the end of the last checkpoint record, and its redo pointer must point
      * to itself.
      */
-    if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0)
+    if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IS_STARTUP |
+                  CHECKPOINT_FORCE)) == 0)
     {
         XLogRecPtr    curInsert;

@@ -6528,7 +6540,7 @@ CreateCheckPoint(int flags)
      * in subtrans.c).    During recovery, though, we mustn't do this because
      * StartupSUBTRANS hasn't been called yet.
      */
-    if (!InRecovery)
+    if (!InRecovery && (flags & CHECKPOINT_IS_STARTUP) == 0)
         TruncateSUBTRANS(GetOldestXmin(true, false));

     /* All real work is done, but log before releasing lock. */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 55dff57..2c42584 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -449,6 +449,13 @@ BackgroundWriterMain(void)
             SpinLockRelease(&bgs->ckpt_lck);

             /*
+             * A startup checkpoint is a real checkpoint that's performed
+             * while we're still in recovery.
+             */
+            if (flags & CHECKPOINT_IS_STARTUP)
+                do_restartpoint = false;
+
+            /*
              * We will warn if (a) too soon since last checkpoint (whatever
              * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
              * since the last checkpoint start.  Note in particular that this
@@ -895,10 +902,12 @@ BgWriterShmemInit(void)
  *
  * flags is a bitwise OR of the following:
  *    CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *    CHECKPOINT_IS_STARTUP: checkpoint is for database startup.
  *    CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *        ignoring checkpoint_completion_target parameter.
  *    CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
- *        since the last one (implied by CHECKPOINT_IS_SHUTDOWN).
+ *        since the last one (implied by CHECKPOINT_IS_SHUTDOWN and
+ *        CHECKPOINT_IS_STARTUP).
  *    CHECKPOINT_WAIT: wait for completion before returning (otherwise,
  *        just signal bgwriter to do it, and return).
  *    CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d42e86a..b01b2ab 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -204,6 +204,21 @@ mdinit(void)
 }

 /*
+ * In archive recovery, we rely on bgwriter to do fsyncs(), but we don't
+ * know that we do archive recovery at process startup when pendingOpsTable
+ * has already been created. Calling this function drops pendingOpsTable
+ * and causes any subsequent requests to be forwarded to bgwriter.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+    /* Perform any pending ops we may have queued up */
+    if (pendingOpsTable)
+        mdsync();
+    pendingOpsTable = NULL;
+}
+
+/*
  *    mdexists() -- Does the physical file exist?
  *
  * Note: this will return true for lingering files, with pending deletions
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8720bb..d206b93 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -166,6 +166,7 @@ extern bool XLOG_DEBUG;
 /* These indicate the cause of a checkpoint request */
 #define CHECKPOINT_CAUSE_XLOG    0x0010    /* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME    0x0020    /* Elapsed time */
+#define CHECKPOINT_IS_STARTUP    0x0040    /* Checkpoint is for startup */

 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 7556b14..fd79d7b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -109,6 +109,7 @@ extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);

+extern void SetForwardFsyncRequests(void);
 extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
                      BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Here's a patch taking that approach, and I think it's better than the
> previous  one. I was afraid we would lose robustness if we have to set
> the shared state as "out of recovery" before requesting the checkpoint,
> but we can use the same trick we were using in startup process and set
> LocalRecoveryInProgress=false before setting the shared variable. I
> introduced a new CHECKPOINT_IS_STARTUP flag,

This isn't a "startup" checkpoint, it's an "end of recovery" checkpoint,
because we don't do it during normal startup.  Not sure about an equally
concise name based on that, but I don't like IS_STARTUP.

On the other point: are we going to eliminate mdunlink's isRedo
parameter?  Maybe the better thing is to have its callers pass the value
of InArchiveRecovery?

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 14:46 -0400, Tom Lane wrote:

> On the other point: are we going to eliminate mdunlink's isRedo
> parameter?  Maybe the better thing is to have its callers pass the value
> of InArchiveRecovery?

We have one caller that is using a parameter incorrectly. It seems we
should correct the caller rather than change the API and potentially
effect all callers.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Here's a patch taking that approach, and I think it's better than the
>> previous  one. I was afraid we would lose robustness if we have to set
>> the shared state as "out of recovery" before requesting the checkpoint,
>> but we can use the same trick we were using in startup process and set
>> LocalRecoveryInProgress=false before setting the shared variable. I
>> introduced a new CHECKPOINT_IS_STARTUP flag,
>
> This isn't a "startup" checkpoint, it's an "end of recovery" checkpoint,
> because we don't do it during normal startup.  Not sure about an equally
> concise name based on that, but I don't like IS_STARTUP.
>
> On the other point: are we going to eliminate mdunlink's isRedo
> parameter?  Maybe the better thing is to have its callers pass the value
> of InArchiveRecovery?

I think my initial analysis of this bug was bogus. There's nothing wrong
with mdunlink() as it is, it's calling ForgetRelationFsyncRequests()
before it unlinks anything, regardless of the isRedo parameter. In
Fujii-san's scenario, it was just going to the pendingOpsTable of the
startup process and not sent to bgwriter as it should. Setting
pendingOpsTable=NULL when bgwriter is launched will fix that.

I somehow confused register_unlink() and ForgetRelationFsyncRequests()
and thought that we need the register_unlink() call when bgwriter is
active, but we don't.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
While nosing around the problem areas, I think I've found yet another
issue here.  The global bool InRecovery is only maintained correctly
in the startup process, which wasn't a problem before 8.4.  However,
if we are making the bgwriter execute the end-of-recovery checkpoint,
there are multiple places where it is tested that are going to be
executed by bgwriter.  I think (but am not 100% sure) that these
are all the at-risk references:
    XLogFlush
    CheckPointMultiXact
    CreateCheckPoint (2 places)
Heikki's latest patch deals with the tests in CreateCheckPoint (rather
klugily IMO) but not the others.  I think it might be better to fix
things so that InRecovery is maintained correctly in the bgwriter too.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 21:59 +0300, Heikki Linnakangas wrote:

> I think my initial analysis of this bug was bogus.

Perhaps, but the pendingOpsTable issue is a serious one, so we haven't
wasted our time by fixing it.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> On the other point: are we going to eliminate mdunlink's isRedo
>> parameter?  Maybe the better thing is to have its callers pass the value
>> of InArchiveRecovery?

> I think my initial analysis of this bug was bogus. There's nothing wrong
> with mdunlink() as it is, it's calling ForgetRelationFsyncRequests()
> before it unlinks anything, regardless of the isRedo parameter. In
> Fujii-san's scenario, it was just going to the pendingOpsTable of the
> startup process and not sent to bgwriter as it should. Setting
> pendingOpsTable=NULL when bgwriter is launched will fix that.

+1, I think that's okay.  So to summarize the state of play, it seems
we have these issues:

* need to delete startup process's local pendingOpsTable once bgwriter
is launched, so that requests go to bgwriter instead

* need to push end-of-recovery checkpoint into bgwriter

* need to do something about IsRecovery tests that will now be executed
in bgwriter

* need to fix mistaken code in FinishPreparedTransaction

Anything else?

Are you (Heikki and Simon) in a position to finish these things today?
I know it's starting to get late on your side of the pond.  Core have
already been discussing whether we have to abort the release for this.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> While nosing around the problem areas, I think I've found yet another
> issue here.  The global bool InRecovery is only maintained correctly
> in the startup process, which wasn't a problem before 8.4.  However,
> if we are making the bgwriter execute the end-of-recovery checkpoint,
> there are multiple places where it is tested that are going to be
> executed by bgwriter.  I think (but am not 100% sure) that these
> are all the at-risk references:
>     XLogFlush
>     CheckPointMultiXact
>     CreateCheckPoint (2 places)
> Heikki's latest patch deals with the tests in CreateCheckPoint (rather
> klugily IMO) but not the others.  I think it might be better to fix
> things so that InRecovery is maintained correctly in the bgwriter too.

We could set InRecovery=true in CreateCheckPoint if it's a startup
checkpoint, and reset it afterwards. I'm not 100% sure it's safe to have
bgwriter running with InRecovery=true at other times. Grepping for
InRecovery doesn't show anything that bgwriter calls, but it feels safer
that way.

Hmm, I see another small issue. We now keep track of the "minimum
recovery point". Whenever a data page is flushed, we set minimum
recovery point to the LSN of the page in XLogFlush(), instead of
fsyncing WAL like we do in normal operation. During the end-of-recovery
checkpoint, however, RecoveryInProgress() returns false, so we don't
update minimum recovery point in XLogFlush(). You're unlikely to be
bitten by that in practice; you would need to crash during the
end-of-recovery checkpoint, and then set the recovery target to an
earlier point. It should be fixed nevertheless.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> +1, I think that's okay.  So to summarize the state of play, it seems
> we have these issues:
>
> * need to delete startup process's local pendingOpsTable once bgwriter
> is launched, so that requests go to bgwriter instead
>
> * need to push end-of-recovery checkpoint into bgwriter
>
> * need to do something about IsRecovery tests that will now be executed
> in bgwriter

Yep, that's all I have in mind.

> * need to fix mistaken code in FinishPreparedTransaction

This one I committed already.

> Are you (Heikki and Simon) in a position to finish these things today?
> I know it's starting to get late on your side of the pond.  Core have
> already been discussing whether we have to abort the release for this.

I'm planning to finish this today still. Assuming no new issues are
found, it won't take long to finish the patch, and then it needs an hour
or two of testing. If there's any new issues, it will have to wait 'till
tomorrow since after a few hours I won't be able to think straight anymore.

I'm not familiar with the release process so I don't have an opinion on
what to do about the release.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Here's a patch taking that approach, and I think it's better than the
>> previous  one. I was afraid we would lose robustness if we have to set
>> the shared state as "out of recovery" before requesting the checkpoint,
>> but we can use the same trick we were using in startup process and set
>> LocalRecoveryInProgress=false before setting the shared variable. I
>> introduced a new CHECKPOINT_IS_STARTUP flag,
>
> This isn't a "startup" checkpoint, it's an "end of recovery" checkpoint,
> because we don't do it during normal startup.  Not sure about an equally
> concise name based on that, but I don't like IS_STARTUP.

Ok, END_OF_RECOVERY works for me.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> ... I think it might be better to fix
>> things so that InRecovery is maintained correctly in the bgwriter too.

> We could set InRecovery=true in CreateCheckPoint if it's a startup
> checkpoint, and reset it afterwards. I'm not 100% sure it's safe to have
> bgwriter running with InRecovery=true at other times. Grepping for
> InRecovery doesn't show anything that bgwriter calls, but it feels safer
> that way.

Actually, my thought was exactly that it would be better if it was set
correctly earlier in the run --- if there ever are any places where it
matters, this way is more likely to be right.  (I'm not convinced that
it doesn't matter today, anyhow --- are we sure these places are not
called in a restartpoint?)

What you suggest might be a good minimum-change solution for right now,
but I think eventually it should work the other way.

> Hmm, I see another small issue. We now keep track of the "minimum
> recovery point". Whenever a data page is flushed, we set minimum
> recovery point to the LSN of the page in XLogFlush(), instead of
> fsyncing WAL like we do in normal operation. During the end-of-recovery
> checkpoint, however, RecoveryInProgress() returns false, so we don't
> update minimum recovery point in XLogFlush(). You're unlikely to be
> bitten by that in practice; you would need to crash during the
> end-of-recovery checkpoint, and then set the recovery target to an
> earlier point. It should be fixed nevertheless.

We would want the end-of-recovery checkpoint to act like it's not in
recovery anymore for this purpose, no?

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
I wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Hmm, I see another small issue. We now keep track of the "minimum
>> recovery point". Whenever a data page is flushed, we set minimum
>> recovery point to the LSN of the page in XLogFlush(), instead of
>> fsyncing WAL like we do in normal operation.

> We would want the end-of-recovery checkpoint to act like it's not in
> recovery anymore for this purpose, no?

Actually, what in the world is the purpose of that code at all?
I can't see a case where it would trigger that wouldn't be a result
of data corruption rather than a real need to advance the min restart
point.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> I wrote:
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>> Hmm, I see another small issue. We now keep track of the "minimum
>>> recovery point". Whenever a data page is flushed, we set minimum
>>> recovery point to the LSN of the page in XLogFlush(), instead of
>>> fsyncing WAL like we do in normal operation.
>
>> We would want the end-of-recovery checkpoint to act like it's not in
>> recovery anymore for this purpose, no?
>
> Actually, what in the world is the purpose of that code at all?
> I can't see a case where it would trigger that wouldn't be a result
> of data corruption rather than a real need to advance the min restart
> point.

Huh? The only other place where we advance minimum recovery point is
CreateRestartPoint() when we skip the restartpoint. The
UpdateMinRecoveryPoint() call in XLogFlush() is what we rely on.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Actually, what in the world is the purpose of that code at all?

> Huh? The only other place where we advance minimum recovery point is
> CreateRestartPoint() when we skip the restartpoint. The
> UpdateMinRecoveryPoint() call in XLogFlush() is what we rely on.

I would like an explanation of why minimum recovery point needs to
be advanced at all.  If there's any use for it, it is surely part
of functionality that is not there in 8.4.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> ... I think it might be better to fix
>>> things so that InRecovery is maintained correctly in the bgwriter too.
>
>> We could set InRecovery=true in CreateCheckPoint if it's a startup
>> checkpoint, and reset it afterwards. I'm not 100% sure it's safe to have
>> bgwriter running with InRecovery=true at other times. Grepping for
>> InRecovery doesn't show anything that bgwriter calls, but it feels safer
>> that way.
>
> Actually, my thought was exactly that it would be better if it was set
> correctly earlier in the run --- if there ever are any places where it
> matters, this way is more likely to be right.

Well, we have RecoveryInProgress() now that answers the question "is
recovery still in progress in the system". InRecovery now means "am I a
process that's performing WAL replay?".

>  (I'm not convinced that
> it doesn't matter today, anyhow --- are we sure these places are not
> called in a restartpoint?)

Hmm, good point, I didn't think of restartpoints. But skimming though
all the references to InRecovery, I can't see any.

>> Hmm, I see another small issue. We now keep track of the "minimum
>> recovery point". Whenever a data page is flushed, we set minimum
>> recovery point to the LSN of the page in XLogFlush(), instead of
>> fsyncing WAL like we do in normal operation. During the end-of-recovery
>> checkpoint, however, RecoveryInProgress() returns false, so we don't
>> update minimum recovery point in XLogFlush(). You're unlikely to be
>> bitten by that in practice; you would need to crash during the
>> end-of-recovery checkpoint, and then set the recovery target to an
>> earlier point. It should be fixed nevertheless.
>
> We would want the end-of-recovery checkpoint to act like it's not in
> recovery anymore for this purpose, no?

For the purpose of updating min recovery point, we want it to act like
it *is* still in recovery. But in the XLogFlush() call in
CreateCheckPoint(), we really want it to flush the WAL, not update min
recovery point.

A simple fix is to call UpdateMinRecoveryPoint() after the WAL replay is
finished, but before creating the checkpoint. exitArchiveRecovery()
seems like a good place.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 15:10 -0400, Tom Lane wrote:

> Are you (Heikki and Simon) in a position to finish these things today?

Certainly will try.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> We would want the end-of-recovery checkpoint to act like it's not in
>> recovery anymore for this purpose, no?

> For the purpose of updating min recovery point, we want it to act like
> it *is* still in recovery.

Well, without a clear explanation of why min recovery point should move
at all, I'm unable to evaluate that statement.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 15:10 -0400, Tom Lane wrote:

> So to summarize the state of play, it seems
> we have these issues:
>
> * need to delete startup process's local pendingOpsTable once bgwriter
> is launched, so that requests go to bgwriter instead

Need to ensure that fsync requests are directed to the process that will
act on the fsync requests.

> * need to push end-of-recovery checkpoint into bgwriter

That's probably the easiest thing to do, but the issue is that we must
fsync all files mentioned in the pendingOpsTable in *any* process that
has been accumulating such requests.

> * need to do something about IsRecovery tests that will now be executed
> in bgwriter

Yes

> * need to fix mistaken code in FinishPreparedTransaction

Yes

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> Actually, what in the world is the purpose of that code at all?
>
>> Huh? The only other place where we advance minimum recovery point is
>> CreateRestartPoint() when we skip the restartpoint. The
>> UpdateMinRecoveryPoint() call in XLogFlush() is what we rely on.
>
> I would like an explanation of why minimum recovery point needs to
> be advanced at all.  If there's any use for it, it is surely part
> of functionality that is not there in 8.4.

It gives protection against starting up the database too early. It stops
the database from starting up if you stop recovery, and move recovery
target backwards or delete WAL files from pg_xlog, in a way that's not
safe. Of course, that's a case of "don't do that!", but it's a mistake
you can make when trying to manually recover to a given point in time.
Or if your recovery scripts are broken.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> I would like an explanation of why minimum recovery point needs to
>> be advanced at all.  If there's any use for it, it is surely part
>> of functionality that is not there in 8.4.

> It gives protection against starting up the database too early.

Protection against what failure scenario?

> It stops
> the database from starting up if you stop recovery, and move recovery
> target backwards or delete WAL files from pg_xlog, in a way that's not
> safe. Of course, that's a case of "don't do that!", but it's a mistake
> you can make when trying to manually recover to a given point in time.

You can't move the target backwards --- it's embedded in pg_control.
And even if you could, I don't see what protection this particular
mechanism gives you.  To the extent that it does anything at all, it's
going to be full of holes because it only examines pages that happen to
be dirtied by (the current instance of) the recovery process.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 22:29 +0300, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > While nosing around the problem areas, I think I've found yet another
> > issue here.  The global bool InRecovery is only maintained correctly
> > in the startup process, which wasn't a problem before 8.4.  However,
> > if we are making the bgwriter execute the end-of-recovery checkpoint,
> > there are multiple places where it is tested that are going to be
> > executed by bgwriter.  I think (but am not 100% sure) that these
> > are all the at-risk references:
> >     XLogFlush
> >     CheckPointMultiXact
> >     CreateCheckPoint (2 places)
> > Heikki's latest patch deals with the tests in CreateCheckPoint (rather
> > klugily IMO) but not the others.  I think it might be better to fix
> > things so that InRecovery is maintained correctly in the bgwriter too.
>
> We could set InRecovery=true in CreateCheckPoint if it's a startup
> checkpoint, and reset it afterwards.

That seems like a bad idea.

As you said earlier,

On Thu, 2009-06-25 at 23:15 +0300, Heikki Linnakangas wrote:

> Well, we have RecoveryInProgress() now that answers the question "is
> recovery still in progress in the system". InRecovery now means "am I
> a process that's performing WAL replay?".

so it would be a mistake to do as you propose above because it changes
the meaning of these well defined parts of the system.

* XLogFlush mentions InRecovery right at the end and the correct usage
would be RecoveryIsInProgress() rather than InRecovery.

* The usage of InRecovery in CheckPointMultiXact() should also be
replaced with RecoveryIsInProgress()

* The two usages of InRecovery can also be replaced by
RecoveryIsInProgress()

So, yes, there are some places where InRecovery is used in code executed
by the bgwriter, but the correct fix is to use RecoveryIsInProgress().
It is a hack to try to set the InRecovery state flag in bgwriter and one
that would change the clear meaning of InRecovery, to no good purpose.

Subsequent discussion on this subthread may no longer be relevant.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> I would like an explanation of why minimum recovery point needs to
>>> be advanced at all.  If there's any use for it, it is surely part
>>> of functionality that is not there in 8.4.
>
>> It gives protection against starting up the database too early.
>
> Protection against what failure scenario?
>
>> It stops
>> the database from starting up if you stop recovery, and move recovery
>> target backwards or delete WAL files from pg_xlog, in a way that's not
>> safe. Of course, that's a case of "don't do that!", but it's a mistake
>> you can make when trying to manually recover to a given point in time.
>
> You can't move the target backwards --- it's embedded in pg_control.

No it's not. It's in recovery.conf.

We do store the minimum recovery point required by the base backup in
control file, but it's not advanced once the recovery starts. So if you
start recovery, starting from say 123, let it progress to location 456,
kill recovery and start it again, but only let it go up to 300, you end
up with a corrupt database.

> And even if you could, I don't see what protection this particular
> mechanism gives you.  To the extent that it does anything at all, it's
> going to be full of holes because it only examines pages that happen to
> be dirtied by (the current instance of) the recovery process.

The only process dirtying pages during recovery is the startup process,
when it replays WAL. And for the sake of a consistent database if we
crash, we only care about pages that have been written to disk.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> So, yes, there are some places where InRecovery is used in code executed
> by the bgwriter, but the correct fix is to use RecoveryIsInProgress().

Agreed, but this gets us no closer to solving the real problem, which is
that when we perform the end-of-recovery checkpoint, we need to act like
we are *not* in recovery anymore, for at least some purposes.  Most
notably, to allow us to write a WAL entry at all; but I am suspicious
that pretty much every InRecovery/RecoveryIsInProgress test that that
checkpoint might execute should behave as if we're not in recovery.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> We do store the minimum recovery point required by the base backup in
> control file, but it's not advanced once the recovery starts. So if you
> start recovery, starting from say 123, let it progress to location 456,
> kill recovery and start it again, but only let it go up to 300, you end
> up with a corrupt database.

I don't believe that you have the ability to do that.  Once the recovery
process is launched, the user does not have the ability to control where
the WAL scan proceeds from.  Or at least that's how it used to work, and
if someone has tried to change it, it's broken for exactly this reason.

The behavior you seem to have in mind would be completely disastrous
from a performance standpoint, as we'd be writing and fsyncing
pg_control constantly during a recovery.  I wouldn't consider that a
good idea from a reliability standpoint either --- the more writes to
pg_control, the more risk of fatal corruption of that file.

This is sounding more and more like something that needs to be reverted.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 22:29 +0300, Heikki Linnakangas wrote:

> Hmm, I see another small issue. We now keep track of the "minimum
> recovery point". Whenever a data page is flushed, we set minimum
> recovery point to the LSN of the page in XLogFlush(), instead of
> fsyncing WAL like we do in normal operation. During the end-of-recovery
> checkpoint, however, RecoveryInProgress() returns false, so we don't
> update minimum recovery point in XLogFlush(). You're unlikely to be
> bitten by that in practice; you would need to crash during the
> end-of-recovery checkpoint, and then set the recovery target to an
> earlier point. It should be fixed nevertheless.

RecoveryInProgress returns false only because you set
LocalRecoveryInProgress at the start of the checkpoint, not at the end.
It only needs to be set immediately prior to the call for XLogInsert()
in CreateCheckpoint(). If we do that then XLogFlush() acts as advertised
and we have no worries.

Tom: Heikki's work on MinRecoveryPoint seems sound to me and altering it
now seems like something too dangerous to attempt at this stage. I see
no need; we have no new bug, just a minor point of how we code the fix
to the bug we do have.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 17:11 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > So, yes, there are some places where InRecovery is used in code executed
> > by the bgwriter, but the correct fix is to use RecoveryIsInProgress().
>
> Agreed, but this gets us no closer to solving the real problem, which is
> that when we perform the end-of-recovery checkpoint, we need to act like
> we are *not* in recovery anymore, for at least some purposes.

Not for some purposes, just for *one* purpose: the end of recovery
checkpoint needs to run XLogInsert() which has specific protection
against being run during recovery.

> Most
> notably, to allow us to write a WAL entry at all; but I am suspicious
> that pretty much every InRecovery/RecoveryIsInProgress test that that
> checkpoint might execute should behave as if we're not in recovery.

You are right to question whether we should revoke the patch. ISTM that
we are likely to decrease code robustness by doing that at this stage.

I think we have the problems pretty much solved now.

If we want to increase robustness, I would suggest we add a
recovery.conf parameter to explicitly enable use of bgwriter during
recovery, off by default. In addition to the fixes being worked on
currently.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-06-25 at 15:10 -0400, Tom Lane wrote:
>
>> So to summarize the state of play, it seems
>> we have these issues:
>>
>> * need to delete startup process's local pendingOpsTable once bgwriter
>> is launched, so that requests go to bgwriter instead
>
> Need to ensure that fsync requests are directed to the process that will
> act on the fsync requests.
>
>> * need to push end-of-recovery checkpoint into bgwriter
>
> That's probably the easiest thing to do, but the issue is that we must
> fsync all files mentioned in the pendingOpsTable in *any* process that
> has been accumulating such requests.
>
>> * need to do something about IsRecovery tests that will now be executed
>> in bgwriter
>
> Yes
>
>> * need to fix mistaken code in FinishPreparedTransaction
>
> Yes

Ok, I've committed the above fixes everyone agreed on.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 17:17 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > We do store the minimum recovery point required by the base backup in
> > control file, but it's not advanced once the recovery starts. So if you
> > start recovery, starting from say 123, let it progress to location 456,
> > kill recovery and start it again, but only let it go up to 300, you end
> > up with a corrupt database.
>
> I don't believe that you have the ability to do that.  Once the recovery
> process is launched, the user does not have the ability to control where
> the WAL scan proceeds from.  Or at least that's how it used to work, and
> if someone has tried to change it, it's broken for exactly this reason.
>
> The behavior you seem to have in mind would be completely disastrous
> from a performance standpoint, as we'd be writing and fsyncing
> pg_control constantly during a recovery.  I wouldn't consider that a
> good idea from a reliability standpoint either --- the more writes to
> pg_control, the more risk of fatal corruption of that file.
>
> This is sounding more and more like something that needs to be reverted.

AFAICS the problem Heikki is worried about exists 8.2+. If you stop
recovery, edit recovery.conf to an earlier recovery target and then
re-run recovery then it is possible that data that would not exist until
after the (new) recovery point has made its way to disk. The code in 8.4
does a few things to improve that but the base problem persists and
revoking code won't change that. We should describe the issue in the
docs and leave it at that - there is no particular reason to believe
anybody would want to do such a thing.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> AFAICS the problem Heikki is worried about exists 8.2+. If you stop
> recovery, edit recovery.conf to an earlier recovery target and then
> re-run recovery then it is possible that data that would not exist until
> after the (new) recovery point has made its way to disk. The code in 8.4
> does a few things to improve that but the base problem persists and
> revoking code won't change that. We should describe the issue in the
> docs and leave it at that - there is no particular reason to believe
> anybody would want to do such a thing.

The way I've bumped into that is when playing with pg_standby:

1. Create base backup, set up standby
2. Let pg_standby catch up, so that it waits for the next segment to arrive.
3. killall -9 postmaster
4. Start standby again
5. Create trigger file, before recovery catches up again (that is,
before it reaches the point where it was before killing it)

Now that we have the "smart" mode in pg_standby, that's harder to do by
accident, but can still happen if you e.g remove a WAL file from archive.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Simon Riggs wrote:
>> AFAICS the problem Heikki is worried about exists 8.2+. If you stop
>> recovery, edit recovery.conf to an earlier recovery target and then
>> re-run recovery then it is possible that data that would not exist until
>> after the (new) recovery point has made its way to disk. The code in 8.4
>> does a few things to improve that but the base problem persists and
>> revoking code won't change that. We should describe the issue in the
>> docs and leave it at that - there is no particular reason to believe
>> anybody would want to do such a thing.

> The way I've bumped into that is when playing with pg_standby:
> [ different scenario *not* involving any explicit recovery target ]

Okay, I misunderstood that code as being intended to prevent some
scenario that was new with Hot Standby.  I still think it's a bad
solution though because of the large number of pg_control writes it
will cause.  I agree that the code can be made to work in connection
with the fixes for the immediate bugs, but I am not convinced that we
want it there in its current form.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> The behavior you seem to have in mind would be completely disastrous
> from a performance standpoint, as we'd be writing and fsyncing
> pg_control constantly during a recovery.

Please define "constantly". We discussed that part of the patch here:

http://archives.postgresql.org/message-id/498AB55D.50408@enterprisedb.com

>  I wouldn't consider that a
> good idea from a reliability standpoint either --- the more writes to
> pg_control, the more risk of fatal corruption of that file.

We certainly update it an order of magnitude more often than before, but
I don't think that's an issue. We're talking about archive recovery
here. It's not like in normal operation where a corrupt pg_control file
means that you lose your data. It will stop the server from starting up,
but there's many other files that can be corrupt in a way that causes
recovery to fail or stop too early.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Fri, 2009-06-26 at 00:37 +0300, Heikki Linnakangas wrote:

> Ok, I've committed the above fixes everyone agreed on.

Sorry, but I haven't agreed to very much of what you just committed.

There are some unnecessary and confusing hacks that really don't help
anybody sort out why any of this has been done.

At this stage of RC, I expected you to post the patch and have the other
2 people working actively on this issue review it before you commit.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>>  I wouldn't consider that a
>> good idea from a reliability standpoint either --- the more writes to
>> pg_control, the more risk of fatal corruption of that file.
>
> We certainly update it an order of magnitude more often than before, but
> I don't think that's an issue. We're talking about archive recovery
> here. It's not like in normal operation where a corrupt pg_control file
> means that you lose your data. It will stop the server from starting up,
> but there's many other files that can be corrupt in a way that causes
> recovery to fail or stop too early.

If you still find the frequent pg_control updates unacceptable, we could
always move minRecoveryPoint to a file of its own..

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> The behavior you seem to have in mind would be completely disastrous
>> from a performance standpoint, as we'd be writing and fsyncing
>> pg_control constantly during a recovery.

> Please define "constantly". We discussed that part of the patch here:
> http://archives.postgresql.org/message-id/498AB55D.50408@enterprisedb.com

Okay, after reading the code a bit more I found this:

        /*
         * To avoid having to update the control file too often, we update it
         * all the way to the last record being replayed, even though 'lsn'
         * would suffice for correctness.
         */

which should alleviate the too-many-writes syndrome.  Never mind that
complaint then.  (But shouldn't there be an Assert that this is >= lsn?)

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2009-06-26 at 00:37 +0300, Heikki Linnakangas wrote:
>> Ok, I've committed the above fixes everyone agreed on.

> At this stage of RC, I expected you to post the patch and have the other
> 2 people working actively on this issue review it before you commit.

We're going to slip release a day to allow time to review this properly.
Given that, I have no problem with the proposed fix being in CVS --- it
makes it a shade easier for other people to test it.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 18:40 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Fri, 2009-06-26 at 00:37 +0300, Heikki Linnakangas wrote:
> >> Ok, I've committed the above fixes everyone agreed on.
>
> > At this stage of RC, I expected you to post the patch and have the other
> > 2 people working actively on this issue review it before you commit.
>
> We're going to slip release a day to allow time to review this properly.
> Given that, I have no problem with the proposed fix being in CVS --- it
> makes it a shade easier for other people to test it.

I wasn't suggesting that we post the patch and leave it for days, I was
thinking more of the 3 people actively working together on the problem
looking at the code before its committed. Describing it as stuff we
agreed seems particularly strange in that light.

On the patch, the kluge to set InRecovery is unnecessary and confusing.
If I submitted that you'd kick my ass all around town.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On the patch, the kluge to set InRecovery is unnecessary and confusing.

I'll look into a better way to do that tonight.  Did you have any
specific improvement in mind?

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 19:16 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On the patch, the kluge to set InRecovery is unnecessary and confusing.
>
> I'll look into a better way to do that tonight.  Did you have any
> specific improvement in mind?

Yes, all already mentioned on this thread.

I'm unable to spend time on this tomorrow, but I think the main elements
of the solution are there now. The bug was found in my feature, so mea
culpa. Thanks to everybody for helping identify and fix it.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, 2009-06-25 at 19:16 -0400, Tom Lane wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> On the patch, the kluge to set InRecovery is unnecessary and confusing.
>>
>> I'll look into a better way to do that tonight.  Did you have any
>> specific improvement in mind?

> Yes, all already mentioned on this thread.

After looking at this a bit more, the "easy" solution mentioned upthread
doesn't work.  We want the end-of-recovery checkpoint to be able to
write WAL (obviously), but it *must not* try to call TruncateMultiXact
or TruncateSUBTRANS, because those subsystems aren't initialized yet.
The check in XLogFlush needs to behave as though InRecovery were true
too.  So the idea of testing RecoveryInProgress() rather than InRecovery
cannot handle all these cases.

However, I still agree with your thought that having InRecovery true
only in the process that's applying WAL records is probably the cleanest
definition.  And it also seems to me that having crystal-clear
definitions of these states is essential, because not being clear about
them is exactly what got us into this mess.

What I am thinking is that instead of the hack of clearing
LocalRecoveryInProgress to allow the current process to write WAL,
we should have a separate test function WALWriteAllowed() with a
state variable LocalWALWriteAllowed, and the hack should set that
state without playing any games with LocalRecoveryInProgress.  Then
RecoveryInProgress() remains true during the end-of-recovery checkpoint
and we can condition the TruncateMultiXact and TruncateSUBTRANS calls
on that.  Meanwhile the various places that check RecoveryInProgress
to decide if WAL writing is allowed should call WALWriteAllowed()
instead.

I have not yet gone through all the code sites to make sure this is a
consistent way to do it, but we clearly need more states than we have
now if we are to avoid weird overloadings of the state meanings.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Fujii Masao
Date:
Hi,

Thank you for addressing the problem!

On Fri, Jun 26, 2009 at 7:14 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
> We certainly update it an order of magnitude more often than before, but
> I don't think that's an issue. We're talking about archive recovery
> here. It's not like in normal operation where a corrupt pg_control file
> means that you lose your data. It will stop the server from starting up,
> but there's many other files that can be corrupt in a way that causes
> recovery to fail or stop too early.

Frequent updating of pg_control causes the significant performance
degradation of archive recovery. I think that this is an issue to be fixed.
The warm-standby users (including me) care about the performance
of the standby server, because that affects the failover time, for example.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
I wrote:
> ... Never mind that complaint then.

Be advised that I'm going to be on the warpath again in the morning.
I observe that the substantial amount of care we have taken over
XLogFlush's handling of bad-input-LSN scenarios has been completely
destroyed by the UpdateMinRecoveryPoint patch, which will fail
disastrously (leaving the database unstartable/unrecoverable) if a
bogusly large LSN is encountered during recovery.

However, I grow weary...

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Thu, 2009-06-25 at 20:25 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Thu, 2009-06-25 at 19:16 -0400, Tom Lane wrote:
> >> Simon Riggs <simon@2ndQuadrant.com> writes:
> >>> On the patch, the kluge to set InRecovery is unnecessary and confusing.
> >>
> >> I'll look into a better way to do that tonight.  Did you have any
> >> specific improvement in mind?
>
> > Yes, all already mentioned on this thread.
>
> After looking at this a bit more, the "easy" solution mentioned upthread
> doesn't work.  We want the end-of-recovery checkpoint to be able to
> write WAL (obviously), but it *must not* try to call TruncateMultiXact
> or TruncateSUBTRANS, because those subsystems aren't initialized yet.
> The check in XLogFlush needs to behave as though InRecovery were true
> too.  So the idea of testing RecoveryInProgress() rather than InRecovery
> cannot handle all these cases.

Yes it can, but only if LocalRecoveryInProgress is set immediately
before XLogInsert() in CreateCheckpoint(). Then it is correct

> However, I still agree with your thought that having InRecovery true
> only in the process that's applying WAL records is probably the cleanest
> definition.  And it also seems to me that having crystal-clear
> definitions of these states is essential, because not being clear about
> them is exactly what got us into this mess.

Yes

> What I am thinking is that instead of the hack of clearing
> LocalRecoveryInProgress to allow the current process to write WAL,
> we should have a separate test function WALWriteAllowed() with a
> state variable LocalWALWriteAllowed, and the hack should set that
> state without playing any games with LocalRecoveryInProgress.  Then
> RecoveryInProgress() remains true during the end-of-recovery checkpoint
> and we can condition the TruncateMultiXact and TruncateSUBTRANS calls
> on that.  Meanwhile the various places that check RecoveryInProgress
> to decide if WAL writing is allowed should call WALWriteAllowed()
> instead.

No need.

> I have not yet gone through all the code sites to make sure this is a
> consistent way to do it, but we clearly need more states than we have
> now if we are to avoid weird overloadings of the state meanings.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Fri, 2009-06-26 at 10:56 +0900, Fujii Masao wrote:
> Hi,
>
> Thank you for addressing the problem!
>
> On Fri, Jun 26, 2009 at 7:14 AM, Heikki
> Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
> > We certainly update it an order of magnitude more often than before, but
> > I don't think that's an issue. We're talking about archive recovery
> > here. It's not like in normal operation where a corrupt pg_control file
> > means that you lose your data. It will stop the server from starting up,
> > but there's many other files that can be corrupt in a way that causes
> > recovery to fail or stop too early.
>
> Frequent updating of pg_control causes the significant performance
> degradation of archive recovery. I think that this is an issue to be fixed.
> The warm-standby users (including me) care about the performance
> of the standby server, because that affects the failover time, for example.

An important point.

The update rate "should" be once per clock cycle at most and should
occur in bgwriter not startup. If there is evidence of a problem then I
would support reducing the update rate. The purpose of the patch was
performance, not to fix an obscure and unreported inconsistency.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Fri, 2009-06-26 at 05:14 +0100, Simon Riggs wrote:
> On Thu, 2009-06-25 at 20:25 -0400, Tom Lane wrote:

> > What I am thinking is that instead of the hack of clearing
> > LocalRecoveryInProgress to allow the current process to write WAL,
> > we should have a separate test function WALWriteAllowed() with a
> > state variable LocalWALWriteAllowed, and the hack should set that
> > state without playing any games with LocalRecoveryInProgress.  Then
> > RecoveryInProgress() remains true during the end-of-recovery checkpoint
> > and we can condition the TruncateMultiXact and TruncateSUBTRANS calls
> > on that.  Meanwhile the various places that check RecoveryInProgress
> > to decide if WAL writing is allowed should call WALWriteAllowed()
> > instead.
>
> No need.

Belay that. Yes, agree need for additional state, though think its more
like EndRecoveryIsComplete().

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> On Fri, Jun 26, 2009 at 7:14 AM, Heikki
> Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
>> We certainly update it an order of magnitude more often than before, but
>> I don't think that's an issue. We're talking about archive recovery
>> here. It's not like in normal operation where a corrupt pg_control file
>> means that you lose your data. It will stop the server from starting up,
>> but there's many other files that can be corrupt in a way that causes
>> recovery to fail or stop too early.
>
> Frequent updating of pg_control causes the significant performance
> degradation of archive recovery. I think that this is an issue to be fixed.
> The warm-standby users (including me) care about the performance
> of the standby server, because that affects the failover time, for example.

Are you actually seeing performance degradation caused by frequent
pg_control updates? In the simple test scenarios I've tested, pg_control
is updated only once every few WAL segments, and this with
shared_buffer=32MB. With larger shared_buffers, it happens even less
frequently.

There's a DEBUG2-line in UpdateMinRecoveryPoint() that you can bump to
LOG level if you want to observe that behavior.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> I observe that the substantial amount of care we have taken over
> XLogFlush's handling of bad-input-LSN scenarios has been completely
> destroyed by the UpdateMinRecoveryPoint patch, which will fail
> disastrously (leaving the database unstartable/unrecoverable) if a
> bogusly large LSN is encountered during recovery.

Note that we don't update minRecoveryPoint to the LSN from the data
page, but to the LSN of the last replayed WAL record. A warning similar
to that at the end of XLogFlush() would be a good idea though, if the
data page LSN is greater.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Fujii Masao
Date:
Hi,

On Fri, Jun 26, 2009 at 3:22 PM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
> Are you actually seeing performance degradation caused by frequent
> pg_control updates? In the simple test scenarios I've tested, pg_control
> is updated only once every few WAL segments, and this with
> shared_buffer=32MB. With larger shared_buffers, it happens even less
> frequently.

Oh, I misunderstood about how UpdateMinRecoveryPoint() is called.
ISTM that recovery is still fast unless shared_buffers is set to unreasonable
value. Sorry for the noise.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-06-26 at 05:14 +0100, Simon Riggs wrote:
>> On Thu, 2009-06-25 at 20:25 -0400, Tom Lane wrote:
>
>>> What I am thinking is that instead of the hack of clearing
>>> LocalRecoveryInProgress to allow the current process to write WAL,
>>> we should have a separate test function WALWriteAllowed() with a
>>> state variable LocalWALWriteAllowed, and the hack should set that
>>> state without playing any games with LocalRecoveryInProgress.  Then
>>> RecoveryInProgress() remains true during the end-of-recovery checkpoint
>>> and we can condition the TruncateMultiXact and TruncateSUBTRANS calls
>>> on that.  Meanwhile the various places that check RecoveryInProgress
>>> to decide if WAL writing is allowed should call WALWriteAllowed()
>>> instead.
>> No need.
>
> Belay that. Yes, agree need for additional state, though think its more
> like EndRecoveryIsComplete().

Here's a patch implementing the WALWriteAllowed() idea (I'm not wedded
to the name). There's two things that trouble me with this patch:

- CreateCheckPoint() calls AdvanceXLInsertBuffer() before setting
LocalWALWriteAllowed. I don't see anything in AdvanceXLInsertBuffer()
that would fail, but it doesn't feel right. While strictly speaking it
doesn't insert new WAL records, it does write WAL page headers.

- As noted with an XXX comment in the patch, CreateCheckPoint() now
resets LocalWALWriteAllowed to false after a shutdown/end-of-recovery
checkpoint. But that's not enough to stop WAL inserts after a shutdown
checkpoint, because when RecoveryInProgress() is false, we
WALWriteAllowed() still returns true. We haven't had such a safeguard in
place before, so we can keep living without it, but now that we have a
WALWriteAllowed() macro it would be nice if it returned false when WAL
writes are no longer allowed after a shutdown checkpoint. (that would've
caught a bug in Guillaume Smet's original patch to rotate a WAL segment
at shutdown, where the xlog switch was done after shutdown checkpoint)

On whole, this is probably the right way going forward, but I'm not sure
if it'd make 8.4 more or less robust than what's in CVS now.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 7314341..6f86961 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1543,7 +1543,7 @@ CheckPointMultiXact(void)
      * SimpleLruTruncate would get confused.  It seems best not to risk
      * removing any data during recovery anyway, so don't truncate.
      */
-    if (!InRecovery)
+    if (!RecoveryInProgress())
         TruncateMultiXact();

     TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d6f63c7..f28bd4d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -142,6 +142,16 @@ static bool InArchiveRecovery = false;
  */
 static bool LocalRecoveryInProgress = true;

+/*
+ * Am I allowed to write new WAL records? It's always allowed after recovery.
+ * End-of-recovery checkpoint sets LocalWALWriteAllowed before we exit
+ * recovery, so that it can write the checkpoint record even though
+ * RecoveryInProgress() is still true.
+ */
+#define WALWriteAllowed() (LocalWALWriteAllowed || !RecoveryInProgress())
+
+static bool LocalWALWriteAllowed = false;
+
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;

@@ -537,7 +547,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
     bool        isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);

     /* cross-check on whether we should be here or not */
-    if (RecoveryInProgress())
+    if (!WALWriteAllowed())
         elog(FATAL, "cannot make new WAL entries during recovery");

     /* info's high bits are reserved for use by me */
@@ -1846,7 +1856,7 @@ XLogFlush(XLogRecPtr record)
      * During REDO, we don't try to flush the WAL, but update minRecoveryPoint
      * instead.
      */
-    if (RecoveryInProgress())
+    if (!WALWriteAllowed())
     {
         UpdateMinRecoveryPoint(record, false);
         return;
@@ -1949,7 +1959,7 @@ XLogFlush(XLogRecPtr record)
      * and so we will not force a restart for a bad LSN on a data page.
      */
     if (XLByteLT(LogwrtResult.Flush, record))
-        elog(InRecovery ? WARNING : ERROR,
+        elog(RecoveryInProgress() ? WARNING : ERROR,
         "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
              record.xlogid, record.xrecoff,
              LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
@@ -1977,7 +1987,7 @@ XLogBackgroundFlush(void)
     bool        flexible = true;

     /* XLOG doesn't need flushing during recovery */
-    if (RecoveryInProgress())
+    if (!WALWriteAllowed())
         return;

     /* read LogwrtResult and update local state */
@@ -5849,7 +5859,10 @@ RecoveryInProgress(void)
          * recovery is finished.
          */
         if (!LocalRecoveryInProgress)
+        {
             InitXLOGAccess();
+            LocalWALWriteAllowed = true;
+        }

         return LocalRecoveryInProgress;
     }
@@ -6225,7 +6238,6 @@ CreateCheckPoint(int flags)
     uint32        _logSeg;
     TransactionId *inCommitXids;
     int            nInCommit;
-    bool        OldInRecovery = InRecovery;

     /*
      * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6236,33 +6248,9 @@ CreateCheckPoint(int flags)
     else
         shutdown = false;

-    /*
-     * A startup checkpoint is created before anyone else is allowed to
-     * write WAL. To allow us to write the checkpoint record, set
-     * LocalRecoveryInProgress to false. This lets us write WAL, but others
-     * are still not allowed to do so.
-     */
-    if (flags & CHECKPOINT_END_OF_RECOVERY)
-    {
-        Assert(RecoveryInProgress());
-        LocalRecoveryInProgress = false;
-        InitXLOGAccess();
-
-        /*
-         * Before 8.4, end-of-recovery checkpoints were always performed by
-         * the startup process, and InRecovery was set true. InRecovery is not
-         * normally set in bgwriter, but we set it here temporarily to avoid
-         * confusing old code in the end-of-recovery checkpoint code path that
-         * rely on it.
-         */
-        InRecovery = true;
-    }
-    else
-    {
-        /* shouldn't happen */
-        if (RecoveryInProgress())
-            elog(ERROR, "can't create a checkpoint during recovery");
-    }
+    /* shouldn't happen */
+    if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+        elog(ERROR, "can't create a checkpoint during recovery");

     /*
      * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
@@ -6305,7 +6293,6 @@ CreateCheckPoint(int flags)

     /* Begin filling in the checkpoint WAL record */
     MemSet(&checkPoint, 0, sizeof(checkPoint));
-    checkPoint.ThisTimeLineID = ThisTimeLineID;
     checkPoint.time = (pg_time_t) time(NULL);

     /*
@@ -6473,6 +6460,24 @@ CreateCheckPoint(int flags)
     START_CRIT_SECTION();

     /*
+     * An end-of-recovery checkpoint is created before anyone is allowed to
+     * write WAL. To allow us to write the checkpoint record, temporarily
+     * enable LocalWALWriteAllowed.
+     */
+    if (flags & CHECKPOINT_END_OF_RECOVERY)
+    {
+        Assert(RecoveryInProgress());
+        LocalWALWriteAllowed = true;
+        InitXLOGAccess();
+    }
+
+    /*
+     * this needs to be done after the InitXLOGAccess() call above or
+     * ThisTimeLineID might be uninitialized
+     */
+    checkPoint.ThisTimeLineID = ThisTimeLineID;
+
+    /*
      * Now insert the checkpoint record into XLOG.
      */
     rdata.data = (char *) (&checkPoint);
@@ -6488,6 +6493,19 @@ CreateCheckPoint(int flags)
     XLogFlush(recptr);

     /*
+     * We mustn't write any new WAL after a shutdown checkpoint, or it will
+     * be overwritten at next startup. No-one should even try, this just
+     * allows a bit more sanity-checking. XXX: since RecoveryInProgress() is
+     * false at that point, we'll just set LocalWriteAllowed again if anyone
+     * calls XLogInsert(), so this is actually useless in the shutdown case.
+     *
+     * Don't allow WAL writes after an end-of-recovery checkpoint either. It
+     * will be enabled again after the startup is fully completed.
+     */
+    if (shutdown)
+        LocalWALWriteAllowed = false;
+
+    /*
      * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
      * = end of actual checkpoint record.
      */
@@ -6560,7 +6578,7 @@ CreateCheckPoint(int flags)
      * in subtrans.c).    During recovery, though, we mustn't do this because
      * StartupSUBTRANS hasn't been called yet.
      */
-    if (!InRecovery)
+    if (!RecoveryInProgress())
         TruncateSUBTRANS(GetOldestXmin(true, false));

     /* All real work is done, but log before releasing lock. */
@@ -6574,9 +6592,6 @@ CreateCheckPoint(int flags)
                                      CheckpointStats.ckpt_segs_recycled);

     LWLockRelease(CheckpointLock);
-
-    /* Restore old value */
-    InRecovery = OldInRecovery;
 }

 /*

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> I observe that the substantial amount of care we have taken over
>> XLogFlush's handling of bad-input-LSN scenarios has been completely
>> destroyed by the UpdateMinRecoveryPoint patch, which will fail
>> disastrously (leaving the database unstartable/unrecoverable) if a
>> bogusly large LSN is encountered during recovery.

> Note that we don't update minRecoveryPoint to the LSN from the data
> page, but to the LSN of the last replayed WAL record. A warning similar
> to that at the end of XLogFlush() would be a good idea though, if the
> data page LSN is greater.

Ah, roger, so actually we can make this *better* than it was before.
The special case in XLogFlush is no longer needed, because the case
in which we formerly wished to use WARNING is now diverted to
UpdateMinRecoveryPoint.  But the latter ought to handle the situation
explicitly.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Here's a patch implementing the WALWriteAllowed() idea (I'm not wedded
> to the name).

I'm not either ... this morning it seems to me that XLogWriteAllowed
(or maybe XLogInsertAllowed?) would be more in keeping with the existing
code names in this area.

> There's two things that trouble me with this patch:

> - CreateCheckPoint() calls AdvanceXLInsertBuffer() before setting
> LocalWALWriteAllowed. I don't see anything in AdvanceXLInsertBuffer()
> that would fail, but it doesn't feel right. While strictly speaking it
> doesn't insert new WAL records, it does write WAL page headers.

The ordering is necessary because we have to do that before we start
flushing buffers, and XLogFlush has to see WriteAllowed as FALSE or it
will do the wrong thing.  If we ever put something into
AdvanceXLInsertBuffer that would depend on this, we could flip the flag
to TRUE just for the duration of calling AdvanceXLInsertBuffer, though
I agree that's a bit ugly.  Perhaps calling the flag/routine
XLogInsertAllowed() would alleviate this issue somewhat?

> - As noted with an XXX comment in the patch, CreateCheckPoint() now
> resets LocalWALWriteAllowed to false after a shutdown/end-of-recovery
> checkpoint. But that's not enough to stop WAL inserts after a shutdown
> checkpoint, because when RecoveryInProgress() is false, we
> WALWriteAllowed() still returns true. We haven't had such a safeguard in
> place before, so we can keep living without it, but now that we have a
> WALWriteAllowed() macro it would be nice if it returned false when WAL
> writes are no longer allowed after a shutdown checkpoint. (that would've
> caught a bug in Guillaume Smet's original patch to rotate a WAL segment
> at shutdown, where the xlog switch was done after shutdown checkpoint)

It would be possible to make LocalWALWriteAllowed a three-state flag:
    * allowed, don't check RecoveryInProgress,
    * disallowed, don't check RecoveryInProgress
    * check RecoveryInProgress, transition to first state if clear
Not sure if worth the trouble.

> On whole, this is probably the right way going forward, but I'm not sure
> if it'd make 8.4 more or less robust than what's in CVS now.

I think we should do it.  There is a lot of stuff I'm still not happy
with in this area, but without a clean and understandable set of state
definitions we aren't going to be able to improve it.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> - CreateCheckPoint() calls AdvanceXLInsertBuffer() before setting
>> LocalWALWriteAllowed. I don't see anything in AdvanceXLInsertBuffer()
>> that would fail, but it doesn't feel right. While strictly speaking it
>> doesn't insert new WAL records, it does write WAL page headers.
>
> The ordering is necessary because we have to do that before we start
> flushing buffers, and XLogFlush has to see WriteAllowed as FALSE or it
> will do the wrong thing.  If we ever put something into
> AdvanceXLInsertBuffer that would depend on this, we could flip the flag
> to TRUE just for the duration of calling AdvanceXLInsertBuffer, though
> I agree that's a bit ugly.  Perhaps calling the flag/routine
> XLogInsertAllowed() would alleviate this issue somewhat?

Yeah, it would look much less dirty if called XLogInsertAllowed().

>> - As noted with an XXX comment in the patch, CreateCheckPoint() now
>> resets LocalWALWriteAllowed to false after a shutdown/end-of-recovery
>> checkpoint. But that's not enough to stop WAL inserts after a shutdown
>> checkpoint, because when RecoveryInProgress() is false, we
>> WALWriteAllowed() still returns true. We haven't had such a safeguard in
>> place before, so we can keep living without it, but now that we have a
>> WALWriteAllowed() macro it would be nice if it returned false when WAL
>> writes are no longer allowed after a shutdown checkpoint. (that would've
>> caught a bug in Guillaume Smet's original patch to rotate a WAL segment
>> at shutdown, where the xlog switch was done after shutdown checkpoint)
>
> It would be possible to make LocalWALWriteAllowed a three-state flag:
>     * allowed, don't check RecoveryInProgress,
>     * disallowed, don't check RecoveryInProgress
>     * check RecoveryInProgress, transition to first state if clear
> Not sure if worth the trouble.

One idea is to merge LocalWALWriteAllowed and LocalRecoveryInProgress
(and the corresponding shared variables too) into one XLogState variable
with three states
* in-recovery
* normal operation
* shutdown.

The variable would always move from a lower state to higher, similar to
PMState in postmaster.c, so it could be cached like RecoveryInProgress
works now. WAL inserts would only be allowed in normal operation. An
end-of-recovery checkpoint would set LocalXLogState to "normal
operation" ahead of SharedXLogState, to allow writing the checkpoint
record, and a shutdown checkpoint would set Shared- and LocalXLogState
to "shutdown" right after writing the checkpoint record.

>> On whole, this is probably the right way going forward, but I'm not sure
>> if it'd make 8.4 more or less robust than what's in CVS now.
>
> I think we should do it.  There is a lot of stuff I'm still not happy
> with in this area, but without a clean and understandable set of state
> definitions we aren't going to be able to improve it.

I agree, we need clear states and invariants that can be checked and
relied on. This gets even more complicated when the hot standby stuff
gets involved.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> One idea is to merge LocalWALWriteAllowed and LocalRecoveryInProgress
> (and the corresponding shared variables too) into one XLogState variable
> with three states
> * in-recovery
> * normal operation
> * shutdown.

> The variable would always move from a lower state to higher,

Hmm ... this doesn't really feel cleaner to me, although I'm not sure
why not.  One point is that you really don't have enough states there;
there's a difference between "in recovery" and "writing end-of-recovery
checkpoint".  Also, you stated that you want to disable XLogInsert again
as soon as the EOR checkpoint is written, so I don't believe that the
state sequence is really monotonic, unless you split in-recovery into
three sub-states.

On the whole I prefer keeping "InRecovery" and "XLogInsertAllowed" as
separate concepts, I think.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
I wrote:
> Hmm ... this doesn't really feel cleaner to me, although I'm not sure
> why not.

Oh, I thought of a more concrete point: InRecovery is inherently a
system-wide state, but XLogInsertAllowed is *not*.  While we write
the EOR checkpoint, we really want only the bgwriter to be authorized
to write WAL, but the scheme you propose would effectively authorize
all processes during that window.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Alvaro Herrera
Date:
Tom Lane escribió:
> I wrote:
> > Hmm ... this doesn't really feel cleaner to me, although I'm not sure
> > why not.
>
> Oh, I thought of a more concrete point: InRecovery is inherently a
> system-wide state, but XLogInsertAllowed is *not*.  While we write
> the EOR checkpoint, we really want only the bgwriter to be authorized
> to write WAL, but the scheme you propose would effectively authorize
> all processes during that window.

BTW one of the problems of the block-CRC patch was that we needed to
emit WAL records for hint bit states set during shutdown, as buffers
were flushed out, which obviously caused that error message to show up.
Due to another more serious problem I didn't investigate the solution to
this one, but it looks like it would be easy to set XLogInsertAllowed to
true while flushing buffers.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> I wrote:
>> Hmm ... this doesn't really feel cleaner to me, although I'm not sure
>> why not.
>
> Oh, I thought of a more concrete point: InRecovery is inherently a
> system-wide state, but XLogInsertAllowed is *not*.  While we write
> the EOR checkpoint, we really want only the bgwriter to be authorized
> to write WAL, but the scheme you propose would effectively authorize
> all processes during that window.

I was thinking of doing the same hack we have now, and set
LocalXLogState to "normal processing" in bgwriter, but leave
SharedXLogState still in "recovery".


It's getting late again, and I'm afraid I have to stop for today :-(.
I'll probably check my emails tomorrow, but probably won't be doing much
else until Monday.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> It's getting late again, and I'm afraid I have to stop for today :-(.

OK, I'll start from your last patch and see what I can do.

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Here's a patch implementing the WALWriteAllowed() idea (I'm not wedded
> to the name). There's two things that trouble me with this patch:

I've committed this patch plus the mentioned revisions, along with a
bunch of code review of my own (mostly doctoring comments that needed
it).  There are a couple of loose ends that should get looked at later:

* AFAICS there is no mechanism preventing the bgwriter from trying to
issue a restartpoint after the end-of-recovery checkpoint and before the
startup process manages to clear SharedRecoveryInProgress.  It's
unlikely, because the EOR checkpoint would advance the local idea of
when the next checkpoint should happen, but it doesn't appear to be
impossible.  If it did happen it would be really bad because
pg_control's notion of the latest checkpoint could go backward, perhaps
into a part of XLOG that isn't there anymore.  You'd never notice unless
you were unfortunate enough to crash before the next regular checkpoint.
I put a hack defense into CreateRestartPoint so that it won't overwrite
pg_control if the state doesn't look right once it's got
ControlFileLock, but I wonder if there's a better answer.

* CreateRestartPoint tries to report recoveryLastXTime, but that is dead
code because that variable isn't maintained in the bgwriter process.
This is not critical functionality so I don't mind releasing 8.4 with it
broken.  Perhaps we should use the time from the restartpoint's
checkpoint?

* I'm unconvinced by the code you added to "Update min recovery point
one last time".  What is that supposed to be good for?  It won't do
anything in a crash-recovery case (when minRecoveryPoint is 0/0);
does that matter?

* I find the RecoveryInProgress test in XLogNeedsFlush rather dubious.
Why is it okay to disable that?  For at least one of the two callers
(SetHintBits) it seems like the safe answer is "true" not "false".
This doesn't matter too much yet but it will for hot standby.

* I find the RecoveryInProgress test in ShutdownXLOG rather dubious too.
If that code path can actually be taken, isn't it trying to shut down
modules that haven't been initialized?

            regards, tom lane

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Simon Riggs
Date:
On Fri, 2009-06-26 at 16:48 -0400, Tom Lane wrote:

> * I find the RecoveryInProgress test in XLogNeedsFlush rather dubious.
> Why is it okay to disable that?  For at least one of the two callers
> (SetHintBits) it seems like the safe answer is "true" not "false".
> This doesn't matter too much yet but it will for hot standby.

IIUC you think it is safest to *not* write hint bits during Hot Standby?

So we would treat all commits as async commits?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2009-06-26 at 16:48 -0400, Tom Lane wrote:
>> * I find the RecoveryInProgress test in XLogNeedsFlush rather dubious.
>> Why is it okay to disable that?  For at least one of the two callers
>> (SetHintBits) it seems like the safe answer is "true" not "false".
>> This doesn't matter too much yet but it will for hot standby.

> IIUC you think it is safest to *not* write hint bits during Hot Standby?

Well, the point I was trying to make is that if you want the code to do
nothing then hot-wiring it to return "false" all the time during hot
standby doesn't accomplish that.  I think it probably is safe to update
hint bits during HS, but we need to think through the timing relative to
WAL processing and make sure that we don't set them too early.

Maybe that analysis has actually been done already, but there's no
evidence of it in the code or comments.

            regards, tom lane