Thread: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Andres Freund
Date:
Hi,

Abhijit and I investigated a customer problem which has showed that crash recovery +
unlogged relations don't always work well together:

A condensed version of how crash recovery works is:

StartupXLOG()
{
...   if (ControlFile->state != DB_SHUTDOWNED)      InRecovery = true;      if (InRecovery)   {
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);      ....       /* perform crash recovery till the end of WAL */
 ...       CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);       ...   }
PreallocXlogFiles(EndOfLog);     if (InRecovery)       ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
 
...   /* finish startup */
}

the second important part is:

CreateCheckPoint(flags)
{
...   if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))       shutdown = true;
...   if (shutdown)       ControlFile->state = DB_SHUTDOWNED;   UpdateControlFile();
...
}

If you consider a crash due to ENOSPC the problem is that first crash
recovery is performed. Then a checkpoint is performed which is marked as
END_OF_RECOVERY - which marks the database as being properly shut down!
So far, while not pretty, so good. The problem is that if we crash after
the CreateCheckPoint(), e.g. because of xlog preallocation or the new
files created in ResetUnloggedRelations(), the server will restart *but*
will *not* perform crash recovery anymore as pg_control marked as
DB_SHUTDOWNED!

That leaves you with a database which has all the _init forks, but not
the main forks... Leading to scary an unexpected errors.

Should somebody google this: The problem can be fixed by forcing the
server into crash recovery again using an immediate shutdown.


Personally I think it's quite the mistake that END_OF_RECOVERY
checkpoints are treated as shutdown checkpoints. The claimed reason
that:    *    * Note that we write a shutdown checkpoint rather than an on-line    * one. This is not particularly
critical,but since we may be    * assigning a new TLI, using a shutdown checkpoint allows us to have    * the rule that
TLIonly changes in shutdown checkpoints, which    * allows some extra error checking in xlog_redo.    *
 
and   /*    * An end-of-recovery checkpoint is really a shutdown checkpoint, just    * issued at a different time.
*/

isn't very convincing as those checks could just as well be saved in a
flags argument in the checkpoint. The likelihood of this confusion
causing further bugs (IIRC it already has caused a couple) seems high.

What I like even less is that pg_control is actually marked as
DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
the database was *NOT* shutdown peacefully. I don't see active bugs due
it besides this, but I think it's likely to either have or create futher
ones.


Because at least the former is something that obviously we can't (and
don't want) to change in the back branches I think the solution for this
particular problem is to simply move the ResetUnloggedRelations() call a
couple lines up to before the CreateCheckPoint(). That should fix this.

Comments, other opinions?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Andres Freund
Date:
Hi,

On 2014-09-12 13:22:46 +0200, Andres Freund wrote:
> Because at least the former is something that obviously we can't (and
> don't want) to change in the back branches I think the solution for this
> particular problem is to simply move the ResetUnloggedRelations() call a
> couple lines up to before the CreateCheckPoint(). That should fix this.
> 
> Comments, other opinions?

A second thing I realized, just after hitting send, is that I think
ResetUnloggedRelations(UNLOGGED_RELATION_INIT) actually has to fsync the
created files. As we rely on the !init relations being there they really
need to be fsynced. During normal running that's ensured via the smgr
during checkpoints, but that's not the case here.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Robert Haas
Date:
On Fri, Sep 12, 2014 at 7:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> What I like even less is that pg_control is actually marked as
> DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
> the database was *NOT* shutdown peacefully. I don't see active bugs due
> it besides this, but I think it's likely to either have or create futher
> ones.

I agree.  The database clearly isn't shut down at end of recovery;
it's still active and we're still doing things to it.  If we crash at
that point, we need to go back into recovery on restart.  This seems
open and shut, but maybe I'm missing something.  Why shouldn't we fix
*that*?

With regard to your second email, I agree that
ResetUnloggedRelation(UNLOGGED_RELATION_INIT) needs to issue fsyncs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Andres Freund
Date:
On 2014-09-12 14:44:48 -0400, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 7:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > What I like even less is that pg_control is actually marked as
> > DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
> > the database was *NOT* shutdown peacefully. I don't see active bugs due
> > it besides this, but I think it's likely to either have or create futher
> > ones.
> 
> I agree.  The database clearly isn't shut down at end of recovery;
> it's still active and we're still doing things to it.  If we crash at
> that point, we need to go back into recovery on restart.  This seems
> open and shut, but maybe I'm missing something.  Why shouldn't we fix
> *that*?

Well, I think we might want to do both. There doesn't seem to be a
reason to *not* do the ResetUnloggedRelation(UNLOGGED_RELATION_INIT)
around the ShutdownWalRcv(). That seems much closer where it, for me,
logically belongs. And it'd fix the concrete problem.

For DB_SHUTDOWNED (is that actually a word? Looks like it could be from
me...) the case isn't that clear:

If you start a node after a crash and stop it as soon as it finished, it
doesn't need recovery again. Similar if a node is promoted and doesn't
use fast promotion or a older release. Now, I think this is a pretty
dubious benefit. But I'm not sure it's wise to change it in the back
branches.

> With regard to your second email, I agree that
> ResetUnloggedRelation(UNLOGGED_RELATION_INIT) needs to issue fsyncs.

Good.

The topic reminds me of the fact that I actually think at the very least
pg_xlog/ and pg_control needs the same treatment. Consider the following
sequence:
1) postgres writes loads of stuff. Lots of it to *not* fsynced WAL  segments
2) postgres crashes and crash recovery happens. Replays *not* fsynced  WAL
3) the OS crashes
4) Bad. We now might hava pg_control with a minRecovery that's *later*  than some potentially unsynced WAL segments

I think the easiest would be to just fsync() the entire data directory
at the start when ControlFile->state !=
DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY

Note that that's independent of the fsync for unlogged relations.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Robert Haas
Date:
On Thu, Sep 18, 2014 at 4:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-12 14:44:48 -0400, Robert Haas wrote:
>> On Fri, Sep 12, 2014 at 7:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > What I like even less is that pg_control is actually marked as
>> > DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
>> > the database was *NOT* shutdown peacefully. I don't see active bugs due
>> > it besides this, but I think it's likely to either have or create futher
>> > ones.
>>
>> I agree.  The database clearly isn't shut down at end of recovery;
>> it's still active and we're still doing things to it.  If we crash at
>> that point, we need to go back into recovery on restart.  This seems
>> open and shut, but maybe I'm missing something.  Why shouldn't we fix
>> *that*?
>
> Well, I think we might want to do both. There doesn't seem to be a
> reason to *not* do the ResetUnloggedRelation(UNLOGGED_RELATION_INIT)
> around the ShutdownWalRcv(). That seems much closer where it, for me,
> logically belongs. And it'd fix the concrete problem.

That might be all right.  I've booted enough things in this area of
the code to not have a lot of confidence in my ability to not boot
things in this area of the code ... but I don't see a problem with it.
It does seem a little odd to do that before checking whether we
reached the minimum recovery point, or deciding whether to assign a
new TLI, but I don't see a concrete problem with putting it there.

>> With regard to your second email, I agree that
>> ResetUnloggedRelation(UNLOGGED_RELATION_INIT) needs to issue fsyncs.
>
> Good.
>
> The topic reminds me of the fact that I actually think at the very least
> pg_xlog/ and pg_control needs the same treatment. Consider the following
> sequence:
> 1) postgres writes loads of stuff. Lots of it to *not* fsynced WAL
>    segments
> 2) postgres crashes and crash recovery happens. Replays *not* fsynced
>    WAL
> 3) the OS crashes
> 4) Bad. We now might hava pg_control with a minRecovery that's *later*
>    than some potentially unsynced WAL segments
>
> I think the easiest would be to just fsync() the entire data directory
> at the start when ControlFile->state !=
> DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY
>
> Note that that's independent of the fsync for unlogged relations.

Seems reasonable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Abhijit Menon-Sen
Date:
Hi Andres, Robert.

I've attached four patches here.

1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
   earlier in StartupXLOG.

2. Inside that function, issue fsync()s for the main forks we create by
   copying the _init fork.

3. A small fixup to add a const to "typedef char *FileName", because the
   earlier patch gave me warnings about discarding const-ness. This is
   consistent with many other functions in fd.c that take const char *.

4. Issue an fsync() on the data directory at startup if we need to
   perform crash recovery.

I created some unlogged relations, performed an immediate shutdown, and
then straced postgres as it performed crash recovery. The changes in (2)
do indeed fsync the files we create by copying *_init, and don't fsync
anything else (at least not after I fixed a bug ;-).

I did not do anything about the END_OF_RECOVERY checkpoint setting the
ControlFile->state to DB_SHUTDOWNED, because it wasn't clear to me if
there was any agreement on what to do. I would be happy to submit a
followup patch if we reach some decision about it.

Is this what you had in mind?

-- Abhijit

Attachment

Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Andres Freund
Date:
Hi,

On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote:
> Hi Andres, Robert.
> 
> I've attached four patches here.

Cool. Will review.

> 1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
>    earlier in StartupXLOG.
> 
> 2. Inside that function, issue fsync()s for the main forks we create by
>    copying the _init fork.

These two imo should definitely be backpatched.

> 3. A small fixup to add a const to "typedef char *FileName", because the
>    earlier patch gave me warnings about discarding const-ness. This is
>    consistent with many other functions in fd.c that take const char *.

I'm happy with consider that one for master (although I seem to recall having had
a patch for it rejected?), but I don't think we want to do that in the
backbranches.

> 4. Issue an fsync() on the data directory at startup if we need to
>    perform crash recovery.

I personally think this one should be backpatched too. Does anyone
believe differently?

> From d8726c06cdf11674661eac1d091cf7edd05c2a0c Mon Sep 17 00:00:00 2001
> From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
> Date: Wed, 24 Sep 2014 14:43:18 +0530
> Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier
> 
> We need to call this after recovery, but not after the END_OF_RECOVERY
> checkpoint is written. If we crash after that checkpoint, for example
> because of ENOSPC in PreallocXlogFiles, the checkpoint has already set
> the ControlFile->state to DB_SHUTDOWNED, so we don't enter recovery
> again at startup.
> 
> Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP)
> in the first cleanup, this leaves the database with a bunch of _init
> forks for unlogged relations, but no main forks, leading to scary
> errors.
> 
> See thread from 20140912112246.GA4984@alap3.anarazel.de for details.

With a explanation. Shiny!

> ---
>  src/backend/access/transam/xlog.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 46eef5f..218f7fb 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -6863,6 +6863,14 @@ StartupXLOG(void)
>      ShutdownWalRcv();
>  
>      /*
> +     * Reset initial contents of unlogged relations.  This has to be done
> +     * AFTER recovery is complete so that any unlogged relations created
> +     * during recovery also get picked up.
> +     */
> +    if (InRecovery)
> +        ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
> +

+
* Also recovery shouldn't be regarded successful if the reset fails -* e.g. because of ENOSPC.

> +
> +        /*
> +         * copy_file() above has already called pg_flush_data() on the
> +         * files it created. Now we need to fsync those files, because
> +         * a checkpoint won't do it for us while we're in recovery.
> +         */

+

* Doing this in a separate pass is advantageous for performance reasons
* because it allows the kernel to perform all the flushes at once.

> +    /*
> +     * If we need to perform crash recovery, we issue an fsync on the
> +     * data directory to try to ensure that any data written before the
> +     * crash are flushed to disk. Otherwise a power failure in the near
> +     * future might mean that earlier unflushed writes are lost, but the
> +     * more recent data written to disk from here on are persisted.
> +     */
> +
> +    if (ControlFile->state != DB_SHUTDOWNED &&
> +        ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
> +        fsync_fname(data_directory, true);
> +
>      if (ControlFile->state == DB_SHUTDOWNED)
>      {
>          /* This is the expected case, so don't be chatty in standalone mode */
> -- 

Unless I miss something this isn't sufficient. We need to fsync the
files in the data directory, not just the toplevel directory?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Abhijit Menon-Sen
Date:
At 2014-09-25 22:41:18 +0200, andres@2ndquadrant.com wrote:
>
>  * Also recovery shouldn't be regarded successful if the reset fails -
>  * e.g. because of ENOSPC.

OK.

> * Doing this in a separate pass is advantageous for performance reasons
> * because it allows the kernel to perform all the flushes at once.

OK.

> Unless I miss something this isn't sufficient. We need to fsync the
> files in the data directory, not just the toplevel directory?

No, of course you're right. So a separate function that does the moral
equivalent of "find $PGDATA -exec fsync_fname …"?

Will resubmit with the additional comments.

-- Abhijit



Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Andres Freund
Date:
On 2014-09-26 02:34:06 +0530, Abhijit Menon-Sen wrote:
> At 2014-09-25 22:41:18 +0200, andres@2ndquadrant.com wrote:
> > Unless I miss something this isn't sufficient. We need to fsync the
> > files in the data directory, not just the toplevel directory?
>
> No, of course you're right. So a separate function that does the moral
> equivalent of "find $PGDATA -exec fsync_fname …"?

Probably will require some care to deal correctly with tablespaces.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Abhijit Menon-Sen
Date:
At 2014-09-25 22:41:18 +0200, andres@2ndquadrant.com wrote:
>
> On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote:
> >
> > 1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
> >    earlier in StartupXLOG.
> >
> > 2. Inside that function, issue fsync()s for the main forks we create by
> >    copying the _init fork.
>
> These two imo should definitely be backpatched.

I've attached updated versions of these two patches. The only changes
are to revise the comments as you suggested.

> > 3. A small fixup to add a const to "typedef char *FileName", because the
> >    earlier patch gave me warnings about discarding const-ness. This is
> >    consistent with many other functions in fd.c that take const char *.
>
> I'm happy with consider that one for master (although I seem to recall
> having had a patch for it rejected?), but I don't think we want to do
> that in the backbranches.

(I gave up on this for now as an unrelated diversion.)

> > 4. Issue an fsync() on the data directory at startup if we need to
> > perform crash recovery.
>
> I personally think this one should be backpatched too. Does anyone
> believe differently?

I revised this patch as well, I'll rebase and send it separately along
with the initdb -S patch shortly.

-- Abhijit

Attachment

Re: END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

From
Andres Freund
Date:
On 2014-10-27 16:09:30 +0530, Abhijit Menon-Sen wrote:
> At 2014-09-25 22:41:18 +0200, andres@2ndquadrant.com wrote:
> >
> > On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote:
> > > 
> > > 1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
> > >    earlier in StartupXLOG.
> > > 
> > > 2. Inside that function, issue fsync()s for the main forks we create by
> > >    copying the _init fork.
> > 
> > These two imo should definitely be backpatched.
> 
> I've attached updated versions of these two patches. The only changes
> are to revise the comments as you suggested.

Committed and backpatched (to 9.1) these. That required also
backpatching the move of fsync_fname() to fd.c/h to 9.1-9.3.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services