Thread: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Thu, Jan 28, 2010 at 12:27 AM, Heikki Linnakangas
<heikki@postgresql.org> wrote:
> Log Message:
> -----------
> Make standby server continuously retry restoring the next WAL segment with
> restore_command, if the connection to the primary server is lost. This
> ensures that the standby can recover automatically, if the connection is
> lost for a long time and standby falls behind so much that the required
> WAL segments have been archived and deleted in the master.
>
> This also makes standby_mode useful without streaming replication; the
> server will keep retrying restore_command every few seconds until the
> trigger file is found. That's the same basic functionality pg_standby
> offers, but without the bells and whistles.

http://archives.postgresql.org/pgsql-hackers/2010-01/msg01520.php
http://archives.postgresql.org/pgsql-hackers/2010-01/msg02589.php

As I pointed out previously, the standby might restore a partially-filled
WAL file that is being archived by the primary, and cause a FATAL error.
And this happened in my box when I was testing the SR.
 sby [20088] FATAL:  archive file "000000010000000000000087" has
wrong size: 14139392 instead of 16777216 sby [20076] LOG:  startup process (PID 20088) exited with exit code 1 sby
[20076]LOG:  terminating any other active server processes act [18164] LOG:  received immediate shutdown request
 

If the startup process is in standby mode, I think that it should retry
starting replication instead of emitting an error when it finds a
partially-filled file in the archive. Then if the replication has been
terminated, it has only to restore the archived file again. Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> As I pointed out previously, the standby might restore a partially-filled
> WAL file that is being archived by the primary, and cause a FATAL error.
> And this happened in my box when I was testing the SR.
> 
>   sby [20088] FATAL:  archive file "000000010000000000000087" has
> wrong size: 14139392 instead of 16777216
>   sby [20076] LOG:  startup process (PID 20088) exited with exit code 1
>   sby [20076] LOG:  terminating any other active server processes
>   act [18164] LOG:  received immediate shutdown request
> 
> If the startup process is in standby mode, I think that it should retry
> starting replication instead of emitting an error when it finds a
> partially-filled file in the archive. Then if the replication has been
> terminated, it has only to restore the archived file again. Thought?

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Wed, Feb 10, 2010 at 4:32 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Hmm, so after running restore_command, check the file size and if it's
> too short, treat it the same as if restore_command returned non-zero?

Yes, only in standby mode case. OTOH I think that normal archive recovery
should treat it as a FATAL error.

> And it will be retried on the next iteration. Works for me, though OTOH
> it will then fail to complain about a genuinely WAL file that's
> truncated for some reason. I guess there's no way around that, even if
> you have a script as restore_command that does the file size check, it
> will have the same problem.

Right. But the server in standby mode also needs to complain about that?
We might be able to read completely such a WAL file that looks truncated
from the primary via SR, or from the archive after a few seconds. So it's
odd for me to give up continuing the standby only by finding the WAL file
whose file size is short. I believe that the warm standby (+ pg_standby)
also is based on that thought.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100210 02:33]:
> Hmm, so after running restore_command, check the file size and if it's
> too short, treat it the same as if restore_command returned non-zero?
> And it will be retried on the next iteration. Works for me, though OTOH
> it will then fail to complain about a genuinely WAL file that's
> truncated for some reason. I guess there's no way around that, even if
> you have a script as restore_command that does the file size check, it
> will have the same problem.

But isn't this something every current PITR archive already "works
around"...  Everybody doing PITR archives already know the importance of
making the *appearance* of the WAL filename in the archive atomic.

Don't docs warn about plain cp not being atomic and allowing "short"
files to appear in the archive... 

I'm not sure why "streaming recovery" suddenly changes the requirements...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote:
> * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100210 02:33]:
>  
>> Hmm, so after running restore_command, check the file size and if it's
>> too short, treat it the same as if restore_command returned non-zero?
>> And it will be retried on the next iteration. Works for me, though OTOH
>> it will then fail to complain about a genuinely WAL file that's
>> truncated for some reason. I guess there's no way around that, even if
>> you have a script as restore_command that does the file size check, it
>> will have the same problem.
> 
> But isn't this something every current PITR archive already "works
> around"...  Everybody doing PITR archives already know the importance of
> making the *appearance* of the WAL filename in the archive atomic.

Well, pg_standby does defend against that, but you don't use pg_standby
with the built-in standby mode anymore. It would be reasonable to have
the same level of defenses built-in. It's essentially a one-line change,
and saves a lot of trouble and risk of subtle misconfiguration for admins.

> Don't docs warn about plain cp not being atomic and allowing "short"
> files to appear in the archive... 

Hmm, I don't see anything about that at quick glance. Besides, normal
PITR doesn't have a problem with that, because it will stop when it
reaches the end of archived WAL anyway.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
> Fujii Masao wrote:
> > As I pointed out previously, the standby might restore a partially-filled
> > WAL file that is being archived by the primary, and cause a FATAL error.
> > And this happened in my box when I was testing the SR.
> > 
> >   sby [20088] FATAL:  archive file "000000010000000000000087" has
> > wrong size: 14139392 instead of 16777216
> >   sby [20076] LOG:  startup process (PID 20088) exited with exit code 1
> >   sby [20076] LOG:  terminating any other active server processes
> >   act [18164] LOG:  received immediate shutdown request
> > 
> > If the startup process is in standby mode, I think that it should retry
> > starting replication instead of emitting an error when it finds a
> > partially-filled file in the archive. Then if the replication has been
> > terminated, it has only to restore the archived file again. Thought?
> 
> Hmm, so after running restore_command, check the file size and if it's
> too short, treat it the same as if restore_command returned non-zero?
> And it will be retried on the next iteration. Works for me, though OTOH
> it will then fail to complain about a genuinely WAL file that's
> truncated for some reason. I guess there's no way around that, even if
> you have a script as restore_command that does the file size check, it
> will have the same problem.

Are we trying to re-invent pg_standby here?

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
>> Hmm, so after running restore_command, check the file size and if it's
>> too short, treat it the same as if restore_command returned non-zero?
>> And it will be retried on the next iteration. Works for me, though OTOH
>> it will then fail to complain about a genuinely WAL file that's
>> truncated for some reason. I guess there's no way around that, even if
>> you have a script as restore_command that does the file size check, it
>> will have the same problem.
> 
> Are we trying to re-invent pg_standby here?

That's not the goal, but we seem to need some of the same functionality
in the backend now.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
> >> Hmm, so after running restore_command, check the file size and if it's
> >> too short, treat it the same as if restore_command returned non-zero?
> >> And it will be retried on the next iteration. Works for me, though OTOH
> >> it will then fail to complain about a genuinely WAL file that's
> >> truncated for some reason. I guess there's no way around that, even if
> >> you have a script as restore_command that does the file size check, it
> >> will have the same problem.
> > 
> > Are we trying to re-invent pg_standby here?
> 
> That's not the goal, but we seem to need some of the same functionality
> in the backend now.

I think you need to say why...

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
>>>> Hmm, so after running restore_command, check the file size and if it's
>>>> too short, treat it the same as if restore_command returned non-zero?
>>>> And it will be retried on the next iteration. Works for me, though OTOH
>>>> it will then fail to complain about a genuinely WAL file that's
>>>> truncated for some reason. I guess there's no way around that, even if
>>>> you have a script as restore_command that does the file size check, it
>>>> will have the same problem.
>>> Are we trying to re-invent pg_standby here?
>> That's not the goal, but we seem to need some of the same functionality
>> in the backend now.
> 
> I think you need to say why...

See the quoted paragraph above. We should check the file size, so that
we will not fail if the WAL file is just being copied into the archive
directory.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Thu, 2010-02-11 at 14:44 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
> >>>> Hmm, so after running restore_command, check the file size and if it's
> >>>> too short, treat it the same as if restore_command returned non-zero?
> >>>> And it will be retried on the next iteration. Works for me, though OTOH
> >>>> it will then fail to complain about a genuinely WAL file that's
> >>>> truncated for some reason. I guess there's no way around that, even if
> >>>> you have a script as restore_command that does the file size check, it
> >>>> will have the same problem.
> >>> Are we trying to re-invent pg_standby here?
> >> That's not the goal, but we seem to need some of the same functionality
> >> in the backend now.
> > 
> > I think you need to say why...
> 
> See the quoted paragraph above. We should check the file size, so that
> we will not fail if the WAL file is just being copied into the archive
> directory.

We can read, but that's not an explanation. By giving terse answers in
that way you are giving the impression that you don't want discussion on
these points.

If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
previously existed elsewhere.

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> If you were running pg_standby as the restore_command then this error
> wouldn't happen. So you need to explain why running pg_standby cannot
> solve your problem and why we must fix it by replicating code that has
> previously existed elsewhere.

pg_standby cannot be used with streaming replication.

I guess you're next question is: why not?

The startup process alternates between streaming, and restoring files
from archive using restore_command. It will progress using streaming as
long as it can, but if the connection is lost, it will try to poll the
archive until the connection is established again. The startup process
expects the restore_command to try to restore the file and fail if it's
not found. If the restore_command goes into sleep, waiting for the file
to arrive, that will defeat the retry logic in the server because the
startup process won't get control again to retry establishing the
connection.

That's the the essence of my proposal here:
http://archives.postgresql.org/message-id/4B50AFB4.4060902@enterprisedb.com
which is what has now been implemented.

To suppport a restore_command that does the sleeping itself, like
pg_standby, would require a major rearchitecting of the retry logic. And
I don't see why that'd desirable anyway. It's easier for the admin to
set up using simple commands like 'cp' or 'scp', than require him/her to
write scripts that handle the sleeping and retry logic.


The real problem we have right now is missing documentation. It's
starting to hurt us more and more every day, as more people start to
test this. As shown by this thread and some other recent posts.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Simon Riggs <simon@2ndQuadrant.com> writes:
> If you were running pg_standby as the restore_command then this error
> wouldn't happen. So you need to explain why running pg_standby cannot
> solve your problem and why we must fix it by replicating code that has
> previously existed elsewhere.

Let me try.

pg_standby will not let the server get back to streaming replication
mode once it's done with driving the replay of all the WAL files
available in the archive, but will have the server sits there waiting
for the next file.

The way we want that is implemented now is to have the server switch
back and forth between replaying from the archive and streaming from the
master. So we want the server to restore from the archive the same way
pg_standby used to, except that if the archive does not contain the next
WAL files, we want to get back to streaming.

And the archive reading will resume at next network glitch.

I think it's the reasonning, I hope it explains what you see happening.
-- 
dim


On Thu, 2010-02-11 at 15:28 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > If you were running pg_standby as the restore_command then this error
> > wouldn't happen. So you need to explain why running pg_standby cannot
> > solve your problem and why we must fix it by replicating code that has
> > previously existed elsewhere.
> 
> pg_standby cannot be used with streaming replication.

> I guess you're next question is: why not?
> 
> The startup process alternates between streaming, and restoring files
> from archive using restore_command. It will progress using streaming as
> long as it can, but if the connection is lost, it will try to poll the
> archive until the connection is established again. The startup process
> expects the restore_command to try to restore the file and fail if it's
> not found. If the restore_command goes into sleep, waiting for the file
> to arrive, that will defeat the retry logic in the server because the
> startup process won't get control again to retry establishing the
> connection.

Why does the startup process need to regain control? Why not just let it
sit and wait? Have you seen that if someone does use pg_standby or
similar scripts in the restore_command that the server will never regain
control in the way you hope. Would that cause a sporadic hang?

The overall design was previously that the solution implementor was in
charge of the archive and only they knew its characteristics.

It seems strange that we will be forced to explicitly ban people from
using a utility they were previously used to using and is still included
with the distro. Then we implement in the server the very things the
utility did. Only this time the solution implementor will not be in
control.

I would not be against implementing all aspects of pg_standby into the
server. It would make life easier in some ways. I am against
implementing only a *few* of the aspects because that leaves solution
architects in a difficult position to know what to do.

Please lay out some options here for discussion by the community. This
seems like a difficult area and not one to be patched up quickly.

> That's the the essence of my proposal here:
> http://archives.postgresql.org/message-id/4B50AFB4.4060902@enterprisedb.com
> which is what has now been implemented.
> 
> To suppport a restore_command that does the sleeping itself, like
> pg_standby, would require a major rearchitecting of the retry logic. And
> I don't see why that'd desirable anyway. It's easier for the admin to
> set up using simple commands like 'cp' or 'scp', than require him/her to
> write scripts that handle the sleeping and retry logic.
> 
> 
> The real problem we have right now is missing documentation. It's
> starting to hurt us more and more every day, as more people start to
> test this. As shown by this thread and some other recent posts.

-- Simon Riggs           www.2ndQuadrant.com



On Thu, 2010-02-11 at 14:41 +0100, Dimitri Fontaine wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > If you were running pg_standby as the restore_command then this error
> > wouldn't happen. So you need to explain why running pg_standby cannot
> > solve your problem and why we must fix it by replicating code that has
> > previously existed elsewhere.
> 
> Let me try.
> 
> pg_standby will not let the server get back to streaming replication
> mode once it's done with driving the replay of all the WAL files
> available in the archive, but will have the server sits there waiting
> for the next file.
> 
> The way we want that is implemented now is to have the server switch
> back and forth between replaying from the archive and streaming from the
> master. So we want the server to restore from the archive the same way
> pg_standby used to, except that if the archive does not contain the next
> WAL files, we want to get back to streaming.
> 
> And the archive reading will resume at next network glitch.
> 
> I think it's the reasonning, I hope it explains what you see happening.

OK, thanks.

One question then: how do we ensure that the archive does not grow too
big? pg_standby cleans down the archive using %R. That function appears
to not exist anymore. 

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> One question then: how do we ensure that the archive does not grow too
> big? pg_standby cleans down the archive using %R. That function appears
> to not exist anymore. 

You can still use %R. Of course, plain 'cp' won't know what to do with
it, so a script will then be required. We should probably provide a
sample of that in the docs, or even a ready-made tool similar to
pg_standby but without the waiting.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 08:29]:
> To suppport a restore_command that does the sleeping itself, like
> pg_standby, would require a major rearchitecting of the retry logic. And
> I don't see why that'd desirable anyway. It's easier for the admin to
> set up using simple commands like 'cp' or 'scp', than require him/her to
> write scripts that handle the sleeping and retry logic.

But colour me confused, I'm still not understanding why this is any
different that with normal PITR recovery.

So even with a plain "cp" in your recovery command instead of a
sleep+copy (a la pg_standby, or PITR tools, or all the home-grown
solutions out thery), I'm not seeing how it's going to get "half files".
The only way I can see that is if you're out of disk space in your
recovering pg_xlog.

It's well know in PostgreSQL wal archivne - you don't just "shove" files
into the archive, you make sure they appear there with the right name
atomically.  And if the master is only running the archive command on
whole WAL files, I just don't understand this whole short wal problem.

And don't try and tell me your just "poaching" files from a running
cluster's pg_xlog directory, because I'm going to cry...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

On Thu, 2010-02-11 at 15:55 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > One question then: how do we ensure that the archive does not grow too
> > big? pg_standby cleans down the archive using %R. That function appears
> > to not exist anymore. 
> 
> You can still use %R. Of course, plain 'cp' won't know what to do with
> it, so a script will then be required. We should probably provide a
> sample of that in the docs, or even a ready-made tool similar to
> pg_standby but without the waiting.

So we still need a script but it can't be pg_standby? Hmmm, OK...

Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote:
> But colour me confused, I'm still not understanding why this is any
> different that with normal PITR recovery.
> 
> So even with a plain "cp" in your recovery command instead of a
> sleep+copy (a la pg_standby, or PITR tools, or all the home-grown
> solutions out thery), I'm not seeing how it's going to get "half files".

If the file is just being copied to the archive when restore_command
('cp', say) is launched, it will copy a half file. That's not a problem
for PITR, because PITR will end at the end of valid WAL anyway, but
returning a half WAL file in standby mode is a problem.

> It's well know in PostgreSQL wal archivne - you don't just "shove" files
> into the archive, you make sure they appear there with the right name
> atomically.  And if the master is only running the archive command on
> whole WAL files, I just don't understand this whole short wal problem.

Yeah, if you're careful about that, then this change isn't required. But
pg_standby protects against that, so I think it'd be reasonable to have
the same level of protection built-in. It's not a lot of code.

We could well just document that you should do that, ie. make sure the
file appears in the archive atomically with the right size.

> And don't try and tell me your just "poaching" files from a running
> cluster's pg_xlog directory, because I'm going to cry...

No :-).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Might it not be simpler to add a parameter onto pg_standby?
> We send %s to tell pg_standby the standby_mode of the server which is
> calling it so it can decide how to act in each case.

That would work too, but it doesn't seem any simpler to me. On the contrary.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Thu, 2010-02-11 at 16:22 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Might it not be simpler to add a parameter onto pg_standby?
> > We send %s to tell pg_standby the standby_mode of the server which is
> > calling it so it can decide how to act in each case.
> 
> That would work too, but it doesn't seem any simpler to me. On the contrary.

It would mean that pg_standby would act appropriately according to the
setting of standby_mode. So you wouldn't need multiple examples of use,
it would all just work whatever the setting of standby_mode. Nice simple
entry in the docs.

We've said we need a script and pg_standby is it. Having a second
script, pg_standby2, rather than adding something to the existing tools
just seems confusing and easier to get wrong.

I'm happy to do the patch if you like.

-- Simon Riggs           www.2ndQuadrant.com



* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 09:17]:

> If the file is just being copied to the archive when restore_command
> ('cp', say) is launched, it will copy a half file. That's not a problem
> for PITR, because PITR will end at the end of valid WAL anyway, but
> returning a half WAL file in standby mode is a problem.

But it can be a problem - without the last WAL (or at least enough of
it) the master switched and archived, you have no guarantee of having
being consistent again (I'm thinking specifically of recovering from a
fresh backup)

> Yeah, if you're careful about that, then this change isn't required. But
> pg_standby protects against that, so I think it'd be reasonable to have
> the same level of protection built-in. It's not a lot of code.

This 1 check isn't, but what about the rest of the things pg_standby
does.  How much functionality should we bring it?  Ideally, "all" of it.

> We could well just document that you should do that, ie. make sure the
> file appears in the archive atomically with the right size.

I have to admit, today was the first time I went and re-read the PITR
docs, and no, the docs don't seem to talk about that... Maybe it was
just plain obvious to me because it (the atomic apperance) is something
unix devloppers have always had to deal with, so it's ingrained in me.
But I'm *sure* that I've seen that bandied around as common knowledge on
the lists, and one of the reasons we alway see warnings about using
rsync instead of plain SCP, etc.

So ya, we should probably mention that somewhere in the docs.  Section
24.3.6. Caveats?

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Heikki Linnakangas wrote: <blockquote cite="mid:4B7412BE.5030605@enterprisedb.com" type="cite"><pre wrap="">Simon
Riggswrote: </pre><blockquote type="cite"><pre wrap="">Might it not be simpler to add a parameter onto pg_standby?
 
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.   </pre></blockquote><pre wrap="">
That would work too, but it doesn't seem any simpler to me. On the contrary. </pre></blockquote><br /> I don't know
thatthe ideas are mutually exclusive.  Should the server internals do a basic check that the files they're about to
processseem sane before processing them?  Yes.  Should we provide a full-featured program that knows how far back it
candelete archives rather than expecting people to write their own?  Yes.  And a modified pg_standby seems the easiest
wayto do that, since it already knows how to do the computations, among other benefits.<br /><br /> I already have a
workin progress patch to pg_standby I'm racing to finish here that cleans up some of the UI warts in the program,
includingthe scary sounding and avoidable warnings Selena submitted a patch to improve.  I think I'm going to add this
featureto it, too, whether or not it's deemed necessary.  I'm really tired of example setups provided that say "just
writea simple script using cp like <trivial example>" and then supporting those non-production quality messes in
thefield.  My scripts for this sort of thing have more error checking in them than functional code.  And pg_standby
alreadyhas a healthy sense of paranoia built into it.<br /><br /><pre class="moz-signature" cols="72">-- 
 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>   <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a>
 
</pre>

Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2010-02-11 at 16:22 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> Might it not be simpler to add a parameter onto pg_standby?
>>> We send %s to tell pg_standby the standby_mode of the server which is
>>> calling it so it can decide how to act in each case.
>> That would work too, but it doesn't seem any simpler to me. On the contrary.
> 
> It would mean that pg_standby would act appropriately according to the
> setting of standby_mode. So you wouldn't need multiple examples of use,
> it would all just work whatever the setting of standby_mode. Nice simple
> entry in the docs.
> 
> We've said we need a script and pg_standby is it. Having a second
> script, pg_standby2, rather than adding something to the existing tools
> just seems confusing and easier to get wrong.
> 
> I'm happy to do the patch if you like.

Well, doesn't really seem that useful to me, but I won't object if you
want to do it.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Euler Taveira de Oliveira
Date:
Simon Riggs escreveu:
> It would mean that pg_standby would act appropriately according to the
> setting of standby_mode. So you wouldn't need multiple examples of use,
> it would all just work whatever the setting of standby_mode. Nice simple
> entry in the docs.
> 
+1. I like the %s idea. IMHO fixing pg_standby for SR is a must-fix. I'm
foreseeing a lot of users asking why pg_standby doesn't work on SR mode ...


--  Euler Taveira de Oliveira http://www.timbira.com/


Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote:
> * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 09:17]:
> 
>> If the file is just being copied to the archive when restore_command
>> ('cp', say) is launched, it will copy a half file. That's not a problem
>> for PITR, because PITR will end at the end of valid WAL anyway, but
>> returning a half WAL file in standby mode is a problem.
> 
> But it can be a problem - without the last WAL (or at least enough of
> it) the master switched and archived, you have no guarantee of having
> being consistent again (I'm thinking specifically of recovering from a
> fresh backup)

You have to wait for the last WAL file required by the backup to be
archived before starting recovery. Otherwise there's no guarantee anyway.

>> We could well just document that you should do that, ie. make sure the
>> file appears in the archive atomically with the right size.
> 
> I have to admit, today was the first time I went and re-read the PITR
> docs, and no, the docs don't seem to talk about that... Maybe it was
> just plain obvious to me because it (the atomic apperance) is something
> unix devloppers have always had to deal with, so it's ingrained in me.
> But I'm *sure* that I've seen that bandied around as common knowledge on
> the lists, and one of the reasons we alway see warnings about using
> rsync instead of plain SCP, etc.
> 
> So ya, we should probably mention that somewhere in the docs.  Section
> 24.3.6. Caveats?

-1. it isn't necessary for PITR. It's a new requirement for
standby_mode='on', unless we add the file size check into the backend. I
think we should add the file size check to the backend instead and save
admins the headache.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote:
> * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 09:17]:
> 
>> Yeah, if you're careful about that, then this change isn't required. But
>> pg_standby protects against that, so I think it'd be reasonable to have
>> the same level of protection built-in. It's not a lot of code.
> 
> This 1 check isn't, but what about the rest of the things pg_standby
> does.  How much functionality should we bring it?  Ideally, "all" of it.

Well, how about we bite the bullet then and add enough bells and
whistles to the backend that pg_standby really isn't needed anymore, and
remove it from contrib?

Looking at the options to pg_standby, we're not missing much:

> Options:
>   -c                 copies file from archive (default)
>   -l                 links into archive (leaves file in archive)

Obsolete (link mode not supported anymore)

>   -d                 generate lots of debugging output (testing only)

We have DEBUG statements in the server...

>   -k NUMFILESTOKEEP  if RESTARTWALFILE not used, removes files prior to limit
>                      (0 keeps all)

This is dangerous, and obsoleted by the RESTARTWALFILE option (%r).

>   -r MAXRETRIES      max number of times to retry, with progressive wait
>                      (default=3)

Frankly this seems pretty useless, but it would be easy to implement

>   -s SLEEPTIME       seconds to wait between file checks (min=1, max=60,
>                      default=5)

The sleep time in the backend is currently hard-coded at 5 s. Should we
make it configurable?

>   -t TRIGGERFILE     defines a trigger file to initiate failover (no default)

We have this in the backend already.

>   -w MAXWAITTIME     max seconds to wait for a file (0=no limit) (default=0)

We don't have a timeout in the backend. Should we? (the timeout in
pg_standby won't work, even if we add the option to use pg_standby with
standby_mode='on' as Simon suggested)


So the only major feature we're missing is the ability to clean up old
files.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 12:04]:

> > But it can be a problem - without the last WAL (or at least enough of
> > it) the master switched and archived, you have no guarantee of having
> > being consistent again (I'm thinking specifically of recovering from a
> > fresh backup)
> 
> You have to wait for the last WAL file required by the backup to be
> archived before starting recovery. Otherwise there's no guarantee anyway.

Right, but now define "wait for".  If you pull only "half" the last WAL
(because you've accepted that you *can* have short WAL files in the
archive), you have the problem with PITR.

Is it "wait for it to be in the archive", or "wait for it to be in the
archive, and be sure that the contents satisfy some criteria".

I've always made my PITR such that "in the archive" (i.e. the first
moment a recovery can see it) implies that it's bit-for-bit identical to
the original (or at least as bit-for-bit I can assume by checking
various hashes I can afford to).  I just assumed that was kind of common
practice.

I'm amazed that "partial WAL" files are every available in anyones
archive, for anyone's  restore command to actually pull.  I find that
scarry, and sure, probably won't regularly be noticed... But man, I'ld
hate the time I need that emergency PITR restore to be the one time when
it needs that WAL, pulls it slightly before the copy has finished (i.e.
the master is pushing the WAL over a WAN to a 2nd site), and have my
restore complete consistently...

a.




-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> -1. it isn't necessary for PITR. It's a new requirement for
> standby_mode='on', unless we add the file size check into the backend. I
> think we should add the file size check to the backend instead and save
> admins the headache.

I think the file size check needs to be in the backend purely on safety
grounds.  Whether we make pg_standby useful for this context is a
different discussion.
        regards, tom lane


Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote:
> * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 12:04]:
> 
>>> But it can be a problem - without the last WAL (or at least enough of
>>> it) the master switched and archived, you have no guarantee of having
>>> being consistent again (I'm thinking specifically of recovering from a
>>> fresh backup)
>> You have to wait for the last WAL file required by the backup to be
>> archived before starting recovery. Otherwise there's no guarantee anyway.
> 
> Right, but now define "wait for".

As in don't start postmaster until the last WAL file needed by backup
has been fully copied to the archive.

> (because you've accepted that you *can* have short WAL files in the
> archive)

Only momentarily, while the copy is in progress.

> I've always made my PITR such that "in the archive" (i.e. the first
> moment a recovery can see it) implies that it's bit-for-bit identical to
> the original (or at least as bit-for-bit I can assume by checking
> various hashes I can afford to).  I just assumed that was kind of common
> practice.

It's certainly good practice, agreed, but hasn't been absolutely required.

> I'm amazed that "partial WAL" files are every available in anyones
> archive, for anyone's  restore command to actually pull.  I find that
> scarry, and sure, probably won't regularly be noticed... But man, I'ld
> hate the time I need that emergency PITR restore to be the one time when
> it needs that WAL, pulls it slightly before the copy has finished (i.e.
> the master is pushing the WAL over a WAN to a 2nd site), and have my
> restore complete consistently...

It's not as dramatic as you make it sound. We're only talking about the
last WAL file, and only when it's just being copied to the archive. If
you have a archive_command like 'cp', and you look at the archive at the
same millisecond that 'cp' runs, then yes you will see that the latest
WAL file in the archive is only partially copied. It's not a problem for
robustness; if you had looked one millisecond earlier you would not have
seen the file there at all.

Windows 'copy' command preallocates the whole file, which poses a
different problem: if you look at the file while it's being copied, the
file has the right length, but isn't in fact fully copied yet. I think
'rsync' has the same problem. To avoid that issue, you have to use
something like copy+rename to make it atomic. There isn't much we can do
in the server (or in pg_standby) to work around that, because there's no
way to distinguish a file that's being copied from a fully-copied
corrupt file.

We do advise to set up an archive_command that doesn't overwrite
existing files. That together with a partial WAL segment can cause a
problem: if archive_command crashes while it's writing the file, leaving
a partial file in the archive, the subsequent run of archive_command
won't overwrite it and will get stuck trying. However, there's a small
window for that even if you put the file into the archive atomically: if
you crash just after fully copying the file, but before the .done file
is created, upon restart the server will also try to copy the file to
archive, find that it already exists, and fail.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Thu, 2010-02-11 at 19:29 +0200, Heikki Linnakangas wrote:
> Aidan Van Dyk wrote:
> > * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 09:17]:
> > 
> >> Yeah, if you're careful about that, then this change isn't required. But
> >> pg_standby protects against that, so I think it'd be reasonable to have
> >> the same level of protection built-in. It's not a lot of code.
> > 
> > This 1 check isn't, but what about the rest of the things pg_standby
> > does.  How much functionality should we bring it?  Ideally, "all" of it.
> 
> Well, how about we bite the bullet then and add enough bells and
> whistles to the backend that pg_standby really isn't needed anymore, and
> remove it from contrib?
> 
> Looking at the options to pg_standby, we're not missing much:

You're attitude to this makes me laugh, and cry. 

Removing pg_standby can be done and if you manage it well, I would be
happy. I laugh because it seems like you think I have some ego tied up
in it. It's hardly my finest hour of coding, especially since I've not
had the ability to improve it much before now.

Not missing *much*?! I cry because you're so blase about missing one or
two essential features that took years to put in place. Keeping
pg_standby at least guarantees we can do something about it when we
discover you didn't do a good enough job removing it.

-- Simon Riggs           www.2ndQuadrant.com



On Thu, 2010-02-11 at 13:08 -0500, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > -1. it isn't necessary for PITR. It's a new requirement for
> > standby_mode='on', unless we add the file size check into the backend. I
> > think we should add the file size check to the backend instead and save
> > admins the headache.
> 
> I think the file size check needs to be in the backend purely on safety
> grounds.  Whether we make pg_standby useful for this context is a
> different discussion.

Happy with that.

-- Simon Riggs           www.2ndQuadrant.com



Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> I think 'rsync' has the same problem.
There is a switch you can use to create the problem under rsync, but
by default rsync copies to a temporary file name and moves the
completed file to the target name.
-Kevin


On Thu, Feb 11, 2010 at 01:22:44PM -0500, Kevin Grittner wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
>  
> > I think 'rsync' has the same problem.
>  
> There is a switch you can use to create the problem under rsync, but
> by default rsync copies to a temporary file name and moves the
> completed file to the target name.
>  
> -Kevin

I don't use PITR, So I don't know any of the well understood facts about 
PITR with postgres, but my understanding with rsync is ...

It doesn't fsync data before rename, its something like: open / write / 
close / rename, which could lead to zero length files on some filesystems.  
(are there other anomalies to worry about here?)

Would that be a concern?  

Garick

> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


On Thu, Feb 11, 2010 at 11:22 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Simon Riggs wrote:
>> Might it not be simpler to add a parameter onto pg_standby?
>> We send %s to tell pg_standby the standby_mode of the server which is
>> calling it so it can decide how to act in each case.
>
> That would work too, but it doesn't seem any simpler to me. On the contrary.

Agreed.

There could be three kinds of SR configurations. Let's think of them separately.

(1) SR without restore_command
(2) SR with 'cp'
(3) SR with pg_standby

(1) is the straightforward configuration. In this case the standby replays only
the WAL files in pg_xlog directory, and starts SR when it has found the invalid
record or been able to find no more WAL file. Then if SR is terminated for some
reasons, the standby would periodically try to connect to the primary and start
SR again. If you choose this, you don't need to care about the problem discussed
on the thread.

In the (2) case the standby replays the WAL files in not only pg_xlog but also
the archive, and starts SR when it has found the invalid record or been able to
find no more WAL file. If the archive is shared between the primary and the
standby, the standby might restore the partial WAL file being archived (copied)
by the primary. This could happen because 'cp' is not an atomic operation.

Currently when the standby finds the WAL file whose file size is less than 16MB,
it emits the FATAL error. This is the problem that I presented upthread. That is
undesirable behavior, so I proposed to just treat that case the same as if no
more WAL file is found. If so, the standby can start SR instead of emitting the
FATAL error. (2) is useful configuration as described in Heikki's
commig message.
http://archives.postgresql.org/pgsql-committers/2010-01/msg00395.php

(3) was unexpected configuration (at least for me). This would work fine as a
*file-based* log shipping but not SR. Since pg_standby doesn't return when no
more WAL file is found in the archive (i.e., it waits until the next complete
WAL file is available), SR will never start. OTOH, since pg_standby treats the
partial file as "nonexistence", the problem discussed on this thread doesn't
happen.

Questions:
(A) Is my proposal for (2) reasonable? For me, Yes.
(B) Should we allow (3) to work as "streaming replication"? In fact, we should   create the new mode that makes
pg_standbyreturn as soon as it doesn't find   a complete WAL file in the archive? I agree with Heikki, i.e., don't
think  it's worth doing. Though pg_standby already has the capability to remove the   old WAL files, we would still
needthe cron job that removes them
 
periodically   because pg_standby is not executed during SR is running normally.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2010-02-11 at 13:08 -0500, Tom Lane wrote:
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>> -1. it isn't necessary for PITR. It's a new requirement for
>>> standby_mode='on', unless we add the file size check into the backend. I
>>> think we should add the file size check to the backend instead and save
>>> admins the headache.
>> I think the file size check needs to be in the backend purely on safety
>> grounds.  Whether we make pg_standby useful for this context is a
>> different discussion.
> 
> Happy with that.

Committed such a check now.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Fri, 2010-02-12 at 14:38 +0900, Fujii Masao wrote:
> On Thu, Feb 11, 2010 at 11:22 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> > Simon Riggs wrote:
> >> Might it not be simpler to add a parameter onto pg_standby?
> >> We send %s to tell pg_standby the standby_mode of the server which is
> >> calling it so it can decide how to act in each case.
> >
> > That would work too, but it doesn't seem any simpler to me. On the contrary.
> 
> Agreed.
> 
> There could be three kinds of SR configurations. Let's think of them separately.
> 
> (1) SR without restore_command
> (2) SR with 'cp'
> (3) SR with pg_standby

Thanks for the explanation.

> (1) is the straightforward configuration. In this case the standby replays only
> the WAL files in pg_xlog directory, and starts SR when it has found the invalid
> record or been able to find no more WAL file. Then if SR is terminated for some
> reasons, the standby would periodically try to connect to the primary and start
> SR again. If you choose this, you don't need to care about the problem discussed
> on the thread.
> 
> In the (2) case the standby replays the WAL files in not only pg_xlog but also
> the archive, and starts SR when it has found the invalid record or been able to
> find no more WAL file. If the archive is shared between the primary and the
> standby, the standby might restore the partial WAL file being archived (copied)
> by the primary. This could happen because 'cp' is not an atomic operation.
> 
> Currently when the standby finds the WAL file whose file size is less than 16MB,
> it emits the FATAL error. This is the problem that I presented upthread. That is
> undesirable behavior, so I proposed to just treat that case the same as if no
> more WAL file is found. If so, the standby can start SR instead of emitting the
> FATAL error. (2) is useful configuration as described in Heikki's
> commig message.
> http://archives.postgresql.org/pgsql-committers/2010-01/msg00395.php

> (3) was unexpected configuration (at least for me). This would work fine as a
> *file-based* log shipping but not SR. Since pg_standby doesn't return when no
> more WAL file is found in the archive (i.e., it waits until the next complete
> WAL file is available), SR will never start. OTOH, since pg_standby treats the
> partial file as "nonexistence", the problem discussed on this thread doesn't
> happen.

When we refer to "pg_standby" we mean any script. pg_standby is just a
reference implementation of a useful script. The discussion shouldn't
really focus on pg_standby, nor should we think of it as a different
approach. My original question was whether we are seeking to remove
pg_standby and, if so, have we implemented all of the technical features
that pg_standby provides? Worryingly the answer seems to be Yes and No.
I don't care if we get rid of pg_standby as long as we retain all the
things it does. *Losing* features is not acceptable.

> Questions:
> (A) Is my proposal for (2) reasonable? For me, Yes.
> (B) Should we allow (3) to work as "streaming replication"? In fact, we should
>     create the new mode that makes pg_standby return as soon as it doesn't find
>     a complete WAL file in the archive? I agree with Heikki, i.e., don't think
>     it's worth doing. Though pg_standby already has the capability to remove the
>     old WAL files, we would still need the cron job that removes them
> periodically
>     because pg_standby is not executed during SR is running normally.

Yes, I realised that myself overnight and was going to raise this point
with you today. 

In 8.4 it is pg_standby that was responsible for clearing down the
archive, which is why I suggested using pg_standby for that again. I
agree that will not work. The important thing is not pg_standby but that
we have a valid mechanism for clearing down the archive.

If (2) is a fully supported replication method then we cannot rely on
the existence of an external cron job to clear down the archive. Most
importantly, that cron job would not know the point up to which to clear
down the archive, the information given to pg_standby by %r.

So I suggest that you have a new action that gets called after every
checkpoint to clear down the archive. It will remove all files from the
archive prior to %r. We can implement that as a sequence of unlink()s
from within the server, or we can just call a script to do it. I prefer
the latter approach. However we do it, we need something initiated by
the server to maintain the archive and stop it from overflowing. 

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> In 8.4 it is pg_standby that was responsible for clearing down the
> archive, which is why I suggested using pg_standby for that again. I
> agree that will not work. The important thing is not pg_standby but that
> we have a valid mechanism for clearing down the archive.

Good point. When streaming from the master, the standby doesn't call
restore_command, and restore_command doesn't get a chance to clean up
old files.

> So I suggest that you have a new action that gets called after every
> checkpoint to clear down the archive. It will remove all files from the
> archive prior to %r. We can implement that as a sequence of unlink()s
> from within the server, or we can just call a script to do it. I prefer
> the latter approach. However we do it, we need something initiated by
> the server to maintain the archive and stop it from overflowing. 

+1

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Fri, 2010-02-12 at 12:54 +0000, Simon Riggs wrote:

> So I suggest that you have a new action that gets called after every
> checkpoint to clear down the archive. It will remove all files from the
> archive prior to %r. We can implement that as a sequence of unlink()s
> from within the server, or we can just call a script to do it. I prefer
> the latter approach. However we do it, we need something initiated by
> the server to maintain the archive and stop it from overflowing.

Attached patch implements pg_standby for use as an
archive_cleanup_command, reusing existing code with new -a option.

e.g.

archive_cleanup_command = 'pg_standby -a -d /mnt/server/archiverdir %r'

Happy to add the archive_cleanup_command into main server as well, if
you like. Won't take long.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment
On Fri, Feb 12, 2010 at 10:10 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> So I suggest that you have a new action that gets called after every
>> checkpoint to clear down the archive. It will remove all files from the
>> archive prior to %r. We can implement that as a sequence of unlink()s
>> from within the server, or we can just call a script to do it. I prefer
>> the latter approach. However we do it, we need something initiated by
>> the server to maintain the archive and stop it from overflowing.
>
> +1

If we leave executing the remove_command to the bgwriter, the restartpoint
might not happen unfortunately for a long time. To prevent that situation, the
archiver should execute the command, I think. Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> On Fri, Feb 12, 2010 at 10:10 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>>> So I suggest that you have a new action that gets called after every
>>> checkpoint to clear down the archive. It will remove all files from the
>>> archive prior to %r. We can implement that as a sequence of unlink()s
>>> from within the server, or we can just call a script to do it. I prefer
>>> the latter approach. However we do it, we need something initiated by
>>> the server to maintain the archive and stop it from overflowing.
>> +1
> 
> If we leave executing the remove_command to the bgwriter, the restartpoint
> might not happen unfortunately for a long time. 

Are you thinking of a scenario where remove_command gets stuck, and
prevents bgwriter from performing restartpoints while it's stuck? You
have trouble if restore_command gets stuck like that as well, so I think
we can require that the remove_command returns in a reasonable period of
time, ie. in a few minutes.

> To prevent that situation, the
> archiver should execute the command, I think. Thought?

The archiver isn't running in standby, so that's not going to work. And
it's not connected to shared memory either, so it doesn't know what the
latest restartpoint is.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Simon Riggs <simon@2ndQuadrant.com> writes:
> Attached patch implements pg_standby for use as an
> archive_cleanup_command, reusing existing code with new -a option.
>
> Happy to add the archive_cleanup_command into main server as well, if
> you like. Won't take long.

Would it be possible to have the server do the cleanup (and the cp for
that matter) on its own or use a script?

Either have archive_cleanup = on and either an commented out (empty)
archive_cleanup_command and the same for the restore_command, or a
special magic value, like 'postgresql' to activate the internal one.

This way we have both a zero config working default setup and the
flexibility to adapt to any archiving solution our user might already be
using.

A default archive_command would be nice too. Like you give it a
directory name or a rsync path and it does the basic right thing.

I'm not sure my detailed approach is the right one, but the birdview is
to have the simple case really simple to setup, and the complex cases
still possible. Default included archive, restore and cleanup commands
would be awesome.

Oh, and the basic simple case actually is with a remote standby. Without
NFS or the like, no shared mount point for archiving. 

Regards,
-- 
dim


<p>so I from by like having the server doing the cleanup because it down by necessarily have the while picture. it down
ntknow of it is the only replica reading these log files our if the site policy is to keep them for disaster recovery
purposes.<p>Ilike having this as an return val command though. Or alternately now that we have read-only replicas you
couldimplement this by polling the slaves and asking them what log position they are up to.<br /><br
/><p>greg<p><blockquotetype="cite">On 12 Feb 2010 17:25, "Dimitri Fontaine" <<a
href="mailto:dfontaine@hi-media.com">dfontaine@hi-media.com</a>>wrote:<br /><br /><p><font color="#500050">Simon
Riggs<simon@2ndQuadrant.com> writes:</font><p><font color="#500050">> Attached patch implements pg_standby for
useas an<br />> archive_cleanup_command, reusing existing cod...</font><p><font color="#500050">> Happy to add
thearchive_cleanup_command into main server as well, if<br /> > you like. Won't take long....</font>Would it be
possibleto have the server do the cleanup (and the cp for<br /> that matter) on its own or use a script?<br /><br />
Eitherhave archive_cleanup = on and either an commented out (empty)<br /> archive_cleanup_command and the same for the
restore_command,or a<br /> special magic value, like 'postgresql' to activate the internal one.<br /><br /> This way we
haveboth a zero config working default setup and the<br /> flexibility to adapt to any archiving solution our user
mightalready be<br /> using.<br /><br /> A default archive_command would be nice too. Like you give it a<br />
directoryname or a rsync path and it does the basic right thing.<br /><br /> I'm not sure my detailed approach is the
rightone, but the birdview is<br /> to have the simple case really simple to setup, and the complex cases<br /> still
possible.Default included archive, restore and cleanup commands<br /> would be awesome.<br /><br /> Oh, and the basic
simplecase actually is with a remote standby. Without<br /> NFS or the like, no shared mount point for archiving.<br
/><br/> Regards,<br /><font color="#888888">--<br /> dim<br /></font><p><font color="#500050"><br />-- <br />Sent via
pgsql-hackersmailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br />To make
changesto your subs...</font></blockquote> 
On Sat, Feb 13, 2010 at 1:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Are you thinking of a scenario where remove_command gets stuck, and
> prevents bgwriter from performing restartpoints while it's stuck?

Yes. If there is the archive in the remote server and the network outage
happens, remove_command might get stuck, I'm afraid.

> You
> have trouble if restore_command gets stuck like that as well, so I think
> we can require that the remove_command returns in a reasonable period of
> time, ie. in a few minutes.

Oh, you are right!

BTW, we need to note that remove_command approach would be useless if one
archive is shared by multiple standbys. One standby might wrongly remove
the archived WAL file that has been still required for another standby.
In this case, we need to have the job script that calculates the archived
WAL files that are required by no standbys, and removes them.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Fri, Feb 12, 2010 at 2:29 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> So the only major feature we're missing is the ability to clean up old
> files.

I found another missing feature in new file-based log shipping (i.e.,
standby_mode is enabled and 'cp' is used as restore_command).

After the trigger file is found, the startup process with pg_standby
tries to replay all of the WAL files in both pg_xlog and the archive.
So, when the primary fails, if the latest WAL file in pg_xlog of the
primary can be read, we can prevent the data loss by copying it to
pg_xlog of the standby before creating the trigger file.

On the other hand, the startup process with standby mode doesn't
replay the WAL files in pg_xlog after the trigger file is found. So
failover always causes the data loss even if the latest WAL file can
be read from the primary. And if the latest WAL file is copied to the
archive instead, it can be replayed but a PANIC error would happen
because it's not filled.

We should remove this restriction?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Fujii Masao wrote:
> I found another missing feature in new file-based log shipping (i.e.,
> standby_mode is enabled and 'cp' is used as restore_command).
> 
> After the trigger file is found, the startup process with pg_standby
> tries to replay all of the WAL files in both pg_xlog and the archive.
> So, when the primary fails, if the latest WAL file in pg_xlog of the
> primary can be read, we can prevent the data loss by copying it to
> pg_xlog of the standby before creating the trigger file.
> 
> On the other hand, the startup process with standby mode doesn't
> replay the WAL files in pg_xlog after the trigger file is found. So
> failover always causes the data loss even if the latest WAL file can
> be read from the primary. And if the latest WAL file is copied to the
> archive instead, it can be replayed but a PANIC error would happen
> because it's not filled.
> 
> We should remove this restriction?

Looking into this, I realized that we have a bigger problem related to
this. Although streaming replication stores the streamed WAL files in
pg_xlog, so that they can be re-replayed after a standby restart without
connecting to the master, we don't try to replay those either. So if you
restart standby, it will fail to start up if the WAL it needs can't be
found in archive or by connecting to the master. That must be fixed.

I'd imagine that the ability to restore WAL files manually copied to
pg_xlog will fall out of that fix too.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Wed, 2010-03-17 at 12:35 +0200, Heikki Linnakangas wrote:

> Looking into this, I realized that we have a bigger problem...

A lot of this would be easier if you do the docs first, then work
through the problems. The new system is more complex, since it has two
modes rather than one and also multiple processes and a live connection.
The number of failure cases must be higher than previously.

Documenting how it is supposed to work in the event of failure will help
everyone check those and comment on them.

-- Simon Riggs           www.2ndQuadrant.com



On Wed, Mar 17, 2010 at 7:35 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Fujii Masao wrote:
>> I found another missing feature in new file-based log shipping (i.e.,
>> standby_mode is enabled and 'cp' is used as restore_command).
>>
>> After the trigger file is found, the startup process with pg_standby
>> tries to replay all of the WAL files in both pg_xlog and the archive.
>> So, when the primary fails, if the latest WAL file in pg_xlog of the
>> primary can be read, we can prevent the data loss by copying it to
>> pg_xlog of the standby before creating the trigger file.
>>
>> On the other hand, the startup process with standby mode doesn't
>> replay the WAL files in pg_xlog after the trigger file is found. So
>> failover always causes the data loss even if the latest WAL file can
>> be read from the primary. And if the latest WAL file is copied to the
>> archive instead, it can be replayed but a PANIC error would happen
>> because it's not filled.
>>
>> We should remove this restriction?
>
> Looking into this, I realized that we have a bigger problem related to
> this. Although streaming replication stores the streamed WAL files in
> pg_xlog, so that they can be re-replayed after a standby restart without
> connecting to the master, we don't try to replay those either. So if you
> restart standby, it will fail to start up if the WAL it needs can't be
> found in archive or by connecting to the master. That must be fixed.

I agree that this is a bigger problem. Since the standby always starts
walreceiver before replaying any WAL files in pg_xlog, walreceiver tries
to receive the WAL files following the REDO starting point even if they
have already been in pg_xlog. IOW, the same WAL files might be shipped
from the primary to the standby many times. This behavior is unsmart,
and should be addressed.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Thu, 2010-03-18 at 23:27 +0900, Fujii Masao wrote:

> I agree that this is a bigger problem. Since the standby always starts
> walreceiver before replaying any WAL files in pg_xlog, walreceiver tries
> to receive the WAL files following the REDO starting point even if they
> have already been in pg_xlog. IOW, the same WAL files might be shipped
> from the primary to the standby many times. This behavior is unsmart,
> and should be addressed.

We might also have written half a file many times. The files in pg_xlog
are suspect whereas the files in the archive are not. If we have both we
should prefer the archive.

-- Simon Riggs           www.2ndQuadrant.com



Simon Riggs wrote:
> On Thu, 2010-03-18 at 23:27 +0900, Fujii Masao wrote:
>
>> I agree that this is a bigger problem. Since the standby always starts
>> walreceiver before replaying any WAL files in pg_xlog, walreceiver tries
>> to receive the WAL files following the REDO starting point even if they
>> have already been in pg_xlog. IOW, the same WAL files might be shipped
>> from the primary to the standby many times. This behavior is unsmart,
>> and should be addressed.
>
> We might also have written half a file many times. The files in pg_xlog
> are suspect whereas the files in the archive are not. If we have both we
> should prefer the archive.

Yep.

Here's a patch I've been playing with. The idea is that in standby mode,
the server keeps trying to make progress in the recovery by:

a) restoring files from archive
b) replaying files from pg_xlog
c) streaming from master

When recovery reaches an invalid WAL record, typically caused by a
half-written WAL file, it closes the file and moves to the next source.
If an error is found in a file restored from archive or in a portion
just streamed from master, however, a PANIC is thrown, because it's not
expected to have errors in the archive or in the master.

When a file is streamed from master, it's left in pg_xlog, so it's found
there after a standby restart, and recovery can progress to the same
point as before restart. It also means that you can copy partial WAL
files to pg_xlog at any time and have them replayed in a few seconds.

The code structure is a bit spaghetti-like, I'm afraid. Any suggestions
on how to improve that are welcome..

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 445,464 **** static uint32 openLogSeg = 0;
  static uint32 openLogOff = 0;

  /*
   * These variables are used similarly to the ones above, but for reading
   * the XLOG.  Note, however, that readOff generally represents the offset
   * of the page just read, not the seek position of the FD itself, which
   * will be just past that page. readLen indicates how much of the current
!  * page has been read into readBuf.
   */
  static int    readFile = -1;
  static uint32 readId = 0;
  static uint32 readSeg = 0;
  static uint32 readOff = 0;
  static uint32 readLen = 0;

! /* Is the currently open segment being streamed from primary? */
! static bool readStreamed = false;

  /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
  static char *readBuf = NULL;
--- 445,477 ----
  static uint32 openLogOff = 0;

  /*
+  * Codes indicating where we got a WAL file from during recovery, or where
+  * to attempt to get one.
+  */
+ #define XLOG_FROM_ARCHIVE        (1<<0)    /* Restored using restore_command */
+ #define XLOG_FROM_PG_XLOG        (1<<1)    /* Existing file in pg_xlog */
+ #define XLOG_FROM_STREAM        (1<<2)    /* Streamed from master */
+
+ /*
   * These variables are used similarly to the ones above, but for reading
   * the XLOG.  Note, however, that readOff generally represents the offset
   * of the page just read, not the seek position of the FD itself, which
   * will be just past that page. readLen indicates how much of the current
!  * page has been read into readBuf, and readSource indicates where we got
!  * the currently open file from.
   */
  static int    readFile = -1;
  static uint32 readId = 0;
  static uint32 readSeg = 0;
  static uint32 readOff = 0;
  static uint32 readLen = 0;
+ static int readSource = 0;        /* XLOG_FROM_* code */

! /*
!  * Keeps track of which sources we've tried to read the current WAL
!  * record from and failed.
!  */
! static int failedSources = 0;

  /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
  static char *readBuf = NULL;
***************
*** 512,520 **** static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
                         bool find_free, int *max_advance,
                         bool use_lock);
  static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
!              bool fromArchive, bool notexistOk);
  static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode,
!                    bool fromArchive);
  static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
               bool randAccess);
  static void XLogFileClose(void);
--- 525,533 ----
                         bool find_free, int *max_advance,
                         bool use_lock);
  static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
!              int source, bool notexistOk);
  static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode,
!                    int sources);
  static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
               bool randAccess);
  static void XLogFileClose(void);
***************
*** 2567,2573 **** XLogFileOpen(uint32 log, uint32 seg)
   */
  static int
  XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
!              bool fromArchive, bool notfoundOk)
  {
      char        xlogfname[MAXFNAMELEN];
      char        activitymsg[MAXFNAMELEN + 16];
--- 2580,2586 ----
   */
  static int
  XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
!              int source, bool notfoundOk)
  {
      char        xlogfname[MAXFNAMELEN];
      char        activitymsg[MAXFNAMELEN + 16];
***************
*** 2576,2598 **** XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,

      XLogFileName(xlogfname, tli, log, seg);

!     if (fromArchive)
      {
!         /* Report recovery progress in PS display */
!         snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
!                  xlogfname);
!         set_ps_display(activitymsg, false);

!         restoredFromArchive = RestoreArchivedFile(path, xlogfname,
!                                                   "RECOVERYXLOG",
!                                                   XLogSegSize);
!         if (!restoredFromArchive)
!             return -1;
!     }
!     else
!     {
!         XLogFilePath(path, tli, log, seg);
!         restoredFromArchive = false;
      }

      fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
--- 2589,2616 ----

      XLogFileName(xlogfname, tli, log, seg);

!     switch (source)
      {
!         case XLOG_FROM_ARCHIVE:
!             /* Report recovery progress in PS display */
!             snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
!                      xlogfname);
!             set_ps_display(activitymsg, false);

!             restoredFromArchive = RestoreArchivedFile(path, xlogfname,
!                                                       "RECOVERYXLOG",
!                                                       XLogSegSize);
!             if (!restoredFromArchive)
!                 return -1;
!             break;
!
!         case XLOG_FROM_PG_XLOG:
!             XLogFilePath(path, tli, log, seg);
!             restoredFromArchive = false;
!             break;
!
!         default:
!             elog(ERROR, "invalid XLogFileRead source %d", source);
      }

      fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
***************
*** 2606,2611 **** XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
--- 2624,2631 ----
                   xlogfname);
          set_ps_display(activitymsg, false);

+         readSource = source;
+
          return fd;
      }
      if (errno != ENOENT || !notfoundOk) /* unexpected failure? */
***************
*** 2624,2630 **** XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
   * searched in pg_xlog if not found in archive.
   */
  static int
! XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, bool fromArchive)
  {
      char        path[MAXPGPATH];
      ListCell   *cell;
--- 2644,2650 ----
   * searched in pg_xlog if not found in archive.
   */
  static int
! XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)
  {
      char        path[MAXPGPATH];
      ListCell   *cell;
***************
*** 2647,2666 **** XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, bool fromArchive)
          if (tli < curFileTLI)
              break;                /* don't bother looking at too-old TLIs */

!         fd = XLogFileRead(log, seg, emode, tli, fromArchive, true);
!         if (fd != -1)
!             return fd;

!         /*
!          * If not in StandbyMode, fall back to searching pg_xlog. In
!          * StandbyMode we're streaming segments from the primary to pg_xlog,
!          * and we mustn't confuse the (possibly partial) segments in pg_xlog
!          * with complete segments ready to be applied. We rather wait for the
!          * records to arrive through streaming.
!          */
!         if (!StandbyMode && fromArchive)
          {
!             fd = XLogFileRead(log, seg, emode, tli, false, true);
              if (fd != -1)
                  return fd;
          }
--- 2667,2685 ----
          if (tli < curFileTLI)
              break;                /* don't bother looking at too-old TLIs */

!         if (sources & XLOG_FROM_ARCHIVE)
!         {
!             fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true);
!             if (fd != -1)
!             {
!                 elog(DEBUG1, "got WAL segment from archive");
!                 return fd;
!             }
!         }

!         if (sources & XLOG_FROM_PG_XLOG)
          {
!             fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true);
              if (fd != -1)
                  return fd;
          }
***************
*** 3530,3545 **** ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
      uint32        pageHeaderSize;
      int            emode;

-     /*
-      * We don't expect any invalid records during streaming recovery: we
-      * should never hit the end of WAL because we wait for it to be streamed.
-      * Therefore treat any broken WAL as PANIC, instead of failing over.
-      */
-     if (StandbyMode)
-         emode = PANIC;
-     else
-         emode = emode_arg;
-
      if (readBuf == NULL)
      {
          /*
--- 3549,3554 ----
***************
*** 3591,3600 **** ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
          randAccess = true;        /* allow curFileTLI to go backwards too */
      }

      /* Read the page containing the record */
!     if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
          return NULL;

      pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
      targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
      if (targetRecOff == 0)
--- 3600,3623 ----
          randAccess = true;        /* allow curFileTLI to go backwards too */
      }

+     /* This is the first try read this page. */
+     failedSources = 0;
+ retry:
      /* Read the page containing the record */
!     if (!XLogPageRead(RecPtr, emode_arg, fetching_ckpt, randAccess))
          return NULL;

+     /*
+      * We don't expect any invalid records in archive or in records streamed
+      * from master: we should never hit the end of WAL because we wait for it
+      * to be streamed. Therefore treat any broken WAL as PANIC, instead of
+      * failing over.
+      */
+     if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE)
+         emode = PANIC;
+     else
+         emode = emode_arg;
+
      pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
      targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
      if (targetRecOff == 0)
***************
*** 3828,3833 **** next_record_is_invalid:;
--- 3851,3864 ----
          close(readFile);
          readFile = -1;
      }
+
+     /* In standby-mode, retry from another source */
+     if (StandbyMode)
+     {
+         failedSources |= readSource;
+         goto retry;
+     }
+
      return NULL;
  }

***************
*** 8698,8704 **** StartupProcessMain(void)
   * as for waiting for the requested WAL record to arrive in standby mode.
   */
  static bool
! XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
               bool randAccess)
  {
      static XLogRecPtr receivedUpto = {0, 0};
--- 8729,8735 ----
   * as for waiting for the requested WAL record to arrive in standby mode.
   */
  static bool
! XLogPageRead(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt,
               bool randAccess)
  {
      static XLogRecPtr receivedUpto = {0, 0};
***************
*** 8707,8719 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
      uint32        targetRecOff;
      uint32        targetId;
      uint32        targetSeg;

      XLByteToSeg(*RecPtr, targetId, targetSeg);
      targetPageOff = ((RecPtr->xrecoff % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
      targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;

      /* Fast exit if we have read the record in the current buffer already */
!     if (targetId == readId && targetSeg == readSeg &&
          targetPageOff == readOff && targetRecOff < readLen)
          return true;

--- 8738,8752 ----
      uint32        targetRecOff;
      uint32        targetId;
      uint32        targetSeg;
+     int            emode;
+     static pg_time_t last_fail_time = 0;

      XLByteToSeg(*RecPtr, targetId, targetSeg);
      targetPageOff = ((RecPtr->xrecoff % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
      targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;

      /* Fast exit if we have read the record in the current buffer already */
!     if (failedSources == 0 && targetId == readId && targetSeg == readSeg &&
          targetPageOff == readOff && targetRecOff < readLen)
          return true;

***************
*** 8725,8742 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
      {
          close(readFile);
          readFile = -1;
      }

      XLByteToSeg(*RecPtr, readId, readSeg);

      /* See if we need to retrieve more data */
      if (readFile < 0 ||
!         (readStreamed && !XLByteLT(*RecPtr, receivedUpto)))
      {
          if (StandbyMode)
          {
-             bool        last_restore_failed = false;
-
              /*
               * In standby mode, wait for the requested record to become
               * available, either via restore_command succeeding to restore the
--- 8758,8775 ----
      {
          close(readFile);
          readFile = -1;
+         readSource = 0;
      }

      XLByteToSeg(*RecPtr, readId, readSeg);

+ retry:
      /* See if we need to retrieve more data */
      if (readFile < 0 ||
!         (readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
      {
          if (StandbyMode)
          {
              /*
               * In standby mode, wait for the requested record to become
               * available, either via restore_command succeeding to restore the
***************
*** 8746,8751 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
--- 8779,8786 ----
              {
                  if (WalRcvInProgress())
                  {
+                     failedSources = 0;
+
                      /*
                       * While walreceiver is active, wait for new WAL to arrive
                       * from primary.
***************
*** 8761,8775 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
                          {
                              readFile =
                                  XLogFileRead(readId, readSeg, PANIC,
!                                              recoveryTargetTLI, false, false);
                              switched_segment = true;
!                             readStreamed = true;
                          }
                          break;
                      }

                      if (CheckForStandbyTrigger())
!                         goto next_record_is_invalid;

                      /*
                       * When streaming is active, we want to react quickly when
--- 8796,8811 ----
                          {
                              readFile =
                                  XLogFileRead(readId, readSeg, PANIC,
!                                              recoveryTargetTLI,
!                                              XLOG_FROM_PG_XLOG, false);
                              switched_segment = true;
!                             readSource = XLOG_FROM_STREAM;
                          }
                          break;
                      }

                      if (CheckForStandbyTrigger())
!                         goto triggered;

                      /*
                       * When streaming is active, we want to react quickly when
***************
*** 8779,8784 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
--- 8815,8823 ----
                  }
                  else
                  {
+                     int sources;
+                     pg_time_t now;
+
                      /*
                       * Until walreceiver manages to reconnect, poll the
                       * archive.
***************
*** 8791,8828 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
                      /* Reset curFileTLI if random fetch. */
                      if (randAccess)
                          curFileTLI = 0;
!                     readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2, true);
                      switched_segment = true;
-                     readStreamed = false;
                      if (readFile != -1)
                      {
-                         elog(DEBUG1, "got WAL segment from archive");
                          break;
                      }

                      /*
!                      * If we succeeded restoring some segments from archive
!                      * since the last connection attempt (or we haven't tried
!                      * streaming yet, retry immediately. But if we haven't,
!                      * assume the problem is persistent, so be less
!                      * aggressive.
                       */
!                     if (last_restore_failed)
                      {
!                         /*
!                          * Check to see if the trigger file exists. Note that
!                          * we do this only after failure, so when you create
!                          * the trigger file, we still finish replaying as much
!                          * as we can before failover.
!                          */
!                         if (CheckForStandbyTrigger())
!                             goto next_record_is_invalid;
!                         pg_usleep(5000000L);    /* 5 seconds */
                      }
!                     last_restore_failed = true;

                      /*
!                      * Nope, not found in archive. Try to stream it.
                       *
                       * If fetching_ckpt is TRUE, RecPtr points to the initial
                       * checkpoint location. In that case, we use RedoStartLSN
--- 8830,8877 ----
                      /* Reset curFileTLI if random fetch. */
                      if (randAccess)
                          curFileTLI = 0;
!
!                     /*
!                      * Try to restore the file from archive, or read an
!                      * existing file from pg_xlog.
!                      */
!                     sources = XLOG_FROM_ARCHIVE | XLOG_FROM_PG_XLOG;
!                     sources &= ~failedSources;
!                     readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2,
!                                                   sources);
                      switched_segment = true;
                      if (readFile != -1)
                      {
                          break;
                      }

                      /*
!                      * Nope, not found in archive.
!                      */
!
!                     /*
!                      * Check to see if the trigger file exists. Note that
!                      * we do this only after failure, so when you create
!                      * the trigger file, we still finish replaying as much
!                      * as we can from archive and pg_xlog before failover.
                       */
!                     if (CheckForStandbyTrigger())
!                         goto triggered;
!
!                     /*
!                      * Sleep if it hasn't been long since last attempt.
!                      */
!                     now = (pg_time_t) time(NULL);
!                     if ((now - last_fail_time) < 5)
                      {
!                         pg_usleep(1000000L * (5 - (now - last_fail_time)));
!                         now = (pg_time_t) time(NULL);
                      }
!                     last_fail_time = now;

                      /*
!                      * If primary_conninfo is set, launch walreceiver to
!                      * try to stream the missing WAL.
                       *
                       * If fetching_ckpt is TRUE, RecPtr points to the initial
                       * checkpoint location. In that case, we use RedoStartLSN
***************
*** 8847,8859 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
              /* In archive or crash recovery. */
              if (readFile < 0)
              {
                  /* Reset curFileTLI if random fetch. */
                  if (randAccess)
                      curFileTLI = 0;
!                 readFile = XLogFileReadAnyTLI(readId, readSeg, emode,
!                                               InArchiveRecovery);
                  switched_segment = true;
-                 readStreamed = false;
                  if (readFile < 0)
                      return false;
              }
--- 8896,8914 ----
              /* In archive or crash recovery. */
              if (readFile < 0)
              {
+                 int sources;
                  /* Reset curFileTLI if random fetch. */
                  if (randAccess)
                      curFileTLI = 0;
!
!                 sources = XLOG_FROM_PG_XLOG;
!                 if (InArchiveRecovery)
!                     sources |= XLOG_FROM_ARCHIVE;
!                 sources &= ~failedSources;
!
!                 readFile = XLogFileReadAnyTLI(readId, readSeg, emode_arg,
!                                               sources);
                  switched_segment = true;
                  if (readFile < 0)
                      return false;
              }
***************
*** 8861,8878 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
      }

      /*
!      * At this point, we have the right segment open and we know the requested
!      * record is in it.
       */
      Assert(readFile != -1);

      /*
       * If the current segment is being streamed from master, calculate how
       * much of the current page we have received already. We know the
       * requested record has been received, but this is for the benefit of
       * future calls, to allow quick exit at the top of this function.
       */
!     if (readStreamed)
      {
          if (RecPtr->xlogid != receivedUpto.xlogid ||
              (RecPtr->xrecoff / XLOG_BLCKSZ) != (receivedUpto.xrecoff / XLOG_BLCKSZ))
--- 8916,8944 ----
      }

      /*
!      * At this point, we have the right segment open and if we're streaming
!      * we know the requested record is in it.
       */
      Assert(readFile != -1);

      /*
+      * We don't expect any invalid records in archive or in records streamed
+      * from master: we should never hit the end of WAL because we wait for it
+      * to be streamed. Therefore treat any broken WAL as PANIC, instead of
+      * failing over.
+      */
+     if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE)
+         emode = PANIC;
+     else
+         emode = emode_arg;
+
+     /*
       * If the current segment is being streamed from master, calculate how
       * much of the current page we have received already. We know the
       * requested record has been received, but this is for the benefit of
       * future calls, to allow quick exit at the top of this function.
       */
!     if (readSource == XLOG_FROM_STREAM)
      {
          if (RecPtr->xlogid != receivedUpto.xlogid ||
              (RecPtr->xrecoff / XLOG_BLCKSZ) != (receivedUpto.xrecoff / XLOG_BLCKSZ))
***************
*** 8936,8946 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
      return true;

  next_record_is_invalid:
      if (readFile >= 0)
          close(readFile);
      readFile = -1;
-     readStreamed = false;
      readLen = 0;

      return false;
  }
--- 9002,9026 ----
      return true;

  next_record_is_invalid:
+     failedSources |= readSource;
+
+     if (readFile >= 0)
+         close(readFile);
+     readFile = -1;
+     readLen = 0;
+     readSource = 0;
+
+     if (StandbyMode)
+         goto retry;
+     else
+         return false;
+
+ triggered:
      if (readFile >= 0)
          close(readFile);
      readFile = -1;
      readLen = 0;
+     readSource = 0;

      return false;
  }

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Simon Riggs wrote:
>> We might also have written half a file many times. The files in pg_xlog
>> are suspect whereas the files in the archive are not. If we have both we
>> should prefer the archive.

> Yep.

Really?  That will result in a change in the longstanding behavior of
ordinary recovery.  I'm unconvinced that this is wise.
        regards, tom lane


Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Simon Riggs wrote:
>>> We might also have written half a file many times. The files in pg_xlog
>>> are suspect whereas the files in the archive are not. If we have both we
>>> should prefer the archive.
> 
>> Yep.
> 
> Really?  That will result in a change in the longstanding behavior of
> ordinary recovery.

Really? Not as far as I can see. Recovery has always tried to restore
WAL files from archive first, falling back to the copy in pg_xlog only
if restore_command returns failure.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas escribió:

> When recovery reaches an invalid WAL record, typically caused by a
> half-written WAL file, it closes the file and moves to the next source.
> If an error is found in a file restored from archive or in a portion
> just streamed from master, however, a PANIC is thrown, because it's not
> expected to have errors in the archive or in the master.

Hmm, I think I've heard that tools like walmgr do incremental copies of
the current WAL segment to the archive.  Doesn't this change break that?
(Maybe I'm confused and it doesn't work that way)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Alvaro Herrera wrote:
> Heikki Linnakangas escribió:
> 
>> When recovery reaches an invalid WAL record, typically caused by a
>> half-written WAL file, it closes the file and moves to the next source.
>> If an error is found in a file restored from archive or in a portion
>> just streamed from master, however, a PANIC is thrown, because it's not
>> expected to have errors in the archive or in the master.
> 
> Hmm, I think I've heard that tools like walmgr do incremental copies of
> the current WAL segment to the archive.  Doesn't this change break that?

Hmm, you could have a restore_command that checks the size before
restoring to make it still work. I note that pg_standby does that, but
of course you can't use pg_standby with the built-in standby mode. Or
maybe we should modify the built-in standby mode to handle partial files
coming from restore_command by not throwing an error but recovering to
the end of the partial file, and then retrying restore_command again
with the same filename until the whole file is recovered (or the missing
WAL is received through other means, ie. streaming replication).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Sorry for the delay.

On Fri, Mar 19, 2010 at 8:37 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Here's a patch I've been playing with.

Thanks! I'm reading the patch.

> The idea is that in standby mode,
> the server keeps trying to make progress in the recovery by:
>
> a) restoring files from archive
> b) replaying files from pg_xlog
> c) streaming from master
>
> When recovery reaches an invalid WAL record, typically caused by a
> half-written WAL file, it closes the file and moves to the next source.
> If an error is found in a file restored from archive or in a portion
> just streamed from master, however, a PANIC is thrown, because it's not
> expected to have errors in the archive or in the master.

But in the current (v8.4 or before) behavior, recovery ends normally
when an invalid record is found in an archived WAL file. Otherwise,
the server would never be able to start normal processing when there
is a corrupted archived file for some reasons. So, that invalid record
should not be treated as a PANIC if the server is not in standby mode
or the trigger file has been created. Thought?

When I tested the patch, the following PANIC error was thrown in the
normal archive recovery. This seems to derive from the above change.
The detail error sequence:
1. In ReadRecord(), emode was set to PANIC after 00000001000000000000000B  was read.
2. 00000001000000000000000C including the contrecord tried to be read  by using the emode (= PANIC). But since
00000001000000000000000Cdid  not exist, PANIC error was thrown.
 

-----------------
LOG:  restored log file "00000001000000000000000B" from archive
cp: cannot stat `../data.arh/00000001000000000000000C': No such file
or directory
PANIC:  could not open file "pg_xlog/00000001000000000000000C" (log
file 0, segment 12): No such file or directory
LOG:  startup process (PID 17204) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
-----------------

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Fujii Masao wrote:
> But in the current (v8.4 or before) behavior, recovery ends normally
> when an invalid record is found in an archived WAL file. Otherwise,
> the server would never be able to start normal processing when there
> is a corrupted archived file for some reasons. So, that invalid record
> should not be treated as a PANIC if the server is not in standby mode
> or the trigger file has been created. Thought?

Hmm, true, this changes behavior over previous releases. I tend to think
that it's always an error if there's a corrupt file in the archive,
though, and PANIC is appropriate. If the administrator wants to start up
the database anyway, he can remove the corrupt file from the archive and
place it directly in pg_xlog instead.

> When I tested the patch, the following PANIC error was thrown in the
> normal archive recovery. This seems to derive from the above change.
> The detail error sequence:
> 1. In ReadRecord(), emode was set to PANIC after 00000001000000000000000B
>    was read.
> 2. 00000001000000000000000C including the contrecord tried to be read
>    by using the emode (= PANIC). But since 00000001000000000000000C did
>    not exist, PANIC error was thrown.
> 
> -----------------
> LOG:  restored log file "00000001000000000000000B" from archive
> cp: cannot stat `../data.arh/00000001000000000000000C': No such file
> or directory
> PANIC:  could not open file "pg_xlog/00000001000000000000000C" (log
> file 0, segment 12): No such file or directory
> LOG:  startup process (PID 17204) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
> -----------------

Thanks. That's easily fixable (applies over the previous patch):

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3773,7 +3773,7 @@ retry:               pagelsn.xrecoff = 0;           }           /* Wait for the next page to
becomeavailable */
 
-           if (!XLogPageRead(&pagelsn, emode, false, false))
+           if (!XLogPageRead(&pagelsn, emode_arg, false, false))               return NULL;
           /* Check that the continuation record looks valid */

Perhaps the emode/emode_arg convention is a bit hard to read.

I'll go through the patch myself once more, and commit later today or
tomorrow if now new issues crop up.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Wed, Mar 24, 2010 at 9:31 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Hmm, true, this changes behavior over previous releases. I tend to think
> that it's always an error if there's a corrupt file in the archive,
> though, and PANIC is appropriate. If the administrator wants to start up
> the database anyway, he can remove the corrupt file from the archive and
> place it directly in pg_xlog instead.

Okay.

> Thanks. That's easily fixable (applies over the previous patch):
>
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -3773,7 +3773,7 @@ retry:
>                pagelsn.xrecoff = 0;
>            }
>            /* Wait for the next page to become available */
> -           if (!XLogPageRead(&pagelsn, emode, false, false))
> +           if (!XLogPageRead(&pagelsn, emode_arg, false, false))
>                return NULL;
>
>            /* Check that the continuation record looks valid */

Seems correct.

> sources &= ~failedSources;
> failedSources |= readSource;

The above lines in XLogPageRead() seem not to be required in normal
recovery case (i.e., standby_mode = off). So how about the attached
patch?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment
On Wed, Mar 24, 2010 at 10:20 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Thanks. That's easily fixable (applies over the previous patch):
>>
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -3773,7 +3773,7 @@ retry:
>>                pagelsn.xrecoff = 0;
>>            }
>>            /* Wait for the next page to become available */
>> -           if (!XLogPageRead(&pagelsn, emode, false, false))
>> +           if (!XLogPageRead(&pagelsn, emode_arg, false, false))
>>                return NULL;
>>
>>            /* Check that the continuation record looks valid */
>
> Seems correct.

On second thought, the following lines seem to be necessary just after
calling XLogPageRead() since it reads new WAL file from another source.

>    if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE)
>        emode = PANIC;
>    else
>        emode = emode_arg;

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Wed, 2010-03-24 at 14:31 +0200, Heikki Linnakangas wrote:
> Fujii Masao wrote:
> > But in the current (v8.4 or before) behavior, recovery ends normally
> > when an invalid record is found in an archived WAL file. Otherwise,
> > the server would never be able to start normal processing when there
> > is a corrupted archived file for some reasons. So, that invalid record
> > should not be treated as a PANIC if the server is not in standby mode
> > or the trigger file has been created. Thought?
> 
> Hmm, true, this changes behavior over previous releases. I tend to think
> that it's always an error if there's a corrupt file in the archive,
> though, and PANIC is appropriate. If the administrator wants to start up
> the database anyway, he can remove the corrupt file from the archive and
> place it directly in pg_xlog instead.

I don't agree with changing the behaviour from previous releases.

PANICing won't change the situation, so it just destroys server
availability. If we had 1 master and 42 slaves then this behaviour would
take down almost the whole server farm at once. Very uncool.

You might have reason to prevent the server starting up at that point,
when in standby mode, but that is not a reason to PANIC. We don't really
want all of the standbys thinking they can be the master all at once
either. Better to throw a serious ERROR and have the server still up and
available for reads.

-- Simon Riggs           www.2ndQuadrant.com



On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> PANICing won't change the situation, so it just destroys server
> availability. If we had 1 master and 42 slaves then this behaviour would
> take down almost the whole server farm at once. Very uncool.
>
> You might have reason to prevent the server starting up at that point,
> when in standby mode, but that is not a reason to PANIC. We don't really
> want all of the standbys thinking they can be the master all at once
> either. Better to throw a serious ERROR and have the server still up and
> available for reads.

OK. How about making the startup process emit WARNING, stop WAL replay and
wait for the presence of trigger file, when an invalid record is found?
Which keeps the server up for readonly queries. And if the trigger file is
found, I think that the startup process should emit a FATAL, i.e., the
server should exit immediately, to prevent the server from becoming the
primary in a half-finished state. Also to allow such a halfway failover,
we should provide fast failover mode as pg_standby does?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Fujii Masao <masao.fujii@gmail.com> writes:
> OK. How about making the startup process emit WARNING, stop WAL replay and
> wait for the presence of trigger file, when an invalid record is found?
> Which keeps the server up for readonly queries. And if the trigger file is
> found, I think that the startup process should emit a FATAL, i.e., the
> server should exit immediately, to prevent the server from becoming the
> primary in a half-finished state. Also to allow such a halfway failover,
> we should provide fast failover mode as pg_standby does?

I find it extremely scary to read this sort of blue-sky design
discussion going on now, two months after we were supposedly
feature-frozen for 9.0.  We need to be looking for the *rock bottom
minimum* amount of work to do to get 9.0 out the door in a usable
state; not what would be nice to have later on.
        regards, tom lane


On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
> On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > PANICing won't change the situation, so it just destroys server
> > availability. If we had 1 master and 42 slaves then this behaviour would
> > take down almost the whole server farm at once. Very uncool.
> >
> > You might have reason to prevent the server starting up at that point,
> > when in standby mode, but that is not a reason to PANIC. We don't really
> > want all of the standbys thinking they can be the master all at once
> > either. Better to throw a serious ERROR and have the server still up and
> > available for reads.
> 
> OK. How about making the startup process emit WARNING, stop WAL replay and
> wait for the presence of trigger file, when an invalid record is found?
> Which keeps the server up for readonly queries. And if the trigger file is
> found, I think that the startup process should emit a FATAL, i.e., the
> server should exit immediately, to prevent the server from becoming the
> primary in a half-finished state. Also to allow such a halfway failover,
> we should provide fast failover mode as pg_standby does?

The lack of docs begins to show a lack of coherent high-level design
here. By now, I've forgotten what this thread was even about. The major
design decision in this that keeps showing up is "remove pg_standby, at
all costs" but no reason has ever been given for that. I do believe
there is a "better way", but we won't find it by trial and error, even
if we had time to do so.

Please work on some clear docs for the failure modes in this system.
That way we can all read them and understand them, or point out further
issues. Moving straight to code is not a solution to this, since what we
need now is to all agree on the way forwards. If we ignore this, then
there is considerable risk that streaming rep will have a fatal
operational flaw.

Please just document/diagram how it works now, highlighting the problems
that still remain to be solved. We're all behind you and I'm helping
wherever I can.

-- Simon Riggs           www.2ndQuadrant.com



Tom Lane wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> OK. How about making the startup process emit WARNING, stop WAL replay and
>> wait for the presence of trigger file, when an invalid record is found?
>> Which keeps the server up for readonly queries. And if the trigger file is
>> found, I think that the startup process should emit a FATAL, i.e., the
>> server should exit immediately, to prevent the server from becoming the
>> primary in a half-finished state. Also to allow such a halfway failover,
>> we should provide fast failover mode as pg_standby does?
> 
> I find it extremely scary to read this sort of blue-sky design
> discussion going on now, two months after we were supposedly
> feature-frozen for 9.0.  We need to be looking for the *rock bottom
> minimum* amount of work to do to get 9.0 out the door in a usable
> state; not what would be nice to have later on.

Agreed, this is getting complicated. I'm already worried about the
amount of changes needed to make it work, I don't want to add any new
modes. PANIC seems like the appropriate solution for now.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
> On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > PANICing won't change the situation, so it just destroys server
> > availability. If we had 1 master and 42 slaves then this behaviour would
> > take down almost the whole server farm at once. Very uncool.
> >
> > You might have reason to prevent the server starting up at that point,
> > when in standby mode, but that is not a reason to PANIC. We don't really
> > want all of the standbys thinking they can be the master all at once
> > either. Better to throw a serious ERROR and have the server still up and
> > available for reads.
> 
> OK. How about making the startup process emit WARNING, stop WAL replay and
> wait for the presence of trigger file, when an invalid record is found?
> Which keeps the server up for readonly queries. 

Yes. Receiving new WAL records is a completely separate activity from
running the rest of the server (in this release...).

> And if the trigger file is
> found, I think that the startup process should emit a FATAL, i.e., the
> server should exit immediately, to prevent the server from becoming the
> primary in a half-finished state. 

Please remember that "half-finished" is your judgment on what has
happened in the particular scenario you are considering. In many cases,
an invalid WAL record clearly and simply indicates the end of WAL and we
should start up normally.

"State" is a good word here. I'd like to see the server have a clear
state model with well documented transitions between them. The state
should also be externally queriable, so we can work out what its doing
and how long we can expect it to keep doing it for.

I don't want to be in a position where we are waiting for the server to
sort itself out from a complex set of retries. 

> Also to allow such a halfway failover,
> we should provide fast failover mode as pg_standby does?

Yes, we definitely need a JFDI solution for immediate failover.

-- Simon Riggs           www.2ndQuadrant.com



On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:

> PANIC seems like the appropriate solution for now.

It definitely is not. Think some more.

-- Simon Riggs           www.2ndQuadrant.com



Simon Riggs wrote:
> On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
>> And if the trigger file is
>> found, I think that the startup process should emit a FATAL, i.e., the
>> server should exit immediately, to prevent the server from becoming the
>> primary in a half-finished state. 
> 
> Please remember that "half-finished" is your judgment on what has
> happened in the particular scenario you are considering. In many cases,
> an invalid WAL record clearly and simply indicates the end of WAL and we
> should start up normally.

Not in the archive. An invalid WAL record in a file in pg_xlog is
usually an indication of end-of-WAL, but there should be no invalid
records in the archived WAL files, or streamed from the master.

> "State" is a good word here. I'd like to see the server have a clear
> state model with well documented transitions between them. The state
> should also be externally queriable, so we can work out what its doing
> and how long we can expect it to keep doing it for.

Agreed.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


(cc'ing docs list)

Simon Riggs wrote:
> The lack of docs begins to show a lack of coherent high-level design
> here.

Yeah, I think you're right. It's becoming hard to keep track of how it's
supposed to behave.

> By now, I've forgotten what this thread was even about. The major
> design decision in this that keeps showing up is "remove pg_standby, at
> all costs" but no reason has ever been given for that. I do believe
> there is a "better way", but we won't find it by trial and error, even
> if we had time to do so.

This has nothing to do with pg_standby.

> Please work on some clear docs for the failure modes in this system.
> That way we can all read them and understand them, or point out further
> issues. Moving straight to code is not a solution to this, since what we
> need now is to all agree on the way forwards. If we ignore this, then
> there is considerable risk that streaming rep will have a fatal
> operational flaw.
>
> Please just document/diagram how it works now, highlighting the problems
> that still remain to be solved. We're all behind you and I'm helping
> wherever I can.

Ok, here's my attempt at the docs. Read it as a replacement for the
"High Availability, Load Balancing, and Replication" chapter, but of
course many of the sections will be unchanged, as indicated below.

-------------
Chapter 25. High Availability, Load Balancing, and Replication

25.1 Comparison of different solutions

<no changes>


25.2 Log-Shipping Standby servers

<overview from current "File-based Log Shipping" section. With small
changes so that it applies to the built-in standby mode as well as
pg_standby like solutions>

A standby server can also be used for read-only queries. This is called
Hot Standby mode, see chapter XXX

25.2.1 Planning

Set up two servers with identical hardware ...

<two first paragraphs of current File-based log-shipping / Planning section>

25.2.3 Standby mode

In standby mode, the server continously applies WAL received from the
master server. The standby server can receive WAL from a WAL archive
(see restore_command) or directly from the master over a TCP connection
(streaming replication). The standby server will also attempt to restore
any WAL found in the standby's pg_xlog. That typically happens after a
server restart, to replay again any WAL streamed from the master before
the restart, but you can also manually copy files to pg_xlog at any time
to have them replayed.

At startup, the standby begins by restoring all WAL available in the
archive location, calling restore_command. Once it reaches the end of
WAL available there and restore_command fails, it tries to restore any
WAL available in the pg_xlog directory (possibly stored there by
streaming replication before restart). If that fails, and streaming
replication has been configured, the standby tries to connect to the
master server and stream WAL from it. If that fails or streaming
replication is not configured, or if the connection is disconnected
later on, the standby goes back to step 1 and tries to restoring the
file from the archive again. This loop of retries from the archive,
pg_xlog, and via streaming replication goes on until the server is
stopped or failover is triggered by a trigger file.

A corrupt or half-finished WAL file in the archive, or streamed from the
master, causes a PANIC and immediate shutdown of the standby server. A
corrupt WAL file is always a serious event which requires administrator
action. If you want to recover a WAL file known to be corrupt as far as
it can be, you can copy the file manually into pg_xlog.

Standby mode is exited and the server switches to normal operation, when
a trigger file is found (trigger_file). Before failover, it will restore
any WAL available in the archive or in pg_xlog, but won't try to connect
to the master or wait for files to become available in the archive.


25.2.4 Preparing Master for Standby servers

Set up continous archiving to a WAL archive on the master, as described
in the chapter "Continous Archiving and Point-In-Time_recovery". The
archive location should be accessible from the standby even when the
master is down, ie. it should reside on the standby server itself or
another trusted server, not on the master server.

If you want to use streaming replication, set up authentication to allow
streaming replication connections. Set max_wal_senders.

Take a base backup as described in chapter Continous Archiving and
Point-In-Time_recovery / Making a Base Backup.

25.2.4.1 Authentication for streaming replication

Ensure that listen_addresses allows connections from the standby server.

<current Streaming Replication / Authentication section, describing
pg_hba.conf>


25.2.5 Setting up the standby server

1. Take a base backup, and copy it to the standby

2. Create a restore_command to restore files from the WAL archive.

3. Set standby_mode=on

4. If you want to use streaming replicaton, set primary_conninfo


You can use restartpoint_command to prune the archive of files no longer
needed by the standby.

You can have any number of standby servers as long as you set
max_wal_senders high enough in the master to allow them to be connected
simultaneously.

25.2.6 Archive recovery based log shipping

An alternative to the built-in standby mode desribed in the previous
sections is to use a restore_command that polls the archive location.
This was the only option available in versions 8.4 and below. In this
setup, set standby_mode=off, because you are implementing the polling
required for a standby server yourself. See contrib/pg_standby for a
reference implementation of this.

Note that the in this mode, the server will apply WAL one file at a
time, so if you use the standby server for queries (see Hot Standby),
there is a bigger delay between an action in the master and when the
action becomes visible in the standby, corresponding the time it takes
to fill up the WAL file. archive_timeout can be used to make that delay
shorter. Also note that you can't combine streaming replication with
this method.

<all the sections from current File-based log shipping that don't apply
to built-in standby mode go here>

25.3 Hot Standby

<no changes>

25.4 Incrementally Updated backups

<no changes>

-------------


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Simon Riggs wrote:
> On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
> 
>> PANIC seems like the appropriate solution for now.
> 
> It definitely is not. Think some more.

Well, what happens now in previous versions with pg_standby et al is
that the standby starts up. That doesn't seem appropriate either.

Hmm, it would be trivial to just stay in the standby mode at a corrupt
file, continuously retrying to restore it and continue replay. If it's
genuinely corrupt, it will never succeed and the standby gets stuck at
that point. Maybe that's better; it's close to what Fujii suggested
except that you don't need a new mode for it.

I'm worried that the administrator won't notice the error promptly
because at a quick glance the server is up and running, while it's
actually stuck at the error and falling indefinitely behind the master.
Maybe if we make it a WARNING, that's enough to alleviate that. It's
true that if the standby is actively being used for read-only queries,
shutting it down to just get the administrators attention isn't good either.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
>>
>>> PANIC seems like the appropriate solution for now.
>> It definitely is not. Think some more.
> 
> Well, what happens now in previous versions with pg_standby et al is
> that the standby starts up. That doesn't seem appropriate either.
> 
> Hmm, it would be trivial to just stay in the standby mode at a corrupt
> file, continuously retrying to restore it and continue replay. If it's
> genuinely corrupt, it will never succeed and the standby gets stuck at
> that point. Maybe that's better; it's close to what Fujii suggested
> except that you don't need a new mode for it.

On second thought, that's easy for the built-in standby mode, but not
for archive recovery where we're not currently retrying anything. In
archive recovery, we could throw a WARNING and start up, which would be
like the current behavior in older versions except you get a WARNING
instead of LOG, or we could PANIC. I'm leaning towards PANIC, which
makes most sense for genuine point-in-time or archive recovery (ie. not
a standby server), but I can see the rationale for WARNING too, for
pg_standby and for the sake of preserving old behavior.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Fujii Masao wrote:
>> sources &= ~failedSources;
>> failedSources |= readSource;
> 
> The above lines in XLogPageRead() seem not to be required in normal
> recovery case (i.e., standby_mode = off). So how about the attached
> patch?
>
> *** 9050,9056 **** next_record_is_invalid:
> --- 9047,9056 ----
>       readSource = 0;
>   
>       if (StandbyMode)
> +     {
> +         failedSources |= readSource;
>           goto retry;
> +     }
>       else
>           return false;

That doesn't work because readSource is cleared above. But yeah,
failedSources is not needed in archive recovery, so that line can be
removed.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Fujii Masao wrote:
> On second thought, the following lines seem to be necessary just after
> calling XLogPageRead() since it reads new WAL file from another source.
>
>>     if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE)
>>         emode = PANIC;
>>     else
>>         emode = emode_arg;

Yep.

Here's an updated patch, with these changes since the last patch:

* Fix the bug of a spurious PANIC in archive recovery, if the WAL ends
in the middle of a WAL record that continues over a WAL segment boundary.

* If a corrupt WAL record is found in archive or streamed from master in
standby mode, throw WARNING instead of PANIC, and keep trying. In
archive recovery (ie. standby_mode=off) it's still a PANIC. We can make
it a WARNING too, which gives the pre-9.0 behavior of starting up the
server on corruption. I prefer PANIC but the discussion is still going on.

* Small code changes to handling of failedSources, inspired by your
comment. No change in functionality.

This is also available in my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges"

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e57f22e..4aa1870 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -450,20 +450,33 @@ static uint32 openLogSeg = 0;
 static uint32 openLogOff = 0;

 /*
+ * Codes indicating where we got a WAL file from during recovery, or where
+ * to attempt to get one.
+ */
+#define XLOG_FROM_ARCHIVE        (1<<0)    /* Restored using restore_command */
+#define XLOG_FROM_PG_XLOG        (1<<1)    /* Existing file in pg_xlog */
+#define XLOG_FROM_STREAM        (1<<2)    /* Streamed from master */
+
+/*
  * These variables are used similarly to the ones above, but for reading
  * the XLOG.  Note, however, that readOff generally represents the offset
  * of the page just read, not the seek position of the FD itself, which
  * will be just past that page. readLen indicates how much of the current
- * page has been read into readBuf.
+ * page has been read into readBuf, and readSource indicates where we got
+ * the currently open file from.
  */
 static int    readFile = -1;
 static uint32 readId = 0;
 static uint32 readSeg = 0;
 static uint32 readOff = 0;
 static uint32 readLen = 0;
+static int readSource = 0;        /* XLOG_FROM_* code */

-/* Is the currently open segment being streamed from primary? */
-static bool readStreamed = false;
+/*
+ * Keeps track of which sources we've tried to read the current WAL
+ * record from and failed.
+ */
+static int failedSources = 0;

 /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
 static char *readBuf = NULL;
@@ -517,11 +530,12 @@ static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
                        bool find_free, int *max_advance,
                        bool use_lock);
 static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
-             bool fromArchive, bool notexistOk);
+             int source, bool notexistOk);
 static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode,
-                   bool fromArchive);
+                   int sources);
 static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
              bool randAccess);
+static int emode_for_corrupt_record(int endofwalmode);
 static void XLogFileClose(void);
 static bool RestoreArchivedFile(char *path, const char *xlogfname,
                     const char *recovername, off_t expectedSize);
@@ -2573,7 +2587,7 @@ XLogFileOpen(uint32 log, uint32 seg)
  */
 static int
 XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
-             bool fromArchive, bool notfoundOk)
+             int source, bool notfoundOk)
 {
     char        xlogfname[MAXFNAMELEN];
     char        activitymsg[MAXFNAMELEN + 16];
@@ -2582,23 +2596,28 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,

     XLogFileName(xlogfname, tli, log, seg);

-    if (fromArchive)
+    switch (source)
     {
-        /* Report recovery progress in PS display */
-        snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
-                 xlogfname);
-        set_ps_display(activitymsg, false);
+        case XLOG_FROM_ARCHIVE:
+            /* Report recovery progress in PS display */
+            snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
+                     xlogfname);
+            set_ps_display(activitymsg, false);

-        restoredFromArchive = RestoreArchivedFile(path, xlogfname,
-                                                  "RECOVERYXLOG",
-                                                  XLogSegSize);
-        if (!restoredFromArchive)
-            return -1;
-    }
-    else
-    {
-        XLogFilePath(path, tli, log, seg);
-        restoredFromArchive = false;
+            restoredFromArchive = RestoreArchivedFile(path, xlogfname,
+                                                      "RECOVERYXLOG",
+                                                      XLogSegSize);
+            if (!restoredFromArchive)
+                return -1;
+            break;
+
+        case XLOG_FROM_PG_XLOG:
+            XLogFilePath(path, tli, log, seg);
+            restoredFromArchive = false;
+            break;
+
+        default:
+            elog(ERROR, "invalid XLogFileRead source %d", source);
     }

     fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
@@ -2612,6 +2631,8 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
                  xlogfname);
         set_ps_display(activitymsg, false);

+        readSource = source;
+
         return fd;
     }
     if (errno != ENOENT || !notfoundOk) /* unexpected failure? */
@@ -2630,7 +2651,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
  * searched in pg_xlog if not found in archive.
  */
 static int
-XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, bool fromArchive)
+XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)
 {
     char        path[MAXPGPATH];
     ListCell   *cell;
@@ -2653,20 +2674,19 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, bool fromArchive)
         if (tli < curFileTLI)
             break;                /* don't bother looking at too-old TLIs */

-        fd = XLogFileRead(log, seg, emode, tli, fromArchive, true);
-        if (fd != -1)
-            return fd;
+        if (sources & XLOG_FROM_ARCHIVE)
+        {
+            fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true);
+            if (fd != -1)
+            {
+                elog(DEBUG1, "got WAL segment from archive");
+                return fd;
+            }
+        }

-        /*
-         * If not in StandbyMode, fall back to searching pg_xlog. In
-         * StandbyMode we're streaming segments from the primary to pg_xlog,
-         * and we mustn't confuse the (possibly partial) segments in pg_xlog
-         * with complete segments ready to be applied. We rather wait for the
-         * records to arrive through streaming.
-         */
-        if (!StandbyMode && fromArchive)
+        if (sources & XLOG_FROM_PG_XLOG)
         {
-            fd = XLogFileRead(log, seg, emode, tli, false, true);
+            fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true);
             if (fd != -1)
                 return fd;
         }
@@ -3520,7 +3540,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
  * the returned record pointer always points there.
  */
 static XLogRecord *
-ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
+ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 {
     XLogRecord *record;
     char       *buffer;
@@ -3530,17 +3550,6 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
                 total_len;
     uint32        targetRecOff;
     uint32        pageHeaderSize;
-    int            emode;
-
-    /*
-     * We don't expect any invalid records during streaming recovery: we
-     * should never hit the end of WAL because we wait for it to be streamed.
-     * Therefore treat any broken WAL as PANIC, instead of failing over.
-     */
-    if (StandbyMode)
-        emode = PANIC;
-    else
-        emode = emode_arg;

     if (readBuf == NULL)
     {
@@ -3593,6 +3602,9 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
         randAccess = true;        /* allow curFileTLI to go backwards too */
     }

+    /* This is the first try to read this page. */
+    failedSources = 0;
+retry:
     /* Read the page containing the record */
     if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
         return NULL;
@@ -3611,7 +3623,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
     }
     else if (targetRecOff < pageHeaderSize)
     {
-        ereport(emode,
+        ereport(emode_for_corrupt_record(emode),
                 (errmsg("invalid record offset at %X/%X",
                         RecPtr->xlogid, RecPtr->xrecoff)));
         goto next_record_is_invalid;
@@ -3619,7 +3631,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
     if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
         targetRecOff == pageHeaderSize)
     {
-        ereport(emode,
+        ereport(emode_for_corrupt_record(emode),
                 (errmsg("contrecord is requested by %X/%X",
                         RecPtr->xlogid, RecPtr->xrecoff)));
         goto next_record_is_invalid;
@@ -3634,7 +3646,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
     {
         if (record->xl_len != 0)
         {
-            ereport(emode,
+            ereport(emode_for_corrupt_record(emode),
                     (errmsg("invalid xlog switch record at %X/%X",
                             RecPtr->xlogid, RecPtr->xrecoff)));
             goto next_record_is_invalid;
@@ -3642,7 +3654,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
     }
     else if (record->xl_len == 0)
     {
-        ereport(emode,
+        ereport(emode_for_corrupt_record(emode),
                 (errmsg("record with zero length at %X/%X",
                         RecPtr->xlogid, RecPtr->xrecoff)));
         goto next_record_is_invalid;
@@ -3651,14 +3663,14 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
         record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
         XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
     {
-        ereport(emode,
+        ereport(emode_for_corrupt_record(emode),
                 (errmsg("invalid record length at %X/%X",
                         RecPtr->xlogid, RecPtr->xrecoff)));
         goto next_record_is_invalid;
     }
     if (record->xl_rmid > RM_MAX_ID)
     {
-        ereport(emode,
+        ereport(emode_for_corrupt_record(emode),
                 (errmsg("invalid resource manager ID %u at %X/%X",
                         record->xl_rmid, RecPtr->xlogid, RecPtr->xrecoff)));
         goto next_record_is_invalid;
@@ -3671,7 +3683,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
          */
         if (!XLByteLT(record->xl_prev, *RecPtr))
         {
-            ereport(emode,
+            ereport(emode_for_corrupt_record(emode),
                     (errmsg("record with incorrect prev-link %X/%X at %X/%X",
                             record->xl_prev.xlogid, record->xl_prev.xrecoff,
                             RecPtr->xlogid, RecPtr->xrecoff)));
@@ -3687,7 +3699,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
          */
         if (!XLByteEQ(record->xl_prev, ReadRecPtr))
         {
-            ereport(emode,
+            ereport(emode_for_corrupt_record(emode),
                     (errmsg("record with incorrect prev-link %X/%X at %X/%X",
                             record->xl_prev.xlogid, record->xl_prev.xrecoff,
                             RecPtr->xlogid, RecPtr->xrecoff)));
@@ -3716,7 +3728,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
         {
             readRecordBufSize = 0;
             /* We treat this as a "bogus data" condition */
-            ereport(emode,
+            ereport(emode_for_corrupt_record(emode),
                     (errmsg("record length %u at %X/%X too long",
                             total_len, RecPtr->xlogid, RecPtr->xrecoff)));
             goto next_record_is_invalid;
@@ -3756,7 +3768,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
             /* Check that the continuation record looks valid */
             if (!(((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD))
             {
-                ereport(emode,
+                ereport(emode_for_corrupt_record(emode),
                         (errmsg("there is no contrecord flag in log file %u, segment %u, offset %u",
                                 readId, readSeg, readOff)));
                 goto next_record_is_invalid;
@@ -3766,7 +3778,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
             if (contrecord->xl_rem_len == 0 ||
                 total_len != (contrecord->xl_rem_len + gotlen))
             {
-                ereport(emode,
+                ereport(emode_for_corrupt_record(emode),
                         (errmsg("invalid contrecord length %u in log file %u, segment %u, offset %u",
                                 contrecord->xl_rem_len,
                                 readId, readSeg, readOff)));
@@ -3784,7 +3796,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
                    contrecord->xl_rem_len);
             break;
         }
-        if (!RecordIsValid(record, *RecPtr, emode))
+        if (!RecordIsValid(record, *RecPtr, emode_for_corrupt_record(emode)))
             goto next_record_is_invalid;
         pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
         EndRecPtr.xlogid = readId;
@@ -3798,7 +3810,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
     }

     /* Record does not cross a page boundary */
-    if (!RecordIsValid(record, *RecPtr, emode))
+    if (!RecordIsValid(record, *RecPtr, emode_for_corrupt_record(emode)))
         goto next_record_is_invalid;
     EndRecPtr.xlogid = RecPtr->xlogid;
     EndRecPtr.xrecoff = RecPtr->xrecoff + MAXALIGN(total_len);
@@ -3824,13 +3836,20 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt)
     }
     return (XLogRecord *) buffer;

-next_record_is_invalid:;
+next_record_is_invalid:
+    failedSources |= readSource;
+
     if (readFile >= 0)
     {
         close(readFile);
         readFile = -1;
     }
-    return NULL;
+
+    /* In standby-mode, keep trying */
+    if (StandbyMode)
+        goto retry;
+    else
+        return NULL;
 }

 /*
@@ -8731,10 +8750,15 @@ StartupProcessMain(void)

 /*
  * Read the XLOG page containing RecPtr into readBuf (if not read already).
- * Returns true if successful, false otherwise or fails if emode is PANIC.
+ * Returns true if the page is read successfully.
  *
  * This is responsible for restoring files from archive as needed, as well
  * as for waiting for the requested WAL record to arrive in standby mode.
+ *
+ * 'emode' specifies the log level used for reporting "file not found" or
+ * "end of WAL" situations in archive recovery, or in standby mode if a trigger
+ * file is found. If set to WARNING or below, XLogPageRead() returns false
+ * in those situations, otherwise the ereport() will cause an error exit.
  */
 static bool
 XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
@@ -8746,13 +8770,14 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
     uint32        targetRecOff;
     uint32        targetId;
     uint32        targetSeg;
+    static pg_time_t last_fail_time = 0;

     XLByteToSeg(*RecPtr, targetId, targetSeg);
     targetPageOff = ((RecPtr->xrecoff % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
     targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;

     /* Fast exit if we have read the record in the current buffer already */
-    if (targetId == readId && targetSeg == readSeg &&
+    if (failedSources == 0 && targetId == readId && targetSeg == readSeg &&
         targetPageOff == readOff && targetRecOff < readLen)
         return true;

@@ -8764,18 +8789,18 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
     {
         close(readFile);
         readFile = -1;
+        readSource = 0;
     }

     XLByteToSeg(*RecPtr, readId, readSeg);

+retry:
     /* See if we need to retrieve more data */
     if (readFile < 0 ||
-        (readStreamed && !XLByteLT(*RecPtr, receivedUpto)))
+        (readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
     {
         if (StandbyMode)
         {
-            bool        last_restore_failed = false;
-
             /*
              * In standby mode, wait for the requested record to become
              * available, either via restore_command succeeding to restore the
@@ -8800,15 +8825,16 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
                         {
                             readFile =
                                 XLogFileRead(readId, readSeg, PANIC,
-                                             recoveryTargetTLI, false, false);
+                                             recoveryTargetTLI,
+                                             XLOG_FROM_PG_XLOG, false);
                             switched_segment = true;
-                            readStreamed = true;
+                            readSource = XLOG_FROM_STREAM;
                         }
                         break;
                     }

                     if (CheckForStandbyTrigger())
-                        goto next_record_is_invalid;
+                        goto triggered;

                     /*
                      * When streaming is active, we want to react quickly when
@@ -8818,6 +8844,9 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
                 }
                 else
                 {
+                    int sources;
+                    pg_time_t now;
+
                     /*
                      * Until walreceiver manages to reconnect, poll the
                      * archive.
@@ -8830,48 +8859,73 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
                     /* Reset curFileTLI if random fetch. */
                     if (randAccess)
                         curFileTLI = 0;
-                    readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2, true);
-                    switched_segment = true;
-                    readStreamed = false;
-                    if (readFile != -1)
-                    {
-                        elog(DEBUG1, "got WAL segment from archive");
-                        break;
-                    }

                     /*
-                     * If we succeeded restoring some segments from archive
-                     * since the last connection attempt (or we haven't tried
-                     * streaming yet, retry immediately. But if we haven't,
-                     * assume the problem is persistent, so be less
-                     * aggressive.
+                     * Try to restore the file from archive, or read an
+                     * existing file from pg_xlog.
                      */
-                    if (last_restore_failed)
+                    sources = XLOG_FROM_ARCHIVE | XLOG_FROM_PG_XLOG;
+                    if (!(sources & ~failedSources))
                     {
                         /*
-                         * Check to see if the trigger file exists. Note that
-                         * we do this only after failure, so when you create
-                         * the trigger file, we still finish replaying as much
-                         * as we can before failover.
+                         * We've exhausted all options for retrieving the
+                         * file. Retry ...
                          */
-                        if (CheckForStandbyTrigger())
-                            goto next_record_is_invalid;
-                        pg_usleep(5000000L);    /* 5 seconds */
+                        failedSources = 0;
+
+                        /*
+                         * ... but sleep first if it hasn't been long since
+                         * last attempt.
+                         */
+                        now = (pg_time_t) time(NULL);
+                        if ((now - last_fail_time) < 5)
+                        {
+                            pg_usleep(1000000L * (5 - (now - last_fail_time)));
+                            now = (pg_time_t) time(NULL);
+                        }
+                        last_fail_time = now;
+
+                        /*
+                         * If primary_conninfo is set, launch walreceiver to
+                         * try to stream the missing WAL, before retrying
+                         * to restore from archive/pg_xlog.
+                         *
+                         * If fetching_ckpt is TRUE, RecPtr points to the
+                         * initial checkpoint location. In that case, we use
+                         * RedoStartLSN as the streaming start position instead
+                         * of RecPtr, so that when we later jump backwards to
+                         * start redo at RedoStartLSN, we will have the logs
+                         * streamed already.
+                         */
+                        if (PrimaryConnInfo)
+                        {
+                            RequestXLogStreaming(
+                                fetching_ckpt ? RedoStartLSN : *RecPtr,
+                                PrimaryConnInfo);
+                            continue;
+                        }
                     }
-                    last_restore_failed = true;
+                    /* Don't try to read from a source that just failed */
+                    sources &= ~failedSources;
+                    readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2,
+                                                  sources);
+                    switched_segment = true;
+                    if (readFile != -1)
+                        break;

                     /*
-                     * Nope, not found in archive. Try to stream it.
-                     *
-                     * If fetching_ckpt is TRUE, RecPtr points to the initial
-                     * checkpoint location. In that case, we use RedoStartLSN
-                     * as the streaming start position instead of RecPtr, so
-                     * that when we later jump backwards to start redo at
-                     * RedoStartLSN, we will have the logs streamed already.
+                     * Nope, not found in archive and/or pg_xlog.
                      */
-                    if (PrimaryConnInfo)
-                        RequestXLogStreaming(fetching_ckpt ? RedoStartLSN : *RecPtr,
-                                             PrimaryConnInfo);
+                    failedSources |= sources;
+
+                    /*
+                     * Check to see if the trigger file exists. Note that
+                     * we do this only after failure, so when you create
+                     * the trigger file, we still finish replaying as much
+                     * as we can from archive and pg_xlog before failover.
+                     */
+                    if (CheckForStandbyTrigger())
+                        goto triggered;
                 }

                 /*
@@ -8886,13 +8940,18 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
             /* In archive or crash recovery. */
             if (readFile < 0)
             {
+                int sources;
                 /* Reset curFileTLI if random fetch. */
                 if (randAccess)
                     curFileTLI = 0;
+
+                sources = XLOG_FROM_PG_XLOG;
+                if (InArchiveRecovery)
+                    sources |= XLOG_FROM_ARCHIVE;
+
                 readFile = XLogFileReadAnyTLI(readId, readSeg, emode,
-                                              InArchiveRecovery);
+                                              sources);
                 switched_segment = true;
-                readStreamed = false;
                 if (readFile < 0)
                     return false;
             }
@@ -8900,8 +8959,8 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
     }

     /*
-     * At this point, we have the right segment open and we know the requested
-     * record is in it.
+     * At this point, we have the right segment open and if we're streaming
+     * we know the requested record is in it.
      */
     Assert(readFile != -1);

@@ -8911,7 +8970,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
      * requested record has been received, but this is for the benefit of
      * future calls, to allow quick exit at the top of this function.
      */
-    if (readStreamed)
+    if (readSource == XLOG_FROM_STREAM)
     {
         if (RecPtr->xlogid != receivedUpto.xlogid ||
             (RecPtr->xrecoff / XLOG_BLCKSZ) != (receivedUpto.xrecoff / XLOG_BLCKSZ))
@@ -8936,13 +8995,14 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
         readOff = 0;
         if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
         {
-            ereport(emode,
+            ereport(emode_for_corrupt_record(emode),
                     (errcode_for_file_access(),
                      errmsg("could not read from log file %u, segment %u, offset %u: %m",
                             readId, readSeg, readOff)));
             goto next_record_is_invalid;
         }
-        if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+        if (!ValidXLOGHeader((XLogPageHeader) readBuf,
+                             emode_for_corrupt_record(emode)))
             goto next_record_is_invalid;
     }

@@ -8950,7 +9010,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
     readOff = targetPageOff;
     if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
     {
-        ereport(emode,
+        ereport(emode_for_corrupt_record(emode),
                 (errcode_for_file_access(),
          errmsg("could not seek in log file %u, segment %u to offset %u: %m",
                 readId, readSeg, readOff)));
@@ -8958,13 +9018,13 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
     }
     if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
     {
-        ereport(emode,
+        ereport(emode_for_corrupt_record(emode),
                 (errcode_for_file_access(),
          errmsg("could not read from log file %u, segment %u, offset %u: %m",
                 readId, readSeg, readOff)));
         goto next_record_is_invalid;
     }
-    if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+    if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode_for_corrupt_record(emode)))
         goto next_record_is_invalid;

     Assert(targetId == readId);
@@ -8975,16 +9035,67 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
     return true;

 next_record_is_invalid:
+    failedSources |= readSource;
+
     if (readFile >= 0)
         close(readFile);
     readFile = -1;
-    readStreamed = false;
     readLen = 0;
+    readSource = 0;
+
+    /* In standby-mode, keep trying */
+    if (StandbyMode)
+        goto retry;
+    else
+        return false;
+
+triggered:
+    if (readFile >= 0)
+        close(readFile);
+    readFile = -1;
+    readLen = 0;
+    readSource = 0;

     return false;
 }

 /*
+ * Determine what log level should be used to report a corrupt WAL record
+ * in the current WAL page, previously read by XLogPageRead().
+ *
+ * 'emode' is the error mode that would be used to report a file-not-found
+ * or legitimate end-of-WAL situation. It is upgraded to WARNING or PANIC
+ * if the an corrupt record is not expected at this point.
+ */
+static int
+emode_for_corrupt_record(int emode)
+{
+    /*
+     * We don't expect any invalid records in archive or in records streamed
+     * from master. Files in the archive should be complete, and we should
+     * never hit the end of WAL because we stop and wait for more WAL to
+     * arrive before replaying it.
+     *
+     * In standby mode, throw a WARNING and keep retrying. If we're lucky
+     * it's a transient error and will go away by itself, and in any case
+     * it's better to keep the standby open for any possible read-only
+     * queries. In PITR, however, stop recovery immediately, rather than
+     * fail over.
+     */
+    if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE)
+    {
+        if (StandbyMode)
+        {
+            if (emode < WARNING)
+                emode = WARNING;
+        }
+        else
+            emode = PANIC;
+    }
+    return emode;
+}
+
+/*
  * Check to see if the trigger file exists. If it does, request postmaster
  * to shut down walreceiver, wait for it to exit, remove the trigger
  * file, and return true.

On Thu, Mar 25, 2010 at 8:55 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> * If a corrupt WAL record is found in archive or streamed from master in
> standby mode, throw WARNING instead of PANIC, and keep trying. In
> archive recovery (ie. standby_mode=off) it's still a PANIC. We can make
> it a WARNING too, which gives the pre-9.0 behavior of starting up the
> server on corruption. I prefer PANIC but the discussion is still going on.

I don't think we should be changing pre-9.0 behavior more than necessary.

...Robert


On Thu, 2010-03-25 at 12:15 +0200, Heikki Linnakangas wrote:
> (cc'ing docs list)
>
> Simon Riggs wrote:
> > The lack of docs begins to show a lack of coherent high-level design
> > here.
>
> Yeah, I think you're right. It's becoming hard to keep track of how it's
> supposed to behave.

Thank you for responding to that comment positively, I am relieved.

--
 Simon Riggs           www.2ndQuadrant.com


On Thu, 2010-03-25 at 12:26 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
> > 
> >> PANIC seems like the appropriate solution for now.
> > 
> > It definitely is not. Think some more.
> 
> Well, what happens now in previous versions with pg_standby et al is
> that the standby starts up. That doesn't seem appropriate either.

Agreed. I said that also, immediately upthread.

Bottom line is I am against anyone being allowed to PANIC the server
just because their piece of it ain't working. The whole purpose of all
of this is High Availability and we don't get that if everybody keeps
stopping for a tea break every time things get tricky. Staying up when
problems occur is the only way to avoid a falling domino taking out the
whole farm.

> I'm worried that the administrator won't notice the error promptly
> because at a quick glance the server is up and running, while it's
> actually stuck at the error and falling indefinitely behind the master.
> Maybe if we make it a WARNING, that's enough to alleviate that. It's
> true that if the standby is actively being used for read-only queries,
> shutting it down to just get the administrators attention isn't good either.

That's what monitoring is for. Let's just make sure this state is
accessible, so people will notice.

-- Simon Riggs           www.2ndQuadrant.com



On Thu, Mar 25, 2010 at 9:55 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> * Fix the bug of a spurious PANIC in archive recovery, if the WAL ends
> in the middle of a WAL record that continues over a WAL segment boundary.
>
> * If a corrupt WAL record is found in archive or streamed from master in
> standby mode, throw WARNING instead of PANIC, and keep trying. In
> archive recovery (ie. standby_mode=off) it's still a PANIC. We can make
> it a WARNING too, which gives the pre-9.0 behavior of starting up the
> server on corruption. I prefer PANIC but the discussion is still going on.

Seems reasonable for me.

> * Small code changes to handling of failedSources, inspired by your
> comment. No change in functionality.
>
> This is also available in my git repository at
> git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges"

I looked the patch and was not able to find any big problems until now.
The attached small patch fixes the typo.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment
Fujii Masao wrote:
>> * Small code changes to handling of failedSources, inspired by your
>> comment. No change in functionality.
>>
>> This is also available in my git repository at
>> git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges"
> 
> I looked the patch and was not able to find any big problems until now.
> The attached small patch fixes the typo.

Thanks. Committed with that typo-fix, and I also added a comment
explaining how failedSources and retrying XLogPageRead() works.

I'm now happy with the standby mode logic. It was a bigger struggle than
I anticipated back in January/February, I hope others now find it
intuitive as well. I'm going to work on the documentation of this, along
the lines of the draft I posted last week.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Wed, Mar 31, 2010 at 1:28 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Fujii Masao wrote:
>>> * Small code changes to handling of failedSources, inspired by your
>>> comment. No change in functionality.
>>>
>>> This is also available in my git repository at
>>> git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges"
>>
>> I looked the patch and was not able to find any big problems until now.
>> The attached small patch fixes the typo.
>
> Thanks. Committed with that typo-fix, and I also added a comment
> explaining how failedSources and retrying XLogPageRead() works.

Great! Thanks a lot!

This change affects various recovery patterns that we support now.
For example, normal crash recovery, archive recovery, recovery using
pg_standby, recovery in standby mode, and so on. So we would need to
test whether all of those patterns work fine with the change.
I'll do it as much as possible.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center